Which RISC-V instructions does the ET-SoC-1 give us?

Continuing the recent theme, I was given an ET-SoC-1 PCIe board, which is now installed in my home lab. The first order of business is confirming exactly which RISC-V instructions are supported by its minion CPU cores. We could try to learn this from documentation, or from the system emulator, or from the C compiler (all of which exist), but the ground truth can only be confirmed by testing on real hardware. This requires writing some code to interact with the hardware, and while there is a high-level runtime intended for this, it is more illuminating to jump in at a slightly lower level. We begin with:

int fd = open("/dev/et0_ops", O_RDWR | O_CLOEXEC);
if (fd < 0) FATAL("Could not open PCIe device");

The kernel driver creates two files per PCIe card: /dev/et<N>_ops and /dev/et<N>_mgmt. In broad strokes, the former is useful for launching compute kernels, whereas the latter is useful for updating firmware. Different filesystem permissions can be applied to the two files: perhaps only root should be able to update the firmware, but any user should be able to launch kernels.

Launching a kernel on the device requires uploading some RISC-V code to it, and in turn that requires choosing somewhere in the device's address space to place said code. Code for minion cores has to live within the device's DRAM, which is a 32 GiB region starting at address 0x80_0000_0000, but firmware takes a little bit for itself. There's an ioctl to determine how much is available:

struct dram_info dram_info;
if (ioctl(fd, ETSOC1_IOCTL_GET_USER_DRAM_INFO, &dram_info) < 0) {
  FATAL("Could not issue ETSOC1_IOCTL_GET_USER_DRAM_INFO");
}
printf("Have %llu bytes of DRAM starting at 0x%llx\n",
  (long long unsigned)dram_info.size,
  (long long unsigned)dram_info.base);

This prints 34265366528 (32 GiB minus 90 MiB) and 0x8005801000 (~88 MiB after 0x80_0000_0000). Proper host software would create a memory allocator at this point to dynamically manage this region, but for this post we'll just bump allocate starting from dram_info.base.

To actually launch a kernel, we need to think about queues. Firmware on the device initializes some submission queues (SQ) and completion queues (CQ), and the kernel driver knows how to push onto an SQ and how to pop from a CQ. These queues are small at the moment: each SQ can hold just over 1 KiB, and each CQ just under 1½ KiB. Each queue contains some number of messages, and we begin with a little helper function to (ask the kernel driver to) push a message onto an SQ:

uint16_t sq_push(int fd, struct cmn_header_t* msg, uint8_t flags) {
  struct cmd_desc msg_desc = {
    .cmd = msg,
    .size = msg->size,
    .flags = flags,
  };
  uint16_t tag = msg->tag_id = (uint16_t)rand();
  for (;;) {
    if (ioctl(fd, ETSOC1_IOCTL_PUSH_SQ, &msg_desc) < 0) {
      if (errno == EAGAIN) continue;
      FATAL("Could not issue ETSOC1_IOCTL_PUSH_SQ");
    }
    return tag;
  }
}

One kind of message we can push to an /dev/et0_ops SQ is struct device_ops_kernel_launch_cmd_t, which in particular includes:

code_start_address: The address of some RISC-V code on the device.
pointer_to_args: An arbitrary 64-bit value to be passed to the RISC-V code in the a0 register. To pass more than this, the values can be placed somewhere in device memory, and the address of those values passed.
shire_mask: A bitmask of which minion tiles to execute the RISC-V code on; every set bit will cause the code to be executed 64 times (because there are 32 minions per tile, and 2 hardware threads per minion).

If the optional CMD_FLAGS_KERNEL_LAUNCH_ARGS_EMBEDDED flag is specified, then we get what Vulkan calls "push constants": the SQ message can include a little bit of data immediately after struct device_ops_kernel_launch_cmd_t, which the ioctl will push to the device for us, and firmware on the device will copy to pointer_to_args prior to invoking code_start_address. Kernel arguments aren't required for this post, but we can (ab)use this mechanism to push the RISC-V code itself to the device.

This causes the message to be:

struct {
  struct device_ops_kernel_launch_cmd_t launch;
  uint32_t rv_code[3];
} __attribute__((packed, aligned(8))) launch_cmd = {
  .launch = {
    .command_info = {
      .cmd_hdr = {
        .size = sizeof(launch_cmd),
        .msg_id = DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_CMD,
        .flags = CMD_FLAGS_KERNEL_LAUNCH_ARGS_EMBEDDED,
      }
    },
    .code_start_address = dram_info.base,
    .pointer_to_args = dram_info.base,
    .shire_mask = 0x1,
  },
  .rv_code = {
    0x00000013, // nop
    0x00800513, // li a0, 8
    0x00000073, // ecall
  },
};
uint16_t tag = sq_push(fd, &launch_cmd.launch.command_info.cmd_hdr, 0);

The pushed RISC-V code consists of three instructions: a nop which'll come in handy later, and then two instructions to perform a syscall (the number 8 is SYSCALL_RETURN_FROM_KERNEL).

Once the device firmware has popped this SQ message and completed running the RISC-V code, it'll push a CQ message. Firmware can also push unsolicited CQ messages for other reasons, which proper host software should do something with, but we'll just ignore in the interest of brevity. This leads to a helper function for popping CQ messages until the message we want arrives:

void cq_pop_until(int fd, struct rsp_desc* dst, uint16_t kind, uint16_t tag) {
  for (;;) {
    if (ioctl(fd, ETSOC1_IOCTL_POP_CQ, dst) < 0) {
      if (errno == EAGAIN) continue;
      FATAL("Could not issue ETSOC1_IOCTL_POP_CQ");
    }
    struct cmn_header_t* hdr = (struct cmn_header_t*)dst->rsp;
    if (hdr->msg_id == kind && hdr->tag_id == tag) return;
  }
}

Using this, we can obtain the struct device_ops_kernel_launch_rsp_t telling us how the kernel launch went:

char rsp_buf[256];
struct rsp_desc rsp_desc = {
  .rsp = rsp_buf,
  .size = sizeof(rsp_buf),
};
cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_RSP, tag);
printf("Kernel launch status: %u\n", ((struct device_ops_kernel_launch_rsp_t*)rsp_buf)->status);

This prints 0, meaning DEV_OPS_API_KERNEL_LAUNCH_RESPONSE_KERNEL_COMPLETED.

So far so good: we can launch a trivial kernel, and it completes successfully. The neat part is that we only need a tiny modification to use this for probing which RISC-V instructions are supported by the ET-SoC-1's minion CPUs. The key is the nop at the start of rv_code: we can replace this with any other RISC-V instruction, and if the kernel still completes successfully then the instruction was supported, whereas if the kernel fails then the instruction wasn't supported. Firmware on the device handles the grungy details of catching the invalid instruction and getting everything neat and tidy again ready for running the next kernel (similar to how your operating system catches segfaults and limits their impact to terminating just the single faulty process rather than the whole machine).

Trying this out merely requires changing the nop to something invalid, launching the new kernel, and printing the status again:

launch_cmd.rv_code[0] = 0x1234deaf;
tag = sq_push(fd, &launch_cmd.launch.command_info.cmd_hdr, 0);
cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_RSP, tag);
printf("Faulty launch status: %u\n", ((struct device_ops_kernel_launch_rsp_t*)rsp_buf)->status);

This prints 2, meaning DEV_OPS_API_KERNEL_LAUNCH_RESPONSE_EXCEPTION.

As it happens, firmware on the device can give us more details about the exception by populating a struct execution_context_t somewhere in device memory. We need to allocate enough device memory to hold an execution_context_t array, with one array element per hardware thread. As we set shire_mask to 0x1, the kernel runs on hardware threads 0 through 63, so we need a 64-element array. Bump allocating the device memory is easy enough: all we need to do is add .exception_buffer = dram_info.base + 64, to the definition of launch_cmd. Pulling the array back to the host is slightly more involved, motivating another little helper function:

uint16_t async_memcpy_from_device(int fd, void* dst, uint64_t src, size_t size) {
  struct {
    struct cmn_header_t header;
    struct dma_read_node node;
  } __attribute__((packed, aligned(8))) dma_cmd = {
    .header = {
      .size = sizeof(dma_cmd),
      .msg_id = DEV_OPS_API_MID_DEVICE_OPS_DMA_READLIST_CMD,
    },
    .node = {
      .dst_host_virt_addr = (uint64_t)(uintptr_t)dst,
      .src_device_phy_addr = src,
      .size = size,
    }
  };
  return sq_push(fd, &dma_cmd.header, CMD_DESC_FLAG_DMA);
}

We need to allocate some special host memory to be the target of the async memcpy, but doing so is just an mmap call, and then we can do the copy and look at the execution_context_t::scause we get back:

const size_t contexts_size = sizeof(execution_context_t) * 64;
void* dma_buf = mmap(NULL, contexts_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (dma_buf == MAP_FAILED) FATAL("Could not allocate %llu byte DMA buffer", (long long unsigned)contexts_size);
tag = async_memcpy_from_device(fd, dma_buf, launch_cmd.launch.exception_buffer, contexts_size);
cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_DMA_READLIST_RSP, tag);
printf("Faulty launch scause: %u\n", (unsigned)((execution_context_t*)dma_buf)->scause);

This also prints 2, but the 2 now means something different: it comes from the RISC-V Instruction Set Manual: Volume II: Privileged Architecture scause value list, which says 2 means Illegal instruction. This is exactly what we expect, but we'll observe some other values in due course.

Next up, we need a list of encoded RISC-V instructions to try running. We could consult the RISC-V Instruction Set Manual, Volume I: Unprivileged Architecture's opcode map for standard RISC-V instructions and ET Programmer's Reference Manual for custom instructions, but transcribing instruction encodings out of manuals is dull work, especially when someone else has already done it for us: the fork of binutils for this device has riscv-opc.h with lots of standard encodings, and esperanto-opc.h for the custom ones. Each of the #define MATCH_<INSN> <ENCODING> lines therein gives us one possible encoding of INSN, typically with all register operands set to x0 or f0 and any immediate operands set to 0. We can start by pulling out a handful of instructions from each file:

// Standard:
#define MATCH_LD          0x3003
#define MATCH_FENCE          0xf
#define MATCH_FENCE_I     0x100f
#define MATCH_DIV      0x2004033
#define MATCH_FMUL_S  0x10000053
// Custom:
#define MATCH_FMUL_PS 0x1000007b
#define MATCH_FDIV_PI 0x1e00007b
#define MATCH_BITMIXB 0x8000703b

const struct insn_entry {
  const char* name;
  uint32_t encoding;
} g_insns[] = {
  {"ld",      MATCH_LD},
  {"fence",   MATCH_FENCE},
  {"fence.i", MATCH_FENCE_I},
  {"div",     MATCH_DIV},
  {"fmul.s",  MATCH_FMUL_S},
  {"fmul.ps", MATCH_FMUL_PS},
  {"fdiv.pi", MATCH_FDIV_PI},
  {"bitmixb", MATCH_BITMIXB},
  {NULL, 0},
};

Each instruction can then be tested in turn:

for (const struct insn_entry* i = g_insns; i->name; ++i) {
  launch_cmd.rv_code[0] = i->encoding;
  tag = sq_push(fd, &launch_cmd.launch.command_info.cmd_hdr, 0);
  cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_RSP, tag);
  if (((struct device_ops_kernel_launch_rsp_t*)rsp_buf)->status == DEV_OPS_API_KERNEL_LAUNCH_RESPONSE_KERNEL_COMPLETED) {
    printf("%-8s -> OK\n", i->name);
  } else {
    tag = async_memcpy_from_device(fd, dma_buf, launch_cmd.launch.exception_buffer, contexts_size);
    cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_DMA_READLIST_RSP, tag);
    printf("%-8s -> scause %u\n", i->name, (unsigned)((execution_context_t*)dma_buf)->scause);
  }
}

This prints:

`ld`	→	scause 5
`fence`	→	OK
`fence.i`	→	scause 30
`div`	→	OK
`fmul.s`	→	OK
`fmul.ps`	→	OK
`fdiv.pi`	→	scause 30
`bitmixb`	→	scause 2

The same documentation as before tells us that scause of 5 means Load access fault, which is to be expected: the encoded instruction is ld x0, 0(x0), and while this is a valid instruction, 0(x0) isn't a valid address in the minion's memory map. More curious is scause of 30: the standard documentation puts this under Designated for custom use, and so we need to look at the aforementioned ET Programmer's Reference Manual to see it described as M-code emulation. This means that the hardware's instruction decoder does recognise the instruction as valid, but the hardware doesn't natively implement the instruction; instead it is asking firmware to invisibly (albeit slowly) emulate it. Unfortunately, the firmware logic for instruction emulation hasn't been written yet, so we get a very visble exception rather than invisible emulation. The distinction between Illegal instruction and M-code emulation is somewhat arbitrary: firmware could choose to perform emulation in response to Illegal instruction, and could choose not to perform emulation in response to M-code emulation (as seen in the current firmware where the emulation logic hasn't been written yet). Despite it being arbitrary, I'll maintain the distinction.

The testing code can then be improved to interpret scause values:

for (const struct insn_entry* i = g_insns; i->name; ++i) {
  launch_cmd.rv_code[0] = i->encoding;
  tag = sq_push(fd, &launch_cmd.launch.command_info.cmd_hdr, 0);
  cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_RSP, tag);
  const char* status = "OK";
  if (((struct device_ops_kernel_launch_rsp_t*)rsp_buf)->status != DEV_OPS_API_KERNEL_LAUNCH_RESPONSE_KERNEL_COMPLETED) {
    tag = async_memcpy_from_device(fd, dma_buf, launch_cmd.launch.exception_buffer, contexts_size);
    cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_DMA_READLIST_RSP, tag);
    switch (((execution_context_t*)dma_buf)->scause) {
    case  2: status = "Invalid"; break;
    case 30: status = "Emulate"; break;
    }
  }
  printf("%-8s -> %s\n", i->name, status);
}

This now prints:

`ld`	→	OK
`fence`	→	OK
`fence.i`	→	Emulate
`div`	→	OK
`fmul.s`	→	OK
`fmul.ps`	→	OK
`fdiv.pi`	→	Emulate
`bitmixb`	→	Invalid

Testing more instructions is just a matter of adding more entries to g_insns. I've put my list of entries along with all supporting code up as a gist, which you're welcome to read through, but you'll need a real hardware device to make the code useful. Alternatively, keep on reading here as I go through the results from running it on the device in my lab.

All of RV64I is OK: this is addi, addiw, slti, sltiu, andi, ori, xori, slli, slliw, srli, srliw, srai, sraiw, lui, auipc, add, addw, sub, subw, slt, sltu, and, or, xor, sll, sllw, srl, srlw, sra, sraw, jal, jalr, beq, bne, blt, bltu, bge, bgeu, ld, lw, lwu, lh, lhu, lb, lbu, sd, sw, sh, sb, fence, ecall, ebreak, and various assembler pseudo-instructions expanding to these.

Standard extensions are fairly quickly enumerated:

M extension: all OK; this is mul, mulh, mulhsu, mulhu, mulw, div, divu, rem, remu, divw, divuw, remw, remuw.
Zicsr extension: all OK; this is csrrw, csrrs, csrrc, csrrwi, csrrsi, csrrci.
Zifencei extension: fence.i is emulated.
Supervisor / Machine-Mode privileged instructions: mret and sret and wfi are all OK, sfence.vma is emulated.
F extension: mostly OK (frcsr, fscsr, frrm, fsrm, fsrmi, frflags, fsflags, fsflagsi, flw, fsw, fadd.s, fsub.s, fmul.s, fmin.s, fmax.s, fmadd.s, fmsub.s, fnmsub.s, fnmadd.s, fcvt.w.s, fcvt.wu.s, fcvt.s.w, fcvt.s.wu, fsgnj.s, fsgnjn.s, fsgnjx.s, fmv.x.s, fmv.s.x, feq.s, flt.s, fle.s, fclass.s), but with a few instructions emulated (fdiv.s, fsqrt.s, fcvt.l.s, fcvt.lu.s, fcvt.s.l, fcvt.s.lu).
C extension: instructions valid as per their corresponding non-compressed counterpart.
All other standard extensions are invalid (A, B, D, Q, V, Zfa, Zfh, Zicntr, Zicond, and so on and so forth).

Rather more interesting are the non-standard instructions. There are various ways of grouping these, but I'll start with custom scalar integer arithmetic: packb is OK, but bitmixb is invalid. The behaviour of packb is just rd = (rs1 & 0xff) | ((rs2 & 0xff) << 8). The behaviour of bitmixb is far more interesting, performing a variety of bit interleavings of two 8-bit values, of the kind you might want for 2D texture address swizzling on a GPU. Conceptually this instruction takes three inputs (two 8-bit values and a 16-bit control), but RISC-V doesn't do three-input instructions, so two 8-bit inputs are packed together in a single input register, which no doubt is part of the motivation for packb. Alas, bitmixb is invalid, but perhaps a future chip will have it.

Next up are the cache-aware narrow store instructions: shl, shg, sbl, and sbg are all OK. These instructions exist because L1 and L2 caches are not coherent on the ET-SoC-1. The standard RISC-V sh / sb instructions operate at per-minion L1D, whereas the shl / sbl instructions do not interact with L1 at all and instead operate at per-tile L2, and then shg / sbg do not interact with L1 or L2 at all and instead operate at per-ASIC L3 (or similar, depending on the exact address). Due to the non-coherence, software needs to be very aware of the cache hierarchy. If writing entire cache lines (which are aligned 64 byte ranges), software can write at any cache level, and then rely on either implicit or explicit cache eviction to propagate the lines to higher cache levels (at which point they can become visible to other cores). If writing less than a cache line, and other cores are writing other parts of the same line, then all writers need to direct their writes to a cache which is common to all writers: L2 (and hence l suffix instruction) if all writers are in the same tile, L3 (and hence g suffix instruction) otherwise. Subsequent readers also need to use a load instruction which operates at that same cache, or need to explicitly flush any lower-level caches prior to issuing regular loads.

Moving on, all of amoadd[lg].[wd], amomin[lg].[wd], amominu[lg].[wd], amomax[lg].[wd], amomaxu[lg].[wd], amoand[lg].[wd], amoor[lg].[wd], amoxor[lg].[wd], amoswap[lg].[wd], and amocmpswap[lg].[wd] are OK. These instructions are inspired by the standard Zaamo instructions, but with the same L2 (l) or L3 (g) suffix as before, and all available as either 32-bit (w) or 64-bit (d). Degenerate forms of amoswap can be used as cache-aware variants of sw and sd, hence bespoke swl / swg / sdl / sdg instructions aren't required, and degenerate forms of amoor can be used for cache-aware loads.

Other than amocmpswap, 32-bit variants of the scalar AMOs also exist in SIMD form: famoadd[lg].pi, famomin[lg].pi, famominu[lg].pi, famomax[lg].pi, famomaxu[lg].pi, famoand[lg].pi, famoor[lg].pi, famoxor[lg].pi, and famoswap[lg].pi are all OK. These all operate in a scatter / gather fashion: each SIMD lane forms its target address as rs2 + fs1.i32<i>, and arbitrary lanes can be skipped by setting the m0 mask register appropriately. Wrapping up the SIMD AMOs, famomin[lg].ps and famomax[lg].ps are also OK: they perform an fp32 min/max rather than an i32 or u32 min/max.

Unlike their scalar counterparts, degenerate AMOs don't need to be used for cache-aware SIMD scatters / gathers. For L1D scatters, fscw.ps is OK, and for L2 / L3 fscw[lg].ps are OK. Narrower versions of these are also OK: fsch.ps and fsch[lg].ps for 16-bit values (from the low bits of each 32-bit lane), and fscb.ps and fscb[lg].ps for 8-bit values (again from the low bits of each 32-bit lane). Gather variants are the same, just starting with fg rather than fsc: all of fgw.ps, fgw[lg].ps, fgh.ps, fgh[lg].ps, fgb.ps, and fgb[lg].ps are OK. There are also "restricted" SIMD scatters / gathers, where all eight memory accesses of the instruction are within the same aligned 32-byte region: fsc32[whb].ps for scatters and fg32[whb].ps for gathers, all of which are OK (but aren't available in cache-aware variants; these restricted instructions always target L1D).

We're almost done with SIMD memory instructions, but there's one more batch to go. flw.ps and flw[lg].ps are OK: they load from consecutive memory locations rather than being a gather, but still respect m0 as a lane-enable mask. The naming follows the usual pattern for L1D / L2 / L3. When lane-enable isn't required, there's instead flq2, which is OK. There are no L2 / L3 variants of flq2, but flq2 is exactly what a compiler might want for spilling registers to the stack, and the stack is per-minion, so L1D is fine for that. Each of these load instructions has a matching store instruction: fsw.ps, fsw[lg].ps, and fsq2 are all reported as OK by the test program, though code in the emulator suggests that A0 silicon doesn't support lane-enables for fsw[lg].ps. Finally, fbc.ps is OK: it loads one 32-bit value from memory, and then broadcasts it to all lanes (subject to m0).

For broadcasting immediates, fbci.pi and fbci.ps are OK: they both have a 20-bit immediate (as is conventional for RISC-V), but differ in how they expand that to 32 bits. Meanwhile, fbcx.ps is OK and broadcasts from a GPR rather than an immediate. As per usual, these instructions all respect m0 as a lane-enable mask, so using any of them with a one-hot mask acts like an insert rather than a broadcast. In the other direction, fmvs.x.ps and fmvz.x.ps are OK: they extract a single lane to a GPR, either sign-extending or zero-extending to get from 32 bits to 64 bits. For shuffling lanes around within a SIMD register, fswizz.ps is OK: it has an 8-bit immediate encoding an arbitrary four-lane shuffle, which is applied to lanes 0-3 and 4-7. For rearranging registers rather than lanes, fcmov.ps and fcmovm.ps are OK, performing conditional moves (or variable blends in SSE/AVX terms).

Masks have been mentioned in passing, but also have a few dedicated instructions: maskand, maskor, maskxor, and masknot are all OK for performing bitwise manipulation of mask registers. For initializing a single mask register from a GPR or an immediate, mov.m.x is OK. Meanwhile, mova.x.m and mova.m.x are OK for doing bulk moves of all eight mask registers to / from one 64-bit GPR. There's no instruction for moving just one mask to a GPR, but maskpopc and maskpopcz are OK: they count the number of 1s or 0s in the mask and put the result in a GPR. The semantics of maskpopc.rast seem like a cute extension of that, but unfortunately the instruction is invalid.

At long last, we get to SIMD arithmetic. Starting with 32-bit integers, fand.pi, fandi.pi, for.pi, fxor.pi, fnot.pi, fadd.pi, faddi.pi, fsub.pi, fmul.pi, fmulh.pi, and fmulhu.pi are all OK, each being an 8x32b SIMD equivalent to a corresponding scalar instruction. Also OK are fmin.pi, fminu.pi, fmax.pi, and fmaxu.pi, which don't have scalar equivalents in ET-SoC-1, but do the obvious thing. There are bitwise shifts in the form of fsll.pi, fslli.pi, fsra.pi, fsrai.pi, fsrl.pi, and fsrli.pi, which are all OK, but have a subtle difference to their scalar counterparts: the shift amount isn't taken mod 32, so a shift amount greater than 31 causes the entire original value to be shifted out. The instruction fslloi.pi is invalid, but doesn't appear in the manual nor in the simulator, so I infer that it's a shift purely from its name. Next up, fsat8.pi and fsatu8.pi are both OK, with semantics of clamping each lane to the limits of int8_t or uint8_t, and then zero-extending from 8 bits back up to 32 bits. Even more specialised are fpackreph.pi and fpackrepb.pi, which are both OK, taking the low 16 or 8 bits of each lane, concatenating them to form 128 or 64 bits, then broadcasting that back up to 256 bits. To wrap up the section, fdiv.pi, fdivu.pi, frem.pi, and fremu.pi are all emulated.

For integer SIMD comparisons, feq.pi, fle.pi, flt.pi, and fltu.pi are all OK, performing lane-wise comparisons and then placing the result in another SIMD register where each lane is either -1 (comparison true) or 0 (comparison false). For results instead in a mask register, fltm.pi and fsetm.pi are both OK: the former performing signed less-than, and the latter checking for not-equal-to-zero.

We then reach FP32 SIMD arithmetic, with fadd.ps, fsub.ps, fmul.ps, fmin.ps, fmax.ps, fmadd.ps, fmsub.ps, fnmadd.ps, fnmsub.ps, fsgnj.ps, fsgnjn.ps, fsgnjx.ps and fclass.ps all being OK as obvious 8x32b SIMD equivalents to scalar instructions from the F extension. Just like in F, fdiv.ps and fsqrt.ps are emulated. To aid with that emulation, frcp.ps is OK, computing the approximate reciprocal with at most 1 ULP of approximation error. The similarly-named frcp.fix.rast is however invalid. Continuing the approximate theme, fexp.ps and flog.ps are both OK, computing the base-2 exponent or logarithm with at most 1 ULP of approximation error, but then fsin.ps and frsq.ps (reciprocal square root) are both emulated. Completing this section, fround.ps and ffrc.ps are both OK: the former rounds a floating-point value to have a zero fractional component, and the latter gives just the fractional component.

FP32 SIMD comparisons are no surprise given the integer SIMD comparisons. feq.ps and flt.ps and fle.ps are all OK, with results as a SIMD register. Their variants feqm.ps and fltm.ps and flem.ps are also OK, this time with results in a mask register.

When it comes to SIMD data type conversions, fcvt.f16.ps and fcvt.pw.ps and fcvt.pwu.ps are all OK, as are their inverses fcvt.ps.f16 and fcvt.ps.pw and fcvt.ps.pwu. All other SIMD variants of fcvt are invalid.

The remaining SIMD instructions are cubeface.ps, cubefaceidx.ps, cubesgnsc.ps, and cubesgntc.ps, all of which are invalid.

That concludes all the RISC-V instructions of the ET-SoC-1's minion CPU cores. However, RISC-V instructions aren't the full story, as additional specialised functionality is made available via dedicated CSRs:

Tensor operations are all expressed via CSR writes.
Most cache control operations (such as prefetching and evicting) are expressed via CSR writes, though some are instead ESR writes.
Synchronization operations (fast local barriers and fast credit counters) are also expressed as a combination of CSRs and ESRs.

All of this specialised functionality is worthy of study, but I've got to draw a line somewhere: functionality exposed via CSRs will have to wait for a future post.