corsix.org

Previously, in part 6, I took a deep dive into Tensix Vector, but if you're buying Tenstorrent hardware, you almost certainly want fast matrix multiplication. Furthermore, if that is your view, then everything else on the chip is secondary to matrix multiplication: SRAM and DRAM are for storing matrices close to the compute logic, Tensix Vector is for operating on matrix multiplication results without having to send them back to the host over PCIe, and so on.

Back in part 5, we saw one representation of a T tile, emphasizing the RISC-V cores. A different representation of the same thing instead emphasizes Tensix Matrix and the data paths to/from it:

The above diagram requires a few remarks:

Part 5's diagram shows Tensix instruction pipes and Tensix Sync between the "T" RV cores and the Tensix execution resources; those things do exist, but I've elided them from today's diagram.
Any of the various Tensix execution resources can be instructed from any of the "T0" / "T1" / "T2" RV cores, but today's diagram shows the traditional setup whereby "T0" instructs Tensix Unpack, "T1" instructs Tensix Matrix and Tensix Vector, and "T2" instructs Tensix Pack.
Similarly, any RV core can issue read or write commands to either of the NoC routers, but today's diagram shows the simplest possible convention of a 1:1 relationship between RV cores and NoC routers (the recently merged tt-metal#13376 enables more complex relationships).
The thin arrows in and out of the NoC routers are the various tile-to-tile connections we saw all the way back in part 1.

The focus might be on Tensix Matrix, but the data path between Tensix Matrix and L1 goes through Tensix Unpack and Tensix Pack. The very quick summary of these units is that they can perform some light data type conversion and some light reshaping, but mainly just shovel data between L1 and SrcB / SrcA / Dst. The bidirectional arrow on the diagram between Tensix Pack and L1 is there because Tensix Pack can perform either L1 = Dst or L1 += Dst, with the latter eliminating some of the need for Tensix Unpack to write to Dst. Tensix Pack can also perform some flavors of ReLU, though only as L1 = ReLU(Dst) or L1 += ReLU(Dst), and not as L1 = ReLU(L1 + Dst).

The sub-32-bit types that Tensix Unpack can write to SrcA and SrcB include:

With that interlude complete, we can finally look at Tensix Matrix. It is capable of a few different operations, but the big one is matrix multiplication, doing Dst += SrcB @ SrcA, where Dst and SrcB are both 8x16 and SrcA is 16x16:

The TT-Metal API tends to expose 32x32 blocks, matmul for which can be built out of the 8x16 matmul, one possible arrangement being:

For the primitive 8x16 operation, what gets applied to each scalar element of Dst is d = d + a₀·b₀ + a₁·b₁ + ⋯ + a₁₅·b₁₅, where a_i ranges over a column of SrcA and b_i ranges over a row of SrcB. For a 32x32 block, the primitive 8x16 operation needs to be applied 16 times in total (because there are 8 chunks of Dst, and each chunk requires a 32-element dot product rather than just a 16-element dot product).

The scalar + and · in the previous paragraph are floating-point addition and multiplication. If you're not a hardware designer, you might view floating-point addition and multiplication as magical primitive operations provided by hardware. If you are a hardware designer, you'll know that nothing is magic: floating-point addition and multiplication need to eventually be implemented via a mixture of integer comparisons, shifts, additions, and multiplications. To really reinforce the point that nothing is magic, I can show you code for bf16 fused-multiply-add implemented using just integer operations. Any actual hardware would be described in (System)Verilog rather than C, but there are more people who can read C than Verilog. If you look through that code, you'll see a few major pieces:

Decompose the bf16 inputs to their sign / exponent / mantissa (lines 20-21).
For normal numbers, re-attach the implicit one bit at the top of the mantissa (line 21). For denormal numbers, use __builtin_clz and << to find and reposition their most significant one bit (lines 22-24).
Check for any input being NaN or infinity, and return an appropriate NaN or infinity if so (lines 33-45).
Perform the multiply: xor of signs, integer addition of exponents, integer multiplication of mantissas (lines 31, 47-49).
Perform the addition: determine the largest exponent, use >> on the mantissa of the other value to equalize the exponents, then integer add or subtract the mantissas (lines 51-73).
Turn the result back into bf16: __builtin_clz and << or >> to get exactly 10 bits of mantissa, clamp exponent to the range 0-255 (possibly another >> here if the value is denormal), then pack sign and exponent and mantissa into 16 bits (lines 75-85).
Considering all the bits thrown away by >> in previous steps, perform rounding on the result (rounding to nearest, ties to even). This is just line 88, though it relies heavily on p_m <<= 3, z_m <<= 3 and on sticky_shift in earlier steps.

If you're aiming for full IEEE754 conformance, then all of the above steps are required. If instead you're designing hardware for AI, then any step that can be removed and/or simplified is a massive win: each Tensix Matrix unit on a Wormhole chip wants to perform 2048 floating-point multiplies and 2048 floating-point additions per cycle, and there are 80 Tensix Matrix units on each chip, so any cost savings in either multiplication or addition get amplified 16,3840 times. Even a small simplification becomes noticable when amplified that much. I don't know exactly what simplifications Tenstorrent have opted for, but there are potential savings in every step:

Perform decomposition somewhere else (e.g. in Tensix Unpack rather than Tensix Matrix).
Treat all denormals as zero, thus saving on __builtin_clz and <<.
Correctly handle only some cases of inputs being NaN or infinity.
Use only some of the mantissa for multiplication rather than all of it.
If adding more than two values together, determine the largest exponent of all of them, then use >> to equalize all their exponents at once (some academic work calls this a Class IV multi-term adder). All the mantissas can then be added together using a multiple-input integer adder, rather than a tree of two-input adders.
If the result is going to immediately feed into step 1 of another multiply/add, skip step 6 and the subsequent step 1. Additionally, if the result is denormal, treat it as zero to save on >>.
Use a cheaper rounding mode: either rounding toward zero (no rounding bits required), or rounding to nearest with ties away from zero (just one rounding bit required).

Treating denormals as zero on input (step 2) and on output (step 6) are relatively common simplifications; usually even CPUs have options for it. The more unusual simplifications are in steps 4 and 5. For step 4, if you want to be able to multiply a variety of floating-point formats between bfp2 and tf32, then you'd normally need an 11b by 11b multiplier for this step:

That said, fp8 (e5m2) only requires 3b by 3b multiply, and bf16 only requires 8b by 8b, so the full 11b by 11b might feel wasteful if lower-precision operations the majority of your workload. The particular simplification that Tenstorrent have opted for in Wormhole is a 7b by 5b multiplier, which can be invoked up to four times to cover (almost) the full range:

This has a number of interesting implications:

All of bfp2 / bfp4 / fp8 (e5m2) can get away with just LoFi(delity).
Bfp8 requires LoFi + HiFi2 to fully consume all the mantissa bits, but you could opt for just LoFi if throwing away two bits from SrcA is acceptable.
Bf16 requires all of LoFi + HiFi2 + HiFi3 + HiFi4 to consume all the mantissa bits, but only a few bits are consumed by HiFi4, so it could reasonably be skipped, and not very much is consumed by HiFi3, so it too could possibly be skipped.
Fp16 requires all of LoFi + HiFi2 + HiFi3 + HiFi4, and even then, one bit from SrcA isn't consumed.
If more than just LoFi is used, there's an extra rounding step (to the precision of Dst) between each fidelity stage.
The programmer chooses how many fidelity stages their workload requires.

At this point, we can cross-check with the advertised TFLOP/s numbers. We'd expect 4.096 TFLOP/s from each Tensix Matrix unit when LoFi is in use (e.g. for fp8), half that when LoFi + HiFi2 is in use (e.g. for bfp8), and half that again when all of LoFi + HiFi2 + HiFi3 + HiFi4 are required (e.g. for fp16). With 72 usable Tensix Matrix units on an n150s board and 128 usable on an n300s board, this would mean:

	n150s	n300s
Just LoFi, e.g. fp8 (e5m2)	294.9 TFLOP/s	524.3 TFLOP/s
LoFi+HiFi2, e.g. bfp8	147.5 TFLOP/s	262.1 TFLOP/s
LoFi+HiFi2+HiFi3+HiFi4, e.g. fp16	73.7 TFLOP/s	131.1 TFLOP/s

Meanwhile, the one-pager from the Tenstorrent sales page reports:

The bfp8 and fp16 numbers are exactly as expected. The oddity is the fp8 number, which is only 88.9% (i.e. sixteen eighteenths) of expected. This suggests that at such low precision, the bottleneck becomes data transfer (e.g. 16 cycles to multiply a 32x32 tile, but 18 cycles to get the data in or get the data out).

To hit these TFLOP/s numbers, "all" you have to do is write code for one of the RV "T" cores to issue a matrix multiplication instruction every cycle, and write code for the other four RV cores to keep data flowing in and out at the right rate. The Macro-Op Expander and Replay Expander seen in part 5 help with this task: the expanders can output a Tensix instruction every cycle without themselves needing to be instructed every cycle, giving the RV cores a bit of time to do other things (such as control flow). Then repeat this 72 times for an n150s board, or 128 times for an n300s board. In either case, that'll result in several hundred RV cores busily working away!

That wraps up part 7; there will likely be a long gap before the next part due to a certain factory game expansion pack being released next week.

Back in part 4, we were considering the entire Wormhole PCIe card, and then in part 5 we zoomed in on a single T tile. Today I'm going to zoom in even more, looking at the box that part 5 labelled "Tensix Vector (SFPU)". To draw a very rough analogy to GPU graphics programming, Tensix Unpack/Matrix/Pack are somewhat like a (configurable) fixed-function pipeline, whereas Tensix Vector can execute arbitrary shader programs. To instead draw a very rough analogy to GPU AI programming, Tensix Unpack/Matrix/Pack are like tensor cores, whereas Tensix Vector is like CUDA cores. That said, neither analogy is entirely accurate, as fundamentally this hardware is trying to be its own thing rather than trying to be a GPU. Continuing the AI theme, Unpack/Matrix/Pack can execute (amongst other things) linear layers consisting of matrix multiplication, optionally adding a bias, and then optionally ReLU, but once you stray too much beyond this, you'll need to pull in Tensix Vector. Tanh? Tensix Vector. Dropout? Tensix Vector. Cumsum? You guessed it, Tensix Vector.

The Tenstorrent documentation and code often refer to Tensix Vector as "SFPU", but I'll stick to calling it Tensix Vector. The hardware is wrapped with an API/toolchain/compiler called SFPI, which has an associated documentation page. I'll try to explain the raw underlying hardware, though I'll occasionally make reference to things the SFPI toolchain does. The documentation makes reference to an emulator in main.cc, which I can't find per se, but sfpu.cc gets somewhat close. Unfortunately, it operates at the level of a partially-lowered compiler IR, so some interpretation is required to map between that IR and actual machine instructions. Speaking of machine instructions, we saw the general layout of Tensix instructions in part 5. As a reminder, these are disjoint from RISC-V instructions, so there's no relation between the RISC-V "V" (for Vector) extension and Tensix Vector, and the general instruction layout is:

Whereas RISC-V "V" tries to present arbitrary-length vectors, Tensix Vector is a good old SIMD instruction set, like AArch64 NEON or x86 SSE/AVX, with 32 SIMD lanes in Wormhole. Each lane consists of 32 bits, which depending on the instruction are viewed as either fp32 or int32 or signmag32.

With the introduction done, we can start to get into the details. The remainder of this post gets very low-level and very dense, so skip it if that's not for you.

Execution environment

Before diving into the Tensix Vector instructions, it is useful to consider the environment in which the instructions execute. The important parts are:

	Size
Vector registers (`L0`-`L7`)	8 registers, 32 lanes per register, 32b per lane
Fixed constants	4 values, 32b each
Programmable constants	4 "constants", 8 lanes per constant, 32b per lane
Flags active	1b
Per lane flags	32 lanes, 1b per lane
Flag stack	Between 0 and 8 entries, (1+32×1)b per entry
PRNG	32 lanes, 32b LFSR per lane (with caveats)
`Dst`	Either 512 rows, 16 lanes per row, 32b per lane Or 1024 rows, 16 lanes per row, 16b per lane
`RWC_Dst`	10b

The vector registers are called L0 through L7, which is a somewhat poor choice of naming scheme, given that L1 could easily instead refer to the 1464 KiB of SRAM on each tile. Considering the vector registers and the constants all together, there are 16 possible operands, which are encoded into 4b fields in instructions like so:

Encoding	Meaning
`0` - `7`	Vector Registers `L0` - `L7`
`8`	Fixed Constant 0.8373 (bit pattern `0x3F56594B`)
`9`	Fixed Constant 0.0 (bit pattern `0x00000000`)
`10`	Fixed Constant 1.0 (bit pattern `0x3F800000`)
`11`	Programmable constant, though toolchain requires it to be -1.0
`12`	Programmable constant (`vConstIntPrgm0` / `vConstFloatPrgm0`)
`13`	Programmable constant (`vConstIntPrgm1` / `vConstFloatPrgm1`)
`14`	Programmable constant (`vConstIntPrgm2` / `vConstFloatPrgm2`)
`15`	Fixed Constant `lane_number << 1` (i.e. 0, 2, 4, ..., 62)

The programmable constants are set using the SFPCONFIG instruction, which we'll cover later. The toolchain exposes two names for each, differing in type, but they're backed by the same storage. The programmable constants usually have the same value in all eight of their lanes, but in the event that the lanes end up with different values, four copies of the constant are stacked horizontally to form 32 lanes. The fixed constants 0.8373 and 0.0 and 1.0 have the same value in every lane, and then the final fixed constant has a different value in every lane.

Next up are flags. Flags can be active or inactive. If flags are active, then there is a 1b flag per lane controlling whether that lane is enabled. Initially all lanes are enabled, and then various instructions can "refine" the per-lane flag: lanes which fail the condition switch from enabled to disabled (whilst previously disabled lanes remain disabled, with neither their contents nor their flags being updated). The toolchain exposes refinement through the v_and macro. If flags are inactive, then all lanes are enabled regardless of the 1b flag per lane. There is also a stack on to which all this state can be pushed and then later popped. Contrast this to Apple G13: there each lane has a disabled counter rather than a stack of flags.

We then find some PRNG state, which can optionally be used for stochastic rounding. The seeding mechanism leaves something to be desired though, as does the state update function, so I'd recommend avoiding the PRNG if you care your random numbers having high quality and low correlation.

The final notable part of the execution environment is Dst: the large 2D piece of memory that the Tensix Matrix unit writes the result of matrix operations to. The rows of this memory are 16 scalars wide, the combination of 16 rows is typically called a 16x16 face (which is what a lot of the LLK code operates on), and then the combination of four such faces is typically called a 32x32 tile (which is what the TT-Metal API exposes). Expressed differently, 64 rows of Dst are required for holding a 32x32 tile. The SFPLOAD and SFPSTORE instructions transfer data between a single vector register and some rows of Dst (they do not transfer between a vector register and main memory!), with the target rows determined by the summation of an immediate operand to SFPLOAD / SFPSTORE and the RWC_Dst variable, taken modulo the number of rows of Dst (512 when it is holding 32b scalars, 1024 when holding 16b scalars). The toolchain exposes RWC_Dst via the slightly questionable syntax dst_reg++.

Notation

I'll use VD to mean an arbitrary vector register used as the output (and often also an input) of an instruction. I'll use VA / VB / VC to mean arbitrary vector registers or constants used as inputs to an instruction. When instructions operate on fixed registers, I'll use the names L0 through L7. Scalar inputs that come from N bits within the instruction itself are referred to as ImmN. Signed immediates (in two's complement form) of N+1 bits will be ±ImmN, the range of which is -(2^N) through (+2^N)-1.

Some instructions can operate in several distinct modes, in which case they'll be listed multiple times in different sections and marked with (‡) each time.

Instruction encoding

The Mod0 family of encodings put a "VD" field at the top, then a modifier field, then immediates at the bottom:

Meanwhile, the Mod1 family of encodings put a modifier field at the bottom, then "VD", then other operands, then immediates at the top:

Each instruction links through to emulation code for that instruction, giving (my best guess of) its precise encoding and behaviour. In each case, the encoding will be one of the above, but the opcode varies by instruction, as does the interpretation of Mod0 / Mod1.

With the stage set, we can now take a brief look at all the instructions handled by Tensix Vector.

Int32 arithmetic and bitwise operations

We begin with some gentle integer instructions:

	Per-lane behaviour (int32)
`SFPIADD`	`VD = VC ± VD` or `VD = VC ± Imm11` Optionally refine flags based on `VD < 0` or inverse thereof
`SFPAND`	`VD &= VC`
`SFPOR`	`VD \|= VC`
`SFPXOR`	`VD ^= VC`
`SFPNOT`	`VD = ~VC`
`SFPLZ`	`VD = CountLeadingZeros(VC)` Optionally refine flags based on `VC != 0` or inverse thereof
`SFPABS` (‡)	`VD = Abs(VC)`
`SFPSHFT`	`VD = VD << (VC % 32)` or `VD = VD >> (-VC % 32)` or `VD = VD << Imm5` or `VD = VD >> -Imm5`
`SFPSHFT2` (‡)	`VD = VB << (VC % 32)` or `VD = VB >> (-VC % 32)`
`SFPSETCC`	Refine flags based on `VC != 0` or `VC < 0` or inverse of either

Nothing greatly surprising here, though it is a shame that so many instructions treat VD as both an input and an output (this isn't for lack of encoding space, as there's plently of that, and isn't for lack of register file ports, as SFPMAD requires three read ports and a write port, so I'm not sure of the rationale here). Shifts are all done modulo 32, with the sign of the shift amount determining whether the shift is left or right. Right shifts are always unsigned, though apparently Blackhole adds support for signed right shifts. There's also a somewhat insane variant of SFPSHFT2 that shifts VB by an immediate, but bits 12 through 15 specify both VB and (part of) the immediate, so the possible options there are L0 << 0, L1 << 1, L2 << 2, and so forth.

Flags are generally refined based on the sign or the zero-ness of the result. The conditions VC != 0 and VC < 0 are native, as are their inverses VC == 0 and VC >= 0. The non-native VC > 0 is achieved by refining on VC >= 0 and then refining on VC != 0. Its inverse (VC <= 0) is achieved by refining on VC >= 0 and then refining on VC != 0 and then issuing SFPCOMPC to invert the overall result. Three instructions for VC <= 0 isn't great, but again is addressed in Blackhole. Comparisons where the right hand side isn't zero are done by subtracting the two operands, and then comparing the subtraction result against zero. This causes < / <= / > / >= to do the wrong thing if overflow occurs during the subtraction, which is mildly concerning.

Flag stack

	Per-lane behaviour
`SFPENCC`	Configure whether flags are active, also set flags
`SFPPUSHC`	Push copy of flags on to flag stack
`SFPCOMPC`	Invert per-lane flags, using top of stack as context
`SFPPOPC`	Pop from flag stack into flags, or read top of stack into flags

The SFPENCC instruction is used to initialise the flags subsystem: it can set flags to active, and initialise the per-lane flags to either enable or disable all lanes.

SFPPUSHC and SFPPOPC mostly do what you'd expect. If SFPPUSHC is used more than eight times, then it'll start overwriting previous entries. The stack size counter is four bits, and it too will wrap if SFPPUSHC is used sixteen times. If SFPPOPC is used with the size counter equal to zero, then the counter will underflow to fiveteen, but the resultant flags state will always be all lanes active. I would not advise trying to do anything clever with stack underflow or overflow.

SFPCOMPC is slightly interesting: it inverts the per-lane flags, but does this subject to the the state on the top of the stack; lanes that would be disabled in that state are set to disabled rather than being inverted.

Fp32 field manipulation

Up next are some unconventional, though not unwelcome, instructions to manipulate the three fields of an IEEE754 float:

	Per-lane behaviour (fp32 sign/exponent/mantissa)
`SFPEXEXP`	`VD = VC.Exponent` or `VD = VC.Exponent - 127` Optionally refine flags based on `VD < 0` or inverse thereof
`SFPEXMAN`	`VD = { 0, !Imm1, VC.Mantissa}`
`SFPMOV` (‡)	`VD = {!VC.Sign, VC.Exponent, VC.Mantissa}`
`SFPSETSGN`	`VD = { VD.Sign, VC.Exponent, VC.Mantissa}` or `VD = { Imm1, VC.Exponent, VC.Mantissa}`
`SFPABS` (‡)	`VD = { 0, VC.Exponent, VC.Mantissa}`
`SFPSETEXP`	`VD = { VC.Sign, VD.Mantissa & 255, VC.Mantissa}` or `VD = { VC.Sign, VD.Exponent, VC.Mantissa}` or `VD = { VC.Sign, Imm8, VC.Mantissa}`
`SFPSETMAN`	`VD = { VC.Sign, VC.Exponent, VD.Mantissa}` or `VD = { VC.Sign, VC.Exponent, Imm12 << 11}`
`SFPDIVP2`	`VD = { VC.Sign, Imm8, VC.Mantissa}` or `VD = { VC.Sign, VC.Exponent ± Imm7, VC.Mantissa}`

There is no SFPEXSGN instruction, as integer instructions suffice for this: SFPSETCC can refine flags based on the sign bit, and SFPSHFT can do a right shift by 31 to extract just the sign bit.

The SFPDIVP2 instruction can perform addition/subtraction on the exponent field, thereby providing multiplication or division by a power of two, though the arithmetic will wrap around if it overflows, so some care is required. The only saving grace is that the VC.Exponent ± Imm7 form will leave VC untouched if it starts as ±Inf or ±NaN. If wrapping is a concern, use SFPMULI instead (described in the next section).

There is some overlap between instructions here; an absolute-value function can be built from SFPSETSGN, or SFPABS can be used for this. Similarly, one mode of SFPSETEXP is completely identical to one mode of SFPDIVP2.

Fp32 arithmetic

Then we reach the floating point multiply/add unit:

	Per-lane behaviour (fp32)
`SFPMUL`	`VD = VA * VB + 0`
`SFPADD`	`VD = 1 * VB + VC`
`SFPMAD`	`VD = VA * VB + VC`
`SFPMULI`	`VD *= Bf16ToFp32(Imm16)`
`SFPADDI`	`VD += Bf16ToFp32(Imm16)`
`SFPLUT`	`TmpA, TmpC = LUT({L0.Low16, L1.Low16, L2.Low16}, Abs(L3)) VD = TmpA * Abs(L3) + TmpC`
`SFPLUTFP32`	`TmpA, TmpC = LUT({L0, L1, L2, L4, L5, L6}, Abs(L3)) VD = TmpA * Abs(L3) + TmpC`

All of these instructions take two cycles, i.e. VD is not available until two cycles after the instruction is issued. An SFPNOP instruction must be inserted if the next instruction would otherwise want to consume VD (Blackhole relieves the compiler of this duty).

There is no fp32 subtract instruction; it is instead achieved by SFPMAD with VB set to -1.0. Most ISAs with a floating-point fused-multiply-add instruction have variants of the instruction to negate the result of the multiplication and/or the negate the addend, as doing so is incredibly cheap in hardware. This glaring omission is seemingly corrected in Blackhole.

The SFPADD instruction always has VA set to the constant 1.0 by the compiler, allowing hardware to treat SFPADD exactly like SFPMAD if it so desires. Similarly, SFPMUL always has VC set to the constant 0.0 by the compiler, allowing hardware to treat SFPMUL exactly like SFPMAD. The chip I'm playing with indeed treats SFPADD and SFPMUL exactly like SFPMAD, though future chips might be able to just add or just multiply faster than SFPMAD (e.g. Zen 4 takes four cycles for a multiply-add, but just 3 cycles for either a multiply or an add).

There are no dedicated fp32 comparison instructions (though see the min/max mode of SFPSWAP described later), as the integer SFPSETCC generally suffices, though this does mean that -NaN is considered less than -Inf and +Inf is considered less than +NaN. It would also mean that -0 is considered less than +0, but it looks like all arithmetic instructions normalize -0 to +0 (similarly, it looks like all denormal inputs are treated as zero and all denormal outputs are flushed to +0; see also Tenstorrent's statement on infinities and NaNs and denormals).

The unusual instructions are SFPLUT and SFPLUTFP32, which create a 3-element or 6-element table from various bits of L0/L1/L2 and optionally L4/L5/L6, then use the magnitude of L3 to determine which table element to use, extract TmpA and TmpC from said element, calculate VD = TmpA * Abs(L3) + TmpC, then optionally overwrite the sign of the result with the original sign of L3. These instructions allow for various unary functions to be approximated in a piecewise linear fashion (similar in spirit, though not at all in implementation, to genlut in Apple's AMX).

For SFPLUT, the table ranges are:

	`TmpA` (multiplicand)	`TmpC` (addend)
`Abs(L3) < 1.0`	`Fp8ToFp32((L0 >> 8) & 255)`	`Fp8ToFp32(L0 & 255)`
`1.0 ≤ Abs(L3) < 2.0`	`Fp8ToFp32((L1 >> 8) & 255)`	`Fp8ToFp32(L1 & 255)`
`2.0 ≤ Abs(L3)`	`Fp8ToFp32((L2 >> 8) & 255)`	`Fp8ToFp32(L2 & 255)`

Whereas for SFPLUTFP32 in mode FP16_3ENTRY_TABLE:

	`TmpA` (multiplicand)	`TmpC` (addend)
`Abs(L3) < 1.0`	`Fp16ToFp32(L0 >> 16)`	`Fp16ToFp32(L0 & 0xffff)`
`1.0 ≤ Abs(L3) < 2.0`	`Fp16ToFp32(L1 >> 16)`	`Fp16ToFp32(L1 & 0xffff)`
`2.0 ≤ Abs(L3)`	`Fp16ToFp32(L2 >> 16)`	`Fp16ToFp32(L2 & 0xffff)`

For SFPLUTFP32 in mode FP32_3ENTRY_TABLE:

	`TmpA` (multiplicand)	`TmpC` (addend)
`Abs(L3) < 1.0`	`L0`	`L4`
`1.0 ≤ Abs(L3) < 2.0`	`L1`	`L5`
`2.0 ≤ Abs(L3)`	`L2`	`L6`

For SFPLUTFP32 in mode FP16_6ENTRY_TABLE1:

	`TmpA` (multiplicand)	`TmpC` (addend)
`Abs(L3) < 0.5`	`Fp16ToFp32(L0 & 0xffff)`	`Fp16ToFp32(L4 & 0xffff)`
`0.5 ≤ Abs(L3) < 1.0`	`Fp16ToFp32(L0 >> 16)`	`Fp16ToFp32(L4 >> 16)`
`1.0 ≤ Abs(L3) < 1.5`	`Fp16ToFp32(L1 & 0xffff)`	`Fp16ToFp32(L5 & 0xffff)`
`1.5 ≤ Abs(L3) < 2.0`	`Fp16ToFp32(L1 >> 16)`	`Fp16ToFp32(L5 >> 16)`
`2.0 ≤ Abs(L3) < 3.0`	`Fp16ToFp32(L2 & 0xffff)`	`Fp16ToFp32(L6 & 0xffff)`
`3.0 ≤ Abs(L3)`	`Fp16ToFp32(L2 >> 16)`	`Fp16ToFp32(L6 >> 16)`

And finally SFPLUTFP32 in mode FP16_6ENTRY_TABLE2:

	`TmpA` (multiplicand)	`TmpC` (addend)
`Abs(L3) < 0.5`	`Fp16ToFp32(L0 & 0xffff)`	`Fp16ToFp32(L4 & 0xffff)`
`0.5 ≤ Abs(L3) < 1.0`	`Fp16ToFp32(L0 >> 16)`	`Fp16ToFp32(L4 >> 16)`
`1.0 ≤ Abs(L3) < 1.5`	`Fp16ToFp32(L1 & 0xffff)`	`Fp16ToFp32(L5 & 0xffff)`
`1.5 ≤ Abs(L3) < 2.0`	`Fp16ToFp32(L1 >> 16)`	`Fp16ToFp32(L5 >> 16)`
`2.0 ≤ Abs(L3) < 4.0`	`Fp16ToFp32(L2 & 0xffff)`	`Fp16ToFp32(L6 & 0xffff)`
`4.0 ≤ Abs(L3)`	`Fp16ToFp32(L2 >> 16)`	`Fp16ToFp32(L6 >> 16)`

Many of the instructions in this section also support a mode whereby the result of the instruction isn't written to VD, but is instead written to the register number in the low four bits of L7. This can be viewed as a kind of scatter operation. SFPMAD also supports a kind of gather operation: rather than reading from VA, the multiplicand input can be taken from the register number in the low four bits of L7.

Min / max / swap

	Per-lane behaviour (fp32 or signmag32)
`SFPSWAP` (‡)	`VD, VC = Min(VD, VC), Max(VD, VC)`
`SFPSWAP` (‡)	`VD, VC = VC, VD`

This instruction takes two cycles, possibly because it has two destinations and there's only one write port on the register file, and SFPSWAP must be followed by SFPNOP. When doing min / max, it uses the total ordering whereby -NaN < -Inf < finite negative values < -0 < +0 < finite positive values < +Inf < +NaN. The smaller of the two inputs ends up in VD, and the larger in VC. There are also variants which compute Min,Max for some groups of 8 lanes, and Max,Min for other groups of 8 lanes.

This is not an arithmetic instruction, so it does not flush denormals on input or on output. This means it can also be used for 32-bit integers in sign/magnitude form. The plain swap can also be used on int32 lanes.

Data type conversions to / from fp32

	Per-lane behaviour
`SFPSTOCHRND` (‡)	`VD = Fp32ToBf16(VC) << 16`
`SFPSTOCHRND` (‡)	`VD = Fp32ToTf32(VC)`
`SFPSTOCHRND` (‡)	`VD = Fp32ToInt32(Min(Abs(VC), 255))` or `VD = Fp32ToInt32(Min(Abs(VC), 65535))`
`SFPSTOCHRND` (‡)	`VD = Fp32ToSignMag32(Clamp(VC, ±127))` or `VD = Fp32ToSignMag32(Clamp(VC, ±32767))`
`SFPCAST`	`VD = SignMag32ToFp32(VC)`

All of the above support two rounding modes, either stochastic rounding or round to nearest (SFPSTOCHRND resolves ties away from zero, SFPCAST resolves ties to even, which seems like a strange discrepancy). The stochastic rounding relies on the hardware PRNG, though as mentioned in the introduction, the quality of its randomness is poor: adjacent vector lanes will have 30 out of 32 bits in common, and consecutive random values within a lane will have 31 out of 32 bits in common. This leads to significant correlation between random values if more than one random value is obtained.

The PRNG state can also be observed directly with an oddball variant of SFPMOV:

	Per-lane behaviour
`SFPMOV` (‡)	`VD = RandomInt32()`
`SFPNOP`	No-op, delay subsequent instructions by one cycle

SFPNOP is listed here as it is required for PRNG seeding: the seeding procedure involves writing the new seed to the PRNG_SEED::Seed_Val configuration register and then executing a bunch of SFPNOP instructions.

Rounding and clamping of sign / magnitude integers

	Per-lane behaviour (signmag32)
`SFPSTOCHRND` (‡)	`VD = Min(Round(Abs(VC) >> (VB % 32)), 255)` or `VD = Min(Round(Abs(VC) >> Imm5), 255)`
`SFPSTOCHRND` (‡)	`VD = Clamp(Round(VC >> (VB % 32)), ±127)` or `VD = Clamp(Round(VC >> Imm5), ±127)`

All of the above support two rounding modes, based on the shifted-out bits: either stochastic rounding or round to nearest with ties away from zero. The toolchain uses the names int32_to_uint8 and int32_to_int8 for these operations. The PRNG used for stochastic rounding is the same as in the previous section.

Note that the lane type here is signmag32: the high bit is a sign bit, and then the low 31 bits are a magnitude. When the magnitude is clamped, it stays in the low bits. Negative zero is allowed as an input, but is always normalised to +0 on output.

Constants

	Per-lane behaviour
`SFPLOADI`	`VD = Bf16ToFp32(Imm16)` or `VD = Fp16ToFp32(Imm16)` or `VD = Imm16` or `VD = ±Imm15` or `VD.High16 = Imm16` or `VD.Low16 = Imm16`
`SFPCONFIG` (‡)	`SelectedProgrammableConstant = L0[0:8]`

There are various modes of SFPLOADI for setting all lanes of a vector register to a 16-bit immediate. A 32-bit immediate can be formed by using two SFPLOADI instructions: Bf16ToFp32 or High16 to set the high 16 bits, and then Low16 to set just the low 16 bits. A selection of interesting 32-bit values can also be formed in a single cycle by using SFPSETSGN / SFPDIVP2 / SFPSETMAN with VC set to one of the fixed constants.

To load a value into one of the programmable constants, first use SFPLOADI to load it into all lanes of L0, then use SFPCONFIG to copy L0[0:8] into one of the programmable constants.

Cross-lane data movement

	Whole-vector behaviour
`SFPMOV` (‡)	`VD = VC`
`SFPSHFT2` (‡)	`L0, L1, L2, L3 = L1, L2, L3, Zeros` or `L0, L1, L2, L3 = L1, L2, L3, {L0[8:32], Zeros[0:8]}` or `L0, L1, L2, L3 = L1, L2, L3, RotateLanesRight(VC)`
`SFPSHFT2` (‡)	`VD = RotateLanesRight(VC)` or `VD = ShiftLanesRight(VC)`
`SFPTRANSP`	`Transpose(L0, L1, L2, L3); Transpose(L4, L5, L6, L7)`

The RotateLanesRight function rotates each group of eight lanes right by one lane, so VD = RotateLanesRight(VC) does VD[i] = VC[i&7 ? i-1 : i+7]. The similar VD = ShiftLanesRight(VC) is meant to do VD[i] = i&7 ? VC[i-1] : 0, but a hardware bug means that instead of every 8^th lane getting zero, it gets whatever the most recent RotateLanesRight wrote to that lane. Between this and the comedy mode that can do L0 << 0 or L1 << 1 or L2 << 2 etc, I get the impression that SFPSHFT2 was poorly specified and/or poorly tested. Hopefully it is all fixed in Blackhole.

The variants of SFPSHFT2 involving RotateLanesRight / ShiftLanesRight require two cycles to execute. If it weren't for this, the variant of SFPSHFT2 which moves zeros to L3 would be redundant, as it could be implemented with the RotateLanesRight variant with constant zero as VC.

Meanwhile, SFPTRANSP causes the following transformation:

	`L0`	`L1`	`L2`	`L3`
`[ 0: 8]`	`L0[0:8]`	`L0[8:16]`	`L0[16:24]`	`L0[24:32]`
`[ 8:16]`	`L1[0:8]`	`L1[8:16]`	`L1[16:24]`	`L1[24:32]`
`[16:24]`	`L2[0:8]`	`L2[8:16]`	`L2[16:24]`	`L2[24:32]`
`[24:32]`	`L3[0:8]`	`L3[8:16]`	`L3[16:24]`	`L3[24:32]`

	`L4`	`L5`	`L6`	`L7`
`[ 0: 8]`	`L4[0:8]`	`L4[8:16]`	`L4[16:24]`	`L4[24:32]`
`[ 8:16]`	`L5[0:8]`	`L5[8:16]`	`L5[16:24]`	`L5[24:32]`
`[16:24]`	`L6[0:8]`	`L6[8:16]`	`L6[16:24]`	`L6[24:32]`
`[24:32]`	`L7[0:8]`	`L7[8:16]`	`L7[16:24]`	`L7[24:32]`

The naïve implementation of this instruction would either require 8 cycles to execute, or require a register file with 8 write ports. Neither of these things seems likely, so perhaps what we're seeing is 8x 32b as the fundamental unit of storage, L0/L1/L2/L3 being backed by 16 units of storage, and the SFPTRANSP instruction flipping how L0/L1/L2/L3 map on to that storage (ditto L4/L5/L6/L7, and their backing 16 units of storage). The modes of SFPSHFT2 which write to all four of L0 through L3 might pull a similar trick; actually writing to one register and just shuffling indices for the others.

Transfer between Dst and vector registers

At long last, we reach the means of getting data in and out of the vector world:

	Whole-vector behaviour
`SFPLOAD`	`VD = Dst[R:R+4, 0:15:2]` or `VD = Dst[R:R+4, 1:16:2]`
`SFPSTORE`	`Dst[R:R+4, 0:15:2] = VD` or `Dst[R:R+4, 1:16:2] = VD`

Given that rows of Dst have 16 lanes, and vector registers have 32 lanes, you might expect SFPLOAD / SFPSTORE to reference two rows of Dst at a time. This is not the case; they instead reference half of four rows at a time. With Imm10 denoting the 10-bit immediate in SFPLOAD / SFPSTORE, the initial row R is (RWC_Dst + Imm10) & 0x3fc. If (RWC_Dst + Imm10) & 2 is zero, then the even columns of Dst[R:R+4] are referenced, whereas if (RWC_Dst + Imm10) & 2 is non-zero, then the odd columns of Dst[R:R+4] are referenced. Row R corresponds to vector lanes [0:8], R+1 to [8:16], R+2 to [16:24], and R+3 to [24:32], which neatly matches up with some of the cross-lane data movement instructions.

SFPLOAD / SFPSTORE can also increment RWC_Dst after performing the data transfer. The mechanism for this is somewhat involved:

Each "T" core has 8 addr_mod_dest configuration registers, containing amongst other things an incr field (the other fields are important too, but I don't fully understand them yet).
Three bits are required to select which addr_mod_dest to use: the low two bits come from the SFPLOAD / SFPSTORE instruction, whereas the high bit comes from the per "T" core ADDR_MOD_SET::Base configuration register.

The SETRWC and INCRWC instructions can also be used to modify RWC_Dst. Furthermore, these instructions can also modify RWC_SrcA, RWC_SrcB, and RWC_Fidelity; Tensix Matrix instructions make use of all of these, but Tensix Vector only needs RWC_Dst. Meanwhile, Tensix Pack and Unpack use totally different sets of counters for selecting their memory locations and their Dst / SrcA / SrcB locations.

When SFPLOAD / SFPSTORE access Dst, the lane width of Dst is either 16 bits per lane or 32 bits per lane, controlled by the ALU_ACC_CTRL_SFPU::Fp32_enabled configuration register. A data type conversion is also performed as part of the access, revealing a variety of possible formats for the lanes of Dst:

`Dst` lane type	↔	Vector lane type
fp32	↔	fp32
fp16 (with slightly permuted bits)	↔	fp32
bf16	↔	fp32
int32	↔	int32
signmag32	↔	int32
signmag8 (permuted and packed into 16 bits)	↔	signmag32
signmag11 (permuted and packed into 16 bits)	↔	int32
signmag16 (with slightly permuted bits)	↔	signmag16 (in half a lane)

If SFPLOAD / SFPSTORE do not specify a data type conversion, then the value of the ALU_FORMAT_SPEC_REG1::SrcB configuration register is used to infer the data type of Dst, and an appropriate conversion is chosen based on this. This is what the Tenstorrent documentation means when it says that on Wormhole, the destination register format is always determined by the runtime.

There's also a SFPLOADMACRO instruction, which is similar to SFPLOAD, but then executes a little pre-programmed instruction sequence. In part 5 we saw the Macro-Op Expander and the Replay Expander; SFPLOADMACRO is yet another mechanism for one instuction to expand to several, albeit limited in scope to Tensix Vector. I can only find one example of Tenstorrent code using this mechanism, which is enough to confirm its existence, but not enough for me to extrapolate further.

Conclusion

We've seen everything that Tensix Vector can do in Wormhole. Constructing useful high-level functionality from the low-level pieces is left as an exercise for the reader (or you can use what Tenstorrent have already built). That wraps up part 6; if you're reading along, then part 7 is next.

Previously, in part 4, we identified the 128 usable T tiles on my Wormhole n300s board. These tiles are the workhorse of the board, so it is about time we took a detailed look inside one of them. Ignoring all the NoC functionality, my best guess as to the contents of each T tile is this diagram:

Starting at the top of the diagram, we have 1464 KiB of SRAM, which is directly mapped into the tile-local address space starting at address 0. It is connected to lots of other components within the tile, and other tiles can also access it via NoC requests (again, I have not shown any of the NoC functionality on the above diagram). The advertised capacity is 1.5 MB of SRAM; if you were hoping for 1.5 MiB, then you'd need 72 KiB more than the 1464 KiB shown, but you can find that distributed across the tile (32 KiB in Dst, 30 KiB in the core-local RAMs, 4 KiB in SrcA, 4 KiB in SrcB, 1 KiB in Lreg, and so on).

Moving down a row, we have five RISC-V RV32IM cores, which I've labelled as "B", "T0", "T1", "T2", and "NC". Each core has 32 GPRs, each 32 bits wide, along with a 32-bit program counter. The RV32IM instruction set can be roughly split into three pieces: load/store, ALU (arithmetic operations, bitwise operations, and multiply and divide), and branches - these execution resources are shown on the diagram within each core. The host system can put whatever RISC-V machine code it desires in L1, and the RISC-V cores will happily execute it. Said code will have exclusive bare-metal control of the cores; there are no interrupts, no user-mode/kernel-mode split, no hypervisor, etc. The RISC-V cores execute completely independently (of each other, and of the host), though there are mechanisms to synchronize them.

Moving down another row, things start to get interesting. Firstly, each core has 2 KiB or 4 KiB of core-local RAM mapped into the address space starting at address 0xFFB00000. The C/C++ call stack is usually located here, thereby decreasing the load on L1, albeit with the trade-off that pointers into the stack cannot be meaningfully passed between cores nor used as the source or destination pointer for NoC requests. Next up, the "NC" core has 16 KiB of instruction RAM mapped into the address space starting at address 0xFFC00000, presumably again to reduce the load on L1. Finally, this row contains three "Tensix" instruction pipes, one attached to each "T" core. This is where we leave the world of standard RISC-V instructions, and enter the world of Tenstorrent special sauce. One way of describing Tensix would be a massive AI coprocessor glued on to the three "T" cores, with emphasis on the word massive: the assorted Tensix pieces occupy much more area and perform vastly more FLOPs than the RISC-V cores that drive them. We'll look at the Tensix instruction pipes in more detail later, but the quick summary is that they ingest Tensix instructions and output (slightly modified) Tensix instructions. Said instructions are 32 bits wide, but other than the width being the same, the Tensix instruction set is completely unrelated to any RISC-V instruction set. The Tensix instruction set is also evolving with each Tenstorrent generation; Grayskull is slightly different to Wormhole, which in turn is slightly different to Blackhole, and so on.

Moving down again, we hit "Tensix Sync". At least conceptually, this unit ingests Tensix instructions coming out of the three pipes, and dispatches Tensix instructions to the eight backend execution resources. A handful of instructions relating to synchronization of the three inbound pipes execute at "Tensix Sync", either manipulating the mutexes and semaphores within "Tensix Sync", or selectively pausing an inbound pipe until certain conditions are met. Instructions leaving "Tensix Sync" are tagged with which pipe they originated from, which is relevant information for most backend instructions.

The next row of the diagram contains the eight Tensix backend execution resources, from left to right: Scalar (often called ThCon), ThCfg, Unpack, Matrix (often called FPU), Pack, Vector (often called SFPU), TDMA, and Xmov. For AI workloads, the star of the show is the Matrix unit, which amongst other things can dispatch Dst[8,16] = SrcB[8,16] @ SrcA[16,16] every cycle (which involves 2048 individual multipliers, each 7b x 5b, followed by the equivalent of 2048 individual additions). To the left of Matrix is the Unpack unit, which moves values from memory (in a variety of data formats, including some block-float ones) into SrcA and SrcB, and then the Pack unit on the other side does the inverse: moving values from Dst back out to memory. Also of note is the Vector unit for performing 32-wide SIMD. This unit cannot directly access memory, but it can do transfers in both directions between Dst and the eight SIMD registers. This is suited to performing non-linear functions on the results of matrix multiplies prior to writing said results out to memory. The Matrix and Vector units are sometimes collectively called "Math". All of these units contain far more configuration parameters than can fit into a 32-bit instruction, so there are lots of configuration registers scattered about the place, along with Scalar and ThCfg units to help drive all this configuration. The Tensix Scalar unit also has a set of 64 32-bit GPRs per pipe, meaning that it contains more GPRs than all of the RISC-V cores in the tile do (3 times 64 versus 5 times 32).

The final row of the diagram I've labelled as "L0 ???", as the descriptions of several Tensix instructions mention an L0, but I'm not particularly confident as to its presence or size or functionality. If it exists, possibly it is a hardware-managed cache that all Tensix loads transparently go through, and Tensix stores can either target or skip and write directly to L1 (for when the stored values are less valuable than the pre-existing contents of the cache).

We can now look at some of the pieces in more detail.

Tensix Instruction Pipe

Each of the three Tensix instruction pipes looks something like this:

Tensix instructions enter at the top via two means. The conceptually simpler means is the MMIO box in the top right of the diagram; any "T" core can write a 32-bit value to address 0xFFE40000 to push a Tensix instruction into the pipe associated with that core. Said instructions are 32 bits wide, laid out as:

In contrast, 32-bit RISC-V instructions look totally different:

The Tensix opcode is 8 bits wide, but values ≥ 0xC0 aren't used, meaning that if a Tensix instruction is rotated left by two bits, it will never overlap with a 32-bit RISC-V instruction (it lands in the encoding space normally reserved for 16-bit RVC instructions, though not used for that purpose here):

This leads us to the box in the top left of the diagram: if a "T" core tries to execute an instruction whose low two bits are not 0b11, then the instruction bits will be rotated right by two and then treated as data to be written to the aforementioned 0xFFE40000. Regardless of the means of entry, once a Tensix instruction has entered the pipe, RISC-V execution and Tensix execution proceed completely independently of each other.

Next up, we hit the Macro-Op Expander, which is where the MOP_CFG(u16 zhi) and MOP(u1 template, u7 count1, u16 zlo) instructions execute (instructions other than MOP_CFG and MOP flow through the Macro-Op Expander unchanged). Of these, MOP_CFG just stores the 16-bit immediate to a 16-bit register within the expander, whereas MOP is the really interesting one; it causes the expander to run through one of the following templates:

Template 0	Template 1
`zmask = (zhi << 16) \| zlo; flags = mop_cfg[1]; for (i = 0; i <= count1; ++i) { if ((zmask & 1) == 0) { exec(mop_cfg[3]); if (flags & 0x02) { exec(mop_cfg[4]); exec(mop_cfg[5]); exec(mop_cfg[6]); } if (flags & 0x01) { exec(mop_cfg[2]); } } else { exec(mop_cfg[7]); if (flags & 0x01) { exec(mop_cfg[8]); } } zmask >>= 1; }`	i_count = mop_cfg[0] & 127; j_count = mop_cfg[1] & 127; j_inst = mop_cfg[5]; j_inst_flip = 0; if (mop_cfg[6] != NOP) { j_inst_flip = j_inst ^ mop_cfg[6]; j_count *= 2; } if (mop_cfg[2] == NOP && mop_cfg[3] != NOP) { if (i_count == 1 && j_count == 0) { i_count += 128; // Hardware bug } } for (i = 1; i <= i_count; ++i) { if (mop_cfg[2] != NOP) { exec(mop_cfg[2]); } for (j = 1; j <= j_count; ++j) { if (j != j_count) { exec(j_inst); } else if (i != i_count) { exec(mop_cfg[8]); } else { exec(mop_cfg[7]); } j_inst ^= j_inst_flip; } if (mop_cfg[3] != NOP) { exec(mop_cfg[3]); if (mop_cfg[4] != NOP) { exec(mop_cfg[4]); } } }

Template 0

Template 1

zmask = (zhi << 16) | zlo;
flags = mop_cfg[1];
for (i = 0; i <= count1; ++i) {
  if ((zmask & 1) == 0) {
    exec(mop_cfg[3]);
    if (flags & 0x02) {
      exec(mop_cfg[4]);
      exec(mop_cfg[5]);
      exec(mop_cfg[6]);
    }
    if (flags & 0x01) {
      exec(mop_cfg[2]);
    }
  } else {
    exec(mop_cfg[7]);
    if (flags & 0x01) {
      exec(mop_cfg[8]);
    }
  }
  zmask >>= 1;
}

i_count = mop_cfg[0] & 127;
j_count = mop_cfg[1] & 127;
j_inst = mop_cfg[5];
j_inst_flip = 0;
if (mop_cfg[6] != NOP) {
  j_inst_flip = j_inst ^ mop_cfg[6];
  j_count *= 2;
}
if (mop_cfg[2] == NOP && mop_cfg[3] != NOP) {
  if (i_count == 1 && j_count == 0) {
    i_count += 128; // Hardware bug
  }
}
for (i = 1; i <= i_count; ++i) {
  if (mop_cfg[2] != NOP) {
    exec(mop_cfg[2]);
  }
  for (j = 1; j <= j_count; ++j) {
    if (j != j_count) {
      exec(j_inst);
    } else if (i != i_count) {
      exec(mop_cfg[8]);
    } else {
      exec(mop_cfg[7]);
    }
    j_inst ^= j_inst_flip;
  }
  if (mop_cfg[3] != NOP) {
    exec(mop_cfg[3]);
    if (mop_cfg[4] != NOP) {
      exec(mop_cfg[4]);
    }
  }
}

Any call to exec(x) in the above causes the expander to output the Tensix instruction x. In this way, a single MOP instruction expands to a somewhat programmable sequence of instructions. The programmability comes from the immediate operands to MOP and the values stored in the mop_cfg registers. For the latter, each "T" core can set the mop_cfg registers of its associated pipe by writing to the uint32_t[9] starting at address 0xFFB80000.

Moving down a row in the diagram, we find a sneaky back door allowing the "B" core to inject Tensix instructions into any of the three pipes:

"B" core MMIO address	Semantics of 32-bit write
`0xFFE40000`	Push instruction into pipe associated with "T0"
`0xFFE50000`	Push instruction into pipe associated with "T1"
`0xFFE60000`	Push instruction into pipe associated with "T2"

This allows the "B" core to help initialize some of the state within the various Tensix units prior the "T" cores being turned on, but it probably isn't intended for much more than this.

Moving down to the final row, we hit the Replay Expander, which is where REPLAY(u5 idx, u5 len, u2 mode) instructions execute. The three possible modes of this instruction are:

Record: The next len instructions which enter the Replay Expander are swallowed up by the Replay Expander, and written to buffer[idx:idx+len].
Tee: The next len instructions which flow through the Replay Expander are written to buffer[idx:idx+len] in addition to flowing through.
Playback: The Replay Expander outputs buffer[idx:idx+len], one instruction at a time.

When not in Record mode, instructions other than REPLAY will flow through the Replay Expander unchanged (though the incoming stream is paused while Playback is in progress).

Tensix Sync

There are eight mutexes within this unit, each with four possible states:

Acquired by "T0" pipe
Acquired by "T1" pipe
Acquired by "T2" pipe
Released

Some instructions execute at Tensix Sync to manipulate these mutexes:

ATGETM(u3 mutex_index)

If the specified mutex is already acquired by the pipe on which ATGETM appeared, does nothing. Otherwise, pauses said pipe until the mutex is released, and then atomically acquires it for said pipe and unpauses the pipe.

ATRELM(u3 mutex_index)

If the specified mutex is already acquired by the pipe on which ATRELM appeared, then it is released. Otherwise, does nothing.

There are also eight semaphores within this unit, each having a four-bit counter value and a four-bit maximum value. Some instructions execute at Tensix Sync to manipulate these semaphores:

SEMINIT(u4 max, u4 ctr, u8 which_sems_mask)

Set the counter value and the maximum value of the specified semaphores to the given values.

SEMPOST(u8 which_sems_mask)

Increment the counter value of the specified semaphores, if not already equal to 15. Note that the upper limit is always 15; the maximum as set by SEMINIT is only used by SEMWAIT.

SEMGET(u8 which_sems_mask)

Decrement the counter value of the specified semaphores, if not already equal to zero.

SEMWAIT(u9 to_pause_mask, u8 which_sems_mask, u2 condition)

For as long as (any of) the specified semaphores have counter equal to zero (condition == 1) or have counter greater than or equal to their maximum (condition == 2), prevent the pipe on which SEMWAIT appeared from dispatching any instructions to the execution resources in to_pause_mask.

The "T" cores can also manipulate the semaphores via MMIO:

Reading from 0xFFE80020 + 4*i gives the counter value of semaphore i.
Writing 0 to 0xFFE80020 + 4*i does what SEMPOST(1u << i) would do.
Writing 1 to 0xFFE80020 + 4*i does what SEMGET(1u << i) would do.

One final instruction executes at Tensix Sync:

STALLWAIT(u9 to_pause_mask, u15 condition_mask)

Similar to SEMWAIT, but waits while (any of) assorted non-semaphore conditions are met. Said conditions can include various execution resources being busy, SrcA or SrcB being valid, and SrcA or SrcB being clear.

Any instructions not yet described will flow through Tensix Sync to one of the backend execution resources, though that flow can be paused while ATGETM or SEMWAIT or STALLWAIT are in progress.

Tensix Scalar (ThCon)

This unit contains 3x 64x 32-bit GPRs, the roles for which are typically statically assigned. Instructions manipulate the set of 64 GPRs corresponding to the pipe from which the instruction originally came. Each "T" core can also access its register set via MMIO to the uint32_t[64] starting at address 0xFFE00000.

Various ALU-style operations execute here to manipulate these GPRs:

SETDMAREG(u16 value, u1 mode, u6 gpr_idx, u1 lo_hi)

Sets the low 16 bits (lo_hi == 0) or high 16 bits (lo_hi == 1) of the specified GPR to the specified value, leaving the other bits unchanged. Does something totally different if mode == 1; consult the YAML for details.

ADDDMAREG(u1 b_is_const, u6 gpr_out, u6 b, u6 gpr_a)

Does gpr_out = gpr_a + (b_is_const ? b : gprs[b]).

SUBDMAREG(u1 b_is_const, u6 gpr_out, u6 b, u6 gpr_a)

Does gpr_out = gpr_a - (b_is_const ? b : gprs[b]).

MULDMAREG(u1 b_is_const, u6 gpr_out, u6 b, u6 gpr_a)

Does gpr_out = (gpr_a & 0xFFFF) * (b_is_const ? b : (gprs[b] & 0xFFFF)).
Note only low 16 bits of each input are used.

BITWOPDMAREG(u1 b_is_const, u2 op, u6 gpr_out, u6 b, u6 gpr_a)

Does gpr_out = gpr_a &|^ (b_is_const ? b : gprs[b]),
where &|^ is & (op == 0) or | (op == 1) or ^ (op == 2).

CMPDMAREG(u1 b_is_const, u2 op, u6 gpr_out, u6 b, u6 gpr_a)

Does gpr_out = gpr_a <==> (b_is_const ? b : gprs[b]),
where <==> is < (op == 1) or == (op == 2) or > (op == 0).

SHIFTDMAREG(u1 b_is_const, u1 op, u6 gpr_out, u6 b, u6 gpr_a)

Does gpr_out = gpr_a <<>> (b_is_const ? b : gprs[b]),
where <<>> is << (op == 0) or >> (op == 1).

Then instructions to move between these GPRs and L0/L1:

LOADIND(u2 sz, u6 gpr_ofs, u1 lo_hi, u2 inc, u6 gpr_data, u6 gpr_base)

Loads from L1 to GPRs.
The L1 address is gpr_base*16 + ((gpr_ofs >> (lo_hi*16)) & 0xFFFF).
Various size modes:

sz == 3: Load 8 bits (high 24 bits of gpr_data unchanged).

sz == 2: Load 16 bits (high 16 bits of gpr_data unchanged).

sz == 1: Load 32 bits.

sz == 0: Load 128 bits (to four GPRs starting at gpr_data & 0x3c).

Also various options for incrementing after the load:

inc == 0: No auto-increment.

inc == 1: Increment the low/high 16 bits of gpr_ofs by 2.

inc == 2: Increment the low/high 16 bits of gpr_ofs by 4.

inc == 3: Increment the low/high 16 bits of gpr_ofs by 16.

STOREIND(u1 l1, u2 sz, u6 gpr_ofs, u1 lo_hi, u2 inc, u6 gpr_data, u6 gpr_base)

Stores from GPRs to L0/L1.
Other than the extra l1 operand, all operands as per LOADIND.

ATSWAP(u1 l1, u8 ofs_mask, u6 gpr_data, u6 gpr_base)

Does an atomic swap between GPRs and L0/L1 of up to 128 bits.
The L1 address is gpr_base*16. Four GPRs starting at gpr_data & 0x3c give 128 bits, which are partially swapped with the 128 bits at the L1 address: if bit i of ofs_mask is set, then bits i*16 through i*16+15 are swapped.

ATCAS(u1 l1, u4 set_val, u4 cmp_val, u2 ofs, u6 gpr_base)

Does an atomic compare/set against L0/L1. The logic is along the lines of:
uint32_t *word = gpr_base*16 + ofs*4;
retry:
atomic {
  if (*word != cmp_val) {
    goto retry; // Comparison failed
  }
  *word = set_val;
}
ATINCGET(u1 l1, u5 len, u2 ofs, u6 gpr_data, u6 gpr_base)

Does an atomic increment against L0/L1. The logic is along the lines of:
uint32_t *word = gpr_base*16 + ofs*4;
uint32_t incr_mask = (1u << (len + 1)) - 1;
atomic {
  uint32_t incremented = *word + gpr_data;
  gpr_data = *word;
  *word = (incremented & incr_mask) | (*word &~ incr_mask);
}
ATINCGETPTR(u1 l1, u1 no_incr, u5 incr_log2, u4 len, u2 ofs, u6 gpr_data, u6 gpr_base)

Does an atomic FIFO operation against L0/L1. The logic is along the lines of:
struct fifo_ctl_t {
  uint32_t rd;
  uint32_t wr;
  uint32_t pad[2];
} *fifo = gpr_base*16;
uint32_t *word = gpr_base*16 + ofs*4;
uint32_t fifo_capacity = 1u << (len - 1);
uint32_t fifo_mask = (1u << len) - 1;
retry:
atomic {
  if (ofs & 1) {
    uint32_t fifo_size = (fifo->wr - fifo->rd) & fifo_mask;
    if (fifo_size == fifo_capacity) {
      goto retry; // Cannot write to full FIFO
    }
  } else {
    if (fifo->rd == fifo->wr) {
      goto retry; // Cannot read from empty FIFO
    }
  }
  uint32_t incremented = *word + (!no_incr << incr_log2);
  gpr_data = *word;
  *word = (incremented & fifo_mask) | (*word &~ fifo_mask);
}

Two instructions move between GPRs and the 1 MiB range of address space starting at 0xFFB00000, though they cannot access the 2 KiB / 4 KiB core-local RAMs within this range:

LOADREG(u6 gpr_data, u18 ofs)

Does gpr_data = *(0xFFB00000 | (ofs << 2)).

STOREREG(u6 gpr_data, u18 ofs)

Does *(0xFFB00000 | (ofs << 2)) = gpr_data.

Configuration Registers

There are two broad categories of configuration registers:

261 per-pipe registers, each of which being between 1 and 16 bits wide, packed into 57x 16b per pipe (so 3x 57x 16b total). A packed 16b group is set using the SETC16(u6 idx, u16 val) instruction, which executes on the ThCfg unit. I have not found any MMIO region exposing these registers. Contents includes:
- CFG_STATE_ID::StateID
- DEST_TARGET_REG_CFG_MATH::Offset
- ADDR_MOD_SET::Base
- ADDR_MOD_{AB, DST, PACK, BIAS}_SEC[0-7]::*
- SRCA_SET::{Base, SetOvrdWithAddr}
- SRCB_SET::Base
- CLR_DVALID::{SrcA, SrcB}_Disable
- FIDELITY_BASE::Phase
- UNPACK_MISC_CFG::CfgContext{Offset, CntReset, CntInc}[01]
- NOC_OVERLAY_MSG_CLEAR::{StreamId, MsgNum}_[01]
- CG_CTRL_{EN, KICK}::*
- PERF_CNT_CMD::Cmd[0-3]{Start, Stop}
- ENABLE_ACC_STATS::Enable
- FPU_BIAS_SEL::Pointer
- FP16A_FORCE::Enable
248+26+39+174 unit-specific registers, each of which being between 1 and 32 bits wide, packed into (72+14+8+28)x 32b. There are two copies of each of these registers, with the per-pipe CFG_STATE_ID::StateID configuration register determining which copy is in use by a given pipe. Both copies are accessible via MMIO from the "B" or "T" cores, the 1^st as uint32_t[188] at 0xFFEF0000, and the 2^nd as uint32_t[188] at 0xFFEF02F0. A packed 32b group can be moved to / from a Tensix Scalar GPR using the RDCFG(u6 gpr, u8 idx) / WRCFG(u6 gpr, u1 wr128, u8 idx) instructions, and 8b-aligned subgroups can be manipulated using the RMWCIB[0-3](u8 mask, u8 bits, u8 idx) instructions. We have:
- 248 registers, packed into 72x 32b, that nominally live in Tensix Scalar, but mostly control other units. These can be set using REG2FLOP rather than WRCFG.
  - THCON_SEC[01]_REG0::TileDescriptor
  - THCON_SEC[01]_REG[189]::* for Tensix Pack?
  - THCON_SEC[01]_REG[23457]::* for Tensix Unpack?
  - THCON_SEC[01]_REG6::* for Tensix Xmov?
- 26 registers, packed into 14x 32b, for Tensix Unpack:
  - UNP[01]_ADDR_CTRL_XY_REG_[01]::[XY]stride
  - UNP[01]_ADDR_CTRL_ZW_REG_[01]::[ZW]stride
  - UNP[01]_ADDR_BASE_REG_[01]::Base
  - UNP[01]_FORCED_SHARED_EXP::shared_exp
  - UNP[01]_ADD_DEST_ADDR_CNTR::add_dest_addr_cntr
  - UNP0_BLOBS_Y_START_CNTX_{01,23}::blobs_y_start
- 39 registers, packed into 8x 32b, for Tensix Matrix and Tensix Vector:
  - ALU_FORMAT_SPEC_REG::{SrcA, SrcB, Dstacc}_{val, override}
  - ALU_FORMAT_SPEC_REG0::{SrcAUnsigned, SrcBUnsigned, SrcA}
  - ALU_FORMAT_SPEC_REG1::SrcB
  - ALU_FORMAT_SPEC_REG2::Dstacc
  - ALU_ROUNDING_MODE::{Fpu, Gasket, Packer}_srnd_en
  - ALU_ACC_CTRL::*
  - STACC_RELU::{ApplyRelu, ReluThreshold}
  - DISABLE_RISC_BP::*
  - ECC_SCRUBBER::*
  - STATE_RESET::EN
  - DEST_OFFSET::Enable
  - DEST_REGW_BASE::Base
  - INT_DESCALE::{Enable, Mode}
- 174 registers, packed into 28x 32b, for Tensix Pack:
  - PCK0_ADDR_CTRL_XY_REG_[01]::[XY]stride
  - PCK0_ADDR_CTRL_ZW_REG_[01]::[ZW]stride
  - PCK0_ADDR_BASE_REG_[01]::Base
  - PCK_DEST_RD_CTRL::*
  - PCK_EDGE_MODE::mode
  - PCK_EDGE_TILE_FACE_SET_SELECT::{select, enable}
  - PCK_EDGE_TILE_ROW_SET_SELECT::select
  - PCK_EDGE_OFFSET_SEC[0-3]::mask
  - PACK_COUNTERS_SEC[0-3]::*
  - PACK_CONCAT_MASK_SEC[0-3]::pack_concat_mask
  - TILE_ROW_SET_MAPPING_[0-3]::row_set_mapping_[0-15]
  - TILE_FACE_SET_MAPPING_[0-3]::face_set_mapping_[0-15]

I'm not going to make any attempt to explain the details of every configuration register, or really any configuration register, as that would take far too long.

General shape of low-level kernels

What we've seen so far should make Tenstorrent's low-level-kernels slightly more scrutable. Each LLK has an init step which configures the Macro-Op Expander and the Replay Expander and the Tensix Scalar GPRs and the relevant configuration registers, and then a runtime step which takes advantage of all that pre-programming. These LLKs are wrapped by things in Metalium's llk_api directory, which in turn are wrapped by things in Metalium's compute_kernel_api directory, which is the API that developers are meant to use.

The LLKs make use of various instructions not yet covered; you'll have to consult the mostly-accurate YAML file outlining every instruction, or the C header generated from that YAML for further details. The general pattern of that header is that TT_OP_X(...) generates the encoding of instruction X (e.g. for later MMIO use), TT_X(...) generates the encoding of X and immediately does an MMIO write to push it into the instruction pipe, and TTI_X(...) uses the T6 as RVC trick to generate the encoding of X and splat it into the RISC-V instruction stream (so TTI_X can be used instead of TT_X when all the operands are compile-time constants).

An obvious next step would be dissecting a matrix multiplication kernel to describe how it orchestrates the Unpack and Matrix and Pack units, but this post is long enough already, so it'll have to wait for another time. That wraps up part 5; if you're reading along, then part 6 is next.

Previously, in parts 2 and 3, I played around with the 1^st ASIC on my n300s board, but there are of course two Wormhole ASICs on the n300s board. As part reminder and part new information, we can augment the circuit board photo we saw in part 1 with a connectivity schematic:

Photo	Schematic

(Circuit board photo from Tenstorrent's own sales pages)

Each E tile can manage 100Gb ethernet, i.e. simultaneous transmit at 100Gb/s and receive at 100Gb/s. The 1^st ASIC has E0 and E1 connected to one QSFP-DD cage, E6 and E7 connected to the other QSFP-DD cage, E8 and E9 connected to the 2^nd ASIC, and E14 and E15 connected to a Warp 100 Bridge connector on the right. The other eight E tiles are not connected to anything on these particular boards. Meanwhile, the 2^nd ASIC has E0 and E1 connected to the 1^st ASIC, E6 and E7 connected to a Warp 100 Bridge connector on the right, and none of the other E tiles connected to anything.

The PCIe tile on the 2^nd ASIC is similarly not connected to anything. There's a little SPI flash memory containing firmware and configuration for the ARC tiles, which can serve as an awkward communication channel: the 1^st ASIC can write revised firmware/configuration to the flash, then trigger a board-level reset to cause both ARC tiles to re-load their firmware and configuration from the flash. Other than using tt-flash to occasionally update the firmware, and tt-topology to occasionally update the configuration, you likely won't be using this channel. That leaves ethernet as the primary means of communication between the two ASICs on the board, so to make any use of the 2^nd ASIC, we're going to have to talk ethernet.

From the host, we can use the PCIe link to do whatever we want to the E8 / E9 tiles on the 1^st ASIC, but until we've established ethernet communication, we have no way to affect the E0 / E1 tiles that they communicate with. Whatever we transmit from the E8 / E9 tiles will, at least initially, be received and processed by the base firmware on the E0 / E1 tiles. The details of that processing logic aren't necessarily documented by Tenstorrent, but the base firmware on the E8 / E9 tiles knows how to form and transmit ethernet packets that'll be received and understood by the base firmware on the E0 / E1 tiles. Hence we don't want to mess with the E8 / E9 tiles too much, as we'll need to ask the firmware on them to do our bidding. That means we'll need to understand the interface that the base firmware on the E8 / E9 tiles presents to the host. This interface isn't really documented either, but at least there are relevant header files. We start with a basic queue/ring structure:

struct eth_queue_t {
  uint32_t wr_req_counter;
  uint32_t wr_resp_counter;
  uint32_t rd_req_counter;
  uint32_t rd_resp_counter;
  uint32_t error_counter;
  uint32_t padding0[3]; // Aligns next field to 16 bytes
  uint32_t wr_idx;
  uint32_t padding1[3]; // Aligns next field to 16 bytes
  uint32_t rd_idx;
  uint32_t padding2[3]; // Aligns next field to 16 bytes
  routing_cmd_t contents[4];
};

Nothing too surprising in eth_queue_t; it starts with some counters that the base firmware increments in various scenarios, then wr_idx and rd_idx, and then space for four elements. The size of the queue, which is always between zero and four (inclusive), is given by (wr_idx - rd_idx) % 8. An empty queue will have wr_idx == rd_idx, whereas a full queue will have (wr_idx - rd_idx) % 8 == 4. To push on to the queue, assuming it isn't full, populate contents[wr_idx % 4] then do wr_idx = (wr_idx + 1) % 8. To pop from the queue, assuming it isn't empty, consume contents[rd_idx % 4] and then do rd_idx = (rd_idx + 1) % 8. Aside: the choice of % 8 is unfortunate; % 2³² would have worked equally well, and % 2³² is completely free on any 32-bit or 64-bit CPU (whereas % 8 is very cheap but not quite free).

Each element of the queue is an instance of the routing_cmd_t structure:

struct routing_cmd_t {
  uint32_t target_addr;
  uint16_t target_noc_xy; // From lo to hi: 4 bits zero, 6 bits NoC X, 6 bits NoC Y
  uint16_t target_shelf_xy; // From lo to hi: 6 bits shelf-level X, 6 bits shelf-level Y, 4 bits unused
  union {
    uint32_t inline_data;
    uint32_t data_block_length;
  };
  uint32_t flags;
  uint16_t target_rack_xy; // From lo to hi: 8 bits rack X (rack #), 8 bits rack Y (shelf #)
  uint16_t reserved[5];
  uint32_t data_block_dma_addr;
};

// Request flags:
#define CMD_WR_REQ         (1u << 0)
#define CMD_RD_REQ         (1u << 2)
#define CMD_DATA_BLOCK_DMA (1u << 4)
#define CMD_DATA_BLOCK     (1u << 6)
#define CMD_BROADCAST      (1u << 7)
#define CMD_USE_NOC1       (1u << 9)
#define CMD_TIMESTAMP      (1u << 10)
#define CMD_ORDERED        (1u << 12)

// Response flags:
#define CMD_WR_ACK                 (1u << 1)
#define CMD_RD_DATA                (1u << 3)
#define CMD_DATA_BLOCK_UNAVAILABLE (1u << 30)
#define CMD_DEST_UNREACHABLE       (1u << 31)

This structure requires slighly more explanation. A request will be either CMD_WR_REQ or CMD_RD_REQ, along with a bunch of optional flags. If we ignore the CMD_BROADCAST flag, these write requests and read requests target a particular location in the address space of a particular tile. The tile-local address is given in the target_addr field, and the tile in question is identified by a combination of the target_noc_xy and target_shelf_xy and target_rack_xy fields. That is, rather than using IPv4 or IPv6 addresses, a custom 6-dimensional addressing scheme is used. We already saw the NoC X and Y dimensions in part 1, noting that they are interleaved versus the actual physical placement of tiles, which is why (amongst other things) the right edge appears as the middle column and the bottom edge appears as the middle row:

If there are multiple Wormhole ASICs in a single server, then they too can be arranged into a logical grid, giving the shelf-level X and Y dimensions:

Finally, an aisle of server racks in a datacenter gives rack-level X (rack #) and Y (shelf #) dimensions:

That completes the detour describing the addressing scheme. Returning to the routing_cmd_t structure, the data to be written (for write requests) can either be a 4 byte value in inline_data, or a small block of data up to 1KB in size somewhere near the routing_cmd_t structure (set the CMD_DATA_BLOCK flag, put the length in data_block_length), or a large block of data up to 3.75GB in size sitting in host DRAM accessible via DMA (set both CMD_DATA_BLOCK and CMD_DATA_BLOCK_DMA, put the length in data_block_length, and the physical memory address in data_block_dma_addr - the kernel-mode driver can be used to obtain such addresses). For read requests, the options are similar: a 4 byte result can go directly in inline_data, or a small result up to 1KB in size can be written somewhere near the routing_cmd_t structure, or a large result up to 3.75GB in size can be written to host DRAM via DMA. The routing_cmd_t structure is used for responses as well as requests, though a different set of flags are applicable to responses, and the only interesting fields on responses are flags and inline_data. The high bits of response flags indicate errors, while the low four bits should contain either CMD_WR_ACK or CMD_RD_DATA. Everything is then wrapped up in a eth_base_firmware_queues_t structure:

struct eth_base_firmware_queues_t {
  uint64_t latency_counter[16];
  eth_queue_t sq; // Contains requests, for host -> E tile
  eth_queue_t reserved;
  eth_queue_t cq; // Contains responses, for E tile -> host
  char padding[4096 - sizeof(uint64_t)*16 - sizeof(eth_queue_t)*3];
  char buffers[4][1024];
};

Skipping over the latency_counter field, this contains a submission queue (sq), in to which the host pushes routing_cmd_t objects containing requests, and a completion queue (cq) from which the host pops routing_cmd_t objects containing responses. Each of the index fields has a single writer:

Field	Writer	Readers
`sq.wr_idx`	Host (as part of pushing)	Host, E tile
`sq.rd_idx`	E tile (as part of popping)	Host, E tile
`cq.wr_idx`	E tile (as part of pushing)	Host, E tile
`cq.rd_idx`	Host (as part of popping)	Host, E tile

The buffers field contains four 1KB buffers, used for requests or responses which have CMD_DATA_BLOCK set, but CMD_DATA_BLOCK_DMA unset. In such cases, request sq.contents[i] uses buffers[i], and response cq.contents[i] also uses buffers[i]. A little bit of care is required to ensure that a buffer isn't used by two different routing_cmd_t objects at once, but assuming that the queue indices start off aligned, and that every request generates a response, then the response to sq.contents[i] will end up in cq.contents[i], and at most one of these two things will require buffers[i].

Each E tile contains a single eth_base_firmware_queues_t structure in its SRAM, the address of which is stored at tile-local address 0x170. The host uses PCIe reads and writes to interact with this structure, and it is the responsibility of host software to avoid having multiple host threads interact with the same structure at the same time. The host can submit requests to read/write against any tile in the 6-dimensional space, and the base firmware on the E tile to which the request is submitted will do one of three things:

If the target tile is the E tile itself, the request can be completed using RISC-V load/store instructions.
Otherwise, if the target tile is on the same ASIC as the E tile, the request can be completed using NoC #0 (default) or NoC #1 (if CMD_USE_NOC1 is set).
Otherwise, the request can be forwarded to a different E tile; either to the E tile at the other end of the ethernet link, or to one of the other E tiles on the same ASIC. The receiving E tile will then do one of the same three things.

In the simple setup of a single n300s board, the rack # is 0, the shelf # is 0, and then the shelf-level coordinates are (0, 0) for the ASIC connected to PCIe and (1, 0) for the other ASIC. In more complex setups, tt-topology should be used to assign coordinates to ASICs.

Back in part 2, we obtained the value of RV_ADDR_NOC0_MC_DISABLE_COL on the 1^st ASIC, thereby determining which rows were disabled. Knowing what we now know about ethernet tiles, we can obtain RV_ADDR_NOC0_MC_DISABLE_COL on both ASICs. To make things interesting, we'll have the host make a request to tile E10 at NoC coordinates (8, 6), but have the target of the request be RV_ADDR_NOC0_MC_DISABLE_COL on tile E2 at coordinates (8, 0). When targetting the 2^nd ASIC, this'll require an ethernet hop between E8 and E0, as shown:

Continuing with the code from part 2, which deliberately eschews the useful software layers provided by Tenstorrent, we can start by obtaining the base firmware queues structure on tile (8, 6):

char* l1_tlb = set_tlb(dev, TLB_IDX_0, TLB_CFG_UNICAST(8, 6), 0);
uint32_t q_addr = *(volatile uint32_t*)(l1_tlb + 0x170);
eth_base_firmware_queues_t* q = (eth_base_firmware_queues_t*)(l1_tlb + q_addr);

We can then prepare the request for the base firmware, setting the target as RV_ADDR_NOC0_MC_DISABLE_COL on tile (8, 0):

routing_cmd_t c;
c.target_rack_xy = (0 << 0) + (0 << 8);
c.target_shelf_xy = (shelf_x << 0) + (shelf_y << 6);
c.target_noc_xy = (8 << 4) + (0 << 10);
c.target_addr = RV_ADDR_NOC0_MC_DISABLE_COL;
c.flags = CMD_RD_REQ;

Dispatching the command to the submission queue and reaping the result from the completion queue involves some grungy work:

void do_eth_cmd(eth_base_firmware_queues_t* q, routing_cmd_t* c) {
  // Spin while sq full
  uint32_t wr_idx = q->sq.wr_idx;
  uint32_t rd_idx;
  do {
    rd_idx = *(volatile uint32_t*)&q->sq.rd_idx;
  } while ((wr_idx - rd_idx) & 4u);

  // Push to sq
  routing_cmd_t* qc = q->sq.contents + (wr_idx & 3u);
  *(volatile __m256i*)qc = _mm256_loadu_si256((__m256i*)c);
  _mm_sfence();
  *(volatile uint32_t*)&q->sq.wr_idx = (wr_idx + 1) & 7u;

  // Spin while cq empty
  rd_idx = q->cq.rd_idx;
  do {
    wr_idx = *(volatile uint32_t*)&q->cq.wr_idx;
  } while (rd_idx == wr_idx);

  // Wait for cq entry to be populated
  qc = q->cq.contents + (rd_idx & 3u);
  do {
    _mm256_storeu_si256((__m256i*)c, *(volatile __m256i*)qc);
  } while (c->flags == 0);

  // Pop from cq
  *(volatile uint32_t*)&q->cq.rd_idx = (rd_idx + 1) & 7u;
}

One subtle point in the above is that the base firmware initially pushes an entry on to the completion queue with flags set to zero, and then populates the entry properly as a 2^nd stage, so (somewhat unfortunately) two loops are required to pop from the completion queue.

We can then wrap all this up in some more loops:

for (uint32_t shelf_y = 0; shelf_y < 2; ++shelf_y) {
  for (uint32_t shelf_x = 0; shelf_x < 2; ++shelf_x) {
    routing_cmd_t c;
    c.target_rack_xy = (0 << 0) + (0 << 8);
    c.target_shelf_xy = (shelf_x << 0) + (shelf_y << 6);
    c.target_noc_xy = (8 << 4) + (0 << 10);
    c.target_addr = RV_ADDR_NOC0_MC_DISABLE_COL;
    c.flags = CMD_RD_REQ;
    do_eth_cmd(q, &c);
    printf("(%u, %u) -> ", shelf_x, shelf_y);
    if (c.flags == CMD_RD_DATA) {
      printf("value %u\n", c.inline_data);
    } else {
      printf("error %#08x\n", c.flags);
    }
  }
}

For my n300s board, the above prints:

(0, 0) -> value 3137
(1, 0) -> value 2121
(0, 1) -> error 0x80000008
(1, 1) -> error 0x80000008

There are CMD_DEST_UNREACHABLE errors for target_shelf_xy of (0, 1) and (1, 1), which makes sense. The value 3137 is what we already saw in part 2, and corresponds to bitmask 0b110001000001. The value 2121 is new, and tells us that the bitmask for the 2^nd ASIC is 0b100001001001. This means that the 128 usable T tiles on my n300s board are at:

At this point we could use a routing_cmd_t to send new RISC-V code to the E tiles on the 2^nd ASIC and have that code implement a communication protocol of our choosing rather than relying on the base firmware, but this blog post is long enough already, so it'll have to wait for another time. The complete code for this post consists of 201 lines, though half of that is unchanged from part 2. That wraps up part 4; if you're reading along, then part 5 is next.

Continuing the trend from part 2 of eschewing the useful software layers provided by Tenstorrent, and instead manually poking around in various address spaces, every T tile contains a 64-bit counter at tile-local address 0xFFB121F0 and 0xFFB121F8 which starts at zero when the chip is powered on, and increments by one every clock cycle. Every T tile also contains a soft-reset register at tile-local address 0xFFB121B0; if this register contains 0x47800 then all five Baby RISC-V cores are held in soft reset, and then individual bits can be cleared to take individual cores out of soft reset (i.e. allow them to run).

With these two pieces of information, we can do something interesting: use a NoC multicast write to take one core out of reset on every tile, have RISC-V code on every tile record its cycle counter somewhere as soon as it comes out of reset, then collect and plot the results.

In order to start from a clean slate, we'll want to use a NoC multicast write to put all cores into soft-reset, specifying (0, 0) through (9, 11) inclusive as the multicast rectangle, and relying on the multicast disable row/column we saw in part 2 to ensure that the multicast only goes to T tiles:

#define RV_ADDR_SOFT_RESET 0xFFB121B0

#define SOFT_RESET_ALL_CORES 0x47800

char* reg_tlb = set_tlb(dev, TLB_IDX_UC0, TLB_CFG_MULTICAST(0, 0, 9, 11), RV_ADDR_SOFT_RESET);
*(volatile uint32_t*)(reg_tlb + RV_ADDR_SOFT_RESET) = SOFT_RESET_ALL_CORES;

With all the cores held in soft-reset, it is safe to send them new code. The SRAM (or, for D tiles, DRAM) within a tile starts at tile-local address 0, and execution will also start at address 0 when soft-reset is cleared, so we can send some RISC-V code to tile-local addresses starting at 0, again multicasting it out. The code will read from the tile-local 64-bit cycle counter at 0xFFB121F0 and 0xFFB121F8, then write it to tile-local address 128:

const uint32_t rv_code[] = {
  0xFFB12537, // lui a0, 0xFFB12
  0x1F052583, // lw a1, 0x1F0(a0)
  0x1F852603, // lw a2, 0x1F8(a0)
  0x08B02023, // sw a1, 128(x0)
  0x08C02223, // sw a2, 132(x0)
  0x0000006F, // loop: j loop
};
char* l1_tlb = set_tlb(dev, TLB_IDX_0, TLB_CFG_MULTICAST(0, 0, 9, 11), 0);
memcpy(l1_tlb, rv_code, sizeof(rv_code));

We can then perform a multicast to bring one core out of reset on each T tile:

*(volatile uint32_t*)(reg_tlb + RV_ADDR_SOFT_RESET) = SOFT_RESET_ALL_CORES & (SOFT_RESET_ALL_CORES - 1);

We can't use multicast to collect the results - instead we need to perform a unicast read against each T tile in turn. That requires knowing the tile coordinates of each T tile, and said grid isn't entirely regular: it'll be disturbed by a column of D tiles, and be disturbed by a row of E tiles, and have one or two disabled rows. We can sidestep this problem by using a convenient translation feature: an X coordinate of 16 will be replaced with 0 (PCIe / ARC / D column), 17 will be replaced with 5 (2^nd D column), then 18 through 25 will be replaced with the column indices containing T tiles. Similarly, a Y coordinate of 16 will be replaced with 0 (E0-E7 row), 17 will be replaced with 6 (E8-E15 row), and 18 through 25 or 26 will be replaced with whatever row indices contain active T tiles (if you need a reminder of the coordinate grid, see part 1). This allows us to easily iterate over the active T tiles:

uint64_t times[8][8];
for (uint32_t y = 0; y < 8; ++y) {
  for (uint32_t x = 0; x < 8; ++x) {
    l1_tlb = set_tlb(dev, TLB_IDX_0, TLB_CFG_UNICAST(18 + x, 18 + y), 0);
    times[y][x] = *(volatile uint64_t*)(l1_tlb + 128);
  }
}

For neatness, we can then put everything back into reset:

*(volatile uint32_t*)(reg_tlb + RV_ADDR_SOFT_RESET) = SOFT_RESET_ALL_CORES;

With T denoting the minimum value seen in the times matrix, I observe:

If instead multicasting via NoC #1 (by adding TLB_CFG_NOC1 to the TLB_CFG_MULTICAST result), and calling S the minimum value seen this time, I observe:

Both sets of measurements suggest that the tile-to-tile propagation delay might be around 9 cycles, but the numbers are far from perfect. The imperfections are very clear if we plot both sets of measurements at the same time, and look at just the row containing the PCIe tile:

Going rightwards, the first tile is "T+3" and the last is "T+75", meaning 72 cycles to traverse 8 tiles. Going leftwards, the first tile is "S+0" and the last is "S+72", again meaning 72 cycles to traverse 8 tiles. However, going rightwards, the 2^nd tile is "T+0", which isn't great: taken at face value it would mean that the multicast reached the 2^nd tile before reaching the first, which is clearly nonsense. There is one obvious explanation for this: the cycle counters on different tiles aren't perfectly aligned - they're meant to all start from 0 when the chip powers on, but powering on is a physically complex process, so some tiles might start counting a few cycles before or after others.

If the tile-to-tile latency was identical for every hop, and we called this unknown quantity X, then what we'd hope to see is:

Regardless of what S or T or X actually are, it so happens that the average of the two expressions in each tile is (S + T)/2 + 4X. As this expression should be the same for all tiles, we can use it to correct for the different counter start times between the different tiles. We need to assume that there is a per-tile counter adjustment, with all readings taken on a given tile adjusted by the same amount, and then set those adjustments so that "should be the same" becomes "is the same". Because I'm lazy, I'll assume that all tiles within a given column have the same adjustment, which isn't quite true, but it'll do for now. After computing and applying this adjustment, the NoC #0 measurements are:

And NoC #1 are:

The results still aren't perfect, but they're good enough for me to conclude that the tile-to-tile propagation delay is 9 clock cycles (i.e. 9 nanoseconds when the clock is running at 1GHz), and that imperfections in measurements are due to the aforementioned laziness. For tile-to-tile communication there'll be some latency to get on to the NoC, then a propagation delay for every traversed tile, and then some latency to get off the NoC. For messages requiring a response, there'll be all that twice, as after the request has done all that, the response needs to get on to the NoC, then propagate back to the requestor, then get off the NoC. For NoC reads (and presumably NoC writes-with-acknowledgement, if you use them), that response travels on the same NoC as the request, so if requestee and respondee are in the same row, the combination of request and response will have 10 tiles (90 cycles) of propagation delay, versus 12 tiles (108 cycles) of propagation delay if they're in the same column, and 10+12 tiles (198 cycles) if they're in different row and column.

That wraps up part 3. The complete code comes out to 164 lines, but a lot of it is common with part 2's 100 lines. If you're reading along, part 4 is next.

Tenstorrent Wormhole Series Part 7: Bits of the MatMul

Tenstorrent Wormhole Series Part 6: Vector instruction set

Tenstorrent Wormhole Series Part 5: Taking apart T tiles

Tenstorrent Wormhole Series Part 4: A touch of Ethernet

Tenstorrent Wormhole Series Part 3: NoC propagation delay