corsix.org

Previously, we considered the physicalities of a Tenstorrent Wormhole card, ending on the remark that one or two rows of T tiles will be disabled in every chip shipped to customers. That naturally begs the question: if you're a customer with a chip (like me), how do you determine which rows are disabled?

It should be noted that most people shouldn't need to care about this question, as a combination of various Tenstorrent-provided software layers should successfully abstract away this difference. That said, I'm not most people; I want to characterise and understand how these cards work at a very low level. Consequently, I'm going to be ignoring most of the Tenstorrent-provided software layers; the kernel-mode driver is fine, and some closed-source firmware is unavoidable at the moment, but I'll ignore the user-mode driver along with all of TT-NN/TT-Metalium and TT-Buda. Again, if you are most people, you probably want to be using those software layers rather than doing what I'm about to do.

Opening the kernel driver is simple enough:

int fd = open("/dev/tenstorrent/0", O_RDWR | O_CLOEXEC);
ASSERT(fd >= 0);

We can then ask the kernel driver what memory ranges (i.e. PCIe bars) it has available for mapping:

#define TENSTORRENT_IOCTL_QUERY_MAPPINGS 0xFA02

struct tenstorrent_mapping {
  uint32_t mapping_id;
  uint32_t reserved;
  uint64_t mapping_base;
  uint64_t mapping_size;
};

#define TENSTORRENT_MAPPING_RESOURCE0_UC 1
#define TENSTORRENT_MAPPING_RESOURCE0_WC 2
#define TENSTORRENT_MAPPING_RESOURCE2_UC 5

unsigned char resource_to_mapping[8] = {0};
struct tenstorrent_mapping mappings[sizeof(resource_to_mapping) + 1];
mappings[0].mapping_size = sizeof(resource_to_mapping);
ASSERT(ioctl(fd, TENSTORRENT_IOCTL_QUERY_MAPPINGS, &mappings[0].mapping_size) >= 0);
mappings[0].mapping_size = 0;
for (unsigned i = 1; i <= sizeof(resource_to_mapping); ++i) {
  uint32_t resource = mappings[i].mapping_id;
  if (resource < sizeof(resource_to_mapping)) {
    resource_to_mapping[resource] = i;
  }
}

To make some future things easier, I want to map these resources in a very particular way:

The first 464MB of resource 0, as write-combining memory.
Then the next 32MB of resource 0, as uncacheable memory.
Then the middle/final 16MB of resource 2, as uncacheable memory.

This sums to a neat 512MB, so it needs one mmap call to reserve a contiguous 512MB range of virtual address space, followed by one mmap call per resource range. If resource 0 isn't available as WC, or less than 464MB is available as WC, then mapping it as uncacheable is an acceptable fallback:

#define BAR0_WC_SIZE (464 << 20)
#define BAR0_SIZE    (496 << 20)
#define MMAP_SIZE    (512 << 20)

#define BAR4_SOC_TARGET_ADDRESS 0x1E000000

struct tenstorrent_mapping* bar0uc = mappings + resource_to_mapping[TENSTORRENT_MAPPING_RESOURCE0_UC];
struct tenstorrent_mapping* bar0wc = mappings + resource_to_mapping[TENSTORRENT_MAPPING_RESOURCE0_WC];
struct tenstorrent_mapping* bar4uc = mappings + resource_to_mapping[TENSTORRENT_MAPPING_RESOURCE2_UC];
ASSERT(bar0uc->mapping_size >= BAR0_SIZE);
ASSERT(bar4uc->mapping_size >= MMAP_SIZE - BAR4_SOC_TARGET_ADDRESS);

char* dev = mmap(NULL, MMAP_SIZE, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
ASSERT(dev != MAP_FAILED);

uint32_t wc_size = bar0wc->mapping_size;
if (wc_size) {
  if (wc_size > BAR0_WC_SIZE) {
    wc_size = BAR0_WC_SIZE;
  }
  if (mmap(dev, wc_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, bar0wc->mapping_base) == MAP_FAILED) {
    wc_size = 0;
  }
}
ASSERT(mmap(dev + wc_size, BAR0_SIZE - wc_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, bar0uc->mapping_base + wc_size) != MAP_FAILED);
ASSERT(mmap(dev + BAR0_SIZE, MMAP_SIZE - BAR0_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, bar4uc->mapping_base + (BAR0_SIZE - BAR4_SOC_TARGET_ADDRESS)) != MAP_FAILED);

This gives us a 512MB window for talking to the Wormhole ASIC, but what wonders does this window contain? It so happens that the 16MB from resource 2 contains an assortment of configuration registers on the ARC and PCIe tiles. Meanwhile, the remainder of the window maps to the PCIe tile, and any read/write performed by the host against this window gets translated into a NoC read/write by the PCIe tile. The details of that translation can be tweaked by using some configuration registers in the aforementioned resource 2. The general shape cannot be tweaked: the 496MB range subdivides into 156 pieces of size 1MB, 10 pieces of size 2MB, and 20 pieces of size 16MB. After that, things get tweakable: for each piece, we can specify the X/Y coordinates of the tile on the NoC to read/write (or the X/Y coordinates of a rectangular range of tiles for multicast writes), which 1MB/2MB/16MB-aligned range of address space within the tile to target, whether to use NoC #0 or #1, and a few other properties. The Tenstorrent software calls these pieces TLBs, which is not to be confused with the TLB used within a CPU to translate between virtual and physical addresses. Mapping the first 464MB of resource 0 as write-combining means that most of the pieces are write-combining; only the final two 16MB pieces fall within the uncacheable part (note that this WC/UC difference only affects whether the host buffers up writes before passing them along to the PCIe tile; once the PCIe tile receives the PCIe transaction, it doesn't care whether WC or UC was used to get there).

The configuration registers controlling these pieces start at address 0x1FC00000, and consist of 8 bytes per piece. We can wrap up the details within a set_tlb function, which takes a piece index (0 ≤ idx < 156+10+20) and details of what to target, configures that piece, and then returns a pointer to the relevant piece:

#define TLB_CONFIG_ADDR 0x1FC00000

#define TLB_CFG_UNICAST(x, y) (((y) << 6) + (x))
#define TLB_CFG_MULTICAST(x_start, y_start, x_end, y_end) ((1 << 25) + ((y_start) << 18) + ((x_start) << 12) + ((y_end) << 6) + (x_end))
#define TLB_CFG_NOC1 (1 << 24)

static char* set_tlb(char* dev, uint32_t idx, uint64_t cfg, uint32_t suitable_for_addr) {
  char* result = dev;
  uint32_t abits;
  if (idx < 156) {
    abits = 20;
    result += (idx << 20);
  } else if (idx < 166) {
    abits = 21;
    result += (156 << 20) + ((idx - 156) << 21);
  } else {
    abits = 24;
    result += (156 << 20) + (10 << 21) + ((idx - 166) << 24);
  }
  cfg = (cfg << (36 - abits)) + (suitable_for_addr >>= abits);
  ((volatile uint64_t*)(dev + TLB_CONFIG_ADDR))[idx] = cfg;
  return result - (suitable_for_addr << abits);
}

We can use set_tlb to go and poke around in the address space of any tile on the NoC. I'm going to interrogate the ethernet tile at logical coordinates (1, 0), as ethernet tiles are never disabled in the way that T tiles can be. Like most of the tiles containing RISC-V cores, its tile-local address space contains various interesting things at/above address 0xFF000000, including "multicast disable row" and "multicast disable column" at 0xFFB20108 and 0xFFB20110:

#define TLB_IDX_UC0 184

#define RV_ADDR_NOC0_MC_DISABLE_ROW 0xFFB20108
#define RV_ADDR_NOC0_MC_DISABLE_COL 0xFFB20110

char* reg_tlb = set_tlb(dev, TLB_IDX_UC0, TLB_CFG_UNICAST(1, 0), RV_ADDR_NOC0_MC_DISABLE_ROW);
printf("%u, %u\n",
    *(volatile uint32_t*)(reg_tlb + RV_ADDR_NOC0_MC_DISABLE_ROW),
    *(volatile uint32_t*)(reg_tlb + RV_ADDR_NOC0_MC_DISABLE_COL));

On my system, for my card, this prints 33, 3137, which in binary is 0b100001, 0b110001000001. Plotting these masks as "X"s on the tile grid gives:

The tiles not marked with "X" are the usable T tiles, meaning that for my chip, what I have is effectively:

I suspect that the final two rows of T tiles were disabled for consistency reasons rather than because of actual defects, but I'll never know for sure!

That wraps up part 2. The complete code comes out to 100 lines, which isn't too shabby. If you're reading along, part 3 is next.

A company called Tenstorrent design and sell PCIe cards for AI acceleration. At the time of writing, they've recently started shipping their Wormhole n150s and Wormhole n300s cards. After taking the shell and the heatsinks off, the underlying circuit board looks like:

Wormhole n150s	Wormhole n300s

(Circuit board photos from Tenstorrent's own sales pages)

At the top of the board, there are two QSFP-DD cages for ethernet, which are intended for direct-attach cables between cards, as shown here. The left edge has a PCIe 4.0 x16 connector, which is connected to the immediately adjacent Wormhole ASIC. The Wormhole ASIC is surrounded by six GDDR6 chips (two on the left, four on the right), with each GDDR6 chip containing two 1GB banks, for a total of 12GB GDDR6 connected to the Wormhole ASIC. The right edge has two of what Tenstorrent call a Warp 100 Bridge connector, which seems to be a proprietary 100Gb ethernet connector. After this, the two boards diverge: the n300s card has another Wormhole ASIC with another 12GB of GDDR6 attached, whereas the n150s card has a bunch of empty space. The 2^nd Wormhole ASIC on the n300s card is not connected to PCIe. Instead, on the n300s card, there are internal ethernet links between the two ASICs. Accordingly, for the host to communicate with the 2^nd ASIC, it needs to use PCIe to communicate with the 1^st ASIC, and ask the 1^st ASIC to use its internal ethernet to communicate with the 2^nd ASIC's internal ethernet.

The Wormhole ASIC itself is described as a 10 by 12 grid of tiles, arranged as:

There are a few different types of tile:

Kind	Count	Contents (per tile)
ARC	1	1x Argonaut RISC Core (also connected to PCIe)
D	6 x 3	Bridge to 2x 1GB GDDR6 (shared by three tiles)
E	16	1x Baby RISC-V CPU (E variant) 256K SRAM Bridge to 100Gb ethernet (if connected)
PCIe	1	Bridge to host over PCIe
T	80 (not all usable)	5x Baby RISC-V CPU (B/T/T/T/NC variants) 1.5MB SRAM 1x Matrix unit (2048 multipliers, each 5b x 7b) 1x Vector/SIMD unit (32 lanes, each 32b wide)

Each E tile can manage 100Gb ethernet (that is, simultaneous transmit at 100Gb/s and receive at 100Gb/s). On the n300s card, E8 of the 1^st ASIC is connected to E0 of the 2^nd ASIC, and E9 of the 1^st ASIC is connected to E1 of the 2^nd ASIC. Other tiles are connected to the QSFP-DD cages and to the Warp 100 Bridge connectors (possibly two E tiles on the 2^nd ASIC route to the Warp, and four E tiles on the 1^st ASIC route to the QSFP-DD cages). There's not enough I/O on the n150s/n300s boards to allow all the E tiles of a single ASIC to be connected, let alone all the E tiles of two ASICs to be connected; you'd need to buy a Galaxy Server to find a form factor with enough I/O to saturate all the E tiles.

Each tile has four outbound connections to its (north/east/south/west) neighbours, along with four corresponding inbound connections from its neighbours, with each connection being 32 bytes wide (apparently upgraded to 64 bytes wide in the next-generation Blackhole ASIC). The combination of the east-bound and south-bound channels are called NoC (Network on Chip) #0, whilst the combination of the west-bound and north-bound channels are called NoC #1. Going eastward from the eastmost edge lands you on the westmost edge (e.g. going east from the T at (9, 3) gets you to PCIe at (0, 3)). A similar thing is true on the other three edges: they are each connected to the edge on the opposite side. Taking the eastbound connections in row 3 as an example, we have:

It looks like the path between most neighbours is short, but then the wraparound path is extremely long. This would make both software and hardware designers unhappy, so the physical reality interleaves tiles to equalize the distances:

A similar interleaving is done to the columns, meaning that the actual physical locations of the various tiles within the grid is closer to:

The various tile-to-tile connections aren't shown on the above, as it would be a horrible mess of overlapping and dodging lines. It does however match the circuit board photos: the DRAM tiles are on the left and right edges, and tiles routing to the same DRAM chip are adjacent (e.g. the three D5 tiles are physically adjacent, even though their logical Y coordinates are 3, 8, 4). The ethernet tiles also end up on the physical edges, which makes sense. Thankfully, software mostly doesn't need to worry about this interleaving (except for HARVESTING_NOC_LOCATIONS, which you'll note are the Y coordinates of an interleaved column of T tiles), and can instead stick to the original logical view:

That said, there's one physical reality that software can't escape: manufacturing defects. You could imagine TSMC having a master template Wormhole ASIC and then photocopying it thousands of times to create the ASICs that actually ship to customers. As with actual photocopying, the photocopies are slightly lower quality than the original. Sometimes the defects are benign, but other times the defects completely ruin the piece of logic or storage that they occur within. Throwing away an entire ASIC because of a single defect would not be viable, so there are various strategies to deal with the defects. One such employed strategy is to disable (or "harvest") an entire row of T tiles if there is a defect in any of the T tiles in that row, effectively pretending that all the T tiles in that row are empty tiles (in the same way that e.g. (0, 2) is empty but still participates in the connection grid). An ASIC with one row of disabled T tiles can be used in an n150s board, which is why the n150s is reported as having 72 T tiles despite the ASIC having 80. Similarly, an ASIC with two rows of disabled T tiles can be used in an n300s board, which is why the n300s is reported as having 64 T tiles per ASIC despite there originally being 80. If the photocopying process had fewer defects than planned for, then one (n150s) or two (n300s) rows of T tiles are disabled regardless, for consistency (of what customers receive, of power usage, etc.). The non-T tiles within a disabled row are kept, though they become slightly harder to address. In particular, some workloads want to really minimise latency to DRAM, so want to run on a T tile immediately adjacent to a D tile. Some rows have 2 D tiles and thus four T tiles immediately adjacent to a D tile, whereas other rows have only 1 D tile and thus only two T tiles immediately adjacent to a D tile. Whilst each PCIe card will have the same number of usable T tiles (72 for n150s, 64 x 2 for n300s), the number of usable T tiles immediately adjacent to a D tile will vary randomly from card to card depending on which rows were disabled.

That wraps up part 1. If you're reading along, part 2 is next.

The AArch64 FEAT_CSSC (Common Short Sequence Compression) extension adds a few instructions operating on general purpose registers. The extension is optional since Armv8.7, and mandatory since Armv8.9 if FEAT_AdvSIMD is implemented. The most interesting addition is a popcnt instruction operating on general purpose registers, but the full list of new instructions is:

CSSC instruction	Equivalent pre-CSSC sequence (†)
`ABS Rd, Rn`	`TST Rn, Rn CNEG Rd, Rn, MI`
`CNT Rd, Rn`	`FMOV d0, Rn CNT v0.8B, v0.8B ADDV b0, v0.8B FMOV Rd, s0`
`CTZ Rd, Rn`	`RBIT Rd, Rn CLZ Rd, Rd`
`SMAX Rd, Rn, Rm`	`CMP Rn, Rm CSEL Rd, Rn, Rm, GT`
`SMAX Rd, Rn, #i8`	`MOV Rtmp, #i8 CMP Rn, Rtmp CSEL Rd, Rn, Rtmp, GT`
`SMIN Rd, Rn, Rm`	`CMP Rn, Rm CSEL Rd, Rn, Rm, LT`
`SMIN Rd, Rn, #i8`	`MOV Rtmp, #i8 CMP Rn, Rtmp CSEL Rd, Rn, Rtmp, LT`
`UMAX Rd, Rn, Rm`	`CMP Rn, Rm CSEL Rd, Rn, Rm, HI`
`UMAX Rd, Rn, #u8`	`MOV Rtmp, #u8 CMP Rn, Rtmp CSEL Rd, Rn, Rtmp, HI`
`UMIN Rd, Rn, Rm`	`CMP Rn, Rm CSEL Rd, Rn, Rm, LO`
`UMIN Rd, Rn, #u8`	`MOV Rtmp, #u8 CMP Rn, Rtmp CSEL Rd, Rn, Rtmp, LO`

(†) Except that the CSSC instructions do not mutate flags, and CSSC CNT does not mutate any SIMD registers.

I find myself writing some CPU emulators at the moment, which has caused the AArch64 (aka. ARM64) RMIF instruction to become my new favourite instruction. It takes a 64-bit general purpose register, rotates it right by a specified number of bits, then selectively merges the low four bits into the flags register. A 6-bit field in the instruction gives the rotate amount, and a 4-bit field in the instruction gives a mask of which flag bits to overwrite versus which to leave unchanged.

One use of rmif is to emulate the x86 bt reg, imm instruction, which extracts one bit from a general purpose register, writes that bit to the C flag, and leaves other flags unchanged. Thus bt reg, imm in x86 becomes rmif reg, #((imm - 1) & 63), #2 in AArch64.

At the other end of the spectrum is the x86 inc instruction, which adds 1 to a general purpose register, and then sets most flags based on this addition, but leaves the C flag unchanged. To emulate inc reg, we can first save off the old value of the C flag (via csinc tmp, wzr, wzr, cc or adc tmp, wzr, wzr), then do adds reg, reg, #1 to perform the addition and set all the flags, then rmif tmp, #63, #2 to restore the old value of the C flag.

As another example, the AArch32 muls instruction sets the N and Z flags based on the result of the multiplication, but leaves the C and V flags unchanged. To emulate this on AArch64, we can save off all the flags (mrs tmp, NZCV), then do the multiplication, then set N and Z based on the result but also clobber C and V (ands wzr, dst, dst or adds wzr, dst, #0), then restore the old values of C and V (rmif tmp, #28, #3).

Distributing dynamically-linked ELF executables on Linux can be arduous. Some downstream effects of this include:

Distributing Linux applications in source form.
Distributing Linux applications as Docker containers.
Relying on OS package managers to provide pre-built packages.
The popularity of Go, with its statically-linked ELF executables.

At first, the problem doesn't look arduous: an ELF executable can contain an rpath or runpath attribute telling the dynamic linker where to find its shared object dependencies, and if that attribute starts with the magic placeholder $ORIGIN/, then the dynamic linker will look in the directory containing the executable (or a directory nearby) for its shared object dependencies. For example, if my_executable depended upon libz.so.1, and my_executable had an rpath or runpath of $ORIGIN/libs, then the executable and the library could be distributed using the following directory structure:

my_executable
libs/
  libz.so.1

This is great, but it has one limitation: an ELF executable also contains an attribute telling the kernel where to find the dynamic linker, and that attribute has to be an absolute path (or a path relative to the current working directory); it cannot be a path relative to the executable. On contemporary x86-64 systems, that absolute path tends to be /lib64/ld-linux-x86-64.so.2. This forces ELF executables to use whatever the system provides at /lib64/ld-linux-x86-64.so.2, which is typically version N of glibc's dynamic linker, for some N. In turn, this forces the ELF executable to use version N of the rest of glibc (libc.so.6, libm.so.6, libpthread.so.0, etc).

Continuing the example, it is likely that my_executable and libz.so.1 were built against some version M of glibc. If M ≤ N, then everything will work fine, but problems often crop up when M > N. One commonly touted solution is to set up a build environment with a very old version M of glibc, build my_executable and libz.so.1 in that environment, and then distribute them and hope for M ≤ N.

The polyfill-glibc project presents another possible solution: build my_executable and libz.so.1 against whatever version of glibc is convenient, and then run polyfill-glibc --target-glibc=N my_executable libz.so.1 to make them compatible with version N of glibc.

Sometimes we don't want either of these solutions, and what we want is to distribute the required version of glibc along with the executable, as in:

my_executable
libs/
  ld-linux-x86-64.so.2
  libc.so.6
  libz.so.1

We can get close to this by adding a launcher script:

launch_my_executable
my_executable
libs/
  ld-linux-x86-64.so.2
  libc.so.6
  libz.so.1

Where launch_my_executable is something like:

#!/usr/bin/env bash

ORIGIN="$(dirname "$(readlink -f "$0")")"
exec "$ORIGIN/libs/ld-linux-x86-64.so.2" --library-path "$ORIGIN/libs:$LD_LIBRARY_PATH" "$ORIGIN/my_executable" "$@"

This will work most of the time, though comes with caveats:

Execing the dynamic linker is a slightly obscure feature, which can increase the chance of hitting bugs (e.g. BZ#16381, BZ#24900).
If my_executable tries to open("/proc/self/exe") or readlink("/proc/self/exe") or similar, it'll get ld-linux-x86-64.so.2 rather than itself.
Having a launcher script is aesthetically displeasing.

As an alternative without these caveats, there's an experimental tool in the polyfill-glibc repository called set_relative_interp. For our running example, the tool would be invoked as:

$ set_relative_interp my_executable libs/ld-linux-x86-64.so.2

After running the tool as above, my_executable will use $ORIGIN/libs/ld-linux-x86-64.so.2 as its dynamic linker.

Tenstorrent Wormhole Series Part 2: Which disabled rows?

Tenstorrent Wormhole Series Part 1: Physicalities

Arm CSSC quick reference

My new favourite AArch64 CPU instruction: rotate then merge in to flags (RMIF)

For want of a relative path