Tenstorrent Wormhole Series Part 4: A touch of Ethernet

Previously, in parts 2 and 3, I played around with the 1st ASIC on my n300s board, but there are of course two Wormhole ASICs on the n300s board. As part reminder and part new information, we can augment the circuit board photo we saw in part 1 with a connectivity schematic:

PhotoSchematic
(Circuit board photo from Tenstorrent's own sales pages)

Each E tile can manage 100Gb ethernet, i.e. simultaneous transmit at 100Gb/s and receive at 100Gb/s. The 1st ASIC has E0 and E1 connected to one QSFP-DD cage, E6 and E7 connected to the other QSFP-DD cage, E8 and E9 connected to the 2nd ASIC, and E14 and E15 connected to a Warp 100 Bridge connector on the right. The other eight E tiles are not connected to anything on these particular boards. Meanwhile, the 2nd ASIC has E0 and E1 connected to the 1st ASIC, E6 and E7 connected to a Warp 100 Bridge connector on the right, and none of the other E tiles connected to anything.

The PCIe tile on the 2nd ASIC is similarly not connected to anything. There's a little SPI flash memory containing firmware and configuration for the ARC tiles, which can serve as an awkward communication channel: the 1st ASIC can write revised firmware/configuration to the flash, then trigger a board-level reset to cause both ARC tiles to re-load their firmware and configuration from the flash. Other than using tt-flash to occasionally update the firmware, and tt-topology to occasionally update the configuration, you likely won't be using this channel. That leaves ethernet as the primary means of communication between the two ASICs on the board, so to make any use of the 2nd ASIC, we're going to have to talk ethernet.

From the host, we can use the PCIe link to do whatever we want to the E8 / E9 tiles on the 1st ASIC, but until we've established ethernet communication, we have no way to affect the E0 / E1 tiles that they communicate with. Whatever we transmit from the E8 / E9 tiles will, at least initially, be received and processed by the base firmware on the E0 / E1 tiles. The details of that processing logic aren't necessarily documented by Tenstorrent, but the base firmware on the E8 / E9 tiles knows how to form and transmit ethernet packets that'll be received and understood by the base firmware on the E0 / E1 tiles. Hence we don't want to mess with the E8 / E9 tiles too much, as we'll need to ask the firmware on them to do our bidding. That means we'll need to understand the interface that the base firmware on the E8 / E9 tiles presents to the host. This interface isn't really documented either, but at least there are relevant header files. We start with a basic queue/ring structure:

struct eth_queue_t {
  uint32_t wr_req_counter;
  uint32_t wr_resp_counter;
  uint32_t rd_req_counter;
  uint32_t rd_resp_counter;
  uint32_t error_counter;
  uint32_t padding0[3]; // Aligns next field to 16 bytes
  uint32_t wr_idx;
  uint32_t padding1[3]; // Aligns next field to 16 bytes
  uint32_t rd_idx;
  uint32_t padding2[3]; // Aligns next field to 16 bytes
  routing_cmd_t contents[4];
};

Nothing too surprising in eth_queue_t; it starts with some counters that the base firmware increments in various scenarios, then wr_idx and rd_idx, and then space for four elements. The size of the queue, which is always between zero and four (inclusive), is given by (wr_idx - rd_idx) % 8. An empty queue will have wr_idx == rd_idx, whereas a full queue will have (wr_idx - rd_idx) % 8 == 4. To push on to the queue, assuming it isn't full, populate contents[wr_idx % 4] then do wr_idx = (wr_idx + 1) % 8. To pop from the queue, assuming it isn't empty, consume contents[rd_idx % 4] and then do rd_idx = (rd_idx + 1) % 8. Aside: the choice of % 8 is unfortunate; % 232 would have worked equally well, and % 232 is completely free on any 32-bit or 64-bit CPU (whereas % 8 is very cheap but not quite free).

Each element of the queue is an instance of the routing_cmd_t structure:

struct routing_cmd_t {
  uint32_t target_addr;
  uint16_t target_noc_xy; // From lo to hi: 4 bits zero, 6 bits NoC X, 6 bits NoC Y
  uint16_t target_shelf_xy; // From lo to hi: 6 bits shelf-level X, 6 bits shelf-level Y, 4 bits unused
  union {
    uint32_t inline_data;
    uint32_t data_block_length;
  };
  uint32_t flags;
  uint16_t target_rack_xy; // From lo to hi: 8 bits rack X (rack #), 8 bits rack Y (shelf #)
  uint16_t reserved[5];
  uint32_t data_block_dma_addr;
};

// Request flags:
#define CMD_WR_REQ         (1u << 0)
#define CMD_RD_REQ         (1u << 2)
#define CMD_DATA_BLOCK_DMA (1u << 4)
#define CMD_DATA_BLOCK     (1u << 6)
#define CMD_BROADCAST      (1u << 7)
#define CMD_USE_NOC1       (1u << 9)
#define CMD_TIMESTAMP      (1u << 10)
#define CMD_ORDERED        (1u << 12)

// Response flags:
#define CMD_WR_ACK                 (1u << 1)
#define CMD_RD_DATA                (1u << 3)
#define CMD_DATA_BLOCK_UNAVAILABLE (1u << 30)
#define CMD_DEST_UNREACHABLE       (1u << 31)

This structure requires slighly more explanation. A request will be either CMD_WR_REQ or CMD_RD_REQ, along with a bunch of optional flags. If we ignore the CMD_BROADCAST flag, these write requests and read requests target a particular location in the address space of a particular tile. The tile-local address is given in the target_addr field, and the tile in question is identified by a combination of the target_noc_xy and target_shelf_xy and target_rack_xy fields. That is, rather than using IPv4 or IPv6 addresses, a custom 6-dimensional addressing scheme is used. We already saw the NoC X and Y dimensions in part 1, noting that they are interleaved versus the actual physical placement of tiles, which is why (amongst other things) the right edge appears as the middle column and the bottom edge appears as the middle row:

If there are multiple Wormhole ASICs in a single server, then they too can be arranged into a logical grid, giving the shelf-level X and Y dimensions:

Finally, an aisle of server racks in a datacenter gives rack-level X (rack #) and Y (shelf #) dimensions:

That completes the detour describing the addressing scheme. Returning to the routing_cmd_t structure, the data to be written (for write requests) can either be a 4 byte value in inline_data, or a small block of data up to 1KB in size somewhere near the routing_cmd_t structure (set the CMD_DATA_BLOCK flag, put the length in data_block_length), or a large block of data up to 3.75GB in size sitting in host DRAM accessible via DMA (set both CMD_DATA_BLOCK and CMD_DATA_BLOCK_DMA, put the length in data_block_length, and the physical memory address in data_block_dma_addr - the kernel-mode driver can be used to obtain such addresses). For read requests, the options are similar: a 4 byte result can go directly in inline_data, or a small result up to 1KB in size can be written somewhere near the routing_cmd_t structure, or a large result up to 3.75GB in size can be written to host DRAM via DMA. The routing_cmd_t structure is used for responses as well as requests, though a different set of flags are applicable to responses, and the only interesting fields on responses are flags and inline_data. The high bits of response flags indicate errors, while the low four bits should contain either CMD_WR_ACK or CMD_RD_DATA. Everything is then wrapped up in a eth_base_firmware_queues_t structure:

struct eth_base_firmware_queues_t {
  uint64_t latency_counter[16];
  eth_queue_t sq; // Contains requests, for host -> E tile
  eth_queue_t reserved;
  eth_queue_t cq; // Contains responses, for E tile -> host
  char padding[4096 - sizeof(uint64_t)*16 - sizeof(eth_queue_t)*3];
  char buffers[4][1024];
};

Skipping over the latency_counter field, this contains a submission queue (sq), in to which the host pushes routing_cmd_t objects containing requests, and a completion queue (cq) from which the host pops routing_cmd_t objects containing responses. Each of the index fields has a single writer:

FieldWriterReaders
sq.wr_idxHost (as part of pushing)Host, E tile
sq.rd_idxE tile (as part of popping)Host, E tile
cq.wr_idxE tile (as part of pushing)Host, E tile
cq.rd_idxHost (as part of popping)Host, E tile

The buffers field contains four 1KB buffers, used for requests or responses which have CMD_DATA_BLOCK set, but CMD_DATA_BLOCK_DMA unset. In such cases, request sq.contents[i] uses buffers[i], and response cq.contents[i] also uses buffers[i]. A little bit of care is required to ensure that a buffer isn't used by two different routing_cmd_t objects at once, but assuming that the queue indices start off aligned, and that every request generates a response, then the response to sq.contents[i] will end up in cq.contents[i], and at most one of these two things will require buffers[i].

Each E tile contains a single eth_base_firmware_queues_t structure in its SRAM, the address of which is stored at tile-local address 0x170. The host uses PCIe reads and writes to interact with this structure, and it is the responsibility of host software to avoid having multiple host threads interact with the same structure at the same time. The host can submit requests to read/write against any tile in the 6-dimensional space, and the base firmware on the E tile to which the request is submitted will do one of three things:

  1. If the target tile is the E tile itself, the request can be completed using RISC-V load/store instructions.
  2. Otherwise, if the target tile is on the same ASIC as the E tile, the request can be completed using NoC #0 (default) or NoC #1 (if CMD_USE_NOC1 is set).
  3. Otherwise, the request can be forwarded to a different E tile; either to the E tile at the other end of the ethernet link, or to one of the other E tiles on the same ASIC. The receiving E tile will then do one of the same three things.

In the simple setup of a single n300s board, the rack # is 0, the shelf # is 0, and then the shelf-level coordinates are (0, 0) for the ASIC connected to PCIe and (1, 0) for the other ASIC. In more complex setups, tt-topology should be used to assign coordinates to ASICs.

Back in part 2, we obtained the value of RV_ADDR_NOC0_MC_DISABLE_COL on the 1st ASIC, thereby determining which rows were disabled. Knowing what we now know about ethernet tiles, we can obtain RV_ADDR_NOC0_MC_DISABLE_COL on both ASICs. To make things interesting, we'll have the host make a request to tile E10 at NoC coordinates (8, 6), but have the target of the request be RV_ADDR_NOC0_MC_DISABLE_COL on tile E2 at coordinates (8, 0). When targetting the 2nd ASIC, this'll require an ethernet hop between E8 and E0, as shown:

Continuing with the code from part 2, which deliberately eschews the useful software layers provided by Tenstorrent, we can start by obtaining the base firmware queues structure on tile (8, 6):

char* l1_tlb = set_tlb(dev, TLB_IDX_0, TLB_CFG_UNICAST(8, 6), 0);
uint32_t q_addr = *(volatile uint32_t*)(l1_tlb + 0x170);
eth_base_firmware_queues_t* q = (eth_base_firmware_queues_t*)(l1_tlb + q_addr);

We can then prepare the request for the base firmware, setting the target as RV_ADDR_NOC0_MC_DISABLE_COL on tile (8, 0):

routing_cmd_t c;
c.target_rack_xy = (0 << 0) + (0 << 8);
c.target_shelf_xy = (shelf_x << 0) + (shelf_y << 6);
c.target_noc_xy = (8 << 4) + (0 << 10);
c.target_addr = RV_ADDR_NOC0_MC_DISABLE_COL;
c.flags = CMD_RD_REQ;

Dispatching the command to the submission queue and reaping the result from the completion queue involves some grungy work:

void do_eth_cmd(eth_base_firmware_queues_t* q, routing_cmd_t* c) {
  // Spin while sq full
  uint32_t wr_idx = q->sq.wr_idx;
  uint32_t rd_idx;
  do {
    rd_idx = *(volatile uint32_t*)&q->sq.rd_idx;
  } while ((wr_idx - rd_idx) & 4u);

  // Push to sq
  routing_cmd_t* qc = q->sq.contents + (wr_idx & 3u);
  *(volatile __m256i*)qc = _mm256_loadu_si256((__m256i*)c);
  _mm_sfence();
  *(volatile uint32_t*)&q->sq.wr_idx = (wr_idx + 1) & 7u;

  // Spin while cq empty
  rd_idx = q->cq.rd_idx;
  do {
    wr_idx = *(volatile uint32_t*)&q->cq.wr_idx;
  } while (rd_idx == wr_idx);

  // Wait for cq entry to be populated
  qc = q->cq.contents + (rd_idx & 3u);
  do {
    _mm256_storeu_si256((__m256i*)c, *(volatile __m256i*)qc);
  } while (c->flags == 0);

  // Pop from cq
  *(volatile uint32_t*)&q->cq.rd_idx = (rd_idx + 1) & 7u;
}

One subtle point in the above is that the base firmware initially pushes an entry on to the completion queue with flags set to zero, and then populates the entry properly as a 2nd stage, so (somewhat unfortunately) two loops are required to pop from the completion queue.

We can then wrap all this up in some more loops:

for (uint32_t shelf_y = 0; shelf_y < 2; ++shelf_y) {
  for (uint32_t shelf_x = 0; shelf_x < 2; ++shelf_x) {
    routing_cmd_t c;
    c.target_rack_xy = (0 << 0) + (0 << 8);
    c.target_shelf_xy = (shelf_x << 0) + (shelf_y << 6);
    c.target_noc_xy = (8 << 4) + (0 << 10);
    c.target_addr = RV_ADDR_NOC0_MC_DISABLE_COL;
    c.flags = CMD_RD_REQ;
    do_eth_cmd(q, &c);
    printf("(%u, %u) -> ", shelf_x, shelf_y);
    if (c.flags == CMD_RD_DATA) {
      printf("value %u\n", c.inline_data);
    } else {
      printf("error %#08x\n", c.flags);
    }
  }
}

For my n300s board, the above prints:

(0, 0) -> value 3137
(1, 0) -> value 2121
(0, 1) -> error 0x80000008
(1, 1) -> error 0x80000008

There are CMD_DEST_UNREACHABLE errors for target_shelf_xy of (0, 1) and (1, 1), which makes sense. The value 3137 is what we already saw in part 2, and corresponds to bitmask 0b110001000001. The value 2121 is new, and tells us that the bitmask for the 2nd ASIC is 0b100001001001. This means that the 128 usable T tiles on my n300s board are at:

At this point we could use a routing_cmd_t to send new RISC-V code to the E tiles on the 2nd ASIC and have that code implement a communication protocol of our choosing rather than relying on the base firmware, but this blog post is long enough already, so it'll have to wait for another time. The complete code for this post consists of 201 lines, though half of that is unchanged from part 2. That wraps up part 4; if you're reading along, then part 5 is next.

Tenstorrent Wormhole Series Part 3: NoC propagation delay

Continuing the trend from part 2 of eschewing the useful software layers provided by Tenstorrent, and instead manually poking around in various address spaces, every T tile contains a 64-bit counter at tile-local address 0xFFB121F0 and 0xFFB121F8 which starts at zero when the chip is powered on, and increments by one every clock cycle. Every T tile also contains a soft-reset register at tile-local address 0xFFB121B0; if this register contains 0x47800 then all five Baby RISC-V cores are held in soft reset, and then individual bits can be cleared to take individual cores out of soft reset (i.e. allow them to run).

With these two pieces of information, we can do something interesting: use a NoC multicast write to take one core out of reset on every tile, have RISC-V code on every tile record its cycle counter somewhere as soon as it comes out of reset, then collect and plot the results.

In order to start from a clean slate, we'll want to use a NoC multicast write to put all cores into soft-reset, specifying (0, 0) through (9, 11) inclusive as the multicast rectangle, and relying on the multicast disable row/column we saw in part 2 to ensure that the multicast only goes to T tiles:

#define RV_ADDR_SOFT_RESET 0xFFB121B0

#define SOFT_RESET_ALL_CORES 0x47800

char* reg_tlb = set_tlb(dev, TLB_IDX_UC0, TLB_CFG_MULTICAST(0, 0, 9, 11), RV_ADDR_SOFT_RESET);
*(volatile uint32_t*)(reg_tlb + RV_ADDR_SOFT_RESET) = SOFT_RESET_ALL_CORES;

With all the cores held in soft-reset, it is safe to send them new code. The SRAM (or, for D tiles, DRAM) within a tile starts at tile-local address 0, and execution will also start at address 0 when soft-reset is cleared, so we can send some RISC-V code to tile-local addresses starting at 0, again multicasting it out. The code will read from the tile-local 64-bit cycle counter at 0xFFB121F0 and 0xFFB121F8, then write it to tile-local address 128:

const uint32_t rv_code[] = {
  0xFFB12537, // lui a0, 0xFFB12
  0x1F052583, // lw a1, 0x1F0(a0)
  0x1F852603, // lw a2, 0x1F8(a0)
  0x08B02023, // sw a1, 128(x0)
  0x08C02223, // sw a2, 132(x0)
  0x0000006F, // loop: j loop
};
char* l1_tlb = set_tlb(dev, TLB_IDX_0, TLB_CFG_MULTICAST(0, 0, 9, 11), 0);
memcpy(l1_tlb, rv_code, sizeof(rv_code));

We can then perform a multicast to bring one core out of reset on each T tile:

*(volatile uint32_t*)(reg_tlb + RV_ADDR_SOFT_RESET) = SOFT_RESET_ALL_CORES & (SOFT_RESET_ALL_CORES - 1);

We can't use multicast to collect the results - instead we need to perform a unicast read against each T tile in turn. That requires knowing the tile coordinates of each T tile, and said grid isn't entirely regular: it'll be disturbed by a column of D tiles, and be disturbed by a row of E tiles, and have one or two disabled rows. We can sidestep this problem by using a convenient translation feature: an X coordinate of 16 will be replaced with 0 (PCIe / ARC / D column), 17 will be replaced with 5 (2nd D column), then 18 through 25 will be replaced with the column indices containing T tiles. Similarly, a Y coordinate of 16 will be replaced with 0 (E0-E7 row), 17 will be replaced with 6 (E8-E15 row), and 18 through 25 or 26 will be replaced with whatever row indices contain active T tiles (if you need a reminder of the coordinate grid, see part 1). This allows us to easily iterate over the active T tiles:

uint64_t times[8][8];
for (uint32_t y = 0; y < 8; ++y) {
  for (uint32_t x = 0; x < 8; ++x) {
    l1_tlb = set_tlb(dev, TLB_IDX_0, TLB_CFG_UNICAST(18 + x, 18 + y), 0);
    times[y][x] = *(volatile uint64_t*)(l1_tlb + 128);
  }
}

For neatness, we can then put everything back into reset:

*(volatile uint32_t*)(reg_tlb + RV_ADDR_SOFT_RESET) = SOFT_RESET_ALL_CORES;

With T denoting the minimum value seen in the times matrix, I observe:

If instead multicasting via NoC #1 (by adding TLB_CFG_NOC1 to the TLB_CFG_MULTICAST result), and calling S the minimum value seen this time, I observe:

Both sets of measurements suggest that the tile-to-tile propagation delay might be around 9 cycles, but the numbers are far from perfect. The imperfections are very clear if we plot both sets of measurements at the same time, and look at just the row containing the PCIe tile:

Going rightwards, the first tile is "T+3" and the last is "T+75", meaning 72 cycles to traverse 8 tiles. Going leftwards, the first tile is "S+0" and the last is "S+72", again meaning 72 cycles to traverse 8 tiles. However, going rightwards, the 2nd tile is "T+0", which isn't great: taken at face value it would mean that the multicast reached the 2nd tile before reaching the first, which is clearly nonsense. There is one obvious explanation for this: the cycle counters on different tiles aren't perfectly aligned - they're meant to all start from 0 when the chip powers on, but powering on is a physically complex process, so some tiles might start counting a few cycles before or after others.

If the tile-to-tile latency was identical for every hop, and we called this unknown quantity X, then what we'd hope to see is:

Regardless of what S or T or X actually are, it so happens that the average of the two expressions in each tile is (S + T)/2 + 4X. As this expression should be the same for all tiles, we can use it to correct for the different counter start times between the different tiles. We need to assume that there is a per-tile counter adjustment, with all readings taken on a given tile adjusted by the same amount, and then set those adjustments so that "should be the same" becomes "is the same". Because I'm lazy, I'll assume that all tiles within a given column have the same adjustment, which isn't quite true, but it'll do for now. After computing and applying this adjustment, the NoC #0 measurements are:

And NoC #1 are:

The results still aren't perfect, but they're good enough for me to conclude that the tile-to-tile propagation delay is 9 clock cycles (i.e. 9 nanoseconds when the clock is running at 1GHz), and that imperfections in measurements are due to the aforementioned laziness. For tile-to-tile communication there'll be some latency to get on to the NoC, then a propagation delay for every traversed tile, and then some latency to get off the NoC. For messages requiring a response, there'll be all that twice, as after the request has done all that, the response needs to get on to the NoC, then propagate back to the requestor, then get off the NoC. For NoC reads (and presumably NoC writes-with-acknowledgement, if you use them), that response travels on the same NoC as the request, so if requestee and respondee are in the same row, the combination of request and response will have 10 tiles (90 cycles) of propagation delay, versus 12 tiles (108 cycles) of propagation delay if they're in the same column, and 10+12 tiles (198 cycles) if they're in different row and column.

That wraps up part 3. The complete code comes out to 164 lines, but a lot of it is common with part 2's 100 lines. If you're reading along, part 4 is next.

Tenstorrent Wormhole Series Part 2: Which disabled rows?

Previously, we considered the physicalities of a Tenstorrent Wormhole card, ending on the remark that one or two rows of T tiles will be disabled in every chip shipped to customers. That naturally begs the question: if you're a customer with a chip (like me), how do you determine which rows are disabled?

It should be noted that most people shouldn't need to care about this question, as a combination of various Tenstorrent-provided software layers should successfully abstract away this difference. That said, I'm not most people; I want to characterise and understand how these cards work at a very low level. Consequently, I'm going to be ignoring most of the Tenstorrent-provided software layers; the kernel-mode driver is fine, and some closed-source firmware is unavoidable at the moment, but I'll ignore the user-mode driver along with all of TT-NN/TT-Metalium and TT-Buda. Again, if you are most people, you probably want to be using those software layers rather than doing what I'm about to do.

Opening the kernel driver is simple enough:

int fd = open("/dev/tenstorrent/0", O_RDWR | O_CLOEXEC);
ASSERT(fd >= 0);

We can then ask the kernel driver what memory ranges (i.e. PCIe bars) it has available for mapping:

#define TENSTORRENT_IOCTL_QUERY_MAPPINGS 0xFA02

struct tenstorrent_mapping {
  uint32_t mapping_id;
  uint32_t reserved;
  uint64_t mapping_base;
  uint64_t mapping_size;
};

#define TENSTORRENT_MAPPING_RESOURCE0_UC 1
#define TENSTORRENT_MAPPING_RESOURCE0_WC 2
#define TENSTORRENT_MAPPING_RESOURCE2_UC 5

unsigned char resource_to_mapping[8] = {0};
struct tenstorrent_mapping mappings[sizeof(resource_to_mapping) + 1];
mappings[0].mapping_size = sizeof(resource_to_mapping);
ASSERT(ioctl(fd, TENSTORRENT_IOCTL_QUERY_MAPPINGS, &mappings[0].mapping_size) >= 0);
mappings[0].mapping_size = 0;
for (unsigned i = 1; i <= sizeof(resource_to_mapping); ++i) {
  uint32_t resource = mappings[i].mapping_id;
  if (resource < sizeof(resource_to_mapping)) {
    resource_to_mapping[resource] = i;
  }
}

To make some future things easier, I want to map these resources in a very particular way:

  1. The first 464MB of resource 0, as write-combining memory.
  2. Then the next 32MB of resource 0, as uncacheable memory.
  3. Then the middle/final 16MB of resource 2, as uncacheable memory.

This sums to a neat 512MB, so it needs one mmap call to reserve a contiguous 512MB range of virtual address space, followed by one mmap call per resource range. If resource 0 isn't available as WC, or less than 464MB is available as WC, then mapping it as uncacheable is an acceptable fallback:

#define BAR0_WC_SIZE (464 << 20)
#define BAR0_SIZE    (496 << 20)
#define MMAP_SIZE    (512 << 20)

#define BAR4_SOC_TARGET_ADDRESS 0x1E000000

struct tenstorrent_mapping* bar0uc = mappings + resource_to_mapping[TENSTORRENT_MAPPING_RESOURCE0_UC];
struct tenstorrent_mapping* bar0wc = mappings + resource_to_mapping[TENSTORRENT_MAPPING_RESOURCE0_WC];
struct tenstorrent_mapping* bar4uc = mappings + resource_to_mapping[TENSTORRENT_MAPPING_RESOURCE2_UC];
ASSERT(bar0uc->mapping_size >= BAR0_SIZE);
ASSERT(bar4uc->mapping_size >= MMAP_SIZE - BAR4_SOC_TARGET_ADDRESS);

char* dev = mmap(NULL, MMAP_SIZE, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
ASSERT(dev != MAP_FAILED);

uint32_t wc_size = bar0wc->mapping_size;
if (wc_size) {
  if (wc_size > BAR0_WC_SIZE) {
    wc_size = BAR0_WC_SIZE;
  }
  if (mmap(dev, wc_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, bar0wc->mapping_base) == MAP_FAILED) {
    wc_size = 0;
  }
}
ASSERT(mmap(dev + wc_size, BAR0_SIZE - wc_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, bar0uc->mapping_base + wc_size) != MAP_FAILED);
ASSERT(mmap(dev + BAR0_SIZE, MMAP_SIZE - BAR0_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, bar4uc->mapping_base + (BAR0_SIZE - BAR4_SOC_TARGET_ADDRESS)) != MAP_FAILED);

This gives us a 512MB window for talking to the Wormhole ASIC, but what wonders does this window contain? It so happens that the 16MB from resource 2 contains an assortment of configuration registers on the ARC and PCIe tiles. Meanwhile, the remainder of the window maps to the PCIe tile, and any read/write performed by the host against this window gets translated into a NoC read/write by the PCIe tile. The details of that translation can be tweaked by using some configuration registers in the aforementioned resource 2. The general shape cannot be tweaked: the 496MB range subdivides into 156 pieces of size 1MB, 10 pieces of size 2MB, and 20 pieces of size 16MB. After that, things get tweakable: for each piece, we can specify the X/Y coordinates of the tile on the NoC to read/write (or the X/Y coordinates of a rectangular range of tiles for multicast writes), which 1MB/2MB/16MB-aligned range of address space within the tile to target, whether to use NoC #0 or #1, and a few other properties. The Tenstorrent software calls these pieces TLBs, which is not to be confused with the TLB used within a CPU to translate between virtual and physical addresses. Mapping the first 464MB of resource 0 as write-combining means that most of the pieces are write-combining; only the final two 16MB pieces fall within the uncacheable part (note that this WC/UC difference only affects whether the host buffers up writes before passing them along to the PCIe tile; once the PCIe tile receives the PCIe transaction, it doesn't care whether WC or UC was used to get there).

The configuration registers controlling these pieces start at address 0x1FC00000, and consist of 8 bytes per piece. We can wrap up the details within a set_tlb function, which takes a piece index (0 ≤ idx < 156+10+20) and details of what to target, configures that piece, and then returns a pointer to the relevant piece:

#define TLB_CONFIG_ADDR 0x1FC00000

#define TLB_CFG_UNICAST(x, y) (((y) << 6) + (x))
#define TLB_CFG_MULTICAST(x_start, y_start, x_end, y_end) ((1 << 25) + ((y_start) << 18) + ((x_start) << 12) + ((y_end) << 6) + (x_end))
#define TLB_CFG_NOC1 (1 << 24)

static char* set_tlb(char* dev, uint32_t idx, uint64_t cfg, uint32_t suitable_for_addr) {
  char* result = dev;
  uint32_t abits;
  if (idx < 156) {
    abits = 20;
    result += (idx << 20);
  } else if (idx < 166) {
    abits = 21;
    result += (156 << 20) + ((idx - 156) << 21);
  } else {
    abits = 24;
    result += (156 << 20) + (10 << 21) + ((idx - 166) << 24);
  }
  cfg = (cfg << (36 - abits)) + (suitable_for_addr >>= abits);
  ((volatile uint64_t*)(dev + TLB_CONFIG_ADDR))[idx] = cfg;
  return result - (suitable_for_addr << abits);
}

We can use set_tlb to go and poke around in the address space of any tile on the NoC. I'm going to interrogate the ethernet tile at logical coordinates (1, 0), as ethernet tiles are never disabled in the way that T tiles can be. Like most of the tiles containing RISC-V cores, its tile-local address space contains various interesting things at/above address 0xFF000000, including "multicast disable row" and "multicast disable column" at 0xFFB20108 and 0xFFB20110:

#define TLB_IDX_UC0 184

#define RV_ADDR_NOC0_MC_DISABLE_ROW 0xFFB20108
#define RV_ADDR_NOC0_MC_DISABLE_COL 0xFFB20110

char* reg_tlb = set_tlb(dev, TLB_IDX_UC0, TLB_CFG_UNICAST(1, 0), RV_ADDR_NOC0_MC_DISABLE_ROW);
printf("%u, %u\n",
    *(volatile uint32_t*)(reg_tlb + RV_ADDR_NOC0_MC_DISABLE_ROW),
    *(volatile uint32_t*)(reg_tlb + RV_ADDR_NOC0_MC_DISABLE_COL));

On my system, for my card, this prints 33, 3137, which in binary is 0b100001, 0b110001000001. Plotting these masks as "X"s on the tile grid gives:

The tiles not marked with "X" are the usable T tiles, meaning that for my chip, what I have is effectively:

I suspect that the final two rows of T tiles were disabled for consistency reasons rather than because of actual defects, but I'll never know for sure!

That wraps up part 2. The complete code comes out to 100 lines, which isn't too shabby. If you're reading along, part 3 is next.

Tenstorrent Wormhole Series Part 1: Physicalities

A company called Tenstorrent design and sell PCIe cards for AI acceleration. At the time of writing, they've recently started shipping their Wormhole n150s and Wormhole n300s cards. After taking the shell and the heatsinks off, the underlying circuit board looks like:

Wormhole n150sWormhole n300s
(Circuit board photos from Tenstorrent's own sales pages)

At the top of the board, there are two QSFP-DD cages for ethernet, which are intended for direct-attach cables between cards, as shown here. The left edge has a PCIe 4.0 x16 connector, which is connected to the immediately adjacent Wormhole ASIC. The Wormhole ASIC is surrounded by six GDDR6 chips (two on the left, four on the right), with each GDDR6 chip containing two 1GB banks, for a total of 12GB GDDR6 connected to the Wormhole ASIC. The right edge has two of what Tenstorrent call a Warp 100 Bridge connector, which seems to be a proprietary 100Gb ethernet connector. After this, the two boards diverge: the n300s card has another Wormhole ASIC with another 12GB of GDDR6 attached, whereas the n150s card has a bunch of empty space. The 2nd Wormhole ASIC on the n300s card is not connected to PCIe. Instead, on the n300s card, there are internal ethernet links between the two ASICs. Accordingly, for the host to communicate with the 2nd ASIC, it needs to use PCIe to communicate with the 1st ASIC, and ask the 1st ASIC to use its internal ethernet to communicate with the 2nd ASIC's internal ethernet.

The Wormhole ASIC itself is described as a 10 by 12 grid of tiles, arranged as:

There are a few different types of tile:

KindCountContents (per tile)
ARC11x Argonaut RISC Core (also connected to PCIe)
D6 x 3Bridge to 2x 1GB GDDR6 (shared by three tiles)
E161x Baby RISC-V CPU (E variant)
256K SRAM
Bridge to 100Gb ethernet (if connected)
PCIe1Bridge to host over PCIe
T80 (not all usable)5x Baby RISC-V CPU (B/T/T/T/NC variants)
1.5MB SRAM
1x Matrix unit (2048 multipliers, each 5b x 7b)
1x Vector/SIMD unit (32 lanes, each 32b wide)

Each E tile can manage 100Gb ethernet (that is, simultaneous transmit at 100Gb/s and receive at 100Gb/s). On the n300s card, E8 of the 1st ASIC is connected to E0 of the 2nd ASIC, and E9 of the 1st ASIC is connected to E1 of the 2nd ASIC. Other tiles are connected to the QSFP-DD cages and to the Warp 100 Bridge connectors (possibly two E tiles on the 2nd ASIC route to the Warp, and four E tiles on the 1st ASIC route to the QSFP-DD cages). There's not enough I/O on the n150s/n300s boards to allow all the E tiles of a single ASIC to be connected, let alone all the E tiles of two ASICs to be connected; you'd need to buy a Galaxy Server to find a form factor with enough I/O to saturate all the E tiles.

Each tile has four outbound connections to its (north/east/south/west) neighbours, along with four corresponding inbound connections from its neighbours, with each connection being 32 bytes wide (apparently upgraded to 64 bytes wide in the next-generation Blackhole ASIC). The combination of the east-bound and south-bound channels are called NoC (Network on Chip) #0, whilst the combination of the west-bound and north-bound channels are called NoC #1. Going eastward from the eastmost edge lands you on the westmost edge (e.g. going east from the T at (9, 3) gets you to PCIe at (0, 3)). A similar thing is true on the other three edges: they are each connected to the edge on the opposite side. Taking the eastbound connections in row 3 as an example, we have:

It looks like the path between most neighbours is short, but then the wraparound path is extremely long. This would make both software and hardware designers unhappy, so the physical reality interleaves tiles to equalize the distances:

A similar interleaving is done to the columns, meaning that the actual physical locations of the various tiles within the grid is closer to:

The various tile-to-tile connections aren't shown on the above, as it would be a horrible mess of overlapping and dodging lines. It does however match the circuit board photos: the DRAM tiles are on the left and right edges, and tiles routing to the same DRAM chip are adjacent (e.g. the three D5 tiles are physically adjacent, even though their logical Y coordinates are 3, 8, 4). The ethernet tiles also end up on the physical edges, which makes sense. Thankfully, software mostly doesn't need to worry about this interleaving (except for HARVESTING_NOC_LOCATIONS, which you'll note are the Y coordinates of an interleaved column of T tiles), and can instead stick to the original logical view:

That said, there's one physical reality that software can't escape: manufacturing defects. You could imagine TSMC having a master template Wormhole ASIC and then photocopying it thousands of times to create the ASICs that actually ship to customers. As with actual photocopying, the photocopies are slightly lower quality than the original. Sometimes the defects are benign, but other times the defects completely ruin the piece of logic or storage that they occur within. Throwing away an entire ASIC because of a single defect would not be viable, so there are various strategies to deal with the defects. One such employed strategy is to disable (or "harvest") an entire row of T tiles if there is a defect in any of the T tiles in that row, effectively pretending that all the T tiles in that row are empty tiles (in the same way that e.g. (0, 2) is empty but still participates in the connection grid). An ASIC with one row of disabled T tiles can be used in an n150s board, which is why the n150s is reported as having 72 T tiles despite the ASIC having 80. Similarly, an ASIC with two rows of disabled T tiles can be used in an n300s board, which is why the n300s is reported as having 64 T tiles per ASIC despite there originally being 80. If the photocopying process had fewer defects than planned for, then one (n150s) or two (n300s) rows of T tiles are disabled regardless, for consistency (of what customers receive, of power usage, etc.). The non-T tiles within a disabled row are kept, though they become slightly harder to address. In particular, some workloads want to really minimise latency to DRAM, so want to run on a T tile immediately adjacent to a D tile. Some rows have 2 D tiles and thus four T tiles immediately adjacent to a D tile, whereas other rows have only 1 D tile and thus only two T tiles immediately adjacent to a D tile. Whilst each PCIe card will have the same number of usable T tiles (72 for n150s, 64 x 2 for n300s), the number of usable T tiles immediately adjacent to a D tile will vary randomly from card to card depending on which rows were disabled.

That wraps up part 1. If you're reading along, part 2 is next.

Arm CSSC quick reference

The AArch64 FEAT_CSSC (Common Short Sequence Compression) extension adds a few instructions operating on general purpose registers. The extension is optional since Armv8.7, and mandatory since Armv8.9 if FEAT_AdvSIMD is implemented. The most interesting addition is a popcnt instruction operating on general purpose registers, but the full list of new instructions is:

CSSC instructionEquivalent pre-CSSC sequence (†)
ABS Rd, Rn
TST  Rn, Rn
CNEG Rd, Rn, MI
CNT Rd, Rn
FMOV d0, Rn
CNT  v0.8B, v0.8B
ADDV b0, v0.8B
FMOV Rd, s0
CTZ Rd, Rn
RBIT Rd, Rn
CLZ  Rd, Rd
SMAX Rd, Rn, Rm
CMP  Rn, Rm
CSEL Rd, Rn, Rm, GT
SMAX Rd, Rn, #i8
MOV  Rtmp, #i8
CMP  Rn, Rtmp
CSEL Rd, Rn, Rtmp, GT
SMIN Rd, Rn, Rm
CMP  Rn, Rm
CSEL Rd, Rn, Rm, LT
SMIN Rd, Rn, #i8
MOV  Rtmp, #i8
CMP  Rn, Rtmp
CSEL Rd, Rn, Rtmp, LT
UMAX Rd, Rn, Rm
CMP  Rn, Rm
CSEL Rd, Rn, Rm, HI
UMAX Rd, Rn, #u8
MOV  Rtmp, #u8
CMP  Rn, Rtmp
CSEL Rd, Rn, Rtmp, HI
UMIN Rd, Rn, Rm
CMP  Rn, Rm
CSEL Rd, Rn, Rm, LO
UMIN Rd, Rn, #u8
MOV  Rtmp, #u8
CMP  Rn, Rtmp
CSEL Rd, Rn, Rtmp, LO

(†) Except that the CSSC instructions do not mutate flags, and CSSC CNT does not mutate any SIMD registers.

page: 1 2 3 4 5