Anthropic are currently in the tech news for (re)producing a C compiler using $20,000 of Claude's time, but I'm more interested in their compiler performance take-home challenge from two weeks ago. One part of Anthropic is showing that compiler engineers might be obsolete in the medium term, whereas another part is trying to find and hire the best compiler engineers on the planet. That both things can be simultaneously true is amusing to me, but perhaps not surprising: software engineers should be in the game of automating away their current problems so that they can move onward to new problems. Just as the higher education sector has expert lecturers teaching novice undergraduates, the AI sector can have expert humans teaching novice AIs.

Anyway, enough philosophical musing, I want to look at that compiler take-home challenge. It revolves around this little computation graph:

Several copies of this graph can be chained together vertically ("rounds"), and then multiple copies can be placed side-by-side horizontally ("batch size"). For example, here are two vertical copies and three horizontal copies, along with the initial loads and final stores:

The full challenge involves 16 vertical copies and 32 horizontal copies, meaning 512 copies in total. The challenge is to take this computational graph and schedule it on a hypothetical CPU capable of ~10 instructions per cycle, trying to minimise the total cycle count. For example, scheduling it in 1300 cycles involves considering a grid which is ~10 cells wide and 1300 cells tall, then placing each box from the diagram in one of those cells. Placing boxes into grid cells isn't all that hard, but I've been glossing over an important fact which makes it hard: the ~10 cells in each grid row are not all the same. Each grid row in fact consists of 7½ "valu" cells, 2 "load" cells, 2 "store" cells, and finally 1 "flow" cell. Most of the boxes from the diagram have to go into a "valu" cell, but there are a few exceptions:

What looked simple now looks a bit harder: the 512 total "gather" boxes require 4096 "load" cells, thus requiring a grid at least 2048 cells tall. At this point, if I told you it was possible to make everything fit into a grid less than 1000 cells tall, you might think I was delusional. I assure you it is possible though, courtesy of two main strategies:

  1. Reducing the number of boxes on the diagram.
  2. Replacing some of the "+ base" and "gather" boxes with alternatives.

I don't have much to say about strategy one, as the majority of the changes can be shown using a single diagram:

Strategy two is more interesting. The gather operation can be replaced with a selection tree: preload the values of every possible idx, and use the output from all earlier & 1 boxes to select the appropriate value out of all the preloaded values. Each binary select operation requires either a single "flow" cell, or one or two "valu" cells. The number of binary select operations required to replace a "+ base" and "gather" then depends on the vertical level within the graph:

LevelGather-styleSelect-style (ignoring one-off overheads)
01x "valu" + 8x "load"Free
11x "valu" + 8x "load"1x ("flow" or "valu")
21x "valu" + 8x "load"2x ("flow" or "valu") + 1x ("flow" or 2x "valu")
31x "valu" + 8x "load"4x ("flow" or "valu") + 3x ("flow" or 2x "valu")
41x "valu" + 8x "load"8x ("flow" or "valu") + 7x ("flow" or 2x "valu")
51x "valu" + 8x "load"16x ("flow" or "valu") + 15x ("flow" or 2x "valu")

More than 280 gathers can be gainfully replaced with selection trees. Doing so massively reduces the number of "load" cells required, at the cost of increasing the number of "flow" cells required. In turn, some of those "flow" cells can be traded for "valu" cells. Part of the challenge is finding just the right instruction mix as to equally balance "valu" and "load" and "flow" in the 7½ : 2 : 1 ratio. If the full graph is considered as a whole, it isn't too hard to balance the overall instruction counts as to hit 7½ : 2 : 1. Unfortunately, this isn't enough: every single one of the ~1000 cycles wants to hit that 7½ : 2 : 1 ratio, which means instruction selection and instruction scheduling are intertwined problems. This is where the real meat of the challenge is found: the space of possible instruction selections and instruction schedules is vast, and so the winner is the person (or system) capable of finding the best point in that search space. If you can find a good point, submit it to the leaderboard (or email Anthropic asking for a job).

You might also ask whether this challenge is a reasonable proxy for actual compiler engineer experience. It is easy to cheat the default test harness, but I'm going to ignore that, as the objective is clearly to find compiler engineers rather than security engineers, and human review of submissions can easily identify the latter. There are also various well-known compiler problems which the challenge doesn't touch: there's no control flow, all instructions have single-cycle latency, memory hazards can be (almost) entirely ignored, there are no cache effects, and no doubt all sorts of other things. The target machine is clearly fictitious, but there is more VLIW hardware in the AI space than many people realise, and that final ½ "valu" slot is an endearing complication. Overall I think it is an OK proxy, though I much prefer it as an unlimited-time challenge rather than its original format of a two-hour timed exam.