What do compilers do with the CPython interpreter main loop?

Compilers are notoriously bad at compiling the main loop of a programming language interpreter, and the CPython interpreter main loop is no exception: it is hard to compile it perfectly. The difficulty of compilation scales with the number of opcodes which the interpreter has, which in the case of CPython is more than 100, but we can actually get a feel for how well a compiler is doing by looking at just one opcode.

For this exercise I'll look at CPython 3.6's LOAD_FAST opcode, which in C is:

TARGET(LOAD_FAST) {
  PyObject *value = GETLOCAL(oparg);
  if (value == NULL) {
    format_exc_check_arg(PyExc_UnboundLocalError,
                         UNBOUNDLOCAL_ERROR_MSG,
                         PyTuple_GetItem(co->co_varnames, oparg));
    goto error;
  }
  Py_INCREF(value);
  PUSH(value);
  FAST_DISPATCH();
}

This opcode does a very small task: it loads a single local variable and pushes it onto the Python stack. After expanding various macros, the code becomes:

TARGET(LOAD_FAST) {
  PyObject *value = fastlocals[oparg];
  if (value == NULL) {
    // ... error handling ...
  }
  value->ob_refcnt++;
  *stack_pointer++ = value;
  if (_Py_TracingPossible) {
    // ... slow path ...
  }
  f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr));
  uint16_t word = *next_instr;
  opcode = word & 255;
  oparg = word >> 8;
  next_instr++;
  goto *opcode_targets[opcode];
}

With this code in hand, we can start looking at what compilers do with it, starting with gcc on Linux x64:

; PyObject *value = fastlocals[oparg]
48 8B 44 24 38           mov     rax, [rsp+38h]
49 63 F5                 movsxd  rsi, r13d
48 8B 04 F0              mov     rax, [rax+rsi*8]
; if (value == NULL)
48 85 C0                 test    rax, rax
0F 84 86 3A 00 00        jz      loc_545A2B
; value->ob_refcnt++
; *stack_pointer++ = value
4C 89 F2                 mov     rdx, r14
48 83 00 01              add     qword ptr [rax], 1
49 83 C6 08              add     r14, 8
48 89 02                 mov     [rdx], rax
; if (_Py_TracingPossible)
8B 05 97 AA 3B 00        mov     eax, cs:_Py_TracingPossible
85 C0                    test    eax, eax
0F 85 A3 CD FF FF        jnz     loc_53ED64
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
; next_instr++
48 89 EA                 mov     rdx, rbp
48 2B 14 24              sub     rdx, [rsp]
48 83 C5 02              add     rbp, 2
48 D1 FA                 sar     rdx, 1
01 D2                    add     edx, edx
89 53 78                 mov     [rbx+78h], edx
; word = *next_instr
0F B7 55 FE              movzx   edx, word ptr [rbp-2]
; opcode = word & 255
44 0F B6 C2              movzx   r8d, dl
; oparg = word >> 8
0F B6 F6                 movzx   esi, dh
; goto *opcode_targets[opcode]
49 63 D0                 movsxd  rdx, r8d
41 89 F5                 mov     r13d, esi
48 8B 14 D5 20 51 60 00  mov     rdx, ds:opcode_targets_11490[rdx*8]
FF E2                    jmp     rdx

From this, we can infer that gcc made the following choices:

fastlocals is stored on the stack at [rsp+38h].
oparg is stored in the register r13d.
stack_pointer is stored in the register r14.
next_instr is stored in the register rbp.
first_instr is stored on the stack at [rsp].
f is stored in the register rbx.

We can also spot a number of sad things:

A memory load could be avoided if gcc knew that [rsp+38h] was rbx+178h (fastlocals is an array member of f).
movsxd rsi, r13d could be avoided if gcc knew that r13d (oparg) was never negative (oparg is stupidly declared as int rather than unsigned int, and proving that it can never be negative requires a bit of effort and language-lawyering [1]).
mov rdx, r14 feels silly; gcc should transpose add r14, 8 and mov [rdx], rax, then this instruction wouldn't be required (as a register-register move, it doesn't consume any execution resources, but it still consumes icache space and decoding resources).
sar rdx, 1 and add edx, edx could be avoided if gcc knew that first_instr and next_instr were always even. The preceding sub could also be avoided if next_instr was maintained as an offset from first_instr rather than an absolute pointer.
movsxd rdx, r8d could be avoided if gcc realized that r8d (opcode) was never negative (which it should be able to spot fairly easily: it is the result of a movzx instruction, and thus can only be in the range 0 through 255).
mov r13d, esi could be avoided by doing movzx esi, dh directly into r13d instead of into esi (except that movzx r13d, dh cannot be expressed in machine code [2], so the compiler would have to swap things around and put oparg in say ebp and next_instr in say r13).
mov rdx, ds:opcode_targets_11490[rdx*8] and jmp rdx could be merged into a single instruction jmp ds:opcode_targets_11490[rdx*8] (thus saving an instruction decode and a byte or two of icache).

It feels like there are quite a few things which gcc could improve upon. Despite that, it could be the case that gcc is doing better than other compilers. To find out, we'll have to consider a few other compilers. With that in mind, we can look at what Clang on OSX x64 does:

; PyObject *value = fastlocals[oparg]
49 63 F7                 movsxd  rsi, r15d
49 8B 84 F6 78 01 00 00  mov     rax, [r14+rsi*8+178h]
; if (value == NULL)
48 85 C0                 test    rax, rax
0F 84 51 41 00 00        jz      loc_F7B6D
; value->ob_refcnt++
48 FF 00                 inc     qword ptr [rax]
; *stack_pointer++ = value
49 89 45 00              mov     [r13+0], rax
49 83 C5 08              add     r13, 8
; if (_Py_TracingPossible)
44 8B 3D D2 D5 1A 00     mov     r15d, cs:__Py_TracingPossible
45 85 FF                 test    r15d, r15d
0F 85 B7 BB FF FF        jnz     loc_EF5EE
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
48 8B 85 48 FE FF FF     mov     rax, [rbp-1B8h]
48 2B 85 28 FE FF FF     sub     rax, [rbp-1D8h]
48 D1 F8                 sar     rax, 1
48 98                    cdqe
48 01 C0                 add     rax, rax
41 89 46 78              mov     [r14+78h], eax
; word = *next_instr
48 8B 95 48 FE FF FF     mov     rdx, [rbp-1B8h]
0F B7 02                 movzx   eax, word ptr [rdx]
; opcode = word & 255
0F B6 D8                 movzx   ebx, al
; oparg = word >> 8
0F B6 C4                 movzx   eax, ah
41 89 C7                 mov     r15d, eax
; next_instr++
48 83 C2 02              add     rdx, 2
48 89 95 48 FE FF FF     mov     [rbp+var_1B8], rdx
; goto *opcode_targets[opcode]
48 63 C3                 movsxd  rax, ebx
48 8D 0D 87 01 13 00     lea     rcx, _opcode_targets_11343
48 8B 04 C1              mov     rax, [rcx+rax*8]
FF E0                    jmp     rax

From this, we can infer that clang made the following choices:

oparg is stored in the register r15d.
stack_pointer is stored in the register r13.
next_instr is stored on the stack at [rbp-1B8h].
first_instr is stored on the stack at [rbp-1D8h].
f is stored in the register r14.

Again, we can critique this assembly:

movsxd rsi, r15d could be avoided if clang knew that oparg was never negative.
Clang does realise that fastlocals can be expressed as r14+178h, which is good.
inc qword ptr [rax] versus add qword ptr [rax], 1 is debatable, though I tend to have a slight preference for add (inc might suffer a partial flags stall, but on the flip side is a slightly shorter encoding).
The choice of r13 for stack_pointer is a shame, as it requires [r13+0] rather than just [r13]. It would have been better to put say f in r13 and stack_pointer in r14.
Clang gets points for putting mov [r13+0], rax and add r13, 8 in the "right" order, thus avoiding the extra mov performed by gcc.
next_instr should really have been assigned a register rather than a stack slot.
cdqe, add rax, rax, mov [r14+78h], eax is a disappointing instruction sequence: given the 32-bit store at the end, Clang should have spotted that add rax, rax could be turned into add eax, eax, and thus that cdqe could be dropped entirely.
The memory load mov rdx, [rbp-1B8h] could be avoided by instead doing mov rdx, rax immediately after the preceding mov rax, [rbp-1B8h].
The mov in movzx eax, ah, mov r15d, eax could be elimited had Clang chosen a more suitable register for oparg [2].
movsxd rax, ebx could be avoided if Clang realized that ebx (opcode) was never negative (which it should be able to spot fairly easily: it is the result of a movzx instruction, and thus can only be in the range 0 through 255).
Clang has generated position-independent code, thus requiring lea rcx, _opcode_targets_11343 rather than fusing the address of the jump table into the next memory load.
mov rax, [rcx+rax*8] and jmp rax could be fused into a single instruction.

Overall, Clang did some things better than gcc, made some of the same mistakes, and did some things worse than gcc.

Next up is MSVC on Windows x64. This compiler is at a slight disadvantage, as it doesn't supported computed goto statements, and has to instead fall back to using a switch statement. Bearing that in mind, the assembly is:

; PyObject *value = fastlocals[oparg]
48 63 C3                  movsxd  rax, ebx
49 8B 8C C7 78 01 00 00   mov     rcx, [r15+rax*8+178h]
; if (value == NULL)
48 85 C9                  test    rcx, rcx
0F 84 5C 68 04 00         jz      loc_1E0791BF
; value->ob_refcnt++
48 FF 01                  inc     qword ptr [rcx]
; *stack_pointer++ = value
49 89 0C 24               mov     [r12], rcx
49 83 C4 08               add     r12, 8
4C 89 64 24 48            mov     [rsp+48h], r12
; end of switch case
E9 77 FF FF FF            jmp     loc_1E0328EF

loc_1E0328EF:
; if (_Py_TracingPossible)
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
48 8B C2                 mov     rax, rdx
49 2B C6                 sub     rax, r14
48 D1 F8                 sar     rax, 1
03 C0                    add     eax, eax
83 3D 8F 2F 33 00 00     cmp     cs:_Py_TracingPossible, 0
41 89 47 78              mov     [r15+78h], eax
0F 85 DB 3D 04 00        jnz     loc_1E0766E6
; word = *next_instr
0F B7 1A                 movzx   ebx, word ptr [rdx]
; opcode = word & 255
44 0F B6 EB              movzx   r13d, bl
; oparg = word >> 8
C1 EB 08                 shr     ebx, 8
; next_instr++
48 83 C2 02              add     rdx, 2
48 89 54 24 40           mov     [rsp+40h], rdx
; spill oparg
89 5C 24 70              mov     dword ptr [rsp+70h], ebx
; align instruction stream
0F 1F 40 00              nop     dword ptr [rax+00h]
                         db      66h, 66h, 66h
66 0F 1F 84 00 00 00 00  nop     word ptr [rax+rax+00000000h]
; check opcode in valid range
41 8D 45 FF              lea     eax, [r13-1]
3D 9D 00 00 00           cmp     eax, 9Dh
0F 87 48 6A 04 00        ja      loc_1E079387
; goto *opcode_targets[opcode] (actually jump to switch case)
48 63 C8                 movsxd  rcx, eax
41 8B 84 88 CC 67 03 00  mov     eax, ds:(off_1E0367CC - 1E000000h)[r8+rcx*4]
49 03 C0                 add     rax, r8
FF E0                    jmp     rax

For MSVC, we can infer:

oparg is stored in the register ebx, and also on the stack at [rsp+70h].
stack_pointer is stored in the register r12, and also on the stack at [rsp+48h].
next_instr is stored in register rdx, and also on the stack at [rsp+40h].
first_instr is stored in register r14.
f is stored in the register r15.

As per the established pattern, the critique on MSVC's code is:

movsxd rax, ebx is, yet again, the fault of oparg being signed.
Like clang, MSVC realised that fastlocals can be expressed as r15+178h, which is good.
Like clang, MSVC opted for inc over add, which is debatable.
Like clang, MSVC put mov [r12], rcx and add r12, 8 in the right order.
MSVC's choice of register for stack_pointer is good, as it avoids +0 in memory operands.
MSVC's spilling of registers to the stack is kind of sad.
jmp loc_1E0328EF is suffered because MSVC lacks computed goto statements.
f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr)) and if (_Py_TracingPossible) are interleaved slightly, which is cute.
Like gcc, MSVC avoids the trap of cdqe and add rax, rax, instead opting for add eax, eax. Alas, like gcc, it still doesn't realise that neither the sar rax, 1 nor the add eax, eax are neccessary.
shr ebx, 8 is done instead of movzx ebx, bh - I'd have a slight preference for the latter, but there isn't much in it. At least the choice of registers avoids the gratuitous mov done by gcc and clang.
The aligning of the instruction stream is - as far as I can tell - completely redundant, as nothing jumps to the aligned instruction.
Rather than having a jump table for opcodes 0 through 255 inclusive, MSVC has a jump table for opcodes 1 through 158 inclusive. This saves a bit of virtual address space, but comes at the cost of lea eax, [r13-1], cmp eax, 9Dh, ja loc_1E079387.
movsxd rcx, eax could be avoided if MSVC realized that eax (opcode - 1) was never negative (which it should be able to spot fairly easily: by means of cmp eax, 9Dh, ja loc_1E079387 it has already verified that eax is between 0 and 157 inclusive).
MSVC's jump table contains 32-bit offsets rather than 64-bit absolute pointers. This saves on address space and dcache and relocations, but comes at the cost of the add rax, r8 instruction to turn an offset into an absolute pointer.

MSVC ends up being like gcc in some regards, like clang in others, and sometimes unlike either. The lack of computed goto statements is certainly painful though, and accounts for four entries in the critique list.

Having bashed the three major compilers for being imperfect, I'm now obliged to provide what I think is the perfect assembly code for this opcode - if I was writing the CPython interpreter main loop in assembly [3] then this is what I'd write for LOAD_FAST:

; PyObject *value = fastlocals[oparg]
48 8B 94 CB 00 01 00 00  mov     rdx, [rbx+rcx*8+100h]
; word = *next_instr
41 0F B7 04 2E           movzx   eax, word ptr [r14+rbp]
; if (value == NULL)
48 85 D2                 test    rdx, rdx
0F 84 F7 12 00 00        jz      loc_1E00696D
; *stack_pointer++ = value
49 89 14 24              mov     [r12], rdx
49 83 C4 08              add     r12, 8
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
89 2B                    mov     [rbx], ebp
; value->ob_refcnt++
48 83 02 01              add     qword ptr [rdx], 1
; if (_Py_TracingPossible)
41 F6 47 F8 01           test    byte ptr [r15-8], 1
0F 85 8F BA FF FF        jnz     loc_1E00111E
; oparg = word >> 8
0F B6 CC                 movzx   ecx, ah
; opcode = word & 255
0F B6 C0                 movzx   eax, al
; next_instr++
83 C5 02                 add     ebp, 2
; goto *opcode_targets[opcode]
41 FF 24 C7              jmp     qword ptr [r15+rax*8]

My assembly makes the following register assignment choices:

oparg is stored in the register ecx (at least on Windows; on other platforms it would be in edi - in both cases this happens to coincide with the register which the normal calling convention uses for the first argument).
stack_pointer is stored in the register r12.
The offset of next_instr from first_instr is stored in register ebp.
first_instr is stored in register r14.
f+78h is stored in the register rbx (in particular, this means that accesses to f->f_lasti don't need a displacement, and accesses to some fields toward the end of PyFrameObject can be accessed with a one-byte displacement rather than a four-byte displacement).
The address of the jump table is stored in register r15.

The combination of storing next_instr as an offset and keeping f biased by offsetof(PyFrameObject, f_lasti) means that f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr)) is two bytes / one instruction, versus 19 bytes / six instructions for gcc. Keeping f biased has no downside, and has the occasional other upside (accesses to some fields toward the end of PyFrameObject can be accessed with a one-byte displacement rather than a four-byte displacement). Storing next_instr as an offset has the minor downside of making the *next_instr memory operand slightly more complex ([14+rbp] rather than [rbp]), but this is a very low cost, and the offset approach also makes certain jump-related opcodes slightly cleaner and avoids a REX prefix on next_instr++. Keeping the jump table address in r15 is expensive (as POSIX x64 only has six non-volatile registers, and this burns one of those six for a runtime constant), but makes opcode dispatch cheap (which is important, given that dispatch is replicated into all 100+ opcodes), and has some upsides (e.g. rip-relative lea instructions can instead be r15-relative, and thus be executed on a wider range of ports). I also change _Py_TracingPossible from being a 32-bit variable to being a 1-bit variable, and put this variable just before the jump table (so that it can be addressed with a one-byte offset from r15). The other notable thing to point out is pulling word = *next_instr up towards the start of the instruction steam - I want to give the CPU as much time as possible to perform that load, as it is critical for control-flow.

That is one opcode - LOAD_FAST - considered in detail. Only 100+ other opcodes to go...

[1] There are two kinds of assignment to oparg: one kind we've already seen, namely oparg = word >> 8, which fairly obviously can't make oparg negative. The other kind is in the EXTENDED_ARG opcode, which does oparg |= oldoparg << 8;: we have to appeal to language lawyering to claim that oldoparg being non-negative implies that oldoparg << 8 is non-negative (signed overflow is undefined and all that). Then one simple step to claim that oparg being non-negative and oldoparg << 8 being non-negative implies oparg | (oldoparg << 8) is non-negative.

[2] The ah/bh/ch/dh registers can only be accessed if a REX prefix is not used. The r8 through r15 registers can only be accessed if a REX prefix is used. QED.

[3] If I was writing the CPython interpreter main loop in assembly. If. I mean, I'd have to be crazy to write that much assembly...