Compilers are notoriously bad at compiling the main loop of a programming language interpreter, and the CPython interpreter main loop is no exception: it is hard to compile it perfectly. The difficulty of compilation scales with the number of opcodes which the interpreter has, which in the case of CPython is more than 100, but we can actually get a feel for how well a compiler is doing by looking at just one opcode.

For this exercise I'll look at CPython 3.6's LOAD_FAST opcode, which in C is:

TARGET(LOAD_FAST) {
  PyObject *value = GETLOCAL(oparg);
  if (value == NULL) {
    format_exc_check_arg(PyExc_UnboundLocalError,
                         UNBOUNDLOCAL_ERROR_MSG,
                         PyTuple_GetItem(co->co_varnames, oparg));
    goto error;
  }
  Py_INCREF(value);
  PUSH(value);
  FAST_DISPATCH();
}

This opcode does a very small task: it loads a single local variable and pushes it onto the Python stack. After expanding various macros, the code becomes:

TARGET(LOAD_FAST) {
  PyObject *value = fastlocals[oparg];
  if (value == NULL) {
    // ... error handling ...
  }
  value->ob_refcnt++;
  *stack_pointer++ = value;
  if (_Py_TracingPossible) {
    // ... slow path ...
  }
  f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr));
  uint16_t word = *next_instr;
  opcode = word & 255;
  oparg = word >> 8;
  next_instr++;
  goto *opcode_targets[opcode];
}

With this code in hand, we can start looking at what compilers do with it, starting with gcc on Linux x64:

; PyObject *value = fastlocals[oparg]
48 8B 44 24 38           mov     rax, [rsp+38h]
49 63 F5                 movsxd  rsi, r13d
48 8B 04 F0              mov     rax, [rax+rsi*8]
; if (value == NULL)
48 85 C0                 test    rax, rax
0F 84 86 3A 00 00        jz      loc_545A2B
; value->ob_refcnt++
; *stack_pointer++ = value
4C 89 F2                 mov     rdx, r14
48 83 00 01              add     qword ptr [rax], 1
49 83 C6 08              add     r14, 8
48 89 02                 mov     [rdx], rax
; if (_Py_TracingPossible)
8B 05 97 AA 3B 00        mov     eax, cs:_Py_TracingPossible
85 C0                    test    eax, eax
0F 85 A3 CD FF FF        jnz     loc_53ED64
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
; next_instr++
48 89 EA                 mov     rdx, rbp
48 2B 14 24              sub     rdx, [rsp]
48 83 C5 02              add     rbp, 2
48 D1 FA                 sar     rdx, 1
01 D2                    add     edx, edx
89 53 78                 mov     [rbx+78h], edx
; word = *next_instr
0F B7 55 FE              movzx   edx, word ptr [rbp-2]
; opcode = word & 255
44 0F B6 C2              movzx   r8d, dl
; oparg = word >> 8
0F B6 F6                 movzx   esi, dh
; goto *opcode_targets[opcode]
49 63 D0                 movsxd  rdx, r8d
41 89 F5                 mov     r13d, esi
48 8B 14 D5 20 51 60 00  mov     rdx, ds:opcode_targets_11490[rdx*8]
FF E2                    jmp     rdx

From this, we can infer that gcc made the following choices:

We can also spot a number of sad things:

It feels like there are quite a few things which gcc could improve upon. Despite that, it could be the case that gcc is doing better than other compilers. To find out, we'll have to consider a few other compilers. With that in mind, we can look at what Clang on OSX x64 does:

; PyObject *value = fastlocals[oparg]
49 63 F7                 movsxd  rsi, r15d
49 8B 84 F6 78 01 00 00  mov     rax, [r14+rsi*8+178h]
; if (value == NULL)
48 85 C0                 test    rax, rax
0F 84 51 41 00 00        jz      loc_F7B6D
; value->ob_refcnt++
48 FF 00                 inc     qword ptr [rax]
; *stack_pointer++ = value
49 89 45 00              mov     [r13+0], rax
49 83 C5 08              add     r13, 8
; if (_Py_TracingPossible)
44 8B 3D D2 D5 1A 00     mov     r15d, cs:__Py_TracingPossible
45 85 FF                 test    r15d, r15d
0F 85 B7 BB FF FF        jnz     loc_EF5EE
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
48 8B 85 48 FE FF FF     mov     rax, [rbp-1B8h]
48 2B 85 28 FE FF FF     sub     rax, [rbp-1D8h]
48 D1 F8                 sar     rax, 1
48 98                    cdqe
48 01 C0                 add     rax, rax
41 89 46 78              mov     [r14+78h], eax
; word = *next_instr
48 8B 95 48 FE FF FF     mov     rdx, [rbp-1B8h]
0F B7 02                 movzx   eax, word ptr [rdx]
; opcode = word & 255
0F B6 D8                 movzx   ebx, al
; oparg = word >> 8
0F B6 C4                 movzx   eax, ah
41 89 C7                 mov     r15d, eax
; next_instr++
48 83 C2 02              add     rdx, 2
48 89 95 48 FE FF FF     mov     [rbp+var_1B8], rdx
; goto *opcode_targets[opcode]
48 63 C3                 movsxd  rax, ebx
48 8D 0D 87 01 13 00     lea     rcx, _opcode_targets_11343
48 8B 04 C1              mov     rax, [rcx+rax*8]
FF E0                    jmp     rax

From this, we can infer that clang made the following choices:

Again, we can critique this assembly:

Overall, Clang did some things better than gcc, made some of the same mistakes, and did some things worse than gcc.

Next up is MSVC on Windows x64. This compiler is at a slight disadvantage, as it doesn't supported computed goto statements, and has to instead fall back to using a switch statement. Bearing that in mind, the assembly is:

; PyObject *value = fastlocals[oparg]
48 63 C3                  movsxd  rax, ebx
49 8B 8C C7 78 01 00 00   mov     rcx, [r15+rax*8+178h]
; if (value == NULL)
48 85 C9                  test    rcx, rcx
0F 84 5C 68 04 00         jz      loc_1E0791BF
; value->ob_refcnt++
48 FF 01                  inc     qword ptr [rcx]
; *stack_pointer++ = value
49 89 0C 24               mov     [r12], rcx
49 83 C4 08               add     r12, 8
4C 89 64 24 48            mov     [rsp+48h], r12
; end of switch case
E9 77 FF FF FF            jmp     loc_1E0328EF

loc_1E0328EF:
; if (_Py_TracingPossible)
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
48 8B C2                 mov     rax, rdx
49 2B C6                 sub     rax, r14
48 D1 F8                 sar     rax, 1
03 C0                    add     eax, eax
83 3D 8F 2F 33 00 00     cmp     cs:_Py_TracingPossible, 0
41 89 47 78              mov     [r15+78h], eax
0F 85 DB 3D 04 00        jnz     loc_1E0766E6
; word = *next_instr
0F B7 1A                 movzx   ebx, word ptr [rdx]
; opcode = word & 255
44 0F B6 EB              movzx   r13d, bl
; oparg = word >> 8
C1 EB 08                 shr     ebx, 8
; next_instr++
48 83 C2 02              add     rdx, 2
48 89 54 24 40           mov     [rsp+40h], rdx
; spill oparg
89 5C 24 70              mov     dword ptr [rsp+70h], ebx
; align instruction stream
0F 1F 40 00              nop     dword ptr [rax+00h]
                         db      66h, 66h, 66h
66 0F 1F 84 00 00 00 00  nop     word ptr [rax+rax+00000000h]
; check opcode in valid range
41 8D 45 FF              lea     eax, [r13-1]
3D 9D 00 00 00           cmp     eax, 9Dh
0F 87 48 6A 04 00        ja      loc_1E079387
; goto *opcode_targets[opcode] (actually jump to switch case)
48 63 C8                 movsxd  rcx, eax
41 8B 84 88 CC 67 03 00  mov     eax, ds:(off_1E0367CC - 1E000000h)[r8+rcx*4]
49 03 C0                 add     rax, r8
FF E0                    jmp     rax

For MSVC, we can infer:

As per the established pattern, the critique on MSVC's code is:

MSVC ends up being like gcc in some regards, like clang in others, and sometimes unlike either. The lack of computed goto statements is certainly painful though, and accounts for four entries in the critique list.

Having bashed the three major compilers for being imperfect, I'm now obliged to provide what I think is the perfect assembly code for this opcode - if I was writing the CPython interpreter main loop in assembly [3] then this is what I'd write for LOAD_FAST:

; PyObject *value = fastlocals[oparg]
48 8B 94 CB 00 01 00 00  mov     rdx, [rbx+rcx*8+100h]
; word = *next_instr
41 0F B7 04 2E           movzx   eax, word ptr [r14+rbp]
; if (value == NULL)
48 85 D2                 test    rdx, rdx
0F 84 F7 12 00 00        jz      loc_1E00696D
; *stack_pointer++ = value
49 89 14 24              mov     [r12], rdx
49 83 C4 08              add     r12, 8
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
89 2B                    mov     [rbx], ebp
; value->ob_refcnt++
48 83 02 01              add     qword ptr [rdx], 1
; if (_Py_TracingPossible)
41 F6 47 F8 01           test    byte ptr [r15-8], 1
0F 85 8F BA FF FF        jnz     loc_1E00111E
; oparg = word >> 8
0F B6 CC                 movzx   ecx, ah
; opcode = word & 255
0F B6 C0                 movzx   eax, al
; next_instr++
83 C5 02                 add     ebp, 2
; goto *opcode_targets[opcode]
41 FF 24 C7              jmp     qword ptr [r15+rax*8]

My assembly makes the following register assignment choices:

The combination of storing next_instr as an offset and keeping f biased by offsetof(PyFrameObject, f_lasti) means that f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr)) is two bytes / one instruction, versus 19 bytes / six instructions for gcc. Keeping f biased has no downside, and has the occasional other upside (accesses to some fields toward the end of PyFrameObject can be accessed with a one-byte displacement rather than a four-byte displacement). Storing next_instr as an offset has the minor downside of making the *next_instr memory operand slightly more complex ([14+rbp] rather than [rbp]), but this is a very low cost, and the offset approach also makes certain jump-related opcodes slightly cleaner and avoids a REX prefix on next_instr++. Keeping the jump table address in r15 is expensive (as POSIX x64 only has six non-volatile registers, and this burns one of those six for a runtime constant), but makes opcode dispatch cheap (which is important, given that dispatch is replicated into all 100+ opcodes), and has some upsides (e.g. rip-relative lea instructions can instead be r15-relative, and thus be executed on a wider range of ports). I also change _Py_TracingPossible from being a 32-bit variable to being a 1-bit variable, and put this variable just before the jump table (so that it can be addressed with a one-byte offset from r15). The other notable thing to point out is pulling word = *next_instr up towards the start of the instruction steam - I want to give the CPU as much time as possible to perform that load, as it is critical for control-flow.

That is one opcode - LOAD_FAST - considered in detail. Only 100+ other opcodes to go...

[1] There are two kinds of assignment to oparg: one kind we've already seen, namely oparg = word >> 8, which fairly obviously can't make oparg negative. The other kind is in the EXTENDED_ARG opcode, which does oparg |= oldoparg << 8;: we have to appeal to language lawyering to claim that oldoparg being non-negative implies that oldoparg << 8 is non-negative (signed overflow is undefined and all that). Then one simple step to claim that oparg being non-negative and oldoparg << 8 being non-negative implies oparg | (oldoparg << 8) is non-negative.

[2] The ah/bh/ch/dh registers can only be accessed if a REX prefix is not used. The r8 through r15 registers can only be accessed if a REX prefix is used. QED.

[3] If I was writing the CPython interpreter main loop in assembly. If. I mean, I'd have to be crazy to write that much assembly...