What do compilers do with the CPython interpreter main loop?
Compilers are notoriously bad at compiling the main loop of a programming language interpreter, and the CPython interpreter main loop is no exception: it is hard to compile it perfectly. The difficulty of compilation scales with the number of opcodes which the interpreter has, which in the case of CPython is more than 100, but we can actually get a feel for how well a compiler is doing by looking at just one opcode.
For this exercise I'll look at CPython 3.6's LOAD_FAST opcode, which in C is:
TARGET(LOAD_FAST) {
PyObject *value = GETLOCAL(oparg);
if (value == NULL) {
format_exc_check_arg(PyExc_UnboundLocalError,
UNBOUNDLOCAL_ERROR_MSG,
PyTuple_GetItem(co->co_varnames, oparg));
goto error;
}
Py_INCREF(value);
PUSH(value);
FAST_DISPATCH();
}
This opcode does a very small task: it loads a single local variable and pushes it onto the Python stack. After expanding various macros, the code becomes:
TARGET(LOAD_FAST) {
PyObject *value = fastlocals[oparg];
if (value == NULL) {
// ... error handling ...
}
value->ob_refcnt++;
*stack_pointer++ = value;
if (_Py_TracingPossible) {
// ... slow path ...
}
f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr));
uint16_t word = *next_instr;
opcode = word & 255;
oparg = word >> 8;
next_instr++;
goto *opcode_targets[opcode];
}
With this code in hand, we can start looking at what compilers do with it, starting with gcc on Linux x64:
; PyObject *value = fastlocals[oparg]
48 8B 44 24 38 mov rax, [rsp+38h]
49 63 F5 movsxd rsi, r13d
48 8B 04 F0 mov rax, [rax+rsi*8]
; if (value == NULL)
48 85 C0 test rax, rax
0F 84 86 3A 00 00 jz loc_545A2B
; value->ob_refcnt++
; *stack_pointer++ = value
4C 89 F2 mov rdx, r14
48 83 00 01 add qword ptr [rax], 1
49 83 C6 08 add r14, 8
48 89 02 mov [rdx], rax
; if (_Py_TracingPossible)
8B 05 97 AA 3B 00 mov eax, cs:_Py_TracingPossible
85 C0 test eax, eax
0F 85 A3 CD FF FF jnz loc_53ED64
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
; next_instr++
48 89 EA mov rdx, rbp
48 2B 14 24 sub rdx, [rsp]
48 83 C5 02 add rbp, 2
48 D1 FA sar rdx, 1
01 D2 add edx, edx
89 53 78 mov [rbx+78h], edx
; word = *next_instr
0F B7 55 FE movzx edx, word ptr [rbp-2]
; opcode = word & 255
44 0F B6 C2 movzx r8d, dl
; oparg = word >> 8
0F B6 F6 movzx esi, dh
; goto *opcode_targets[opcode]
49 63 D0 movsxd rdx, r8d
41 89 F5 mov r13d, esi
48 8B 14 D5 20 51 60 00 mov rdx, ds:opcode_targets_11490[rdx*8]
FF E2 jmp rdx
From this, we can infer that gcc made the following choices:
fastlocalsis stored on the stack at[rsp+38h].opargis stored in the registerr13d.stack_pointeris stored in the registerr14.next_instris stored in the registerrbp.first_instris stored on the stack at[rsp].fis stored in the registerrbx.
We can also spot a number of sad things:
- A memory load could be avoided if gcc knew that
[rsp+38h]wasrbx+178h(fastlocalsis an array member off). movsxd rsi, r13dcould be avoided if gcc knew thatr13d(oparg) was never negative (opargis stupidly declared asintrather thanunsigned int, and proving that it can never be negative requires a bit of effort and language-lawyering [1]).mov rdx, r14feels silly; gcc should transposeadd r14, 8andmov [rdx], rax, then this instruction wouldn't be required (as a register-register move, it doesn't consume any execution resources, but it still consumes icache space and decoding resources).sar rdx, 1andadd edx, edxcould be avoided if gcc knew thatfirst_instrandnext_instrwere always even. The precedingsubcould also be avoided ifnext_instrwas maintained as an offset fromfirst_instrrather than an absolute pointer.movsxd rdx, r8dcould be avoided if gcc realized thatr8d(opcode) was never negative (which it should be able to spot fairly easily: it is the result of amovzxinstruction, and thus can only be in the range 0 through 255).mov r13d, esicould be avoided by doingmovzx esi, dhdirectly intor13dinstead of intoesi(except thatmovzx r13d, dhcannot be expressed in machine code [2], so the compiler would have to swap things around and putopargin sayebpandnext_instrin sayr13).mov rdx, ds:opcode_targets_11490[rdx*8]andjmp rdxcould be merged into a single instructionjmp ds:opcode_targets_11490[rdx*8](thus saving an instruction decode and a byte or two of icache).
It feels like there are quite a few things which gcc could improve upon. Despite that, it could be the case that gcc is doing better than other compilers. To find out, we'll have to consider a few other compilers. With that in mind, we can look at what Clang on OSX x64 does:
; PyObject *value = fastlocals[oparg]
49 63 F7 movsxd rsi, r15d
49 8B 84 F6 78 01 00 00 mov rax, [r14+rsi*8+178h]
; if (value == NULL)
48 85 C0 test rax, rax
0F 84 51 41 00 00 jz loc_F7B6D
; value->ob_refcnt++
48 FF 00 inc qword ptr [rax]
; *stack_pointer++ = value
49 89 45 00 mov [r13+0], rax
49 83 C5 08 add r13, 8
; if (_Py_TracingPossible)
44 8B 3D D2 D5 1A 00 mov r15d, cs:__Py_TracingPossible
45 85 FF test r15d, r15d
0F 85 B7 BB FF FF jnz loc_EF5EE
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
48 8B 85 48 FE FF FF mov rax, [rbp-1B8h]
48 2B 85 28 FE FF FF sub rax, [rbp-1D8h]
48 D1 F8 sar rax, 1
48 98 cdqe
48 01 C0 add rax, rax
41 89 46 78 mov [r14+78h], eax
; word = *next_instr
48 8B 95 48 FE FF FF mov rdx, [rbp-1B8h]
0F B7 02 movzx eax, word ptr [rdx]
; opcode = word & 255
0F B6 D8 movzx ebx, al
; oparg = word >> 8
0F B6 C4 movzx eax, ah
41 89 C7 mov r15d, eax
; next_instr++
48 83 C2 02 add rdx, 2
48 89 95 48 FE FF FF mov [rbp+var_1B8], rdx
; goto *opcode_targets[opcode]
48 63 C3 movsxd rax, ebx
48 8D 0D 87 01 13 00 lea rcx, _opcode_targets_11343
48 8B 04 C1 mov rax, [rcx+rax*8]
FF E0 jmp rax
From this, we can infer that clang made the following choices:
opargis stored in the registerr15d.stack_pointeris stored in the registerr13.next_instris stored on the stack at[rbp-1B8h].first_instris stored on the stack at[rbp-1D8h].fis stored in the registerr14.
Again, we can critique this assembly:
movsxd rsi, r15dcould be avoided if clang knew thatopargwas never negative.- Clang does realise that
fastlocalscan be expressed asr14+178h, which is good. inc qword ptr [rax]versusadd qword ptr [rax], 1is debatable, though I tend to have a slight preference foradd(incmight suffer a partial flags stall, but on the flip side is a slightly shorter encoding).- The choice of
r13forstack_pointeris a shame, as it requires[r13+0]rather than just[r13]. It would have been better to put sayfinr13andstack_pointerinr14. - Clang gets points for putting
mov [r13+0], raxandadd r13, 8in the "right" order, thus avoiding the extramovperformed by gcc. next_instrshould really have been assigned a register rather than a stack slot.cdqe,add rax, rax,mov [r14+78h], eaxis a disappointing instruction sequence: given the 32-bit store at the end, Clang should have spotted thatadd rax, raxcould be turned intoadd eax, eax, and thus thatcdqecould be dropped entirely.- The memory load
mov rdx, [rbp-1B8h]could be avoided by instead doingmov rdx, raximmediately after the precedingmov rax, [rbp-1B8h]. - The
movinmovzx eax, ah,mov r15d, eaxcould be elimited had Clang chosen a more suitable register foroparg[2]. movsxd rax, ebxcould be avoided if Clang realized thatebx(opcode) was never negative (which it should be able to spot fairly easily: it is the result of amovzxinstruction, and thus can only be in the range 0 through 255).- Clang has generated position-independent code, thus requiring
lea rcx, _opcode_targets_11343rather than fusing the address of the jump table into the next memory load. mov rax, [rcx+rax*8]andjmp raxcould be fused into a single instruction.
Overall, Clang did some things better than gcc, made some of the same mistakes, and did some things worse than gcc.
Next up is MSVC on Windows x64. This compiler is at a slight disadvantage, as it doesn't supported computed goto statements, and has to instead fall back to using a switch statement. Bearing that in mind, the assembly is:
; PyObject *value = fastlocals[oparg]
48 63 C3 movsxd rax, ebx
49 8B 8C C7 78 01 00 00 mov rcx, [r15+rax*8+178h]
; if (value == NULL)
48 85 C9 test rcx, rcx
0F 84 5C 68 04 00 jz loc_1E0791BF
; value->ob_refcnt++
48 FF 01 inc qword ptr [rcx]
; *stack_pointer++ = value
49 89 0C 24 mov [r12], rcx
49 83 C4 08 add r12, 8
4C 89 64 24 48 mov [rsp+48h], r12
; end of switch case
E9 77 FF FF FF jmp loc_1E0328EF
loc_1E0328EF:
; if (_Py_TracingPossible)
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
48 8B C2 mov rax, rdx
49 2B C6 sub rax, r14
48 D1 F8 sar rax, 1
03 C0 add eax, eax
83 3D 8F 2F 33 00 00 cmp cs:_Py_TracingPossible, 0
41 89 47 78 mov [r15+78h], eax
0F 85 DB 3D 04 00 jnz loc_1E0766E6
; word = *next_instr
0F B7 1A movzx ebx, word ptr [rdx]
; opcode = word & 255
44 0F B6 EB movzx r13d, bl
; oparg = word >> 8
C1 EB 08 shr ebx, 8
; next_instr++
48 83 C2 02 add rdx, 2
48 89 54 24 40 mov [rsp+40h], rdx
; spill oparg
89 5C 24 70 mov dword ptr [rsp+70h], ebx
; align instruction stream
0F 1F 40 00 nop dword ptr [rax+00h]
db 66h, 66h, 66h
66 0F 1F 84 00 00 00 00 nop word ptr [rax+rax+00000000h]
; check opcode in valid range
41 8D 45 FF lea eax, [r13-1]
3D 9D 00 00 00 cmp eax, 9Dh
0F 87 48 6A 04 00 ja loc_1E079387
; goto *opcode_targets[opcode] (actually jump to switch case)
48 63 C8 movsxd rcx, eax
41 8B 84 88 CC 67 03 00 mov eax, ds:(off_1E0367CC - 1E000000h)[r8+rcx*4]
49 03 C0 add rax, r8
FF E0 jmp rax
For MSVC, we can infer:
opargis stored in the registerebx, and also on the stack at[rsp+70h].stack_pointeris stored in the registerr12, and also on the stack at[rsp+48h].next_instris stored in registerrdx, and also on the stack at[rsp+40h].first_instris stored in registerr14.fis stored in the registerr15.
As per the established pattern, the critique on MSVC's code is:
movsxd rax, ebxis, yet again, the fault ofopargbeing signed.- Like clang, MSVC realised that
fastlocalscan be expressed asr15+178h, which is good. - Like clang, MSVC opted for
incoveradd, which is debatable. - Like clang, MSVC put
mov [r12], rcxandadd r12, 8in the right order. - MSVC's choice of register for
stack_pointeris good, as it avoids+0in memory operands. - MSVC's spilling of registers to the stack is kind of sad.
jmp loc_1E0328EFis suffered because MSVC lacks computedgotostatements.f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))andif (_Py_TracingPossible)are interleaved slightly, which is cute.- Like gcc, MSVC avoids the trap of
cdqeandadd rax, rax, instead opting foradd eax, eax. Alas, like gcc, it still doesn't realise that neither thesar rax, 1nor theadd eax, eaxare neccessary. shr ebx, 8is done instead ofmovzx ebx, bh- I'd have a slight preference for the latter, but there isn't much in it. At least the choice of registers avoids the gratuitousmovdone by gcc and clang.- The aligning of the instruction stream is - as far as I can tell - completely redundant, as nothing jumps to the aligned instruction.
- Rather than having a jump table for opcodes 0 through 255 inclusive, MSVC has a jump table for opcodes 1 through 158 inclusive. This saves a bit of virtual address space, but comes at the cost of
lea eax, [r13-1],cmp eax, 9Dh,ja loc_1E079387. movsxd rcx, eaxcould be avoided if MSVC realized thateax(opcode - 1) was never negative (which it should be able to spot fairly easily: by means ofcmp eax, 9Dh,ja loc_1E079387it has already verified thateaxis between 0 and 157 inclusive).- MSVC's jump table contains 32-bit offsets rather than 64-bit absolute pointers. This saves on address space and dcache and relocations, but comes at the cost of the
add rax, r8instruction to turn an offset into an absolute pointer.
MSVC ends up being like gcc in some regards, like clang in others, and sometimes unlike either. The lack of computed goto statements is certainly painful though, and accounts for four entries in the critique list.
Having bashed the three major compilers for being imperfect, I'm now obliged to provide what I think is the perfect assembly code for this opcode - if I was writing the CPython interpreter main loop in assembly [3] then this is what I'd write for LOAD_FAST:
; PyObject *value = fastlocals[oparg]
48 8B 94 CB 00 01 00 00 mov rdx, [rbx+rcx*8+100h]
; word = *next_instr
41 0F B7 04 2E movzx eax, word ptr [r14+rbp]
; if (value == NULL)
48 85 D2 test rdx, rdx
0F 84 F7 12 00 00 jz loc_1E00696D
; *stack_pointer++ = value
49 89 14 24 mov [r12], rdx
49 83 C4 08 add r12, 8
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
89 2B mov [rbx], ebp
; value->ob_refcnt++
48 83 02 01 add qword ptr [rdx], 1
; if (_Py_TracingPossible)
41 F6 47 F8 01 test byte ptr [r15-8], 1
0F 85 8F BA FF FF jnz loc_1E00111E
; oparg = word >> 8
0F B6 CC movzx ecx, ah
; opcode = word & 255
0F B6 C0 movzx eax, al
; next_instr++
83 C5 02 add ebp, 2
; goto *opcode_targets[opcode]
41 FF 24 C7 jmp qword ptr [r15+rax*8]
My assembly makes the following register assignment choices:
opargis stored in the registerecx(at least on Windows; on other platforms it would be inedi- in both cases this happens to coincide with the register which the normal calling convention uses for the first argument).stack_pointeris stored in the registerr12.- The offset of
next_instrfromfirst_instris stored in registerebp. first_instris stored in registerr14.f+78his stored in the registerrbx(in particular, this means that accesses tof->f_lastidon't need a displacement, and accesses to some fields toward the end ofPyFrameObjectcan be accessed with a one-byte displacement rather than a four-byte displacement).- The address of the jump table is stored in register
r15.
The combination of storing next_instr as an offset and keeping f biased by offsetof(PyFrameObject, f_lasti) means that f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr)) is two bytes / one instruction, versus 19 bytes / six instructions for gcc. Keeping f biased has no downside, and has the occasional other upside (accesses to some fields toward the end of PyFrameObject can be accessed with a one-byte displacement rather than a four-byte displacement). Storing next_instr as an offset has the minor downside of making the *next_instr memory operand slightly more complex ([14+rbp] rather than [rbp]), but this is a very low cost, and the offset approach also makes certain jump-related opcodes slightly cleaner and avoids a REX prefix on next_instr++. Keeping the jump table address in r15 is expensive (as POSIX x64 only has six non-volatile registers, and this burns one of those six for a runtime constant), but makes opcode dispatch cheap (which is important, given that dispatch is replicated into all 100+ opcodes), and has some upsides (e.g. rip-relative lea instructions can instead be r15-relative, and thus be executed on a wider range of ports). I also change _Py_TracingPossible from being a 32-bit variable to being a 1-bit variable, and put this variable just before the jump table (so that it can be addressed with a one-byte offset from r15). The other notable thing to point out is pulling word = *next_instr up towards the start of the instruction steam - I want to give the CPU as much time as possible to perform that load, as it is critical for control-flow.
That is one opcode - LOAD_FAST - considered in detail. Only 100+ other opcodes to go...
[1] There are two kinds of assignment to oparg: one kind we've already seen, namely oparg = word >> 8, which fairly obviously can't make oparg negative. The other kind is in the EXTENDED_ARG opcode, which does oparg |= oldoparg << 8;: we have to appeal to language lawyering to claim that oldoparg being non-negative implies that oldoparg << 8 is non-negative (signed overflow is undefined and all that). Then one simple step to claim that oparg being non-negative and oldoparg << 8 being non-negative implies oparg | (oldoparg << 8) is non-negative.
[2] The ah/bh/ch/dh registers can only be accessed if a REX prefix is not used. The r8 through r15 registers can only be accessed if a REX prefix is used. QED.
[3] If I was writing the CPython interpreter main loop in assembly. If. I mean, I'd have to be crazy to write that much assembly...