Much has been written on the internet about converting integers to strings (e.g. A, B, C). Normally, when this is considered, half of the problem (or the interesting trick in the solution) revolves around knowing how long the resulting string will be. As such, I'd like to consider a slight variant of the problem: converting an integer between 0 and 999999999 to a nine digit string, and always emitting the full nine digits (so 456 becomes `000000456`, for example). We can begin with a simple libc-based approach:

``````void sprintf9(char* p, uint32_t n) {
sprintf(p, "%09u", n);
}
``````

Using `sprintf` is short and pithy, but potentially not the most efficient choice. Another option might be:

``````void divmod9(char* p, uint32_t n) {
for (uint32_t i = 9; i--; n /= 10) {
p[i] = '0' + (n % 10);
}
}
``````

It might look like this approach is going to be slowed down by looping and dividing quite a lot, but Clang will fully unroll the loop, turn all the division into multiplication, and turn the addition into bitwise-or. The result is impressively efficient. In particular, the divide-by-multiplying trick is something that everyone should read up on - I'd recommend this explaination by "fish" for those unfamiliar with it.

A third alternative is this enigma from LuaJIT:

``````#define WINT_R(x, sh, sc) { \
uint32_t d = (x * (((1<<sh)+sc-1)/sc)) >> sh; \
x -= d * sc; \
*p++ = (char)('0' + d); }

void lj_strfmt_wuint9(char* p, uint32_t n) {
uint32_t v = n / 10000, w;
n -= v * 10000;
w = v / 10000;
v -= w * 10000;
*p++ = (char)('0' + w);
WINT_R(v, 23, 1000)
WINT_R(v, 12, 100)
WINT_R(v, 10, 10)
*p++ = (char)('0' + v);
WINT_R(n, 23, 1000)
WINT_R(n, 12, 100)
WINT_R(n, 10, 10)
*p++ = (char)('0' + n);
}
``````

If you consider `(x * (((1<<sh)+sc-1)/sc)) >> sh` in the realm of mathematical real numbers, rather than the realm of C integer arithmetic, then the expression looks kind of like `x * (((1 << sh) / sc) >> sh)`, which in turn looks kind of like `x * (1 / sc)`, which is of course `x / sc`. This intution turns out to be spot-on: this expression is the same divide-by-multiplying trick as used by Clang - `sc` is the value we want to divide by (what fish calls d), `sh` is what fish calls k, and `(((1<<sh)+sc-1)/sc)` is what fish calls m. The choices of `sh` look lightly magical, but 23 is the smallest k which suffices for dividing numbers in the range 0 through 9999 by 1000, 12 is the smallest k for 0 through 999 by 100, and 10 is the smallest k for 0 through 99 by 10.

Another option is this SSE2 implementation, which is clear as mud:

``````#include <emmintrin.h>

void vectorised9(char* p, uint32_t n) {
__m128i a = _mm_set1_epi32(n);
__m128i b = _mm_srli_epi64(
_mm_mul_epu32(a, _mm_set1_epi32(879609303)), 43);
__m128i c = _mm_shuffle_epi32(_mm_mul_epu32(b,
_mm_setr_epi32(10000, 0, 429497, 0)), 0x47);
p[0] = '0' | _mm_cvtsi128_si32(c);
__m128i d = _mm_sub_epi32(_mm_unpacklo_epi64(b, a),
_mm_mul_epu32(c, _mm_setr_epi32(10000, 0, 1, 0)));
__m128i e = _mm_srli_epi32(
_mm_mul_epu32(d, _mm_set1_epi32(5243)), 19);
__m128i f = _mm_or_si128(e,
_mm_shuffle_epi32(_mm_sub_epi32(d,
_mm_mul_epu32(e, _mm_set1_epi32(100))), 0x91));
__m128i g = _mm_mulhi_epu16(f, _mm_set1_epi32(6554));
__m128i h = _mm_slli_si128(_mm_sub_epi32(f,
_mm_mullo_epi16(g, _mm_set1_epi32(10))), 2);
__m128i i = _mm_packus_epi16(_mm_or_si128(g, h), h);
_mm_storel_epi64((__m128i*)(p + 1),
_mm_or_si128(i, _mm_set1_epi32(0x30303030)));
}
``````

After you get over all the underscores, you can see that this code is full of the same divide-by-multiplying trick. For example, the assignment to `g` is doing `WINT_R(f, 16, 10)` four times in parallel (note the choice of k = 16 rather than the minimal k = 10; it could use 10, but using 16 allows it to get a shift for free as part of `_mm_mulhi_epu16`).

Of course, a consideration of different implementations wouldn't be complete without a benchmark. As such, I present a highly unscientific benchmark of how long it takes a single core of my MacBook to convert every integer between 0 and 999999999 to a string:

clang -O2clang -O2 -m32 -msse2
sprintf91m31s1m52s
divmod98.7s11.7s
lj_strfmt_wuint99.5s11.4s
vectorised95.2s5.2s

We can conclude that vectorisation is a win, that 32-bit code is slightly slower than 64-bit code (except for the vectorised solution), that LuaJIT's cleverness isn't an improvement upon the simple `divmod9` (once Clang has optimised the hell out of `divmod9`), and that `sprintf` is dog slow.

Of course, your choice of compiler is important. I've used Clang for the above, as it is the default easy option for MacBooks, but we shouldn't forget gcc. Then again, perhaps we should forget gcc: its 64-bit compilation of `vectorised9` turns `_mm_set1_epi32(n)` into a store-to-memory, a load-from-memory, and a shuffle — the shuffle is required, but the store and load are definitely not. Meanwhile, gcc's 32-bit compilation of `vectorised9` turns `_mm_storel_epi64` into a store-to-memory, two loads-from-memory, and two more stores-to-memory — everything except the first store is completely superfluous.

Next time we'll see how converting 9-digit integers to strings turns out to be very useful in the context of converting floating point numbers to strings.

LuaJIT#311 is the kind of bug report which compiler authors love: a relatively small self-contained file of code, which when fed through the compiler, does the wrong thing. The issue is now resolved, but I'd like to write about the thought process by which it was resolved.

With a bunch of code omitted for berevity, the issue claims that the following file, `row.lua`, is problematic:

``````-- ... lots of code ...

for i=0,100 do
jit.flush()

r=TIntArrayNative.new{ 0,1,2,-1,-2,-3 }
res=ff( TIntArrayNative.new{ 1,2,3,2,1,0 } )
assert( cEql(r, res), 'ff1 failed' )

r=TIntArrayNative.new{ 0,0,0 }
res=ff( TIntArrayNative.new{ 0, 0, 0 } )
assert( cEql(r, res), 'ff2 failed' )

r=TIntArrayNative.new{ 0,1,-1 }
res=ff( TIntArrayNative.new{ 0, 1, 0 } )
assert( cEql(r, res), 'ff3 failed' )

r=TIntArrayNative.new{ 0,-1,-2,-3, 1,2,3 }
res=ff( TIntArrayNative.new{ 0,-1,-2,-4,0,1,2 } )
assert( cEql(r, res), 'ff4 failed' )
end
``````

From this code, we can conclude that the issue isn't perfectly reproducible (hence the 100-iteration loop), and that the issue only affects some traces and not others (the call to `jit.flush` at the top of loop will result in LuaJIT sometimes choosing to compile different traces on each iteration). Bearing this in mind, trying to reproduce the issue was often fruitless:

``````\$ luajit.exe row.lua && echo OK
OK
``````

But, sometimes, an issue would present itself:

``````\$ luajit.exe row.lua && echo OK
luajit.exe: row.lua:199: ff4 failed
stack traceback:
[C]: in function 'assert'
row.lua:199: in main chunk
[C]: at 0x7ff61f3c1eb0
``````

Given the proportion of times the issue didn't appear, "not perfectly reproducible" was perhaps an understatement, so increasing reproducibility was the first priority. To begin, the `-jdump=t` flag can give an idea of what code get selected for tracing - the output is along the lines of:

``````\$ luajit.exe -jdump=t row.lua
---- TRACE flush

---- TRACE 1 start row.lua:69
---- TRACE 1 stop -> return

---- TRACE flush

---- TRACE 1 start row.lua:107
---- TRACE 1 stop -> loop

---- TRACE 2 start row.lua:69
---- TRACE 2 stop -> return

---- TRACE flush

---- TRACE 1 start row.lua:74
---- TRACE 1 stop -> return

---- TRACE 2 start row.lua:69
---- TRACE 2 stop -> return

---- TRACE 3 start row.lua:170
---- TRACE 3 abort row.lua:175 -- leaving loop in root trace

---- TRACE flush

---- TRACE 1 start row.lua:107
---- TRACE 1 stop -> loop

---- TRACE 2 start row.lua:145
---- TRACE 2 stop -> loop

---- TRACE flush

---- TRACE 1 start row.lua:69
---- TRACE 1 stop -> return

---- TRACE 2 start row.lua:170
---- TRACE 2 stop -> loop
``````

Now comes the game of deciding which trace site to investigate. The traces which finish with `stop -> return` are less interesting than the traces which finish with `stop -> loop` (as looping traces have more opportunities for mischief). With that in mind, the `row.lua:107` loop is:

``````      for _,v in ipairs(n) do
a[k] = v
k = k + 1
end
``````

The `row.lua:145` loop is:

``````  for i=1,#a do
if a[i] > a[i-1] then
if r[i-1] > 0 then
r[i] = r[i-1] + 1
else
r[i] = 1
end
end

if a[i] < a[i-1] then
if r[i-1] < 0 then
r[i] = r[i-1] - 1
else
r[i] = -1
end
end
end
``````

The `row.lua:170` loop is:

``````  for i=0,#a1 do
if a1[i] ~= a2[i] then
return false
end
end
``````

Of these three loops, `row.lua:145` contains the most branches, and therefore is the most likely to sometimes have LuaJIT end up choosing a different sequence of branch directions to trace. Given the hypothesis that branch choice is crucial to the issue, this seemed like a good loop to focus on. One way to focus on it is to prevent JIT compilation of other things (by means of `jit.on()` and `jit.off()`), and then to play around with JIT compiler parameters relating to trace selection, eventually stumbling upon this:

``````@@ -142,6 +142,8 @@ end
function ff(a)
local r=TIntArrayNative.new(#a)

+  jit.opt.start("hotloop=2")
+  jit.on()
for i=1,#a do
if a[i] > a[i-1] then
if r[i-1] > 0 then
@@ -159,6 +161,7 @@ function ff(a)
end
end
end
+  jit.off()

return r
end
``````

With this diff in place, the issue becomes reproducible every time, and the next stage of investigation can begin - by which I mean the level of data dumping can be increased from `-jdump=t` all the way up to `-jdump=bitmsrx`:

``````\$ luajit.exe -jdump=bitmsrx row.lua
---- TRACE flush

---- TRACE 1 start row.lua:69
... some bytecode ...
---- TRACE 1 IR
... some IR ...
---- TRACE 1 mcode 184
... some machine code ...
---- TRACE 1 stop -> return

---- TRACE 2 start row.lua:147
... lots of bytecode ...
---- TRACE 2 IR
... lots of IR ...
0039 ------------ LOOP ------------
... more IR ...
---- TRACE 2 mcode 225
... lots of machine code ...
---- TRACE 2 stop -> loop

---- TRACE 2 exit 4
---- TRACE 2 exit 4
---- TRACE 3 start row.lua:74
... some bytecode ...
---- TRACE 3 IR
... some IR ...
---- TRACE 3 mcode 108
... some machine code ...
---- TRACE 3 stop -> return

---- TRACE 3 exit 0
---- TRACE 3 exit 0
---- TRACE 3 exit 0
---- TRACE 3 exit 0
---- TRACE 3 exit 0
---- TRACE 3 exit 0
---- TRACE 2 exit 2
---- TRACE 2 exit 2
---- TRACE 3 exit 0
---- TRACE 3 exit 0
---- TRACE 3 exit 0
---- TRACE 3 exit 0
---- TRACE 4 start 3/0 row.lua:75
... some bytecode ...
---- TRACE 4 IR
... some IR ...
---- TRACE 4 mcode 99
... some machine code ...
---- TRACE 4 stop -> return

---- TRACE 2 exit 1
---- TRACE 2 exit 5
---- TRACE 2 exit 7
---- TRACE 2 exit 1
---- TRACE 2 exit 1
luajit.exe: row.lua:202: ff4 failed
stack traceback:
[C]: in function 'assert'
row.lua:202: in main chunk
[C]: at 0x0100001440
``````

The loop of interest has become `TRACE 2`, and some of the metamethods it invokes have become `TRACE 1`, `TRACE 3`, and `TRACE 4`. After a quick look over the IR of all these traces, it was the IR of `TRACE 2` which caught my attention:

``````---- TRACE 2 IR
....              SNAP   #0   [ ---- ]
0001 rax   >  int SLOAD  #4    CRI
0002       >  int LE     0001  +2147483646
0003 rbp      int SLOAD  #3    CI
0004 r10   >  cdt SLOAD  #1    T
0006       >  int EQ     0005  +102
0012          i64 BSHL   0003  +3
....              SNAP   #1   [ ---- ---- ---- 0003 0001 ---- 0003 ]
0021       >  i64 GE     0019  0015
....              SNAP   #2   [ ---- ---- ---- 0003 0001 ---- ---- ]
0022       >  i64 GT     0019  0015
....              SNAP   #3   [ ---- ---- ---- 0003 0001 ---- 0003 ]
0023 rsi   >  cdt SLOAD  #2    T
0025       >  int EQ     0024  +102
....              SNAP   #4   [ ---- ---- ---- 0003 0001 ---- 0003 ]
0034       >  i64 GE     0032  +0
0036          i64 XSTORE 0035  -1
0037 rbp    + int ADD    0003  +1
....              SNAP   #5   [ ---- ---- ---- ]
0038       >  int LE     0037  0001
....              SNAP   #6   [ ---- ---- ---- 0037 0001 ---- 0037 ]
0039 ------------ LOOP ------------
0040          i64 BSHL   0037  +3
....              SNAP   #7   [ ---- ---- ---- 0037 0001 ---- 0037 ]
0045       >  i64 GE     0044  0043
....              SNAP   #8   [ ---- ---- ---- 0037 0001 ---- 0037 ]
0046       >  i64 GT     0044  0043
0049          i64 XSTORE 0048  -1
0050 rbp    + int ADD    0037  +1
....              SNAP   #9   [ ---- ---- ---- ]
0051       >  int LE     0050  0001
0052 rbp      int PHI    0037  0050
---- TRACE 2 mcode 225
``````

This being a looping trace, all of the instructions before `-- LOOP --` should correspond to one iteration of the loop, and similarly all of the instructions after `-- LOOP --` should correspond to one iteration of the loop (the difference being that the second set of instructions can rely on a bunch of assumptions and invariants set up by the first set of instructions, and thus can be shorter and more efficient). For example, `0012 i64 BSHL 0003 +3` and `0040 i64 BSHL 0037 +3` represent the same thing (namely, the multiplication of `i` by 8 as part of array indexing). Similarly, `0021 > i64 GE 0019 0015` and `0045 > i64 GE 0044 0043` both represent the `>` in `if a[i] > a[i-1] then`, and `0022 > i64 GT 0019 0015` and `0046 > i64 GT 0044 0043` both represent the `<` in `if a[i] < a[i-1] then`.

The thing which jumped out at me was `0034 > i64 GE 0032 +0` (which comes from the `<` in `if r[i-1] < 0 then`), which has no corresponding instruction after `-- LOOP --`. In other words, LuaJIT concluded that it only needed to check `r[i-1] < 0` was `false` once, and that thereafter, it could assume `r[i-1] < 0` was `false` without needing to check. The question became: why did LuaJIT conclude this?

To answer this question, attention moved to the implementation of LuaJIT's "LOOP" optimisation (in `lj_opt_loop.c`), and in particular the part which emits all of the instructions after `-- LOOP --`:

``````/* Copy and substitute all recorded instructions and snapshots. */
for (ins = REF_FIRST; ins < invar; ins++) {
...
/* Substitute instruction operands. */
ir = IR(ins);
op1 = ir->op1;
if (!irref_isk(op1)) op1 = subst[op1];
op2 = ir->op2;
if (!irref_isk(op2)) op2 = subst[op2];
if (irm_kind(lj_ir_mode[ir->o]) == IRM_N &&
op1 == ir->op1 && op2 == ir->op2) {
subst[ins] = (IRRef1)ins;  /* Shortcut. */
} else {
/* Re-emit substituted instruction to the FOLD/CSE/etc. pipeline. */
IRType1 t = ir->t;
IRRef ref = tref_ref(emitir(ir->ot & ~IRT_ISPHI, op1, op2));
subst[ins] = (IRRef1)ref;
...
}
}
``````

The key parts of this code are:

• `for (ins = REF_FIRST; ins < invar; ins++)`, which loops over all instructions prior to `-- LOOP --`.
• `emitir(ir->ot & ~IRT_ISPHI, op1, op2)`, which either emits an instruction, or emits a new constant, or finds a previously-emitted equivalent instruction or constant.
• Assigning to `subst` after emitting an instruction, and reading from `subst` when said instruction appears as an operand to a later instruction.

To see what happened to `0034 > i64 GE 0032 +0`, a conditional breakpoint can be set on the `ir = IR(ins)` line (with the condition being `J->cur.traceno == 2 && ins == REF_BASE + 34`):

The instruction's opcode (`(IROp)ir->o`) is `GE`, as expected. The original left-hand operand (`ir->op1`) is `0x8020`, which after dropping the `8` and converting the `20` from hex to decimal, gives `0032`, as expected. The remaining watches are less obvious: they're all less than `0x8000`, which means that they are constants rather than instructions. Knowing that they happen to be 64-bit signed integer constants, their values are just another watch away:

This shows that the right-hand operand (`op2`) is the constant `0`, as expected. Furthermore, LuaJIT thinks that the left-hand operand (`op1`) is the constant `-1`. Given that the original source code was `r[i-1] < 0`, this might seem slightly surprising, but it is in fact LuaJIT being clever: the previous loop iteration did `r[i] = -1` (this is `0036 i64 XSTORE 0035 -1` in the IR), and `i` increments by one on each iteration (this is `0037 rbp + int ADD 0003 +1` / `0050 rbp + int ADD 0037 +1` / `0052 rbp int PHI 0037 0050` in the IR), and so LuaJIT is correct to conclude that on this iteration, `r[i-1]` is `-1` (reaching this conclusion requires quite a lot of cleverness around re-assocation and forwarding and aliasing, but that's a story for another day).

So, at this point, everything is lined up for `emitir` to emit (or find or etc) the instruction `int64_t(-1) >= int64_t(0)`. Given both operands are constants, this instruction should get constant-folded, and indeed `emitir` ends up at `fold_kfold_int64comp` from `lj_opt_fold.c`:

``````/* Special return values for the fold functions. */
enum {
...
FAILFOLD,   /* Guard would always fail. */
DROPFOLD,   /* Guard eliminated. */
...
};
...
#define CONDFOLD(cond)  ((TRef)FAILFOLD + (TRef)(cond))
...
LJFOLD(LT KINT64 KINT64)
LJFOLD(GE KINT64 KINT64)
LJFOLD(LE KINT64 KINT64)
LJFOLD(GT KINT64 KINT64)
LJFOLD(ULT KINT64 KINT64)
LJFOLD(UGE KINT64 KINT64)
LJFOLD(ULE KINT64 KINT64)
LJFOLD(UGT KINT64 KINT64)
LJFOLDF(kfold_int64comp)
{
#if LJ_HASFFI
uint64_t a = ir_k64(fleft)->u64, b = ir_k64(fright)->u64;
switch ((IROp)fins->o) {
case IR_LT: return CONDFOLD(a < b);
case IR_GE: return CONDFOLD(a >= b);
case IR_LE: return CONDFOLD(a <= b);
case IR_GT: return CONDFOLD(a > b);
case IR_ULT: return CONDFOLD((uint64_t)a < (uint64_t)b);
case IR_UGE: return CONDFOLD((uint64_t)a >= (uint64_t)b);
case IR_ULE: return CONDFOLD((uint64_t)a <= (uint64_t)b);
case IR_UGT: return CONDFOLD((uint64_t)a > (uint64_t)b);
default: lua_assert(0); return FAILFOLD;
}
#else
UNUSED(J); lua_assert(0); return FAILFOLD;
#endif
}
``````

Any sane person should conclude that `int64_t(-1) >= int64_t(0)` is `false` (`FAILFOLD`), but if you follow the code in the above snippit, you'll see that it returns `DROPFOLD` (`true`) for `int64_t(-1) IR_GE int64_t(0)`, because it always does unsigned comparison. Therein lies the bug: the four instructions with `U` in their name are meant to do unsigned comparison, but the four instructions without `U` in their name are meant to do signed comparison. At runtime, they do indeed do this, but at constant-folding-time, the signed comparisons do the wrong thing if one operand is negative and the other is non-negative.

The fix is simple: have `fold_kfold_int64comp` perform signed comparisons for the four instructions which represent signed comparisons.

Last time, I bemoaned what compilers did to some of the CPython interpreter main loop. Following those remarks, there are three obvious courses of action:

1. Make targeted improvements to the compilers.
2. Write the interpreter main loop directly in assembly.
3. Tweak the C source code to make it more amenable to good compilation.

Option three is the easiest to explore, so let's start with a random benchmark to use as a baseline:

``````Python 3.6.0+ (default, Mar  7 2017, 00:04:40)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux
>>> from performance.benchmarks.bm_nbody import bench_nbody
>>> bench_nbody(10, 'sun', 100000)
9.907924063038081
``````

The following patch is a little long, but each individual change is relatively boring, and all the changes are motivated by what we saw in gcc's assembly:

• Using `f->f_localsplus` instead of `fastlocals`, to save on a memory load.
• Using `oparg` as an unsigned type, to save on `movsxd` instructions.
• Computing `next_instr - first_instr` as a difference of `char*`s rather than `uint16_t*`s, to save on a `sar` and an `add`.
``````diff --git a/Python/ceval.c b/Python/ceval.c
index d5172b9..79ccf2a 100644
--- a/Python/ceval.c
+++ b/Python/ceval.c
@@ -729,7 +729,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
int opcode;        /* Current opcode */
int oparg;         /* Current opcode argument, if any */
enum why_code why; /* Reason for block stack unwind */
-    PyObject **fastlocals, **freevars;
+    PyObject **freevars;
PyObject *retval = NULL;            /* Return value */
PyCodeObject *co;
@@ -865,7 +865,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
/* Code access macros */

/* The integer overflow is checked by an assertion below. */
-#define INSTR_OFFSET()  (sizeof(_Py_CODEUNIT) * (int)(next_instr - first_instr))
+#define INSTR_OFFSET()  ((char*)next_instr - (char*)first_instr)
#define NEXTOPARG()  do { \
_Py_CODEUNIT word = *next_instr; \
opcode = _Py_OPCODE(word); \
@@ -959,7 +959,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)

/* Local variable macros */

-#define GETLOCAL(i)     (fastlocals[i])
+#define GETLOCAL(i)     (f->f_localsplus[i])

/* The SETLOCAL() macro must not DECREF the local variable in-place and
then store the new value; it must copy the old value to a temporary
@@ -1045,7 +1045,6 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
co = f->f_code;
names = co->co_names;
consts = co->co_consts;
-    fastlocals = f->f_localsplus;
freevars = f->f_localsplus + co->co_nlocals;
assert(PyBytes_Check(co->co_code));
assert(PyBytes_GET_SIZE(co->co_code) <= INT_MAX);
@@ -1228,7 +1227,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
FAST_DISPATCH();

-            PyObject *value = GETLOCAL(oparg);
+            PyObject *value = GETLOCAL((unsigned)oparg);
if (value == NULL) {
format_exc_check_arg(PyExc_UnboundLocalError,
UNBOUNDLOCAL_ERROR_MSG,
@@ -1242,7 +1241,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)

-            PyObject *value = GETITEM(consts, oparg);
+            PyObject *value = GETITEM(consts, (unsigned)oparg);
Py_INCREF(value);
PUSH(value);
FAST_DISPATCH();
@@ -1251,7 +1250,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
PREDICTED(STORE_FAST);
TARGET(STORE_FAST) {
PyObject *value = POP();
-            SETLOCAL(oparg, value);
+            SETLOCAL((unsigned)oparg, value);
FAST_DISPATCH();
}

@@ -1526,7 +1525,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)

TARGET(LIST_APPEND) {
PyObject *v = POP();
-            PyObject *list = PEEK(oparg);
+            PyObject *list = PEEK((size_t)(unsigned)oparg);
int err;
err = PyList_Append(list, v);
Py_DECREF(v);
@@ -1731,7 +1730,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
_Py_IDENTIFIER(__annotations__);
PyObject *ann_dict;
PyObject *ann = POP();
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
int err;
if (f->f_locals == NULL) {
PyErr_Format(PyExc_SystemError,
@@ -2155,7 +2154,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(STORE_NAME) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
PyObject *v = POP();
PyObject *ns = f->f_locals;
int err;
@@ -2176,7 +2175,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(DELETE_NAME) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
PyObject *ns = f->f_locals;
int err;
if (ns == NULL) {
@@ -2198,7 +2197,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
TARGET(UNPACK_SEQUENCE) {
PyObject *seq = POP(), *item, **items;
if (PyTuple_CheckExact(seq) &&
-                PyTuple_GET_SIZE(seq) == oparg) {
+                PyTuple_GET_SIZE(seq) == (Py_ssize_t)(size_t)(unsigned)oparg) {
items = ((PyTupleObject *)seq)->ob_item;
while (oparg--) {
item = items[oparg];
@@ -2206,7 +2205,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
PUSH(item);
}
} else if (PyList_CheckExact(seq) &&
-                       PyList_GET_SIZE(seq) == oparg) {
+                       PyList_GET_SIZE(seq) == (Py_ssize_t)(size_t)(unsigned)oparg) {
items = ((PyListObject *)seq)->ob_item;
while (oparg--) {
item = items[oparg];
@@ -2215,7 +2214,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}
} else if (unpack_iterable(seq, oparg, -1,
stack_pointer + oparg)) {
} else {
/* unpack_iterable() raised an exception */
Py_DECREF(seq);
@@ -2241,7 +2240,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(STORE_ATTR) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
PyObject *owner = TOP();
PyObject *v = SECOND();
int err;
@@ -2255,7 +2254,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(DELETE_ATTR) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
PyObject *owner = POP();
int err;
err = PyObject_SetAttr(owner, name, (PyObject *)NULL);
@@ -2266,7 +2265,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(STORE_GLOBAL) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
PyObject *v = POP();
int err;
err = PyDict_SetItem(f->f_globals, name, v);
@@ -2277,7 +2276,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(DELETE_GLOBAL) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
int err;
err = PyDict_DelItem(f->f_globals, name);
if (err != 0) {
@@ -2289,7 +2288,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
PyObject *locals = f->f_locals;
PyObject *v;
if (locals == NULL) {
@@ -2340,7 +2339,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
PyObject *v;
if (PyDict_CheckExact(f->f_globals)
&& PyDict_CheckExact(f->f_builtins))
@@ -2385,9 +2384,9 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(DELETE_FAST) {
-            PyObject *v = GETLOCAL(oparg);
+            PyObject *v = GETLOCAL((unsigned)oparg);
if (v != NULL) {
-                SETLOCAL(oparg, NULL);
+                SETLOCAL((unsigned)oparg, NULL);
DISPATCH();
}
format_exc_check_arg(
@@ -2488,7 +2487,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(BUILD_TUPLE) {
-            PyObject *tup = PyTuple_New(oparg);
+            PyObject *tup = PyTuple_New((unsigned)oparg);
if (tup == NULL)
goto error;
while (--oparg >= 0) {
@@ -2500,7 +2499,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(BUILD_LIST) {
-            PyObject *list =  PyList_New(oparg);
+            PyObject *list =  PyList_New((unsigned)oparg);
if (list == NULL)
goto error;
while (--oparg >= 0) {
@@ -2571,7 +2570,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
Py_DECREF(item);
}
if (err != 0) {
Py_DECREF(set);
goto error;
@@ -2601,7 +2600,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)

TARGET(BUILD_MAP) {
Py_ssize_t i;
-            PyObject *map = _PyDict_NewPresized((Py_ssize_t)oparg);
+            PyObject *map = _PyDict_NewPresized((size_t)(unsigned)oparg);
if (map == NULL)
goto error;
for (i = oparg; i > 0; i--) {
@@ -2684,12 +2683,12 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
PyObject *map;
PyObject *keys = TOP();
if (!PyTuple_CheckExact(keys) ||
-                PyTuple_GET_SIZE(keys) != (Py_ssize_t)oparg) {
+                PyTuple_GET_SIZE(keys) != (Py_ssize_t)(size_t)(unsigned)oparg) {
PyErr_SetString(PyExc_SystemError,
goto error;
}
-            map = _PyDict_NewPresized((Py_ssize_t)oparg);
+            map = _PyDict_NewPresized((size_t)(unsigned)oparg);
if (map == NULL) {
goto error;
}
@@ -2746,7 +2745,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
for (i = oparg; i > 0; i--) {
PyObject *arg = PEEK(i);
if (_PyDict_MergeEx(sum, arg, 2) < 0) {
-                    PyObject *func = PEEK(2 + oparg);
+                    PyObject *func = PEEK(2 + (unsigned)oparg);
if (PyErr_ExceptionMatches(PyExc_AttributeError)) {
PyErr_Format(PyExc_TypeError,
"%.200s%.200s argument after ** "
@@ -2810,7 +2809,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
PyObject *owner = TOP();
PyObject *res = PyObject_GetAttr(owner, name);
Py_DECREF(owner);
@@ -2835,7 +2834,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(IMPORT_NAME) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
PyObject *fromlist = POP();
PyObject *level = TOP();
PyObject *res;
@@ -2869,7 +2868,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(IMPORT_FROM) {
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
PyObject *from = TOP();
PyObject *res;
res = import_from(from, name);
@@ -2880,7 +2879,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(JUMP_FORWARD) {
-            JUMPBY(oparg);
+            JUMPBY((unsigned)oparg);
FAST_DISPATCH();
}

@@ -2894,7 +2893,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}
if (cond == Py_False) {
Py_DECREF(cond);
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
FAST_DISPATCH();
}
err = PyObject_IsTrue(cond);
@@ -2902,7 +2901,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
if (err > 0)
err = 0;
else if (err == 0)
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
else
goto error;
DISPATCH();
@@ -2918,14 +2917,14 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}
if (cond == Py_True) {
Py_DECREF(cond);
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
FAST_DISPATCH();
}
err = PyObject_IsTrue(cond);
Py_DECREF(cond);
if (err > 0) {
err = 0;
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
}
else if (err == 0)
;
@@ -2943,7 +2942,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
FAST_DISPATCH();
}
if (cond == Py_False) {
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
FAST_DISPATCH();
}
err = PyObject_IsTrue(cond);
@@ -2953,7 +2952,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
err = 0;
}
else if (err == 0)
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
else
goto error;
DISPATCH();
@@ -2968,13 +2967,13 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
FAST_DISPATCH();
}
if (cond == Py_True) {
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
FAST_DISPATCH();
}
err = PyObject_IsTrue(cond);
if (err > 0) {
err = 0;
-                JUMPTO(oparg);
+                JUMPTO((unsigned)oparg);
}
else if (err == 0) {
@@ -2987,7 +2986,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)

PREDICTED(JUMP_ABSOLUTE);
TARGET(JUMP_ABSOLUTE) {
-            JUMPTO(oparg);
+            JUMPTO((unsigned)oparg);
#if FAST_LOOPS
/* Enabling this path speeds-up all while and for-loops by bypassing
the per-loop checks for signals.  By default, this should be turned-off
@@ -3065,7 +3064,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
/* iterator ended normally */
Py_DECREF(iter);
-            JUMPBY(oparg);
+            JUMPBY((unsigned)oparg);
PREDICT(POP_BLOCK);
DISPATCH();
}
@@ -3076,7 +3075,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
}

TARGET(CONTINUE_LOOP) {
-            retval = PyLong_FromLong(oparg);
+            retval = PyLong_FromLong((unsigned)oparg);
if (retval == NULL)
goto error;
why = WHY_CONTINUE;
@@ -3755,7 +3754,7 @@ format_missing(const char *kind, PyCodeObject *co, PyObject *names)

static void
missing_arguments(PyCodeObject *co, Py_ssize_t missing, Py_ssize_t defcount,
-                  PyObject **fastlocals)
+                  PyFrameObject *f)
{
Py_ssize_t i, j = 0;
Py_ssize_t start, end;
@@ -3793,7 +3792,7 @@ missing_arguments(PyCodeObject *co, Py_ssize_t missing, Py_ssize_t defcount,

static void
too_many_positional(PyCodeObject *co, Py_ssize_t given, Py_ssize_t defcount,
-                    PyObject **fastlocals)
+                    PyFrameObject *f)
{
int plural;
Py_ssize_t kwonly_given = 0;
@@ -3863,7 +3862,7 @@ _PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
PyCodeObject* co = (PyCodeObject*)_co;
PyFrameObject *f;
PyObject *retval = NULL;
-    PyObject **fastlocals, **freevars;
+    PyObject **freevars;
PyObject *x, *u;
const Py_ssize_t total_args = co->co_argcount + co->co_kwonlyargcount;
@@ -3883,7 +3882,6 @@ _PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
if (f == NULL) {
return NULL;
}
-    fastlocals = f->f_localsplus;
freevars = f->f_localsplus + co->co_nlocals;

/* Create a dictionary for keyword parameters (**kwags) */
@@ -3990,7 +3988,7 @@ _PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,

/* Check the number of positional arguments */
if (argcount > co->co_argcount && !(co->co_flags & CO_VARARGS)) {
-        too_many_positional(co, argcount, defcount, fastlocals);
+        too_many_positional(co, argcount, defcount, f);
goto fail;
}

@@ -4004,7 +4002,7 @@ _PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
}
}
if (missing) {
-            missing_arguments(co, missing, defcount, fastlocals);
+            missing_arguments(co, missing, defcount, f);
goto fail;
}
if (n > m)
@@ -4039,7 +4037,7 @@ _PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
missing++;
}
if (missing) {
-            missing_arguments(co, missing, -1, fastlocals);
+            missing_arguments(co, missing, -1, f);
goto fail;
}
}
@@ -4845,7 +4843,6 @@ _PyFunction_FastCall(PyCodeObject *co, PyObject **args, Py_ssize_t nargs,
{
PyFrameObject *f;
-    PyObject **fastlocals;
Py_ssize_t i;
PyObject *result;

@@ -4861,11 +4858,9 @@ _PyFunction_FastCall(PyCodeObject *co, PyObject **args, Py_ssize_t nargs,
return NULL;
}

-    fastlocals = f->f_localsplus;
-
for (i = 0; i < nargs; i++) {
Py_INCREF(*args);
-        fastlocals[i] = *args++;
+        f->f_localsplus[(size_t)i] = *args++;
}
result = PyEval_EvalFrameEx(f,0);

@@ -5335,9 +5330,8 @@ unicode_concatenate(PyObject *v, PyObject *w,
switch (opcode) {
case STORE_FAST:
{
-            PyObject **fastlocals = f->f_localsplus;
-            if (GETLOCAL(oparg) == v)
-                SETLOCAL(oparg, NULL);
+            if (GETLOCAL((unsigned)oparg) == v)
+                SETLOCAL((unsigned)oparg, NULL);
break;
}
case STORE_DEREF:
@@ -5352,7 +5346,7 @@ unicode_concatenate(PyObject *v, PyObject *w,
case STORE_NAME:
{
PyObject *names = f->f_code->co_names;
-            PyObject *name = GETITEM(names, oparg);
+            PyObject *name = GETITEM(names, (unsigned)oparg);
PyObject *locals = f->f_locals;
if (PyDict_CheckExact(locals) &&
PyDict_GetItem(locals, name) == v) {
``````

With all of these changes applied, we get a 1.3% speedup:

``````Python 3.6.0+ (default, Mar  7 2017, 00:06:13)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux
>>> from performance.benchmarks.bm_nbody import bench_nbody
>>> bench_nbody(10, 'sun', 100000)
9.777307317999657
``````

A 1.3% speedup is simutaneously not very much, and also a surprisingly large amount for eliminating just a few instructions here and there. Of course, your mileage may vary, and this is just one randomly chosen benchmark.

Compilers are notoriously bad at compiling the main loop of a programming language interpreter, and the CPython interpreter main loop is no exception: it is hard to compile it perfectly. The difficulty of compilation scales with the number of opcodes which the interpreter has, which in the case of CPython is more than 100, but we can actually get a feel for how well a compiler is doing by looking at just one opcode.

For this exercise I'll look at CPython 3.6's `LOAD_FAST` opcode, which in C is:

``````TARGET(LOAD_FAST) {
PyObject *value = GETLOCAL(oparg);
if (value == NULL) {
format_exc_check_arg(PyExc_UnboundLocalError,
UNBOUNDLOCAL_ERROR_MSG,
PyTuple_GetItem(co->co_varnames, oparg));
goto error;
}
Py_INCREF(value);
PUSH(value);
FAST_DISPATCH();
}
``````

This opcode does a very small task: it loads a single local variable and pushes it onto the Python stack. After expanding various macros, the code becomes:

``````TARGET(LOAD_FAST) {
PyObject *value = fastlocals[oparg];
if (value == NULL) {
// ... error handling ...
}
value->ob_refcnt++;
*stack_pointer++ = value;
if (_Py_TracingPossible) {
// ... slow path ...
}
f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr));
uint16_t word = *next_instr;
opcode = word & 255;
oparg = word >> 8;
next_instr++;
goto *opcode_targets[opcode];
}
``````

With this code in hand, we can start looking at what compilers do with it, starting with gcc on Linux x64:

``````; PyObject *value = fastlocals[oparg]
48 8B 44 24 38           mov     rax, [rsp+38h]
49 63 F5                 movsxd  rsi, r13d
48 8B 04 F0              mov     rax, [rax+rsi*8]
; if (value == NULL)
48 85 C0                 test    rax, rax
0F 84 86 3A 00 00        jz      loc_545A2B
; value->ob_refcnt++
; *stack_pointer++ = value
4C 89 F2                 mov     rdx, r14
48 83 00 01              add     qword ptr [rax], 1
49 83 C6 08              add     r14, 8
48 89 02                 mov     [rdx], rax
; if (_Py_TracingPossible)
8B 05 97 AA 3B 00        mov     eax, cs:_Py_TracingPossible
85 C0                    test    eax, eax
0F 85 A3 CD FF FF        jnz     loc_53ED64
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
; next_instr++
48 89 EA                 mov     rdx, rbp
48 2B 14 24              sub     rdx, [rsp]
48 83 C5 02              add     rbp, 2
48 D1 FA                 sar     rdx, 1
89 53 78                 mov     [rbx+78h], edx
; word = *next_instr
0F B7 55 FE              movzx   edx, word ptr [rbp-2]
; opcode = word & 255
44 0F B6 C2              movzx   r8d, dl
; oparg = word >> 8
0F B6 F6                 movzx   esi, dh
; goto *opcode_targets[opcode]
49 63 D0                 movsxd  rdx, r8d
41 89 F5                 mov     r13d, esi
48 8B 14 D5 20 51 60 00  mov     rdx, ds:opcode_targets_11490[rdx*8]
FF E2                    jmp     rdx
``````

From this, we can infer that gcc made the following choices:

• `fastlocals` is stored on the stack at `[rsp+38h]`.
• `oparg` is stored in the register `r13d`.
• `stack_pointer` is stored in the register `r14`.
• `next_instr` is stored in the register `rbp`.
• `first_instr` is stored on the stack at `[rsp]`.
• `f` is stored in the register `rbx`.

We can also spot a number of sad things:

• A memory load could be avoided if gcc knew that `[rsp+38h]` was `rbx+178h` (`fastlocals` is an array member of `f`).
• `movsxd rsi, r13d` could be avoided if gcc knew that `r13d` (`oparg`) was never negative (`oparg` is stupidly declared as `int` rather than `unsigned int`, and proving that it can never be negative requires a bit of effort and language-lawyering [1]).
• `mov rdx, r14` feels silly; gcc should transpose `add r14, 8` and `mov [rdx], rax`, then this instruction wouldn't be required (as a register-register move, it doesn't consume any execution resources, but it still consumes icache space and decoding resources).
• `sar rdx, 1` and `add edx, edx` could be avoided if gcc knew that `first_instr` and `next_instr` were always even. The preceding `sub` could also be avoided if `next_instr` was maintained as an offset from `first_instr` rather than an absolute pointer.
• `movsxd rdx, r8d` could be avoided if gcc realized that `r8d` (`opcode`) was never negative (which it should be able to spot fairly easily: it is the result of a `movzx` instruction, and thus can only be in the range 0 through 255).
• `mov r13d, esi` could be avoided by doing `movzx esi, dh` directly into `r13d` instead of into `esi` (except that `movzx r13d, dh` cannot be expressed in machine code [2], so the compiler would have to swap things around and put `oparg` in say `ebp` and `next_instr` in say `r13`).
• `mov rdx, ds:opcode_targets_11490[rdx*8]` and `jmp rdx` could be merged into a single instruction `jmp ds:opcode_targets_11490[rdx*8]` (thus saving an instruction decode and a byte or two of icache).

It feels like there are quite a few things which gcc could improve upon. Despite that, it could be the case that gcc is doing better than other compilers. To find out, we'll have to consider a few other compilers. With that in mind, we can look at what Clang on OSX x64 does:

``````; PyObject *value = fastlocals[oparg]
49 63 F7                 movsxd  rsi, r15d
49 8B 84 F6 78 01 00 00  mov     rax, [r14+rsi*8+178h]
; if (value == NULL)
48 85 C0                 test    rax, rax
0F 84 51 41 00 00        jz      loc_F7B6D
; value->ob_refcnt++
48 FF 00                 inc     qword ptr [rax]
; *stack_pointer++ = value
49 89 45 00              mov     [r13+0], rax
49 83 C5 08              add     r13, 8
; if (_Py_TracingPossible)
44 8B 3D D2 D5 1A 00     mov     r15d, cs:__Py_TracingPossible
45 85 FF                 test    r15d, r15d
0F 85 B7 BB FF FF        jnz     loc_EF5EE
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
48 8B 85 48 FE FF FF     mov     rax, [rbp-1B8h]
48 2B 85 28 FE FF FF     sub     rax, [rbp-1D8h]
48 D1 F8                 sar     rax, 1
48 98                    cdqe
48 01 C0                 add     rax, rax
41 89 46 78              mov     [r14+78h], eax
; word = *next_instr
48 8B 95 48 FE FF FF     mov     rdx, [rbp-1B8h]
0F B7 02                 movzx   eax, word ptr [rdx]
; opcode = word & 255
0F B6 D8                 movzx   ebx, al
; oparg = word >> 8
0F B6 C4                 movzx   eax, ah
41 89 C7                 mov     r15d, eax
; next_instr++
48 83 C2 02              add     rdx, 2
48 89 95 48 FE FF FF     mov     [rbp+var_1B8], rdx
; goto *opcode_targets[opcode]
48 63 C3                 movsxd  rax, ebx
48 8D 0D 87 01 13 00     lea     rcx, _opcode_targets_11343
48 8B 04 C1              mov     rax, [rcx+rax*8]
FF E0                    jmp     rax
``````

From this, we can infer that clang made the following choices:

• `oparg` is stored in the register `r15d`.
• `stack_pointer` is stored in the register `r13`.
• `next_instr` is stored on the stack at `[rbp-1B8h]`.
• `first_instr` is stored on the stack at `[rbp-1D8h]`.
• `f` is stored in the register `r14`.

Again, we can critique this assembly:

• `movsxd rsi, r15d` could be avoided if clang knew that `oparg` was never negative.
• Clang does realise that `fastlocals` can be expressed as `r14+178h`, which is good.
• `inc qword ptr [rax]` versus `add qword ptr [rax], 1` is debatable, though I tend to have a slight preference for `add` (`inc` might suffer a partial flags stall, but on the flip side is a slightly shorter encoding).
• The choice of `r13` for `stack_pointer` is a shame, as it requires `[r13+0]` rather than just `[r13]`. It would have been better to put say `f` in `r13` and `stack_pointer` in `r14`.
• Clang gets points for putting `mov [r13+0], rax` and `add r13, 8` in the "right" order, thus avoiding the extra `mov` performed by gcc.
• `next_instr` should really have been assigned a register rather than a stack slot.
• `cdqe`, `add rax, rax`, `mov [r14+78h], eax` is a disappointing instruction sequence: given the 32-bit store at the end, Clang should have spotted that `add rax, rax` could be turned into `add eax, eax`, and thus that `cdqe` could be dropped entirely.
• The memory load `mov rdx, [rbp-1B8h]` could be avoided by instead doing `mov rdx, rax` immediately after the preceding `mov rax, [rbp-1B8h]`.
• The `mov` in `movzx eax, ah`, `mov r15d, eax` could be elimited had Clang chosen a more suitable register for `oparg` [2].
• `movsxd rax, ebx` could be avoided if Clang realized that `ebx` (`opcode`) was never negative (which it should be able to spot fairly easily: it is the result of a `movzx` instruction, and thus can only be in the range 0 through 255).
• Clang has generated position-independent code, thus requiring `lea rcx, _opcode_targets_11343` rather than fusing the address of the jump table into the next memory load.
• `mov rax, [rcx+rax*8]` and `jmp rax` could be fused into a single instruction.

Overall, Clang did some things better than gcc, made some of the same mistakes, and did some things worse than gcc.

Next up is MSVC on Windows x64. This compiler is at a slight disadvantage, as it doesn't supported computed `goto` statements, and has to instead fall back to using a `switch` statement. Bearing that in mind, the assembly is:

``````; PyObject *value = fastlocals[oparg]
48 63 C3                  movsxd  rax, ebx
49 8B 8C C7 78 01 00 00   mov     rcx, [r15+rax*8+178h]
; if (value == NULL)
48 85 C9                  test    rcx, rcx
0F 84 5C 68 04 00         jz      loc_1E0791BF
; value->ob_refcnt++
48 FF 01                  inc     qword ptr [rcx]
; *stack_pointer++ = value
49 89 0C 24               mov     [r12], rcx
49 83 C4 08               add     r12, 8
4C 89 64 24 48            mov     [rsp+48h], r12
; end of switch case
E9 77 FF FF FF            jmp     loc_1E0328EF

loc_1E0328EF:
; if (_Py_TracingPossible)
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
48 8B C2                 mov     rax, rdx
49 2B C6                 sub     rax, r14
48 D1 F8                 sar     rax, 1
83 3D 8F 2F 33 00 00     cmp     cs:_Py_TracingPossible, 0
41 89 47 78              mov     [r15+78h], eax
0F 85 DB 3D 04 00        jnz     loc_1E0766E6
; word = *next_instr
0F B7 1A                 movzx   ebx, word ptr [rdx]
; opcode = word & 255
44 0F B6 EB              movzx   r13d, bl
; oparg = word >> 8
C1 EB 08                 shr     ebx, 8
; next_instr++
48 83 C2 02              add     rdx, 2
48 89 54 24 40           mov     [rsp+40h], rdx
; spill oparg
89 5C 24 70              mov     dword ptr [rsp+70h], ebx
; align instruction stream
0F 1F 40 00              nop     dword ptr [rax+00h]
db      66h, 66h, 66h
66 0F 1F 84 00 00 00 00  nop     word ptr [rax+rax+00000000h]
; check opcode in valid range
41 8D 45 FF              lea     eax, [r13-1]
3D 9D 00 00 00           cmp     eax, 9Dh
0F 87 48 6A 04 00        ja      loc_1E079387
48 63 C8                 movsxd  rcx, eax
41 8B 84 88 CC 67 03 00  mov     eax, ds:(off_1E0367CC - 1E000000h)[r8+rcx*4]
49 03 C0                 add     rax, r8
FF E0                    jmp     rax
``````

For MSVC, we can infer:

• `oparg` is stored in the register `ebx`, and also on the stack at `[rsp+70h]`.
• `stack_pointer` is stored in the register `r12`, and also on the stack at `[rsp+48h]`.
• `next_instr` is stored in register `rdx`, and also on the stack at `[rsp+40h]`.
• `first_instr` is stored in register `r14`.
• `f` is stored in the register `r15`.

As per the established pattern, the critique on MSVC's code is:

• `movsxd rax, ebx` is, yet again, the fault of `oparg` being signed.
• Like clang, MSVC realised that `fastlocals` can be expressed as `r15+178h`, which is good.
• Like clang, MSVC opted for `inc` over `add`, which is debatable.
• Like clang, MSVC put `mov [r12], rcx` and `add r12, 8` in the right order.
• MSVC's choice of register for `stack_pointer` is good, as it avoids `+0` in memory operands.
• MSVC's spilling of registers to the stack is kind of sad.
• `jmp loc_1E0328EF` is suffered because MSVC lacks computed `goto` statements.
• `f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))` and `if (_Py_TracingPossible)` are interleaved slightly, which is cute.
• Like gcc, MSVC avoids the trap of `cdqe` and `add rax, rax`, instead opting for `add eax, eax`. Alas, like gcc, it still doesn't realise that neither the `sar rax, 1` nor the `add eax, eax` are neccessary.
• `shr ebx, 8` is done instead of `movzx ebx, bh` - I'd have a slight preference for the latter, but there isn't much in it. At least the choice of registers avoids the gratuitous `mov` done by gcc and clang.
• The aligning of the instruction stream is - as far as I can tell - completely redundant, as nothing jumps to the aligned instruction.
• Rather than having a jump table for opcodes 0 through 255 inclusive, MSVC has a jump table for opcodes 1 through 158 inclusive. This saves a bit of virtual address space, but comes at the cost of `lea eax, [r13-1]`, `cmp eax, 9Dh`, `ja loc_1E079387`.
• `movsxd rcx, eax` could be avoided if MSVC realized that `eax` (`opcode - 1`) was never negative (which it should be able to spot fairly easily: by means of `cmp eax, 9Dh`, `ja loc_1E079387` it has already verified that `eax` is between 0 and 157 inclusive).
• MSVC's jump table contains 32-bit offsets rather than 64-bit absolute pointers. This saves on address space and dcache and relocations, but comes at the cost of the `add rax, r8` instruction to turn an offset into an absolute pointer.

MSVC ends up being like gcc in some regards, like clang in others, and sometimes unlike either. The lack of computed `goto` statements is certainly painful though, and accounts for four entries in the critique list.

Having bashed the three major compilers for being imperfect, I'm now obliged to provide what I think is the perfect assembly code for this opcode - if I was writing the CPython interpreter main loop in assembly [3] then this is what I'd write for `LOAD_FAST`:

``````; PyObject *value = fastlocals[oparg]
48 8B 94 CB 00 01 00 00  mov     rdx, [rbx+rcx*8+100h]
; word = *next_instr
41 0F B7 04 2E           movzx   eax, word ptr [r14+rbp]
; if (value == NULL)
48 85 D2                 test    rdx, rdx
0F 84 F7 12 00 00        jz      loc_1E00696D
; *stack_pointer++ = value
49 89 14 24              mov     [r12], rdx
49 83 C4 08              add     r12, 8
; f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
89 2B                    mov     [rbx], ebp
; value->ob_refcnt++
48 83 02 01              add     qword ptr [rdx], 1
; if (_Py_TracingPossible)
41 F6 47 F8 01           test    byte ptr [r15-8], 1
0F 85 8F BA FF FF        jnz     loc_1E00111E
; oparg = word >> 8
0F B6 CC                 movzx   ecx, ah
; opcode = word & 255
0F B6 C0                 movzx   eax, al
; next_instr++
83 C5 02                 add     ebp, 2
; goto *opcode_targets[opcode]
41 FF 24 C7              jmp     qword ptr [r15+rax*8]
``````

My assembly makes the following register assignment choices:

• `oparg` is stored in the register `ecx` (at least on Windows; on other platforms it would be in `edi` - in both cases this happens to coincide with the register which the normal calling convention uses for the first argument).
• `stack_pointer` is stored in the register `r12`.
• The offset of `next_instr` from `first_instr` is stored in register `ebp`.
• `first_instr` is stored in register `r14`.
• `f+78h` is stored in the register `rbx` (in particular, this means that accesses to `f->f_lasti` don't need a displacement, and accesses to some fields toward the end of `PyFrameObject` can be accessed with a one-byte displacement rather than a four-byte displacement).
• The address of the jump table is stored in register `r15`.

The combination of storing `next_instr` as an offset and keeping `f` biased by `offsetof(PyFrameObject, f_lasti)` means that `f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))` is two bytes / one instruction, versus 19 bytes / six instructions for gcc. Keeping `f` biased has no downside, and has the occasional other upside (accesses to some fields toward the end of `PyFrameObject` can be accessed with a one-byte displacement rather than a four-byte displacement). Storing `next_instr` as an offset has the minor downside of making the `*next_instr` memory operand slightly more complex (`[14+rbp]` rather than `[rbp]`), but this is a very low cost, and the offset approach also makes certain jump-related opcodes slightly cleaner and avoids a REX prefix on `next_instr++`. Keeping the jump table address in `r15` is expensive (as POSIX x64 only has six non-volatile registers, and this burns one of those six for a runtime constant), but makes opcode dispatch cheap (which is important, given that dispatch is replicated into all 100+ opcodes), and has some upsides (e.g. `rip`-relative `lea` instructions can instead be `r15`-relative, and thus be executed on a wider range of ports). I also change `_Py_TracingPossible` from being a 32-bit variable to being a 1-bit variable, and put this variable just before the jump table (so that it can be addressed with a one-byte offset from `r15`). The other notable thing to point out is pulling `word = *next_instr` up towards the start of the instruction steam - I want to give the CPU as much time as possible to perform that load, as it is critical for control-flow.

That is one opcode - `LOAD_FAST` - considered in detail. Only 100+ other opcodes to go...

[1] There are two kinds of assignment to `oparg`: one kind we've already seen, namely `oparg = word >> 8`, which fairly obviously can't make `oparg` negative. The other kind is in the `EXTENDED_ARG` opcode, which does `oparg |= oldoparg << 8;`: we have to appeal to language lawyering to claim that `oldoparg` being non-negative implies that `oldoparg << 8` is non-negative (signed overflow is undefined and all that). Then one simple step to claim that `oparg` being non-negative and `oldoparg << 8` being non-negative implies `oparg | (oldoparg << 8)` is non-negative.

[2] The `ah`/`bh`/`ch`/`dh` registers can only be accessed if a REX prefix is not used. The `r8` through `r15` registers can only be accessed if a REX prefix is used. QED.

[3] If I was writing the CPython interpreter main loop in assembly. If. I mean, I'd have to be crazy to write that much assembly...

There are various ways of estimating π, though one cheap and cheerful way is to estimate the area of a circle: imagine a grid of 1x1 squares, draw a perfect circle of radius `r`, and estimate the area of the circle as being the number of squares whose midpoint is within the circle. We can implement this in a few lines of Lua:

``````r = 1000
for x = -r, r do
for y = -r, r do
if x^2 + y^2 <= r^2 then
n = (n or 0) + 1
end
end
end
print(n / r^2)
``````

With a small amount of thought, the inner `for` and `if` can be replaced with an analytical solution:

``````r = 1000
for x = -r, r do
n = (n or 0) + math.floor(math.sqrt(r^2 - x^2)) * 2 + 1
end
print(n / r^2)
``````

At this point, we can bump `r` up to `10000000`, run the above through LuaJIT, and get an estimate of `3.1415926535059` in less than a tenth of a second. Not a bad estimate for a few lines of code and a tiny amount of compute time.

The estimate itself is nice, but I find it more interesting to look at what LuaJIT is doing behind the scenes with this code. To begin with, we can inspect the bytecode - if source code is an array of characters, then bytecode is an array of instructions (plus an array of constants, plus a few other bits and bobs), and the first thing LuaJIT does with any source code is turn it into bytecode. We can see the bytecode by using the `-bl` argument:

``````\$ luajit -bl pi.lua
-- BYTECODE -- pi.lua:0-6
0001    KNUM     0   0      ; 10000000
0002    GSET     0   0      ; "r"
0003    GGET     0   0      ; "r"
0004    UNM      0   0
0005    GGET     1   0      ; "r"
0006    KSHORT   2   1
0007    FORI     0 => 0029
0008 => GGET     4   1      ; "n"
0009    IST          4
0010    JMP      5 => 0012
0011    KSHORT   4   0
0012 => GGET     5   2      ; "math"
0013    TGETS    5   5   3  ; "floor"
0014    GGET     6   2      ; "math"
0015    TGETS    6   6   4  ; "sqrt"
0016    GGET     7   0      ; "r"
0017    KSHORT   8   2
0018    POW      7   7   8
0019    KSHORT   8   2
0020    POW      8   3   8
0021    SUBVV    7   7   8
0022    CALL     6   0   2
0023    CALLM    5   2   0
0024    MULVN    5   5   1  ; 2
0026    ADDVN    4   4   2  ; 1
0027    GSET     4   1      ; "n"
0028    FORL     0 => 0008
0029 => GGET     0   5      ; "print"
0030    GGET     1   1      ; "n"
0031    GGET     2   0      ; "r"
0032    KSHORT   3   2
0033    POW      2   2   3
0034    DIVVV    1   1   2
0035    CALL     0   1   2
0036    RET0     0   1
``````

Having said bytecode was an array of instructions, the first column in the above (containing `0001` through `0036`) is the array index. There is actually also an instruction at array index 0 which `-bl` doesn't show us, which in this case is `0000 FUNCV 9`. The next column contains either ` ` or `=>` - it is blank for most instructions, and is `=>` for any instruction which is the target of a jump. The next column (`KNUM`, `GSET`, ..., `RET0`) contains the instruction name. The next few columns contain numbers, the meaning of which depend upon the instruction name. Finally, some instructions have a comment printed after the `;`.

We can run through the various different kinds of instruction in the above (in order of first appearance):

• `FUNCV` - Special instruction which marks the start of a vararg Lua function (Lua turns files into functions by wrapping them in `function(...)`...`end`, which is a vararg function with no fixed parameters). The sole number is the number of local variable slots required by the function (I'm calling them local variable slots, but they can also hold invisible variables and temporary values - I'll refrain from calling them registers, even though that is a common term for them).
• `KNUM` - Load a numeric constant from the array of constants into the array of local variables. The first number is the local variable index, the second number is the constant index. The comment is the value of the constant.
• `GSET` - Store a value into the table of globals. The first number is the index of the local variable to store, the second number is the index of the string constant containing the name of the global (numeric constants and string constants are in conceptually separate arrays, so when `KNUM` talks about constant #0 and `GSET` talks about constant #0, they are not talking about the same constant). The comment is the name of the global.
• `GGET` - Like `GSET`, but for loading out of the global table rather than storing into it. The first number is the index of the local variable to load into, the second number is the index of the string constant containing the name of the global. The comment is the name of the global.
• `UNM` - Perform unary negation (aka. unary minus). The second number is the index of some local variable `v`, and the first number is the local variable index at which to store `-v`.
• `KSHORT` - Like `KNUM`, but only suitable for small integer constants, and avoids an array lookup. The first number is a local variable index, and the second number is the numeric value to be stored in that local.
• `FORI` - Start of a numeric `for` loop. The second number (in this case `=> 0029`) is the instruction to jump to if the loop body doesn't execute even once. The first number refers to four consecutive local variables (that is, a value of `i` refers to `i`, `i+1`, `i+2`, and `i+3`). The first three local variables respectively initially contain the loop start value (`-r`), the loop end value (`r`), and the loop step size (`1`). As loop iterations are performed, the first local variable is incremented by the loop step value. At the start of every loop iteration, the first local variable is copied to the fourth local variable (Lua allows the programmer to assign to the loop iteration variable during a loop, but doing so doesn't affect the loop, so the first local variable is the real loop iteration variable, and the fourth local variable is a copy which can be assigned-to without affecting the loop).
• `IST` - Test whether a local variable is truthy (`nil` and `false` are falsey, all other values are truthy). The sole number is the index of the local variable to test. If the local variable is truthy, the next instruction is executed, otherwise the next instruction is skipped. The instruction is printed as `IST 4`, though it could equally be represented as `IST - 4` - the `4` is really the second argument, and the first argument is ignored (this makes `IST` more like the `ISTC` instruction, which does use its first argument).
• `JMP` - Causes control flow to jump to the second number (in this case `=> 0012`). The first number is the number of live local variables (or, equivalently, the index of the first dead local variable) - this number isn't used by LuaJIT's interpreter, but it is used as a hint by the JIT compiler.
• `TGETS` - Perform a table lookup using a string key. The second number is the index of some local variable `v` which should contain a table (or something with an `__index` metamethod), the third number is the index of some string constant `s`, and the first number is the index of the local variable into which `v[s]` should be stored. The comment is the string constant.
• `POW` - Perform exponentiation (raising some value to the power of another). All three numbers are local variable indices, which if we label them `a`, `b`, and `c` (in that order), then the instruction does `a = b ^ c` (invoking metamethods if either `b` or `c` aren't numbers).
• `SUBVV` - Like `POW`, but for subtraction rather than exponentiation. That is, it does `a = b - c`, where `a`, `b`, and `c` are the three local variable indices given as numbers after the instruction name (the `VV` suffix indicates that `b` and `c` are local variable indices; there is also `SUBVN` where `b` and `c` are a local variable index and a numeric constant index respectively, and `SUBNV` where `b` and `c` are a numeric constant index and a local variable index respectively).
• `CALL` - Call a function (or something with a `__call` metamethod) passing a fixed number of arguments. The first number is the index of some local variable `v` containing the function to call. The second number (say `b`) is the number of return values plus one, and the third number (say `c`) is the number of arguments to pass plus one. The arguments are in the local variable slots immediately after `v` (that is, `v+1`, `v+2`, ..., `v+c-1`), unless you're using the GC64 mode of LuaJIT (in which case they are in `v+2`, `v+3`, ..., `v+c`). The return values replace `v` and the slots after it (that is, `v`, `v+1`, ..., `v+b-2`). In our case of `CALL 6 0 2`, there are -1 return value(s) and 1 argument(s); the -1 indicates infinity (that is, keep all the return values of the call, don't throw any away, and don't substitute `nil` for any missing return values).
• `CALLM` - Call a function (or something with a `__call` metamethod) passing a potentially infinite number of arguments. The first two numbers are the same as for `CALL`: the index of some local variable `v` containing the function to call, and the number of return values plus one. The third number is added to the number of values generated by the previous instruction to give the number of arguments (`CALLM` must immediately follow something which returns an "infinite" number of values). The location of arguments and return values is the same as for `CALL`.
• `MULVN` - Perform multiplication by a numeric constant. The first two numbers, say `a` and `b`, are local variable indices. The third number, say `c`, is a numeric constant index. The instruction does `a = b * c`. The comment is the value of the numeric constant `c`.
• `ADDVV` - Like `SUBVV`, but for addition rather than subtraction. That is, all three numbers are local variable indices, and it does `a = b + c`.
• `ADDVN` - Like `ADDVV`, but where the third number is a numeric constant index rather than a local variable index. The comment is the value of the numeric constant.
• `FORL` - End of a numeric `for` loop. The second number (in this case `=> 0008`) is the instruction to jump to if the loop body is to be executed again. The first number is as for `FORI`.
• `DIVVV` - Like `SUBVV`, but for division rather than subtraction.
• `RET0` - Return to the calling function, returning zero values. The two numbers are always `0` and `1` (to make `RET0` look like other less-specialised `RET` instructions).

Phew. That gets us through the bytecode listing. There's a lot of it, some of it is executed just one, and other bits of it are executed many many times. The LuaJIT interpreter doesn't care though; it'll happily churn through bytecode all day long. At some point though, the JIT part of LuaJIT comes into play. To explore this part, we'll switch from `-bl` (which dumps bytecode instead of running it) to `-jdump=bitmsrx` (which runs bytecode and also dumps various pieces of information as JIT compilation happens and JIT-compiled code executes):

``````\$ luajit -jdump=bitmsrx pi.lua
``````

The first thing to be output by `-jdump=bitmsrx` is the following:

``````---- TRACE 1 start pi.lua:2
0008  GGET     4   1      ; "n"
0009  IST          4
0010  JMP      5 => 0012
0012  GGET     5   2      ; "math"
0013  TGETS    5   5   3  ; "floor"
0014  GGET     6   2      ; "math"
0015  TGETS    6   6   4  ; "sqrt"
0016  GGET     7   0      ; "r"
0017  KSHORT   8   2
0018  POW      7   7   8
0019  KSHORT   8   2
0020  POW      8   3   8
0021  SUBVV    7   7   8
0022  CALL     6   0   2
0000  . FUNCC               ; math.sqrt
0023  CALLM    5   2   0
0000  . FUNCC               ; math.floor
0024  MULVN    5   5   1  ; 2
0026  ADDVN    4   4   2  ; 1
0027  GSET     4   1      ; "n"
0028  FORL     0 => 0008
``````

This tells us that trace #1 starts at line 2 of `pi.lua`, and is followed by the bytecode instructions which comprise the trace. This should look familiar to the `-bl` dump from earlier, albeit with a few differences. We start at instruction index `0008` - the JIT compiler doesn't compile everything, only the things it thinks are worth compiling, and in this case it thinks that the loop body (which starts at instruction `0008`) is worth compiling. The column which contained either ` ` or `=>` is gone - traces are strictly linear sequences with no branches, and hence no jump targets within them, and hence no need for `=>` jump target indicators. Instruction `0011` isn't listed in the trace, as it isn't part of the trace - in the particular flow of execution recorded by the JIT compiler, the `0` branch of `(n or 0)` wasn't taken. The other major difference happens at function calls: when a call happens, tracing follows the call, and starts including bytecode instructions from the called function in the trace. The function known as `math.sqrt` consists of the single bytecode instruction `0000 FUNCC`, the effect of which is three-fold:

1. Start a C function (c.f. the `FUNCV` instruction we saw earlier for starting a vararg Lua function).
2. Go off and run the C code associated with the function.
3. Return to the calling function (c.f. the `RET0` instruction we saw earlier for returning from a Lua function).

Like `math.sqrt`, `math.floor` also consists of just a single bytecode instruction. In both cases, the bytecode instruction is included in the trace; the `. ` marker between the array index and the instruction name denotes a call frame level (`. . ` denotes two levels of call frame, etc.).

Actually, `-jdump=bitmsrx` is telling us a lie: `math.sqrt` does consist of just a single bytecode instruction, and that instruction does do steps 1 and 3 from the above list, but its step 2 is "Go off and run the assembly code for `math.sqrt`". This super-specialised bytecode instruction is only used by `math.sqrt`, and doesn't have a name per-se, so reporting its name as `FUNCC` is perhaps not the worse lie in the world. Similarly, `math.floor` conists of a single super-specialised bytecode instruction (not all standard library functions follow this pattern - some are just plain old C functions - but most of the `math` library happens to be implemented in assembly rather than C).

We talk about bytecode instructions being included in a trace, but the bytecode isn't actually retained in the trace. Instead, as each bytecode instruction is executed, some so-called IR instructions are appended to the trace. After bytecode recording has finished for the trace, we get the next chunk of output from `-jdump=bitmsrx`, which is the full list of IR instructions:

``````---- TRACE 1 IR
....              SNAP   #0   [ ---- ]
0001 rbx   >  int SLOAD  #2    CRI
0002       >  int LE     0001  +2147483646
0003 rbp   >  int SLOAD  #1    CI
0004 rax      fun SLOAD  #0    R
0005 rax      tab FLOAD  0004  func.env
0007       >  int EQ     0006  +63
0008 r14      p32 FLOAD  0005  tab.node
0009 r12   >  p32 HREFK  0008  "n"  @19
0011       >  p32 HREFK  0008  "math" @54
0012 r15   >  tab HLOAD  0011
0014       >  int EQ     0013  +31
0015 r13      p32 FLOAD  0012  tab.node
0016       >  p32 HREFK  0015  "floor" @14
0018       >  p32 HREFK  0015  "sqrt" @15
0020       >  p32 HREFK  0008  "r"  @12
0021 xmm0  >  num HLOAD  0020
0022 [8]      num MUL    0021  0021
0023 xmm1     num CONV   0003  num.int
0024 xmm1     num MUL    0023  0023
0025 xmm0     num SUB    0022  0024
0026       >  fun EQ     0019  math.sqrt
0027 xmm0     num FPMATH 0025  sqrt
0028       >  fun EQ     0017  math.floor
0029 xmm7     num FPMATH 0027  floor
0030 xmm7     num ADD    0029  0029
0031 xmm7     num ADD    0030  0010
0032 xmm7   + num ADD    0031  +1
0033          num HSTORE 0009  0032
0034 rbp    + int ADD    0003  +1
....              SNAP   #1   [ ---- ]
0035       >  int LE     0034  0001
....              SNAP   #2   [ ---- 0034 0001 ---- 0034 ]
0036 ------------ LOOP ------------
0037 xmm5     num CONV   0034  num.int
0038 xmm5     num MUL    0037  0037
0039 xmm6     num SUB    0022  0038
0040 xmm0     num FPMATH 0039  sqrt
0041 xmm6     num FPMATH 0040  floor
0042 xmm6     num ADD    0041  0041
0043 xmm7     num ADD    0042  0032
0044 xmm7   + num ADD    0043  +1
0045          num HSTORE 0009  0044
0046 rbp    + int ADD    0034  +1
....              SNAP   #3   [ ---- ]
0047       >  int LE     0046  0001
0048 rbp      int PHI    0034  0046
0049 xmm7     num PHI    0032  0044
``````

Ignoring the `.... SNAP` lines for a moment, the first column (containing `0001` through `0049`) is the index into the IR array. The next column contains either a register name (like `rbx` or `xmm7`) or a stack offset (like `[8]`) or is blank. This column isn't populated as the IR is created - instead it is populated as the IR is turned into machine code: if the result of the instruction gets assigned to a register or a stack slot, then that register or stack slot is recorded (some instructions don't have results, or don't have their result materialised, and so this column remains blank for them). The next column contains either `>` or ` `: the `>` symbol indicates that the instruction is a so-called guard instruction: these instructions need to be executed even if their result is otherwise unused, and these instructions are also able to "fail" (failure, if/when it happens, causes execution to leave the JIT-compiled code and return to the interpreter). The next column contains either `+` or ` `: the `+` symbol indicates that the instruction is used later-on in a `PHI` instruction - as with the register/stack column, this column isn't populated as the IR is recorded, and instead it is populated by the LOOP optimisation pass (as it is this optimisation pass which emits `PHI` instructions). The next column contains the type of the IR instruction, which in our case is one of:

• `int` - 32-bit signed integer (arising either from a 32-bit field which the IR accesses, or from a numeric local variable which LuaJIT determined would be better manipulated as an integer than as a double-precision float).
• `fun` - Pointer to a GC-owned function (e.g. the pointer address you see when calling `tostring` on a function value).
• `tab` - Pointer to a GC-owned table (e.g. the pointer address you see when calling `tostring` on a table value).
• `p32` - Arbitrary 32-bit pointer (on 32-bit systems, this is a bog standard pointer, on 64-bit systems it is pointer to something in the low 4GB of address space).

The next columns contain the instruction name, and then the two (or occasionally one or three) operands to the instruction. We can run through the IR instructions in order of first appearance:

• `SLOAD` - Load a value from a slot. Most of the time, slots correspond to local variables, though not all of the time, and the numbering doesn't match up. Each Lua call frame uses a consecutive range of slots: the first few slots are used to store metadata about the call frame, and the remaining slots are used to store local variables. Bytecode instructions cannot refer to call frame metadata, so their local variable indices use `0` to refer to the first local variable slot. IR instructions can refer to call frame metadata, so slot `#0` refers to the first piece of frame metadata. For now, we'll say that there is only ever one slot of metadata (though continuation frames and GC64 mode both double the amount of metadata), meaning that slot `#1` refers to the first local variable slot, slot `#2` refers to the next local variable slot, and so on. The first operand is the slot number to load, and the second operand is a bitmask of the following:
• `P` - Coalesce with parent trace (our example doesn't have parent traces, so let's gloss over this).
• `F` - Load the "ft/sz" part of a frame metadata slot (our example doesn't use this, so let's gloss over it).
• `T` - The generated machine code needs to check that the value in the slot is of the expected type.
• `C` - Convert a value of type `number` (i.e. `double`) to an `int` as part of loading it.
• `R` - Read-only slot; no need to perform a deferred store on trace exit (we'll cover deferred stores later).
• `I` - Inherited by exits/side traces (our example doesn't have side traces, so let's gloss over this).
• `LE` - Test whether the first operand is less than or equal to the second operand. The instruction must have the same type as both of its operands, and the type determines the kind of less-than-or-equal comparison to perform. The instruction fails if the first operand is not less than or equal to the second. The operands are either IR instructions (expressed as four digits, e.g. `0001`) or constants (expressed with a leading `+` or `-`, e.g. `+2147483646`).
• `FLOAD` - Load a well-known field of a well-known structure type. The first operand is an IR instruction which gives a pointer to a well-known structure type, the second operand names the field and the structure. The type of the first operand must match the structure named in the second operand, and the type of the instruction must match the field named in the second operand. In our case, we see the following for the second operand:
• `func.env` - The pointer to the environment table (aka. global table) of the given function object.
• `tab.hmask` - The size of the hash part of the given table object (actually the mask, but size and mask are trivially related).
• `tab.node` - The pointer to the hash part of the given table object.
• `EQ` - Like `LE`, but for testing equality rather than less-than-or-equal. Additionally, constant operands can be pointers, and these pointers can be expressed in a variety of means. In our case, `math.sqrt` and `math.floor` both refer to pointers (the IR really does contain pointer values, and it is merely the pretty-printing done by `-jdump=bitmsrx` which gives friendly names to those pointer values).
• `HREFK` - Look up a constant key in the hash part of a table. The first operand must refer to an `FLOAD` of `tab.node`, the second operand is the constant key, and the third operand is the index within the table at which we expect to find the key (unlike other JIT compilers, LuaJIT doesn't do shape or hidden-type analysis, instead it has hash table index hints for constant keys). The instruction fails if the key is not present (or is present, but isn't at the expected index). The instruction type is always `p32` (or `p64` in GC64 mode), as the instruction gives a pointer into the hash table.
• `HLOAD` - Load a value from the hash part of a table. The sole operand must refer to an `HREF`, `HREFK`, or `NEWREF` instruction (one of these three instructions performs the key lookup, and then `HLOAD` performs the value load). The type of the instruction specifies the type of value we expect to load, and the instruction fails if the type of the value in the table is not this.
• `MUL` - Multiply two values. Like `LE`, the instruction must have the same type as both of its operands, and the type determines the kind of multiplication to perform. Again, like `LE`, the operands are either IR instructions (expressed as four digits, e.g. `0001`) or constants (expressed with a leading `+` or `-`, e.g. `+2147483646`). Of note is that the `POW` instructions in the bytecode have been turned into `MUL` instructions in the IR (as `v ^ 2` is the same as `v * v`, and the latter is much cheaper to compute). Also of note is that the `MULVN` instruction in the bytecode didn't turn into a `MUL` instruction in the IR (as `v * 2` is the same as `v + v`, and the latter is slightly cheaper to compute).
• `CONV` - Convert a value from one type to another. In our case, from an int (`int32_t`) to a num (`double`). This occurs because LuaJIT narrowed the loop iteration variable from a double to an int, and needs to convert it back again before performing arithmetic which might overflow.
• `SUB` - Like `MUL`, but performs subtraction rather than multiplication.
• `FPMATH` - Perform a well-known unary floating-point function. The first operand specifies the input to the function, and the second operand names the function. The first operand and the instruction itself must both have type `num`.
• `ADD` - Like `MUL`, but performs addition rather than multiplication.
• `HSTORE` - Store a value into the hash part of a table. The first operand must refer to an `HREF`, `HREFK`, or `NEWREF` instruction. The second operand specifies the value to be stored, and the type of the instruction must have the same type as the second operand.
• `LOOP` - Marker instruction inserted by the LOOP optimisation pass. The instructions prior to the `LOOP` instruction correspond to one iteration of the loop body. The instructions after the `LOOP` instruction also correspond to one iteration of the loop body. The difference is that every instruction after the `LOOP` instruction can assume that all of the instructions before the `LOOP` instruction have already successfully executed. Consequently, there are generally far fewer instructions after `LOOP` than before it. Loop-invariant expressions (like `math.floor`, `math.sqrt`, and `r^2` in our case) drop out of this optimisation: the expression is computed for the first time before the `LOOP` instruction, and then after the `LOOP` instruction the previously computed value can be re-used. Loop-invariant guard instructions (like `EQ` and `LE`) drop out similarly: provided the loop body doesn't affect the guard condition, the guard only needs to be checked before the `LOOP` instruction, and can be omitted afterwards.
• `PHI` - Pseudo-instruction inserted by the LOOP optimisation pass. Wherever the first operand appears after the `LOOP` instruction, it actually refers to a phi of the first and second operands. See Wikipedia's article on SSA for details of phi functions, but in short, the two operands of the `PHI` instruction need to be assigned to the same register in order to make the loop behave properly.

The effectiveness of the LOOP optimisation pass is really quite impressive: instructions `0001` through `0022` disappear completely, as do `0026` and `0028`. The bytecode performs seven table lookups per iteration (`n`, `math`, `floor`, `math`, `sqrt`, `r`, `n`). Common-subexpression-elimination and load-forwarding during the bytecode to IR conversion causes the IR before the `LOOP` instruction to contain just five `HREFK` instructions (i.e. `n` and `math` are looked up once each rather than twice each). Despite the table store to `n` within the loop, these five `HREFK` instructions instructions are all determined to be loop-invariant (good alias analysis is to thank here). The `HLOAD` instructions for `math`, `floor`, `sqrt`, and `r` are also determined to be loop-invariant. The `HLOAD` for `n` isn't loop-invariant, but forwarding saves us, so there is no `HLOAD` for `n` after the `LOOP` instruction (that is, the reason for the `HLOAD` not being loop-invariant is the `HSTORE` within the loop body, but LuaJIT can forward the stored value rather than having to reload it). The `HSTORE` instruction for `n` is still done after the `LOOP` instruction: stores to local variables are deferred to trace exits, but stores to tables (including stores to global variables) are not deferred.

On that note, we can begin to consider the `.... SNAP` lines in the IR dump. Each of these lines corresponds to a so-called snapshot. A snapshot is used to transition from JIT-compiled code back to the interpreter, and consists of:

• The bytecode instruction at which the interpreter should resume.
• Deferred slot-stores which need to be performed (the IR has an `SLOAD` instruction, but no corresponding `SSTORE` instruction, and instead all slot-stores are deferred until trace exit).
• The slot index of the innermost call frame metadata (in our case this is always zero, but it can be non-zero when tracing follows a `CALL` instruction into a function and the called function contains some IR which can fail).

Sadly, for snapshots, `-jdump` doesn't report the bytecode instruction at which the interpreter will resume. It does report the deferred stores though: reading from left to right, each token between `[` and `]` represents a slot, with the leftmost token being slot #0. For example, `[ ---- 0034 0001 ---- 0034 ]` means that when exiting with this snapshot:

• Slot #0 (call frame metadata) has not been modified.
• Slot #1 (local variable 0) should be overwritten with the result of IR instruction `0034`.
• Slot #2 (local variable 1) should be overwritten with the result of IR instruction `0001`.
• Slot #3 (local variable 2) has not been modified.
• Slot #4 (local variable 3) should be overwritten with the result of IR instruction `0034`.

Call frames are denoted with `|` symbols in snapshots, but we'll gloss over this as it doesn't occur in our example. If an IR instruction can fail, then when it fails, the nearest preceding snapshot is used to return to the interpreter. In our example, this means instructions `0001` through `0034` use snapshot `#0`, `0035` uses `#1`, `0036` though `0046` use `#2`, and `0047` through `0049` use `#3`. It is worth dwelling on snapshots for moment, and viewing them as a form of transactional rollback. For example, (in our case) if instruction `0001` fails, then snapshot `#0` is used. If instruction `0028` fails, then snapshot `#0` is still used, despite various table lookups, some arithmetic, and a call to `math.sqrt` all happening between instructions `0001` and `0028`. This means that if instruction `0028` fails, then after restoring the interpreter state, the interpreter will repeat the lookups, the arithmetic, and the `math.sqrt` call (presumably it wouldn't repeat the `math.floor` call, as a failure of instruction `0028` would mean that `math.floor` no longer corresponded to the builtin floor function).

With that, we can move on to the next chunk of output from `-jdump=bitmsrx`, which is the generated assembly (though LuaJIT actually generates machine code directly rather than generating assembly, so what is shown is really the output of `-jdump`'s own bundled disassembler):

``````---- TRACE 1 mcode 507
f125fdfa  mov dword [0x00043410], 0x1
f125fe05  movsd xmm7, [rdx+0x8]
f125fe0a  cvttsd2si ebx, xmm7
f125fe0e  xorps xmm6, xmm6
f125fe11  cvtsi2sd xmm6, ebx
f125fe15  ucomisd xmm7, xmm6
f125fe19  jnz 0xf1250010  ->0
f125fe1f  jpe 0xf1250010  ->0
f125fe25  cmp ebx, 0x7ffffffe
f125fe2b  jg 0xf1250010 ->0
f125fe31  movsd xmm7, [rdx]
f125fe35  cvttsd2si ebp, xmm7
f125fe39  xorps xmm6, xmm6
f125fe3c  cvtsi2sd xmm6, ebp
f125fe40  ucomisd xmm7, xmm6
f125fe44  jnz 0xf1250010  ->0
f125fe4a  jpe 0xf1250010  ->0
f125fe50  mov eax, [rdx-0x8]
f125fe53  mov eax, [rax+0x8]
f125fe56  cmp dword [rax+0x1c], +0x3f
f125fe5a  jnz 0xf1250010  ->0
f125fe60  mov r14d, [rax+0x14]
f125fe64  mov rdi, 0xfffffffb00054868
f125fe6e  cmp rdi, [r14+0x1d0]
f125fe75  jnz 0xf1250010  ->0
f125fe7b  lea r12d, [r14+0x1c8]
f125fe82  cmp dword [r12+0x4], 0xfffeffff
f125fe8b  jnb 0xf1250010  ->0
f125fe91  mov rdi, 0xfffffffb00048d48
f125fe9b  cmp rdi, [r14+0x518]
f125fea2  jnz 0xf1250010    ->0
f125fea8  cmp dword [r14+0x514], -0x0c
f125feb0  jnz 0xf1250010  ->0
f125feb6  mov r15d, [r14+0x510]
f125febd  cmp dword [r15+0x1c], +0x1f
f125fec2  jnz 0xf1250010  ->0
f125fec8  mov r13d, [r15+0x14]
f125fecc  mov rdi, 0xfffffffb00049150
f125fed6  cmp rdi, [r13+0x158]
f125fedd  jnz 0xf1250010  ->0
f125fee3  cmp dword [r13+0x154], -0x09
f125feeb  jnz 0xf1250010  ->0
f125fef1  mov rdi, 0xfffffffb000491e0
f125fefb  cmp rdi, [r13+0x170]
f125ff02  jnz 0xf1250010  ->0
f125ff08  cmp dword [r13+0x16c], -0x09
f125ff10  jnz 0xf1250010  ->0
f125ff16  mov rdi, 0xfffffffb0004ebc8
f125ff20  cmp rdi, [r14+0x128]
f125ff27  jnz 0xf1250010  ->0
f125ff2d  cmp dword [r14+0x124], 0xfffeffff
f125ff38  jnb 0xf1250010  ->0
f125ff3e  movsd xmm0, [r14+0x120]
f125ff47  mulsd xmm0, xmm0
f125ff4b  movsd [rsp+0x8], xmm0
f125ff51  xorps xmm1, xmm1
f125ff54  cvtsi2sd xmm1, ebp
f125ff58  mulsd xmm1, xmm1
f125ff5c  subsd xmm0, xmm1
f125ff60  cmp dword [r13+0x168], 0x000491b8
f125ff6b  jnz 0xf1250010  ->0
f125ff71  sqrtsd xmm0, xmm0
f125ff75  cmp dword [r13+0x150], 0x00049128
f125ff80  jnz 0xf1250010  ->0
f125ff86  roundsd xmm7, xmm0, 0x09
f125ff9f  movsd [r12], xmm7
f125ffa8  cmp ebp, ebx
f125ffaa  jg 0xf1250014 ->1
->LOOP:
f125ffb0  movsd xmm0, [rsp+0x8]
f125ffb6  xorps xmm5, xmm5
f125ffb9  cvtsi2sd xmm5, ebp
f125ffbd  mulsd xmm5, xmm5
f125ffc1  movaps xmm6, xmm0
f125ffc4  subsd xmm6, xmm5
f125ffc8  sqrtsd xmm0, xmm6
f125ffcc  roundsd xmm6, xmm0, 0x09
f125ffe3  movsd [r12], xmm7
f125ffec  cmp ebp, ebx
f125ffee  jle 0xf125ffb0  ->LOOP
f125fff0  jmp 0xf125001c  ->3
``````

The first column gives the memory address of the instruction, and the remainder of the line gives the instruction in Intel-ish syntax (which happens to be a syntax I'm fond of; AT&T syntax needs to die). I'm not going to explain the semantics of each individual instruction (there are multi-thousand page Intel manuals for that), but there are a number of interesting things to point out:

• The first instruction, `mov dword [0x00043410], 0x1`, is writing its own trace number to the `global_State::vmstate` field.
• The next instruction, `movsd xmm7, [rdx+0x8]`, is loading slot #2 into `xmm7` (on trace entry, `rdx` points to the first local variable slot, and slots are 8 bytes wide). The next six instructions (`cvttsd2si`, `xorps`, `cvtsi2sd`, `ucomisd`, `jnz`, `jpe`) convert `xmm7` from a double to an int (in `ebx`), and then verify that the conversion was lossless. Note the `jpe`, which checks `xmm7` wasn't NaN, which is really a check that the local variable in slot #2 was of type `number` (LuaJIT represents all values of all other types in a way which looks like NaN). In aggregate, these seven instructions correspond to the single IR instruction `0001 rbx > int SLOAD #2 CRI` (the IR dump mentions `rbx`, but in this case it means the low 32-bits of `rbx`, i.e. `ebx`).
• Jumps which correspond to failed instructions, like `jnz 0xf1250010 ->0`, list two things: the absolute address of a so-called exit stub (`f1250010` in this case), and the snapshot number which the stub corresponds to (`0` in this case). An exit stub is a small piece of code which effectively does `push <snapshot-number>; jmp lj_vm_exit_handler`. In turn, `lj_vm_exit_handler` is a piece of assembly code which captures the contents of every register, calls `lj_trace_exit`, and then resumes the interpreter. In turn, `lj_trace_exit` is a piece of C code which takes the snapshot number and the current trace number and the captured register state, and uses all of this to perform the deferred slot stores specified by the snapshot (along with setting the interpreter's instruction pointer to the resumption point specified in the snapshot, and a bunch of other things, but those are details for another time).
• `mov eax, [rdx-0x8]` is performing `0004 rax fun SLOAD #0 R` - this `SLOAD` has no type conversion, and also cannot fail, so it turns into just one assembly instruction. The next instruction, `mov eax, [rax+0x8]`, is performing `0005 rax tab FLOAD 0004 func.env` (i.e. the `env` field is at offset `8` of the `func` struct).
• `cmp dword [rax+0x1c], +0x3f` corresponds to one and a half IR instructions: it is all of `0006 int FLOAD 0005 tab.hmask`, and also the first half of `0007 > int EQ 0006 +63` (`0x3f` is `63` in hex). We say that the memory load has been fused into the comparison operand.
• The four-instruction sequence `mov rdi, 0xfffffffb00054868`, `cmp rdi, [r14+0x1d0]`, `jnz 0xf1250010 ->0`, `lea r12d, [r14+0x1c8]` corresponds to the single IR instruction `0009 r12 > p32 HREFK 0008 "n" @19`: each key-value pair in a hash table takes up 24 bytes, and 19 times 24 is `0x1c8` (plus 8 is `0x1d0`; the value is at 0 mod 24, the key is at 8 mod 24, and a chain pointer is at 16 mod 24). The constant `0xfffffffb00054868` is really two 32-bit constants in one: `fffffffb` is the type code for strings, and `00054868` is the interned address of the string `"n"` (LuaJIT interns all strings, so string equality is pointer equality).
• The two-instruction sequence `cmp dword [r12+0x4], 0xfffeffff`, `jnb 0xf1250010 ->0` is the type-test from `0010 > num HLOAD 0009`, that is, it is checking that the type of the value in the key-value pair is a number. Note that the "actual" load of the value from the key-value pair is done much later as part of `addsd xmm7, [r12]`.
• `jg 0xf1250014 ->1` is the first instruction to refer to exit stub 1, from which we can infer that each exit stub is a mere four bytes long (as `->0` was at `0xf1250010`). How you fit `push <snapshot-number>; jmp lj_vm_exit_handler` into four bytes is left as an exercise to the reader (look at `asm_exitstub_gen` in `lj_asm_x86.h` to determine whether your answer is correct).
• The lookup and call to `math.sqrt`, which is Lord-knows how many CPU instructions when executed by the interpreter, becomes the single instruction `sqrtsd xmm0, xmm6`.
• The lookup and call to `math.floor` has become `roundsd xmm6, xmm0, 0x09`. This instruction is only available on CPUs which support SSE 4.1 - had the example been run on a machine which didn't support SSE 4.1, then LuaJIT would have emitted something else (namely, a call to `lj_vm_floor_sse`, which is an assembly function with a slightly special calling convention).
• `addsd xmm7, [0x00064bc8]` corresponds to either `0032 xmm7 + num ADD 0031 +1` or `0044 xmm7 + num ADD 0043 +1`: there is no encoding of `addsd` which accepts an immediate, so LuaJIT has put `(double)1.0` at memory address `0x00064bc8`.

The next piece of output from `-jdump=bitmsrx` tells us that recording (and IR optimisation and machine code generation) of the trace has finished, and that the trace is a looping trace:

``````---- TRACE 1 stop -> loop
``````

The final piece of output from `-jdump=bitmsrx` tells us that execution of the trace finished and snapshot #3 was used to restore the interpreter state (noting that `-jdump=bitmsrx` doesn't tell us when execution of a trace starts):

``````---- TRACE 1 exit 3
``````

Finally, we get our estimate of π courtesy of the interpreter executing (the bytecode corresponding to) `print(n / r^2)`:

``````3.1415926535059
``````
``page: 1 2 3``