Last time, we had a need to compute `mod_P(A * x32)` where `A` was a polynomial with degree at most 31, and `mod_P` took the remainder after dividing by a polynomial `P` with degree 32. The traditional Barrett reduction approach involved finding the quotient and then calculating the remainder from that:

``````Q = ((A * x32) * (x63 // P)) // x63
R =  (A * x32) - Q * P
return R``````

Via a combination of the `* x32`, and the degree of `P` being 32, and the context being Z2 polynomials, this simplified to:

``````Q = ((A * x32) * (x63 // P)) // x63
return (Q * P)[0:32]``````

The resultant C code (for bit-reversed polynomials) was very short:

``````uint32_t mul_x32_mod_P(uint32_t A) {
__m128i k = _mm_setr_epi32(0, 0xdea713f1, 0x05ec76f1, 1);
__m128i a = _mm_cvtsi32_si128(A);
__m128i b = _mm_clmulepi64_si128(a, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);
}
``````

Though our particular case permitted this simplification, it is not permitted in general. In Faster Remainder by Direct Computation (D. Lemire, O. Kaser, and N. Kurz, 2018), a different approach was presented: the remainder can be computed directly, rather than computing the quotient and getting the remainder from that. Said paper is about integers, and its main Theorem 1 is effectively:

Provided that `d in [1, 2N)` and `n in [0, 2N)` and `F ≥ N + log2(d)`, then:
`n % d` equals `(((n * ceil(2F / d)) % 2F) * d) // 2F`.

Translated from the language of integers to the language of polynomials, this theorem would be:

Provided that `S`, `P`, `T` are polynomials satisfying `deg((S * P) // T) ≤ 0`, then:
`S % P` equals `(((S * (T // P)) % T) * P) // T`.

This lets us turn `% P` into multiplication by `T // P` followed by `% T` followed by `* P` followed by `// T`. The `% T` and `// T` steps are trivial if `T` is chosen to be `xn` (with `n ≥ deg(S) + deg(P)` to make the precondition hold), and `T // P` can be precomputed if `P` is fixed.

It turns out that the translated theorem is true, and that the proof isn't too hard. The proof builds upon the Barrett reduction property, which is:

Provided that `deg(S // T) ≤ 0`, `S // P` equals `(S * (T // P)) // T`.

``````(((S * (T // P)) % T) * P) // T
``````

We're going to be overwhelmed by parentheses if we're not careful, so I'm going to say that a sequence of mixed `*` and `%` and `//` operations always happens left-to-right, thus allowing some of the parentheses to be dropped:

``````S * (T // P) % T * P // T
``````

From here, the `%` can be expanded: (`F % G` equals `F - (F // G * G)`)

``````((S * (T // P)) - (S * (T // P) // T * T)) * P // T
``````

Then push the `-` to the top level:

``````(S * (T // P) * P // T) - (S * (T // P) // T * T * P // T)
``````

Making use of our `deg((S * P) // T) ≤ 0` precondition, by the Barrett reduction property, `(S * P) // P` equals `(S * P) * (T // P) // T`:

``````(S * P // P) - (S * (T // P) // T * T * P // T)
``````

We can then simplify away `* G // G` for any `G`:

``````S - (S * (T // P) // T * P)
``````

Our `deg((S * P) // T) ≤ 0` precondition implies `deg(S // T) ≤ 0`, so by the Barrett reduction property, `S // P` equals `S * (T // P) // T`:

``````S - (S // P * P)
``````

Which is the definition of:

``````S % P
``````

Thus our supposed equality is an actual equality.

A recent post presented a faster way of computing CRC32C on x86. Part of that was `crc32_4k_pclmulqdq`, based on Intel's whitepaper and Chromium's implementation thereof. The Intel whitepaper is the traditional exposition for this `crc32_4k_pclmulqdq` function, but I don't like the exposition therein, as I think it spends too much time on some obvious things, and not nearly enough time on some non-obvious things. I present an alternative exposition here.

We begin by thinking about CPU-native data types. The types `u32` and `i32` should be familiar to most programmers, being 32-bit unsigned integers and 32-bit two's complement signed integers respectively. Two-operand addition can be defined for both of them; `u32 + u32 ↦ u32` and `i32 + i32 ↦ i32`. The neat thing is that the hardware circuit for both types of addition is the same, and is what we call the CPU `add` instruction. On 64-bit CPUs, these 32-bit types live in the least significant bits of 64-bit registers. I'll use `u32_z` and `i32_z` for such things, where the `_z` suffix denotes zero-extension of the 32-bit value to the 64-bit storage. Note that value-preserving conversion from `u32_z` to `u64` is a no-op, whereas value-preserving conversion from `i32_z` to `i64` requires sign extension (`movsx` instruction on x86, `sxtw` instruction on ARM).

Some less common CPU-native data types are `p32` and `r32`: polynomials over Z2 of degree at most 31 (`p32`), and bit-reversed versions thereof (`r32`). As with `u` and `i` types, these `p` and `r` types exist at a variety of bit-widths, and (at times) live in the least significant bits of wider registers. The `rbit` instruction on ARM is a value-preserving conversion `p32 ⇆ r32` (x86 has no such thing).

The ARM `crc32cb` instruction, and the matching overload of the x86 `crc` instruction, has type signature `r32, r8 ↦ r32_z` and implementation:

``````def crc32cb(acc, val):
val = val * x32
acc = acc * x8 + val
P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
x18 + x14 + x13 + x11 + x10 + x9  + x8  + x6 + 1
q, r = floor_divide(acc, P)
return r``````

To avoid some future repitition, we can pull out the tail of `crc32cb` and call it `mod_P` (with type signature `rN ↦ r32` for any `N`):

``````def crc32cb(acc, val):
val = val * x32
acc = acc * x8 + val
return mod_P(acc)
def mod_P(acc):
P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
x18 + x14 + x13 + x11 + x10 + x9  + x8  + x6 + 1
q, r = floor_divide(acc, P)
return r``````

The ARM `crc32ch` instruction, and the matching overload of the x86 `crc` instruction, has type signature `r32, r16 ↦ r32_z` and implementation:

``````def crc32ch(acc, val):
q, r = floor_divide(val, x8)
acc = crc32cb(acc, q)
acc = crc32cb(acc, r)
return acc``````

A totally equivalent implementation is the following, which is identical to the `crc32cb` implementation except for one number:

``````def crc32ch(acc, val):
val = val * x32
acc = acc * x16 + val # The x8 on this line has changed to x16
return mod_P(acc)``````

The `crc32cw` (`r32, r32 ↦ r32_z`) and `crc32cx` (`r32, r64 ↦ r32_z`) instructions can be similarly implemented either as two calls to a lower-width implementation, or as `crc32cb` with a different scaling factor when merging `acc` with `val`. In all cases, the equivalence of the two implementations arises due to the nature of `mod_P`: so long as `mod_P` is done once at the end, it can be done any number of times at any point in the preceding computation. It is useful to do so regularly in order to bound the size of `acc`, but it doesn't need to be done. In the extreme, if `acc` could grow without bound, a valid multi-byte implementation `r32, r8[] ↦ r32_z` would be:

``````def crc32c_bytes(acc, bytes):
for val in bytes:
acc = acc * x8 + val * x32
return mod_P(acc)``````

Something similar operating 64-bits at a time rather than 8-bits at a time (`r32, r64[] ↦ r32_z`) would be:

``````def crc32c_qwords(acc, qwords):
for val in qwords:
acc = acc * x64 + val * x32
return mod_P(acc)``````

At this width, the math can be fiddled around with slightly to do `* x32` once at the end of the loop rather than in every loop iteration:

``````def crc32c_qwords(acc, qwords):
acc = acc * x32 + qwords[0]
for i in range(1, len(qwords)):
acc = acc * x64 + qwords[i]
acc = acc * x32
return mod_P(acc)``````

In preparation for SIMD, we can go wider still, and operate on 128 bits at a time (`r32, r128[] ↦ r32_z`):

``````def crc32c_xwords(acc, xwords):
acc = acc * x96 + xwords[0]
for i in range(1, len(xwords)):
acc = acc * x128 + xwords[i]
acc = acc * x32
return mod_P(acc)``````

Furthermore, multiple accumulators could be used (this being a standard latency-hiding trick):

``````def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x96
acc1 = xwords[1]
for i in range(2, len(xwords), 2):
acc0 = acc0 * x256 + xwords[i + 0]
acc1 = acc1 * x256 + xwords[i + 1]
acc = acc0 * x128 + acc1
acc = acc * x32
return mod_P(acc)``````

This is starting to look very similar to `crc32_4k_pclmulqdq`, albeit we've got a way to go still: having even more accumulators, implementing polynomial `+` and `*`, preventing the accumulators from getting too big, and implementing the final `mod_P`.

To prevent the accumulators from getting too big, we can insert some extra `mod_P` calls (remembering that these can be freely inserted at any point in the computation):

``````def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x96
acc1 = xwords[1]
for i in range(2, len(xwords), 2):
acc0 = mod_P(acc0 * x256)
acc1 = mod_P(acc1 * x256)
acc0 += xwords[i + 0]
acc1 += xwords[i + 1]
acc = acc0 * x128 + acc1
acc = acc * x32
return mod_P(acc)``````

This solves the accumulator problem, albeit at the cost of introducing the problem of implementing the extra `mod_P` calls. That problem is slightly fiddly, so we'll come back to it later, and instead address polynomial `+` and `*` next.

For small `N`, every CPU implements Z2 polynomial `+` with signature `pN_z + pN_z ↦ pN_z` — it is the `xor` instruction on x86, and the `eor` instruction on ARM. Every CPU also implements Z2 polynomial `+` with signature `rN_z + rN_z ↦ rN_z` — it is also the `xor` instruction (in the same way that the `add` instruction is both `u32` and `i32`). Modern CPUs implement Z2 polynomial `*` under the name of `pclmulqdq` (x86) or `pmull` / `pmull2` (ARM). The kernel of this instruction is normally seen as `p64 * p64 ↦ p127_z`, though it can equally be seen as `r64 * r64 ↦ r127_z`. Value-preserving conversion from `p127_z` to `p128` is a no-op (like going from `u32_z` to `u64`), whereas value-preserving conversion from `r127_z` to `r128` requires a left-shift by one (like going from `i32_z` to `i64` requires a sign extension). When operating on `r128`, a right-shift by one is multiplication by x1. Doing both shifts in sequence results in no shift at all, and just a multiplication by x1. For `N` ≤ 64 and `M` ≤ 64, the kernel can alternatively be seen as `pN_z * pM_z ↦ p(N+M-1)_z` or as `rN_z * rM_z ↦ r(N+M-1)_z`. As the output of the kernel can have up to 127 bits, it goes into a 128-bit SIMD register. This causes the two inputs to the kernel to also come from 128-bit SIMD registers, and thus causes the type signature of the overall instruction to be `p128, p128 ↦ p127_z` or `r128, r128 ↦ r127_z`. To get the inputs down from `p128` to `p64` (or from `r128` to `r64`), `floor_divide` by x64 is done, and then either the quotient or the remainder is used as the 64-bit input to the kernel. On x86, an immediate argument to `pclmulqdq` controls whether to take the quotient or the remainder of each input. On ARM, there is slightly less flexibility: either both inputs have their remainder taken or both have their quotient taken. In practice, the decreased flexibility is rarely a problem.

With this knowledge of how to do polynomial `+` and `*`, we can return to the question of how to do `acc0 = mod_P(acc0 * x256)`. The type of `acc0` is going to be `r128` in order to match the SIMD register width, and a `floor_divide` by x64 can split it into two `r64` pieces:

``````q, r = floor_divide(acc0, x64)
assert acc0 == q * x64 + r``````

We can then do some arbitrary-looking algebraic manipulations of the expression we want:

``````   mod_P(acc0 * x256)
== mod_P((q * x64 + r) * x256)
== mod_P(q * x256+64 + r * x256)
== mod_P(q * x256+64-33 * x33 + r * x256-33 * x33)``````

The next trick is where the magic happens: because of the `mod_P` call at the end of `crc32c_xwords`, we're free to insert to remove `mod_P` calls anywhere else. We previously used this freedom to insert some extra `mod_P` calls, and we're going to use it again now to remove those `mod_P` calls and insert new ones elsewhere:

``````   mod_P(q * x256+64-33  * x33 + r *       x256-33  * x33)
-> q * mod_P(x256+64-33) * x33 + r * mod_P(x256-33) * x33``````

Note that these two expressions give different values (hence `->` rather than `==`), but this difference evaporates away after the `mod_P` call at the end of `crc32c_xwords`. This new expression is very convenient to compute: `q` has type `r64_z`, and `mod_P(x256+64-33)` is a constant with type `r32_z`, so their product has type `r95_z`. A value-preserving conversion from `r95_z` to `r128` involves a left-shift by 33 bits, and then `* x33` is a right-shift by 33 bits. These two shifts cancel each other out, meaning that we effectively get the `* x33` for free. In other words, a single CPU instruction can obtain `q` from `acc0` and compute `q * mod_P(x256+64-33) * x33`. Similarly, a single CPU instruction can obtain `r` from `acc0` and compute `r * mod_P(x256-33) * x33`. Adding the two terms is another CPU instruction - `pxor` on x86, or `eor` on ARM (see also `eor3` on ARM, and also note that the M1 can fuse `pmull` with `eor`, albeit the fused operation still takes two retire slots). We can wrap this expression up in a function, call it `mul_xN_mod_P`, and make use of it in `crc32c_xwords`:

``````def mul_xN_mod_P(acc, N): # r128, int ↦ r128
q, r = floor_divide(acc, x64)
return q * mod_P(xN+64-33) * x33 + r * mod_P(xN-33) * x33
def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x96
acc1 = xwords[1]
for i in range(2, len(xwords), 2):
acc0 = mul_xN_mod_P(acc0, 256)
acc1 = mul_xN_mod_P(acc1, 256)
acc0 += xwords[i + 0]
acc1 += xwords[i + 1]
acc = acc0 * x128 + acc1
acc = acc * x32
return mod_P(acc)``````

This `mul_xN_mod_P` is also useful for doing `acc = acc0 * x128 + acc1` at the end:

``````def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x96
acc1 = xwords[1]
for i in range(2, len(xwords), 2):
acc0 = mul_xN_mod_P(acc0, 256)
acc1 = mul_xN_mod_P(acc1, 256)
acc0 += xwords[i + 0]
acc1 += xwords[i + 1]
acc = mul_xN_mod_P(acc0, 128) + acc1
return mod_P(acc * x32)``````

At this point, only three problems remain: having even more accumulators, precomputing the various `mod_P(xM)` constants, and computing the final `mod_P(acc * x32)`.

The optimal number of accumulators depends on the throughput and latency of the CPU's polynomial `+` and `*` operations. `+` is typically latency 1 and throughput better than or equal to the throughput of `*`. On recent Intel chips, `*` (i.e. `pclmulqdq`) has a latency of between 5 and 7 cycles, and a throughput of 1 per cycle. On the M1, `*` (i.e. `pmull` / `pmull2`) has a latency of 3 cycles, and a throughput of 4 per cycle (these numbers are incredible compared to Intel's). We need two `*` and one `+` for each `mul_xN_mod_P` call. This makes 4 (or possibly 5) accumulators optimal on Intel, and 12 accumulators optimal on M1. A version with four accumulators looks like:

``````def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x96
acc1 = xwords[1]
acc2 = xwords[2]
acc3 = xwords[3]
for i in range(4, len(xwords), 4):
acc0 = mul_xN_mod_P(acc0, 128*4)
acc1 = mul_xN_mod_P(acc1, 128*4)
acc2 = mul_xN_mod_P(acc2, 128*4)
acc3 = mul_xN_mod_P(acc3, 128*4)
acc0 += xwords[i + 0]
acc1 += xwords[i + 1]
acc2 += xwords[i + 2]
acc3 += xwords[i + 3]
acc0 = mul_xN_mod_P(acc0, 128) + acc1
acc2 = mul_xN_mod_P(acc2, 128) + acc3
acc = mul_xN_mod_P(acc0, 128*2) + acc2
return mod_P(acc * x32)``````

If we have a hardware `crc32c` instruction, then the remaining two problems are easy. `mod_P(acc * x32)` is just `crc32c(0, acc)`. In our case, `acc` is `r128`, and the widest hardware `crc32c` instruction only takes `r64`, but following the pattern from the start of this post, a wide CRC can be constructed from two narrow CRCs:

``````def crc32c_128(acc, val): # r32, r128 ↦ r32_z
q, r = floor_divide(val, x64)
acc = crc32c_64(acc, q)
acc = crc32c_64(acc, r)
return acc``````

Meanwhile, `mod_P(xM)` is just `acc = xM % 32` followed by `M//32` iterations of `acc = mod_P(acc * x32)`, and the latter is just a CRC instruction. In concrete C code, this is just:

``````uint32_t magic(uint64_t M) { // int ↦ r32_z
uint32_t acc = ((uint32_t)0x80000000) >> (M & 31);
for (M >>= 5; M; --M) acc = _mm_crc32_u32(acc, 0);
return acc;
}
``````

For bonus points, note that `magic` can be implemented to run in `O(log(M))` time rather than `O(M)` time:

``````uint32_t magic(uint64_t M) {
uint64_t stack = ~(uint64_t)1;
for (; M > 191; M = (M >> 1) - 16) {
stack = (stack << 1) + (M & 1);
}
stack = ~stack;
uint32_t acc = ((uint32_t)0x80000000) >> (M & 31);
for (M >>= 5; M; --M) acc = _mm_crc32_u32(acc, 0);
for (;;) {
uint32_t low = stack & 1;
if (!(stack >>= 1)) return acc;
__m128i x = _mm_cvtsi32_si128(acc);
uint64_t y = _mm_cvtsi128_si64(_mm_clmulepi64_si128(x, x, 0));
acc = _mm_crc32_u64(0, y << low);
}
}
``````

For more bonus points, the ARM translation of this is:

``````uint32_t magic(uint64_t M) {
uint64_t stack = ~(uint64_t)1;
for (; M > 191; M = (M >> 1) - 16) {
stack = (stack << 1) + (M & 1);
}
stack = ~stack;
uint32_t acc = ((uint32_t)0x80000000) >> (M & 31);
for (M >>= 5; M; --M) acc = __crc32cw(acc, 0);
for (;;) {
uint32_t low = stack & 1;
if (!(stack >>= 1)) return acc;
poly8x8_t x = vreinterpret_p8_u64(vmov_n_u64(acc));
uint64_t y = vgetq_lane_u64(vreinterpretq_u64_p16(vmull_p8(x, x)), 0);
acc = __crc32cd(0, y << low);
}
}
``````

If we do not have a hardware `crc32c` instruction, or we want to change `P` to something which lacks a hardware instruction, then things are harder, but not insurmountable. All we have to do is build the equivalent of the 32-bit CRC instruction:

``````def crc32cw(acc, val): # r32, r32 ↦ r32_z
acc = (acc + val) * x32
return mod_P(acc)``````

Expanding the `mod_P` gets us to:

``````def crc32cw(acc, val): # r32, r32 ↦ r32_z
acc = (acc + val) * x32
P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
x18 + x14 + x13 + x11 + x10 + x9  + x8  + x6 + 1
q, r = floor_divide(acc, P)
return r``````

We can compute `q` by Barrett reduction, and then use that to compute `r`:

``````def crc32cw(acc, val): # r32, r32 ↦ r32_z
acc = (acc + val) * x32
P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
x18 + x14 + x13 + x11 + x10 + x9  + x8  + x6 + 1
q = (acc * (x63 // P)) // x63                # Anything ≥ 63 works
r = acc - q * P
return r``````

As we're operating in Z2, the `-` in `r = acc - q * P` can be replaced by `+`. As `deg(r)` < 32 and `deg(acc)` ≥ 32, the `acc` in `r = acc - q * P` can be replaced with term-slicing. After doing both these things, we're left with:

``````def crc32cw(acc, val): # r32, r32 ↦ r32_z
acc = (acc + val) * x32
P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
x18 + x14 + x13 + x11 + x10 + x9  + x8  + x6 + 1
q = (acc * (x63 // P)) // x63                # Anything ≥ 63 works
return (q * P)[0:32]``````

As concrete C code, this is:

``````uint32_t crc32cw(uint32_t acc, uint32_t val) {
__m128i k = _mm_setr_epi32(0, 0xdea713f1, 0x05ec76f1, 1);
__m128i a = _mm_cvtsi32_si128(acc ^ val);
__m128i b = _mm_clmulepi64_si128(a, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);
}
``````

Where the magic numbers in `k` come from:

``````def floor_divide(S, P):
Q = 0
while S:
d = S.bit_length() - P.bit_length()
if d < 0:
break
Q ^= 1 << d
S ^= P << d
return Q, S

def rev32(F):
G = 0
for _ in range(32):
G = (G << 1) + (F & 1)
F >>= 1
assert F == 0
return G

P = 0x11EDC6F41
Q, _ = floor_divide(1 << 63, P)
print(hex(rev32(Q >> 32)))
print(hex(rev32(Q & 0xffffffff)))
print(hex(rev32(P >> 1)))
print(P & 1)
``````

We can then build a 64-bit CRC function as two calls to `crc32cw`, or directly:

``````uint32_t crc32cx(uint32_t acc, uint64_t val) {
__m128i k = _mm_setr_epi32(0xdea713f1, 0x4869ec38, 0x05ec76f1, 1);
__m128i a = _mm_cvtsi64_si128(acc ^ val);
__m128i b = _mm_clmulepi64_si128(a, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);
}
``````

Where the magic numbers in `k` come from:

``````P = 0x11EDC6F41
Q, _ = floor_divide(1 << 95, P)
print(hex(rev32(Q >> 32)))
print(hex(rev32(Q & 0xffffffff)))
print(hex(rev32(P >> 1)))
print(P & 1)
``````

For bonus points, the zlib CRC equivalent (`P = 0x104C11DB7`) is the following:

``````uint32_t crc32x(uint32_t acc, uint64_t val) {
__m128i k = _mm_setr_epi32(0xf7011641, 0xb4e5b025, 0xdb710641, 1);
__m128i a = _mm_cvtsi64_si128(acc ^ val);
__m128i b = _mm_clmulepi64_si128(a, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);
}
``````

This can be plugged into Chromium's implementation to give a more succinct and slightly faster tail; everything under "Fold 128-bits to 64-bits" and "Barret reduce to 32-bits" and "Return the crc32" can be replaced by:

``````    __m128i k = _mm_setr_epi32(0xf7011641, 0xb4e5b025, 0xdb710641, 1);
__m128i b = _mm_clmulepi64_si128(x1, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
c = _mm_blend_epi16(c, _mm_setzero_si128(), 0xcf); // or: c = _mm_and_si128(c, _mm_setr_epi32(0, 0, ~0, 0));
__m128i a = _mm_xor_si128(c, x1);
b = _mm_clmulepi64_si128(a, k, 0x01);
c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);
``````

Let `F` denote some field (e.g. the real numbers, or integers modulo some prime), and let `F[x]` denote the ring of polynomials over that field, and let `P`, `S`, `T` denote elements of that polynomial ring. Polynomial division can then be defined; division of one polynomial by another yields a (unique) quotient and a (unique) remainder. Following the Python 3 notation where `//` denotes floor division, let `S // P` denote the quotient obtained from dividing `S` by `P`. The key property of Barrett reduction is then:

Provided that `deg(S // T) ≤ 0`, `S // P` equals `(S * (T // P)) // T`.

This lets us turn division by `P` into multiplication by `T // P` followed by division by `T`. Said division is trivial if `T` is chosen to be `xn` (with `n ≥ deg(S)` to make the precondition hold), and `T // P` can be precomputed if `P` is fixed.

Proving that this property is true turns out to be surprisingly elusive; all the good expositions on the internet are about Barrett reduction for integers rather than for polynomials. Accordingly, I'm providing a proof in this post. The proof proceeds in four parts:

1. Defining `/` (true division rather than floor division) for (extended) polynomials, and defining term-slicing `P[j:k]` for (extended) polynomials.
2. Showing that `S // P` equals `(S * (T / P)) // T`.
3. Showing that `(S*(T/P))//T` equals `(S*(T//P))//T` plus `(S*(T/P)[-∞:0])//T`.
4. Showing that `(S*(T/P)[-∞:0])//T` equals zero. This final part is the only part which requires the precondition of `deg(S // T) ≤ 0`.

These four parts combine in the obvious way to prove the overall reduction property.

Part 1: Definitions

`P` being an element of the polynomial ring `F[x]` just means that `P` equals `sum(pi * xi)`, where `i` runs over some finite set of non-negative integers, and each `pi` is a non-zero element of `F`. Note that `deg(P)` denotes the maximum such `i` from this set (or -∞ if this set is empty).

Extended polynomials can be defined similarly; `P` equals `sum(pi * xi)`, where `i` runs over some (possibly infinite) set of (possibly negative) integers, and each `pi` is a non-zero element of `F`. The requirement on `pi` being non-zero can also be dropped, provided that `deg(P)` is redefined as the maximum `i` such that `pi` is non-zero. As described, `deg(P)` could be +∞, but we'll restrict ourselves to `deg(P)` being finite (or being -∞ for the zero polynomial).

Every polynomial is trivially an extended polynomial. Going the other way is harder; an extended polynomial can only be trivially turned into a polynomial when it has a finite number of terms and has no `xi` terms with negative `i`.

It is also useful to define term-slicing; given `P` and integer `j` and integer `k`, `P[j:k]` denotes `sum(pi * xi)` restricted to just `j ≤ i < k`. As a shorthand, `P[j]` denotes `P[j:j+1]`. It is trivially true that `P` equals `P[-∞:j]` plus `P[j:+∞]`.

Addition and multiplication of extended polynomials follows that of normal polynomials. Division is where things get interesting; the whole point of introducing extended polynomials is to allow the definition of true division (`/`) for extended polynomials, which is just like floor division (`//`) for normal polynomials, except that it produces a quotient with an infinite number of (negative `i`) terms (and no remainder), rather than terminating with a finite quotient and a remainder. For comparison, the division algorithms are:

``````def floor_division(S, P):
Q = 0
while S != 0 and deg(S) >= deg(P):
T = S[deg(S)] / P[deg(P)]
Q = Q + T
S = S - T * P
return Q, S
``````
``````def true_division(S, P):
Q = 0
while S != 0:
T = S[deg(S)] / P[deg(P)]
Q = Q + T
S = S - T * P
return Q
``````

Note that as written `true_division` often never terminates. For actual computation, it needs to be re-expressed as a generator which yields terms of `Q` in descending `i` order (and then only a finite number of terms used). Both algorithms involve division of monomonials, which is just division of the field elements and subtraction of the `x` powers.

Given the similarity in algorithms, it is clear that `S // P` equals `(S / P)[0:+∞]`. Various other properties hold; most notably `(S / P) // T` equals `S // (P * T)`.

One benefit of true division over floor division is that true division is associative with multiplication; that is `(S / P) * T` equals `S / (P * T)`. Accordingly, `S / P * T` can expressed without any parentheses. Additionally, it commutes with multiplication, so `S / P * T` equals `S * T / P`, and it commutes with itself in so far as `S / P / T` equals `S / T / P`.

Part 2: `S // P` equals `(S * (T / P)) // T`

This is just algebraic manipulation:

``````S // P == (S      / P    ) // 1
== (S *  T / P / T) // 1
== (S *  T / P    ) // T
== (S * (T / P)   ) // T
``````

Part 3: `(S*(T/P))//T` equals `(S*(T//P))//T` plus `(S*(T/P)[-∞:0])//T`

This involves term-slicing on `T / P`, splitting it into `(T / P)[-∞:0]` plus `(T / P)[0:+∞]`, then noting that the latter is `T // P`:

``````(S*(T/P))//T == ( S*((T /P)[0:+∞] +    (T/P)[-∞:0]) )//T
== ( S*((T//P)       +    (T/P)[-∞:0]) )//T
==  (S* (T//P))//T   + (S*(T/P)[-∞:0] )//T
``````

Part 4: `(S*(T/P)[-∞:0])//T` equals zero

``````(S*(T/P)[-∞:0])//T == ( S    * (T/P)[-∞:0]  )//T
== ( S    * (T/P)[-∞:0]/T)//1
== ((S/T) * (T/P)[-∞:0]  )//1
== ((S/T) * (T/P)[-∞:0]  )[0:+∞]
``````

As `deg(S // T) ≤ 0` (by hypothesis), it is also the case that `deg(S / T) ≤ 0`. Clearly also `deg((T/P)[-∞:0]) < 0` by the nature of term-slicing. Hence the `deg` of their product is also less than zero (as multiplication of polynomials sums their degrees). Hence the `[0:+∞]` term-slicing results in no terms — or in other words, results in the zero polynomial.

Computing the CRC-32C of a 4KB buffer on a modern x86 chip can be done as a simple loop around the `crc32` instruction:

``````uint32_t crc32_4k(uint32_t acc, char* buf) {
for (char* end = buf + 4096; buf < end; buf += 8) {
acc = _mm_crc32_u64(acc, *(uint64_t*)buf);
}
return acc;
}
``````

Note that this is bit-reflected CRC32-C with polynomial 0x11EDC6F41, which is different to the zlib CRC (polynomial 0x104C11DB7). Furthermore, there is no bitwise-inverse here; some texts apply `~` to `acc` on input and again on output, as in:

``````uint32_t crc32_4k_inverse(uint32_t acc, char* buf) {
return ~crc32_4k(~acc, buf);
}
``````

The loop within `crc32_4k` is fine, but leaves a lot of performance on the table: the underlying `crc32` instruction has latency 3 and throughput 1 on every Intel chip since its introduction in Nehalem. To fully hide the latency, three independent CRC streams need to be calculated in parallel. Some mathematical tricks allow us to chop the 4KB buffer into multiple chunks, compute the CRC of each chunks independently, and then merge the results together. For a good exposition of the tricks, see Option 12 and Option 13 of komrad36/CRC. As covered in that exposition, it can be convenient to chop into four chunks rather than three, with the fourth being exactly 8 bytes long. This leads to the following function:

``````uint32_t crc32_4k_three_way(uint32_t acc_a, char* buf) {
// Four chunks:
//  Chunk A: 1360 bytes from 0 through 1360
//  Chunk B: 1360 bytes from 1360 through 2720
//  Chunk C: 1368 bytes from 2720 through 4088
//  Chunk D: 8 bytes from 4088 through 4096
uint32_t acc_b = 0;
uint32_t acc_c = 0;
for (char* end = buf + 1360; buf < end; buf += 8) {
acc_a = _mm_crc32_u64(acc_a, *(uint64_t*)buf);
acc_b = _mm_crc32_u64(acc_b, *(uint64_t*)(buf + 1360));
acc_c = _mm_crc32_u64(acc_c, *(uint64_t*)(buf + 1360*2));
}
// Merge together A and B, leaving space for C+D
// kA == magic((1360+1368+8)*8-33)
// kB == magic((     1368+8)*8-33)
__m128i kAkB = _mm_setr_epi32(/*kA*/ 0x8A074012, 0, /*kB*/ 0x93E106A4, 0);
__m128i vec_a = _mm_clmulepi64_si128(_mm_cvtsi32_si128(acc_a), kAkB, 0x00);
__m128i vec_b = _mm_clmulepi64_si128(_mm_cvtsi32_si128(acc_b), kAkB, 0x10);
uint64_t ab = _mm_cvtsi128_si64(_mm_xor_si128(vec_a, vec_b));
// Final 8 bytes of C
acc_c = _mm_crc32_u64(acc_c, *(uint64_t*)(buf + 1360*2));
// Merge together C, AB, and D
return _mm_crc32_u64(acc_c, ab ^ *(uint64_t*)(buf + 1360*2 + 8));
}
``````

The main loop of `crc32_4k_three_way` contains three times as much code as `crc32_4k`, and runs approximately three times faster. The magic happens in the merging step, with literal magic numbers. The magic numbers can be generated with the following function, which computes the CRC of a `1` bit followed by `n` zero bits:

``````uint32_t magic(uint32_t n) {
uint32_t crc = ((uint32_t)1) << (31 - (n & 31));
if (n & 32) crc = _mm_crc32_u32(crc, 0);
for (n >>= 6; n; --n) crc = _mm_crc32_u64(crc, 0);
return crc;
}
``````

Read the aforementioned exposition for details, but the quick summary is that merging step for a given chunk requires a magic number corresponding to how many bits appear in subsequent chunks. Chunks C/D doesn't require any magic, as there are no subsequent chunks. Chunk B has 1368+8 bytes in subsequent chunks, hence `magic((1368+8)*8-33)`, and chunk A has 1360+1368+8 bytes in subsequent chunks, hence `magic((1360+1368+8)*8-33)`. The `-33` is actually `-32-1`: `-32` to counter the shift performed by the final `_mm_crc32_u64`, and `-1` to counter the bit-reversed inputs/outputs to `_mm_clmulepi64_si128`.

There is also a completely different way of computing CRC-32C (or any other 32-bit CRC), based around the `pclmulqdq` instruction (which is what `_mm_clmulepi64_si128` compiles to). The canonical exposition for this is Intel's whitepaper, and the following code is inspired by Chromium's implementation of said whitepaper:

``````uint32_t crc32_4k_pclmulqdq(uint32_t acc, char* buf) {
// First block of 64 is easy.
__m128i x2 = _mm_loadu_si128((__m128i*)(buf + 16));
__m128i x3 = _mm_loadu_si128((__m128i*)(buf + 32));
__m128i x4 = _mm_loadu_si128((__m128i*)(buf + 48));
x1 = _mm_xor_si128(_mm_cvtsi32_si128(acc), x1);
// Parallel fold remaining blocks of 64.
// k1 == magic(4*128+32-1)
// k2 == magic(4*128-32-1)
__m128i k1k2 = _mm_setr_epi32(/*k1*/ 0x740EEF02, 0, /*k2*/ 0x9E4ADDF8, 0);
char* end = buf + 4096 - 64;
do {
__m128i x5 = _mm_clmulepi64_si128(x1, k1k2, 0x00);
x1 = _mm_clmulepi64_si128(x1, k1k2, 0x11);
__m128i x6 = _mm_clmulepi64_si128(x2, k1k2, 0x00);
x2 = _mm_clmulepi64_si128(x2, k1k2, 0x11);
__m128i x7 = _mm_clmulepi64_si128(x3, k1k2, 0x00);
x3 = _mm_clmulepi64_si128(x3, k1k2, 0x11);
__m128i x8 = _mm_clmulepi64_si128(x4, k1k2, 0x00);
x4 = _mm_clmulepi64_si128(x4, k1k2, 0x11);
x5 = _mm_xor_si128(x5, _mm_loadu_si128((__m128i*)(buf + 64)));
x1 = _mm_xor_si128(x1, x5);
x6 = _mm_xor_si128(x6, _mm_loadu_si128((__m128i*)(buf + 80)));
x2 = _mm_xor_si128(x2, x6);
x7 = _mm_xor_si128(x7, _mm_loadu_si128((__m128i*)(buf + 96)));
x3 = _mm_xor_si128(x3, x7);
x8 = _mm_xor_si128(x8, _mm_loadu_si128((__m128i*)(buf + 112)));
x4 = _mm_xor_si128(x4, x8);
buf += 64;
} while (buf < end);
// Fold together the four parallel streams into one.
// k3 == magic(128+32-1)
// k4 == magic(128-32-1)
__m128i k3k4 = _mm_setr_epi32(/*k3*/ 0xF20C0DFE, 0, /*k4*/ 0x493C7D27, 0);
__m128i x5 = _mm_clmulepi64_si128(x1, k3k4, 0x00);
x1 = _mm_clmulepi64_si128(x1, k3k4, 0x11);
x5 = _mm_xor_si128(x5, x2);
x1 = _mm_xor_si128(x1, x5);
x5 = _mm_clmulepi64_si128(x3, k3k4, 0x00);
x3 = _mm_clmulepi64_si128(x3, k3k4, 0x11);
x5 = _mm_xor_si128(x5, x4);
x3 = _mm_xor_si128(x3, x5);
// k5 == magic(2*128+32-1)
// k6 == magic(2*128-32-1)
__m128i k5k6 = _mm_setr_epi32(/*k5*/ 0x3DA6D0CB, 0, /*k6*/ 0xBA4FC28E, 0);
x5 = _mm_clmulepi64_si128(x1, k5k6, 0x00);
x1 = _mm_clmulepi64_si128(x1, k5k6, 0x11);
x5 = _mm_xor_si128(x5, x3);
x1 = _mm_xor_si128(x1, x5);
// Apply missing <<32 and fold down to 32-bits.
acc = _mm_crc32_u64(0, _mm_extract_epi64(x1, 0));
return _mm_crc32_u64(acc, _mm_extract_epi64(x1, 1));
}
``````

Note that `crc32_4k_three_way` consists of lots of `_mm_crc32_u64` calls followed by a few `_mm_clmulepi64_si128` calls, whereas `crc32_4k_pclmulqdq` is the opposite: lots of `_mm_clmulepi64_si128` calls followed by a few `_mm_crc32_u64` calls. This is very convenient, as on all Intel chips since Broadwell, `_mm_clmulepi64_si128` has throughput 1, and executes on a separate port to `_mm_crc32_u64`. This means that we should be able to issue one `_mm_clmulepi64_si128` and one `_mm_crc32_u64` per cycle, thus doubling our speed. The latency of `_mm_clmulepi64_si128` has varied slightly over time; 5 cycles on Broadwell, 7 cycles on Skylake and Coffee Lake and Cannon Lake, then 6 cycles on Ice Lake. In the best case (5 cycles), the main loop of `crc32_4k_pclmulqdq` consumes 64 bytes every 7 cycles, and in the worst case (7 cycles), the main loop of `crc32_4k_pclmulqdq` consumes 64 bytes every 9 cycles. In contrast, the main loop of `crc32_4k_three_way` consumes 24 bytes every 3 cycles, which is 72 bytes every 9 cycles. By combining the two main loops, we should be able to consume 728 bytes every 9 cycles. Splitting 4096 into two pieces in a 72:64 ratio gives 2168.47:1927.53, which we'll round to 2176:1920. The `_mm_clmulepi64_si128` main loop can work through the 1920, while the `crc32_4k_three_way` main loop works through the 2176 (as 728+728+720). The resultant code is an unholy fusion of `crc32_4k_three_way` and `crc32_4k_pclmulqdq`, which interleaves the code from both:

``````uint32_t crc32_4k_fusion(uint32_t acc_a, char* buf) {
// Four chunks:
//  Chunk A: 728 bytes from 0 through 728
//  Chunk B: 728 bytes from 728 through 1456
//  Chunk C: 720 bytes from 1456 through 2176
//  Chunk D: 1920 bytes from 2176 through 4096
// First block of 64 from D is easy.
char* buf2 = buf + 2176;
__m128i x2 = _mm_loadu_si128((__m128i*)(buf2 + 16));
__m128i x3 = _mm_loadu_si128((__m128i*)(buf2 + 32));
__m128i x4 = _mm_loadu_si128((__m128i*)(buf2 + 48));
uint32_t acc_b = 0;
uint32_t acc_c = 0;
// Parallel fold remaining blocks of 64 from D, and 24 from each of A/B/C.
// k1 == magic(4*128+32-1)
// k2 == magic(4*128-32-1)
__m128i k1k2 = _mm_setr_epi32(/*k1*/ 0x740EEF02, 0, /*k2*/ 0x9E4ADDF8, 0);
char* end = buf + 4096 - 64;
do {
acc_a = _mm_crc32_u64(acc_a, *(uint64_t*)buf);
__m128i x5 = _mm_clmulepi64_si128(x1, k1k2, 0x00);
acc_b = _mm_crc32_u64(acc_b, *(uint64_t*)(buf + 728));
x1 = _mm_clmulepi64_si128(x1, k1k2, 0x11);
acc_c = _mm_crc32_u64(acc_c, *(uint64_t*)(buf + 728*2));
__m128i x6 = _mm_clmulepi64_si128(x2, k1k2, 0x00);
acc_a = _mm_crc32_u64(acc_a, *(uint64_t*)(buf + 8));
x2 = _mm_clmulepi64_si128(x2, k1k2, 0x11);
acc_b = _mm_crc32_u64(acc_b, *(uint64_t*)(buf + 728 + 8));
__m128i x7 = _mm_clmulepi64_si128(x3, k1k2, 0x00);
acc_c = _mm_crc32_u64(acc_c, *(uint64_t*)(buf + 728*2 + 8));
x3 = _mm_clmulepi64_si128(x3, k1k2, 0x11);
acc_a = _mm_crc32_u64(acc_a, *(uint64_t*)(buf + 16));
__m128i x8 = _mm_clmulepi64_si128(x4, k1k2, 0x00);
acc_b = _mm_crc32_u64(acc_b, *(uint64_t*)(buf + 728 + 16));
x4 = _mm_clmulepi64_si128(x4, k1k2, 0x11);
acc_c = _mm_crc32_u64(acc_c, *(uint64_t*)(buf + 728*2 + 16));
x5 = _mm_xor_si128(x5, _mm_loadu_si128((__m128i*)(buf2 + 64)));
x1 = _mm_xor_si128(x1, x5);
x6 = _mm_xor_si128(x6, _mm_loadu_si128((__m128i*)(buf2 + 80)));
x2 = _mm_xor_si128(x2, x6);
x7 = _mm_xor_si128(x7, _mm_loadu_si128((__m128i*)(buf2 + 96)));
x3 = _mm_xor_si128(x3, x7);
x8 = _mm_xor_si128(x8, _mm_loadu_si128((__m128i*)(buf2 + 112)));
x4 = _mm_xor_si128(x4, x8);
buf2 += 64;
buf += 24;
} while (buf2 < end);
// Next 24 bytes from A/B/C, and 8 more from A/B, then merge A/B/C.
// Meanwhile, fold together D's four parallel streams.
// k3 == magic(128+32-1)
// k4 == magic(128-32-1)
__m128i k3k4 = _mm_setr_epi32(/*k3*/ 0xF20C0DFE, 0, /*k4*/ 0x493C7D27, 0);
acc_a = _mm_crc32_u64(acc_a, *(uint64_t*)buf);
__m128i x5 = _mm_clmulepi64_si128(x1, k3k4, 0x00);
acc_b = _mm_crc32_u64(acc_b, *(uint64_t*)(buf + 728));
x1 = _mm_clmulepi64_si128(x1, k3k4, 0x11);
acc_c = _mm_crc32_u64(acc_c, *(uint64_t*)(buf + 728*2));
__m128i x6 = _mm_clmulepi64_si128(x3, k3k4, 0x00);
acc_a = _mm_crc32_u64(acc_a, *(uint64_t*)(buf + 8));
x3 = _mm_clmulepi64_si128(x3, k3k4, 0x11);
acc_b = _mm_crc32_u64(acc_b, *(uint64_t*)(buf + 728 + 8));
acc_c = _mm_crc32_u64(acc_c, *(uint64_t*)(buf + 728*2 + 8));
acc_a = _mm_crc32_u64(acc_a, *(uint64_t*)(buf + 16));
acc_b = _mm_crc32_u64(acc_b, *(uint64_t*)(buf + 728 + 16));
x5 = _mm_xor_si128(x5, x2);
acc_c = _mm_crc32_u64(acc_c, *(uint64_t*)(buf + 728*2 + 16));
x1 = _mm_xor_si128(x1, x5);
acc_a = _mm_crc32_u64(acc_a, *(uint64_t*)(buf + 24));
// k5 == magic(2*128+32-1)
// k6 == magic(2*128-32-1)
__m128i k5k6 = _mm_setr_epi32(/*k5*/ 0x3DA6D0CB, 0, /*k6*/ 0xBA4FC28E, 0);
x6 = _mm_xor_si128(x6, x4);
x3 = _mm_xor_si128(x3, x6);
x5 = _mm_clmulepi64_si128(x1, k5k6, 0x00);
acc_b = _mm_crc32_u64(acc_b, *(uint64_t*)(buf + 728 + 24));
x1 = _mm_clmulepi64_si128(x1, k5k6, 0x11);
// kC == magic((        1920)*8-33)
__m128i kCk0 = _mm_setr_epi32(/*kC*/ 0xF48642E9, 0, 0, 0);
__m128i vec_c = _mm_clmulepi64_si128(_mm_cvtsi32_si128(acc_c), kCk0, 0x00);
// kB == magic((    720+1920)*8-33)
// kA == magic((728+720+1920)*8-33)
__m128i kAkB = _mm_setr_epi32(/*kA*/ 0x155AD968, 0, /*kB*/ 0x2E7D11A7, 0);
__m128i vec_a = _mm_clmulepi64_si128(_mm_cvtsi32_si128(acc_a), kAkB, 0x00);
__m128i vec_b = _mm_clmulepi64_si128(_mm_cvtsi32_si128(acc_b), kAkB, 0x10);
x5 = _mm_xor_si128(x5, x3);
x1 = _mm_xor_si128(x1, x5);
uint64_t abc = _mm_cvtsi128_si64(_mm_xor_si128(_mm_xor_si128(vec_c, vec_a), vec_b));
// Apply missing <<32 and fold down to 32-bits.
uint32_t crc = _mm_crc32_u64(0, _mm_extract_epi64(x1, 0));
crc = _mm_crc32_u64(crc, abc ^ _mm_extract_epi64(x1, 1));
return crc;
}
``````

To give an idea of relative performance, I ran some very crude benchmarks of every function on a range of servers. The resultant gigabytes per second (GB/s) measurements are given below. As the servers have different clock speeds (both advertised and burst), I've assumed `crc32_4k` runs at 21.33 bits per cycle (b/c) on each, and used this to translate the other GB/s measurements to b/c.

``````                    Broadwell @ 3.00GHz  Skylake @ 3.20GHz    Ice Lake @ 2.40GHz
crc32_4k  10.19 GB/s 21.3 b/c  12.64 GB/s 21.3 b/c  11.59 GB/s 21.3 b/c
crc32_4k_pclmulqdq  23.32 GB/s 48.8 b/c  28.12 GB/s 47.5 b/c  27.45 GB/s 50.5 b/c
crc32_4k_three_way  26.99 GB/s 56.5 b/c  33.26 GB/s 56.1 b/c  29.18 GB/s 53.7 b/c
crc32_4k_fusion  44.44 GB/s 93.0 b/c  55.65 GB/s 93.9 b/c  50.36 GB/s 92.7 b/c
``````

PS. If running on ARM rather than Intel, see dougallj's Faster CRC32 on the Apple M1.

Update: See https://github.com/corsix/fast-crc32/.

The Raft distributed consensus protocol is normally seen as having an array of log entries. There is however a generalisation of Raft which replaces the array of log entries with a tree of nodes, and then (standard) Raft can be seen as a very simple tree that contains no branches. The resultant protocol can be called Tree Raft, and is very similar in spirit to Chained Raft. This post describes Tree Raft and contrasts it to standard Raft.

In standard Raft, the log is an array, indexed by `uint64_t index`, with each array element being a log entry:

``````struct LogEntry {
uint64_t term;
string contents;
};
``````

In Tree Raft, the log is a tree. The tree can be stored in a map/dictionary whose key type is `NodeRef` and value type is `Node`:

``````struct NodeRef {
uint64_t index;
uint64_t term;
};
struct Node {
string contents;
NodeRef parent;
};
``````

There are two restrictions on the `parent` field: if `Node n` has key `(i, t)` then `n.parent.index == i - 1` and `n.parent.term <= t`. This first restriction allows an optimisation: `parent.index` does not need to be stored; only `parent.term` needs to be stored.

There are two obvious orderings that can be defined on `NodeRef`: by ascending index then ascending term (lexicographic), and by ascending term then ascending index (transposed lexicographic). Both orderings turn out to be useful in different places. Note that a node comes after its parent in both orderings.

As an optimisation, a `Node` and (some number of) its transitive parents can be represented as an array of `LogEntry` plus the parent term of the oldest `Node`. This optimised representation is the only representation in standard Raft, and it appears twice: the per-server log takes this form (with the parent term of the oldest `Node` being `lastTerm` in the InstallSnapshot RPC), and the AppendEntries RPC takes this form (with the parent term of the oldest `Node` being `prevLogTerm`). A Tree Raft implementation could use this optimised form for committed nodes, and only use the more expensive map/dictionary representation for uncommitted nodes.

Note that a `NodeRef` identifies a single `Node`, but because every node identifies its parent, a `NodeRef` implicitly identifies an entire chain of nodes. This is similar to a commit hash in git: a hash identifies a single commit, but a commit references its parent commit(s), and thus a single hash identifies an entire graph of commits leading to that point. Standard Raft describes this as the Log Matching property: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.

A Tree Raft server maintains two cursors into the tree: `NodeRef commit` and `NodeRef head`. The `NodeRef commit` cursor replaces the `uint64_t commitIndex` from standard Raft. The `NodeRef head` cursor replaces "last log entry" everywhere that it comes up: the RequestVote RPC contains the candidate's `NodeRef head`, and other servers are willing to vote for candidates whose `NodeRef head` is greater than or equal to theirs (using transposed lexicographic order). The leader is able to create a new node by using the key `(head.index + 1, currentTerm)` and then setting its `head` to this new node. For every follower, there will be a path in the tree from the follower's `NodeRef head` to the leader's `NodeRef head`, and one objective of a follower is to move its `head` cursor along this path toward the leader's `head` cursor (note that the path might involve going up the tree before going back down). Similarly, for every follower, there will be a path in the tree from the follower's `NodeRef commit` to the leader's `NodeRef commit`, and another objective of a follower is to move its `commit` cursor along this path toward the leader's `commit` cursor (crucially, this path is always downwards and never backtracks). The leader is responsible for moving its `commit` cursor toward its `head` cursor when it is safe to do so, and the safety constraint here is very similar to standard Raft: a strict majority of servers must have their `head.term` equal to the leader's `currentTerm`, and within this majority, the minimum `head.index` gives the safe commit point. On every server, the `commit` cursor will either be equal to the `head` cursor, or be an ancestor of the `head` cursor. This permits an optimisation: only the `index` of the `commit` cursor needs to be stored, as its `term` can be reconstructed by finding the ancestor of `head` with the given `index`. Standard Raft contains this optimisation, as it only stores `uint64_t commitIndex`. It can be seen that standard Raft is just Tree Raft where a server stores the chain of nodes between the root and its `head` (a very simple non-branching tree) as its log, and does not store any other nodes.

Tree Raft replaces the AppendEntries RPC with an AddNodes RPC. The `prevLogIndex`, `prevLogTerm`, `entries[]`, and `leaderCommit` arguments are replaced by the leader's `NodeRef head`, `pair<NodeRef, Node> nodes[]`, and the leader's `NodeRef commit`. Note that these arguments can be optimised somewhat (only including `commit.index` and not `commit.term`, using the array of `LogEntry` representation for `nodes`, not including `head` if the last `nodes` entry is the head, not including the term of nodes when equal to the leader's `currentTerm`, and so forth). After processing the RPC, followers respond with their `currentTerm` and their (possibly moved) `NodeRef head`. The receiver implementation is:

1. If `term < currentTerm`, finish processing.
2. Add all the `nodes` to your map.
3. If there is a path from your `NodeRef head` to the leader's `NodeRef head`, set your `head` to the leader's.
4. If there is a path from your `NodeRef commit` to the leader's `NodeRef commit`, and the leader's `commit` either equals your `head` or is an ancestor of your `head`, set your `commit` to the leader's (this can be deferred until after the response has been sent, and the movement can be done one step at a time).

Note that if a follower misses an AddNodes RPC (due to packet loss, or leader failure, or late startup), then the nodes added in step 2 might not have a path back to the root; that is, there'll be a node in the map whose parent is not in the map. In this case, steps 3 and 4 will not be able to find a path to the new nodes. In a departure from standard Raft, in Tree Raft, followers are responsible for identifying these missing parent nodes and issuing Replay RPCs (against a randomly chosen server) in an attempt to obtain them. Once obtained, steps 3 and 4 can be retried. This has a number of nice properties: leaders no longer need to track `nextIndex` for each follower, the replay workload is shared between followers rather than being yet another task that the leader is responsible for, and the AddNodes RPC can be a UDP multicast to all followers. Furthermore, as nodes are not removed from the map as part of processing an AddNodes RPC, there are slightly nicer forward progress guarantees during leader changeover. As yet another nice property, if implementing the PreVote extension (which is necessary for liveness), then PreVote rejects can carry the (server's latest view of) the leader's `currentTerm` and `NodeRef head` and `NodeRef commit`, and then the receiver of the reject can treat this like an `AddNodes` and then replay the missing nodes (note that `currentTerm` is monotonic, and within a term, the leader's `head` and `commit` are also monotonic, hence stale views of this triplet can be arbitrated). Consequently, followers can keep up with the leader even when direct communication between the leader and the follower is not possible, avoiding the need for a long recovery process once direct communication is restored.

As nodes are not removed from the map as part of the AddNodes RPC, it is worth considering when nodes do get removed. A server cannot remove its `head` node, nor any ancestor of its `head` node (except when moving these nodes into a snapshot), but it is safe to remove any other node at any time. Standard Raft takes a very aggressive policy: when `head` moves backward (as part of moving it toward the common parent of the follower's `head` and the leader's `head`, i.e. moving it along the path in the tree from follower's `head` to leader's `head`), immediately remove everything that can be removed. A very lax policy is also possible: never remove nodes. A good middle ground is to remove during commitment: when committing a node, remove any other nodes which have the same `index` but a different `term`, and when removing a node, also remove its children (recursively). This fits very nicely with using the optimised array representation for committed nodes. If taking this middle ground, then it becomes attractive to store the uncommitted nodes in a sorted map, with nodes in lexicographic order, as then nodes with the same `index` can be found with a range scan, and the children of a node can also be found with a range scan (find all nodes with `index + 1` using range scan, then filter based on their `parent.term`).

To efficiently answer "is there a path between node X and node Y" queries, it can be useful for each server to store an additional `NodeRef furthest_ancestor` field on every `Node`. Initially, `furthest_ancestor` is set to `parent`, but when it is discovered that `n.furthest_ancestor` exists in the map, `n.furthest_ancestor` is replaced by `lookup(n.furthest_ancestor).furthest_ancestor` (giving a union-find-like behaviour). If two nodes have the same `furthest_ancestor` then a path exists between them. As an optimisation, `furthest_ancestor` does not need to be stored for committed nodes, as it'll always refer to the root for them. Furthermore, `furthest_ancestor` precisely identifies the node which needs to be the target of a Replay RPC.

``page: 3 4 5 6 7``