# An alternative exposition of crc32_4k_pclmulqdq

A recent post presented a faster way of computing CRC32C on x86. Part of that was `crc32_4k_pclmulqdq`

, based on Intel's whitepaper and Chromium's implementation thereof. The Intel whitepaper is the traditional exposition for this `crc32_4k_pclmulqdq`

function, but I don't *like* the exposition therein, as I think it spends too much time on some obvious things, and not nearly enough time on some non-obvious things. I present an alternative exposition here.

We begin by thinking about CPU-native data types. The types `u32`

and `i32`

should be familiar to most programmers, being 32-bit unsigned integers and 32-bit two's complement signed integers respectively. Two-operand addition can be defined for both of them; `u32 + u32 ↦ u32`

and `i32 + i32 ↦ i32`

. The neat thing is that the hardware circuit for both types of addition is the same, and is what we call the CPU `add`

instruction. On 64-bit CPUs, these 32-bit types live in the least significant bits of 64-bit registers. I'll use `u32_z`

and `i32_z`

for such things, where the `_z`

suffix denotes zero-extension of the 32-bit value to the 64-bit storage. Note that value-preserving conversion from `u32_z`

to `u64`

is a no-op, whereas value-preserving conversion from `i32_z`

to `i64`

requires sign extension (`movsx`

instruction on x86, `sxtw`

instruction on ARM).

Some less common CPU-native data types are `p32`

and `r32`

: polynomials over Z_{2} of degree at most 31 (`p32`

), and bit-reversed versions thereof (`r32`

). As with `u`

and `i`

types, these `p`

and `r`

types exist at a variety of bit-widths, and (at times) live in the least significant bits of wider registers. The `rbit`

instruction on ARM is a value-preserving conversion `p32 ⇆ r32`

(x86 has no such thing).

The ARM `crc32cb`

instruction, and the matching overload of the x86 `crc`

instruction, has type signature `r32, r8 ↦ r32_z`

and implementation:

```
def crc32cb(acc, val):
val = val * x
```^{32}
acc = acc * x^{8} + val
P = x^{32} + x^{28} + x^{27} + x^{26} + x^{25} + x^{23} + x^{22} + x^{20} + x^{19} +
x^{18} + x^{14} + x^{13} + x^{11} + x^{10} + x^{9} + x^{8} + x^{6} + 1
q, r = floor_divide(acc, P)
return r

To avoid some future repitition, we can pull out the tail of `crc32cb`

and call it `mod_P`

(with type signature `rN ↦ r32`

for any `N`

):

```
def crc32cb(acc, val):
val = val * x
```^{32}
acc = acc * x^{8} + val
return mod_P(acc)

def mod_P(acc):
P = x^{32} + x^{28} + x^{27} + x^{26} + x^{25} + x^{23} + x^{22} + x^{20} + x^{19} +
x^{18} + x^{14} + x^{13} + x^{11} + x^{10} + x^{9} + x^{8} + x^{6} + 1
q, r = floor_divide(acc, P)
return r

The ARM `crc32ch`

instruction, and the matching overload of the x86 `crc`

instruction, has type signature `r32, r16 ↦ r32_z`

and implementation:

```
def crc32ch(acc, val):
q, r = floor_divide(val, x
```^{8})
acc = crc32cb(acc, q)
acc = crc32cb(acc, r)
return acc

A *totally equivalent* implementation is the following, which is identical to the `crc32cb`

implementation except for one number:

```
def crc32ch(acc, val):
val = val * x
```^{32}
acc = acc * x^{16} + val # The x^{8} on this line has changed to x^{16}
return mod_P(acc)

The `crc32cw`

(`r32, r32 ↦ r32_z`

) and `crc32cx`

(`r32, r64 ↦ r32_z`

) instructions can be similarly implemented *either* as two calls to a lower-width implementation, or as `crc32cb`

with a different scaling factor when merging `acc`

with `val`

. In all cases, the equivalence of the two implementations arises due to the nature of `mod_P`

: so long as `mod_P`

is done *once* at the end, it can be done *any number of times* at *any point* in the preceding computation. It is *useful* to do so regularly in order to bound the size of `acc`

, but it doesn't *need* to be done. In the extreme, if `acc`

could grow without bound, a valid multi-byte implementation `r32, r8[] ↦ r32_z`

would be:

```
def crc32c_bytes(acc, bytes):
for val in bytes:
acc = acc * x
```^{8} + val * x^{32}
return mod_P(acc)

Something similar operating 64-bits at a time rather than 8-bits at a time (`r32, r64[] ↦ r32_z`

) would be:

```
def crc32c_qwords(acc, qwords):
for val in qwords:
acc = acc * x
```^{64} + val * x^{32}
return mod_P(acc)

At this width, the math can be fiddled around with slightly to do `* x`

once at the end of the loop rather than in every loop iteration:^{32}

```
def crc32c_qwords(acc, qwords):
acc = acc * x
```^{32} + qwords[0]
for i in range(1, len(qwords)):
acc = acc * x^{64} + qwords[i]
acc = acc * x^{32}
return mod_P(acc)

In preparation for SIMD, we can go wider still, and operate on 128 bits at a time (`r32, r128[] ↦ r32_z`

):

```
def crc32c_xwords(acc, xwords):
acc = acc * x
```^{96} + xwords[0]
for i in range(1, len(xwords)):
acc = acc * x^{128} + xwords[i]
acc = acc * x^{32}
return mod_P(acc)

Furthermore, multiple accumulators could be used (this being a standard latency-hiding trick):

```
def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x
```^{96}
acc1 = xwords[1]
for i in range(2, len(xwords), 2):
acc0 = acc0 * x^{256} + xwords[i + 0]
acc1 = acc1 * x^{256} + xwords[i + 1]
acc = acc0 * x^{128} + acc1
acc = acc * x^{32}
return mod_P(acc)

This is starting to look very similar to `crc32_4k_pclmulqdq`

, albeit we've got a way to go still: having *even more* accumulators, implementing polynomial `+`

and `*`

, preventing the accumulators from getting too big, and implementing the final `mod_P`

.

To prevent the accumulators from getting too big, we can insert some extra `mod_P`

calls (remembering that these can be freely inserted at *any* point in the computation):

```
def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x
```^{96}
acc1 = xwords[1]
for i in range(2, len(xwords), 2):
acc0 = mod_P(acc0 * x^{256})
acc1 = mod_P(acc1 * x^{256})
acc0 += xwords[i + 0]
acc1 += xwords[i + 1]
acc = acc0 * x^{128} + acc1
acc = acc * x^{32}
return mod_P(acc)

This solves the accumulator problem, albeit at the cost of introducing the problem of implementing the extra `mod_P`

calls. That problem is slightly fiddly, so we'll come back to it later, and instead address polynomial `+`

and `*`

next.

For small `N`

, every CPU implements Z_{2} polynomial `+`

with signature `pN_z + pN_z ↦ pN_z`

— it is the `xor`

instruction on x86, and the `eor`

instruction on ARM. Every CPU *also* implements Z_{2} polynomial `+`

with signature `rN_z + rN_z ↦ rN_z`

— it is *also* the `xor`

instruction (in the same way that the `add`

instruction is both `u32`

and `i32`

). Modern CPUs implement Z_{2} polynomial `*`

under the name of `pclmulqdq`

(x86) or `pmull`

/ `pmull2`

(ARM). The kernel of this instruction is normally seen as `p64 * p64 ↦ p127_z`

, though it can equally be seen as `r64 * r64 ↦ r127_z`

. Value-preserving conversion from `p127_z`

to `p128`

is a no-op (like going from `u32_z`

to `u64`

), whereas value-preserving conversion from `r127_z`

to `r128`

requires a *left-shift* by one (like going from `i32_z`

to `i64`

requires a sign extension). When operating on `r128`

, a *right-shift* by one is multiplication by x^{1}. Doing both shifts in sequence results in no shift at all, and just a multiplication by x^{1}. For `N`

≤ 64 and `M`

≤ 64, the kernel can alternatively be seen as `pN_z * pM_z ↦ p(N+M-1)_z`

or as `rN_z * rM_z ↦ r(N+M-1)_z`

. As the output of the kernel can have up to 127 bits, it goes into a 128-bit SIMD register. This causes the two inputs to the kernel to also come from 128-bit SIMD registers, and thus causes the type signature of the overall instruction to be `p128, p128 ↦ p127_z`

or `r128, r128 ↦ r127_z`

. To get the inputs down from `p128`

to `p64`

(or from `r128`

to `r64`

), `floor_divide`

by x^{64} is done, and then *either* the quotient or the remainder is used as the 64-bit input to the kernel. On x86, an immediate argument to `pclmulqdq`

controls whether to take the quotient or the remainder of each input. On ARM, there is slightly less flexibility: either both inputs have their remainder taken or both have their quotient taken. In practice, the decreased flexibility is rarely a problem.

With this knowledge of how to do polynomial `+`

and `*`

, we can return to the question of how to do `acc0 = mod_P(acc0 * x256)`

. The type of `acc0`

is going to be `r128`

in order to match the SIMD register width, and a `floor_divide`

by x^{64} can split it into two `r64`

pieces:

`q, r = floor_divide(acc0, x`^{64})
assert acc0 == q * x^{64} + r

We can then do some arbitrary-looking algebraic manipulations of the expression we want:

` mod_P(acc0 * x`^{256})
== mod_P((q * x^{64} + r) * x^{256})
== mod_P(q * x^{256+64} + r * x^{256})
== mod_P(q * x^{256+64-33} * x^{33} + r * x^{256-33} * x^{33})

The next trick is where the magic happens: because of the `mod_P`

call at the end of `crc32c_xwords`

, we're free to insert to remove `mod_P`

calls *anywhere* else. We previously used this freedom to insert some extra `mod_P`

calls, and we're going to use it again now to remove those `mod_P`

calls and insert new ones elsewhere:

` mod_P(q * x`^{256+64-33} * x^{33} + r * x^{256-33} * x^{33})
-> q * mod_P(x^{256+64-33}) * x^{33} + r * mod_P(x^{256-33}) * x^{33}

Note that these two expressions give different values (hence `->`

rather than `==`

), but this difference evaporates away after the `mod_P`

call at the end of `crc32c_xwords`

. This new expression is *very* convenient to compute: `q`

has type `r64_z`

, and `mod_P(x`

is a constant with type ^{256+64-33})`r32_z`

, so their product has type `r95_z`

. A value-preserving conversion from `r95_z`

to `r128`

involves a left-shift by 33 bits, and then `* x`

is a right-shift by 33 bits. These two shifts cancel each other out, meaning that we effectively get the ^{33}`* x`

for free. In other words, a single CPU instruction can obtain ^{33}`q`

from `acc0`

and compute `q * mod_P(x`

. Similarly, a single CPU instruction can obtain ^{256+64-33}) * x^{33}`r`

from `acc0`

and compute `r * mod_P(x`

. Adding the two terms is another CPU instruction - ^{256-33}) * x^{33}`pxor`

on x86, or `eor`

on ARM (see also `eor3`

on ARM, and also note that the M1 can fuse `pmull`

with `eor`

, albeit the fused operation still takes two retire slots). We can wrap this expression up in a function, call it `mul_xN_mod_P`

, and make use of it in `crc32c_xwords`

:

```
def mul_xN_mod_P(acc, N): # r128, int ↦ r128
q, r = floor_divide(acc, x
```^{64})
return q * mod_P(x^{N+64-33}) * x^{33} + r * mod_P(x^{N-33}) * x^{33}

def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x^{96}
acc1 = xwords[1]
for i in range(2, len(xwords), 2):
acc0 = mul_xN_mod_P(acc0, 256)
acc1 = mul_xN_mod_P(acc1, 256)
acc0 += xwords[i + 0]
acc1 += xwords[i + 1]
acc = acc0 * x^{128} + acc1
acc = acc * x^{32}
return mod_P(acc)

This `mul_xN_mod_P`

is also useful for doing `acc = acc0 * x`

at the end:^{128} + acc1

```
def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x
```^{96}
acc1 = xwords[1]
for i in range(2, len(xwords), 2):
acc0 = mul_xN_mod_P(acc0, 256)
acc1 = mul_xN_mod_P(acc1, 256)
acc0 += xwords[i + 0]
acc1 += xwords[i + 1]
acc = mul_xN_mod_P(acc0, 128) + acc1
return mod_P(acc * x^{32})

At this point, only three problems remain: having *even more* accumulators, precomputing the various `mod_P(x`

constants, and computing the final ^{M})`mod_P(acc * x`

.^{32})

The optimal number of accumulators depends on the throughput and latency of the CPU's polynomial `+`

and `*`

operations. `+`

is typically latency 1 and throughput better than or equal to the throughput of `*`

. On recent Intel chips, `*`

(i.e. `pclmulqdq`

) has a latency of between 5 and 7 cycles, and a throughput of 1 per cycle. On the M1, `*`

(i.e. `pmull`

/ `pmull2`

) has a latency of 3 cycles, and a throughput of 4 per cycle (these numbers are *incredible* compared to Intel's). We need two `*`

and one `+`

for each `mul_xN_mod_P`

call. This makes 4 (or possibly 5) accumulators optimal on Intel, and 12 accumulators optimal on M1. A version with four accumulators looks like:

```
def crc32c_xwords(acc, xwords):
acc0 = xwords[0] + acc * x
```^{96}
acc1 = xwords[1]
acc2 = xwords[2]
acc3 = xwords[3]
for i in range(4, len(xwords), 4):
acc0 = mul_xN_mod_P(acc0, 128*4)
acc1 = mul_xN_mod_P(acc1, 128*4)
acc2 = mul_xN_mod_P(acc2, 128*4)
acc3 = mul_xN_mod_P(acc3, 128*4)
acc0 += xwords[i + 0]
acc1 += xwords[i + 1]
acc2 += xwords[i + 2]
acc3 += xwords[i + 3]
acc0 = mul_xN_mod_P(acc0, 128) + acc1
acc2 = mul_xN_mod_P(acc2, 128) + acc3
acc = mul_xN_mod_P(acc0, 128*2) + acc2
return mod_P(acc * x^{32})

If we have a hardware `crc32c`

instruction, then the remaining two problems are easy. `mod_P(acc * x`

is just ^{32})`crc32c(0, acc)`

. In our case, `acc`

is `r128`

, and the widest hardware `crc32c`

instruction only takes `r64`

, but following the pattern from the start of this post, a wide CRC can be constructed from two narrow CRCs:

```
def crc32c_128(acc, val): # r32, r128 ↦ r32_z
q, r = floor_divide(val, x
```^{64})
acc = crc32c_64(acc, q)
acc = crc32c_64(acc, r)
return acc

Meanwhile, `mod_P(x`

is just ^{M})`acc = x`

followed by ^{M % 32}`M//32`

iterations of `acc = mod_P(acc * x`

, and the latter is just a CRC instruction. In concrete C code, this is just:^{32})

```
uint32_t magic(uint64_t M) { // int ↦ r32_z
uint32_t acc = ((uint32_t)0x80000000) >> (M & 31);
for (M >>= 5; M; --M) acc = _mm_crc32_u32(acc, 0);
return acc;
}
```

For bonus points, note that `magic`

can be implemented to run in `O(log(M))`

time rather than `O(M)`

time:

```
uint32_t magic(uint64_t M) {
uint64_t stack = ~(uint64_t)1;
for (; M > 191; M = (M >> 1) - 16) {
stack = (stack << 1) + (M & 1);
}
stack = ~stack;
uint32_t acc = ((uint32_t)0x80000000) >> (M & 31);
for (M >>= 5; M; --M) acc = _mm_crc32_u32(acc, 0);
for (;;) {
uint32_t low = stack & 1;
if (!(stack >>= 1)) return acc;
__m128i x = _mm_cvtsi32_si128(acc);
uint64_t y = _mm_cvtsi128_si64(_mm_clmulepi64_si128(x, x, 0));
acc = _mm_crc32_u64(0, y << low);
}
}
```

For more bonus points, the ARM translation of this is:

```
uint32_t magic(uint64_t M) {
uint64_t stack = ~(uint64_t)1;
for (; M > 191; M = (M >> 1) - 16) {
stack = (stack << 1) + (M & 1);
}
stack = ~stack;
uint32_t acc = ((uint32_t)0x80000000) >> (M & 31);
for (M >>= 5; M; --M) acc = __crc32cw(acc, 0);
for (;;) {
uint32_t low = stack & 1;
if (!(stack >>= 1)) return acc;
poly8x8_t x = vreinterpret_p8_u64(vmov_n_u64(acc));
uint64_t y = vgetq_lane_u64(vreinterpretq_u64_p16(vmull_p8(x, x)), 0);
acc = __crc32cd(0, y << low);
}
}
```

If we do *not* have a hardware `crc32c`

instruction, or we want to change `P`

to something which lacks a hardware instruction, then things are harder, but not insurmountable. All we have to do is *build* the equivalent of the 32-bit CRC instruction:

```
def crc32cw(acc, val): # r32, r32 ↦ r32_z
acc = (acc + val) * x
```^{32}
return mod_P(acc)

Expanding the `mod_P`

gets us to:

```
def crc32cw(acc, val): # r32, r32 ↦ r32_z
acc = (acc + val) * x
```^{32}
P = x^{32} + x^{28} + x^{27} + x^{26} + x^{25} + x^{23} + x^{22} + x^{20} + x^{19} +
x^{18} + x^{14} + x^{13} + x^{11} + x^{10} + x^{9} + x^{8} + x^{6} + 1
q, r = floor_divide(acc, P)
return r

We can compute `q`

by Barrett reduction, and then use that to compute `r`

:

```
def crc32cw(acc, val): # r32, r32 ↦ r32_z
acc = (acc + val) * x
```^{32}
P = x^{32} + x^{28} + x^{27} + x^{26} + x^{25} + x^{23} + x^{22} + x^{20} + x^{19} +
x^{18} + x^{14} + x^{13} + x^{11} + x^{10} + x^{9} + x^{8} + x^{6} + 1
q = (acc * (x^{63} // P)) // x^{63} # Anything ≥ 63 works
r = acc - q * P
return r

As we're operating in Z_{2}, the `-`

in `r = acc - q * P`

can be replaced by `+`

. As `deg(r)`

< 32 and `deg(acc)`

≥ 32, the `acc`

in `r = acc - q * P`

can be replaced with term-slicing. After doing both these things, we're left with:

```
def crc32cw(acc, val): # r32, r32 ↦ r32_z
acc = (acc + val) * x
```^{32}
P = x^{32} + x^{28} + x^{27} + x^{26} + x^{25} + x^{23} + x^{22} + x^{20} + x^{19} +
x^{18} + x^{14} + x^{13} + x^{11} + x^{10} + x^{9} + x^{8} + x^{6} + 1
q = (acc * (x^{63} // P)) // x^{63} # Anything ≥ 63 works
return (q * P)[0:32]

As concrete C code, this is:

```
uint32_t crc32cw(uint32_t acc, uint32_t val) {
__m128i k = _mm_setr_epi32(0, 0xdea713f1, 0x05ec76f1, 1);
__m128i a = _mm_cvtsi32_si128(acc ^ val);
__m128i b = _mm_clmulepi64_si128(a, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);
}
```

Where the magic numbers in `k`

come from:

```
def floor_divide(S, P):
Q = 0
while S:
d = S.bit_length() - P.bit_length()
if d < 0:
break
Q ^= 1 << d
S ^= P << d
return Q, S
def rev32(F):
G = 0
for _ in range(32):
G = (G << 1) + (F & 1)
F >>= 1
assert F == 0
return G
P = 0x11EDC6F41
Q, _ = floor_divide(1 << 63, P)
print(hex(rev32(Q >> 32)))
print(hex(rev32(Q & 0xffffffff)))
print(hex(rev32(P >> 1)))
print(P & 1)
```

We can then build a 64-bit CRC function as two calls to `crc32cw`

, or directly:

```
uint32_t crc32cx(uint32_t acc, uint64_t val) {
__m128i k = _mm_setr_epi32(0xdea713f1, 0x4869ec38, 0x05ec76f1, 1);
__m128i a = _mm_cvtsi64_si128(acc ^ val);
__m128i b = _mm_clmulepi64_si128(a, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);
}
```

Where the magic numbers in `k`

come from:

```
P = 0x11EDC6F41
Q, _ = floor_divide(1 << 95, P)
print(hex(rev32(Q >> 32)))
print(hex(rev32(Q & 0xffffffff)))
print(hex(rev32(P >> 1)))
print(P & 1)
```

For bonus points, the zlib CRC equivalent (`P = 0x104C11DB7`

) is the following:

```
uint32_t crc32x(uint32_t acc, uint64_t val) {
__m128i k = _mm_setr_epi32(0xf7011641, 0xb4e5b025, 0xdb710641, 1);
__m128i a = _mm_cvtsi64_si128(acc ^ val);
__m128i b = _mm_clmulepi64_si128(a, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);
}
```

This can be plugged into Chromium's implementation to give a more succinct and slightly faster tail; everything under "Fold 128-bits to 64-bits" and "Barret reduce to 32-bits" and "Return the crc32" can be replaced by:

```
__m128i k = _mm_setr_epi32(0xf7011641, 0xb4e5b025, 0xdb710641, 1);
__m128i b = _mm_clmulepi64_si128(x1, k, 0x00);
__m128i c = _mm_clmulepi64_si128(b, k, 0x10);
c = _mm_blend_epi16(c, _mm_setzero_si128(), 0xcf); // or: c = _mm_and_si128(c, _mm_setr_epi32(0, 0, ~0, 0));
__m128i a = _mm_xor_si128(c, x1);
b = _mm_clmulepi64_si128(a, k, 0x01);
c = _mm_clmulepi64_si128(b, k, 0x10);
return _mm_extract_epi32(c, 2);
```