Contrasting Intel AMX and Apple AMX

Intel has an x64 instruction set extension called AMX, meanwhile Apple has a totally different aarch64 instruction set extension also called AMX.

Register file

Intel AMX: 8 tmm registers, each a 16 by 16 matrix of 32-bit elements (technically, each can be configured to be any size - square or rectangular - between 1 by 1 and 16 by 16, though element size is fixed at 32-bits regardless). Total architectural state 8 kilobytes.

Apple AMX: Total architectural state 5 kilobytes, broken down as:

The vectors have 8/16/32/64-bit elements, like regular SIMD registers. Note that Intel AMX does not need vector registers, as Intel already has AVX512 with 64-byte vectors (32 of which are in the AVX512 architectural register file).

Data types

Intel AMX: Multiplicands are always 32-bit, either i8[4] or u8[4] or bf16[2], combined via dot product. Accumulators are always 32-bit, either i32 or u32 or f32.

Apple AMX: Multiplicands are scalars, any of i8 / u8 / i16 / u16 / f16 / f32 / f64. Accumulators are any of i16 / u16 / i32 / u32 / f16 / f32 / f64. Note f16 (i.e. IEEE 754 half-precision with 5-bit exponent and 10-bit fraction) rather than bf16 (8-bit exponent, 7-bit fraction), though bf16 support is added on M2 and later.

Computational operations

Intel AMX: Matrix multiplication of any two tmm registers, accumulating onto a third tmm register. For the multiplication, matrix elements are themselves (very small) vectors, combined via dot product. This is the only operation. Viewed differently, this is doing 16×64 by 64×16 (int8) or 16×32 by 32×16 (bf16) matmul, then adding onto a 16×16 matrix.

Apple AMX: Outer product of any x register with any y register (viewed as a column vector), accumulating onto any (matrix view) z register). For the multiplication, x / y elements are scalars (depending on the data type, this might be viewed as doing 16×1 by 1×16 matmul then adding onto a 16×16 matrix). Alternatively, pointwise product of any x register with any y register (viewed as a row vector), accumulating onto any (vector view) z register. Many more operations as well, though the general theme is {outer or pointwise} {multiplication or addition or subtraction}, possibly followed by right-shift, possibly followed by integer saturation. The most exotic exceptions to the theme are min / max / popcnt.

Memory operations

Intel AMX: Matrix load or store (up to 1 kilobyte), configurable with a base address (register + immediate offset) and a row stride (register or zero, optionally shifted left by 1-3 bits).

Apple AMX: Vector load or store (64 bytes), configurable with a base address (register). Also load or store pair (128 bytes), though the two registers must be consecutive, and the row stride is fixed at 64 bytes, and the base address must be 128-byte aligned. Loads or stores with y effectively give a free vector transpose, as y registers can be viewed as column vectors.

Masking modes

Intel AMX: Each tmm register can be configured to any size - square or rectangular - between 1 by 1 and 16 by 16. This is (mostly) equivalent to saying that a tmm register is always 16 by 16, but has an associated mask on each dimension to only enable some number of leading rows and columns. These per-register masks are architectural state.

Apple AMX: Per-dimension masking is available on a per-instruction basis (though notably not for loads / stores). Available masking modes are: all rows (columns), even/odd rows (columns) only, first N rows (columns) only, last N rows (columns) only, row (column) N only.

Note that both of these approaches are different to Intel's AVX512 approach, which is a separate register file containing 8 mask registers (k0 through k7) and every operation optionally taking a mask register as an input.

Other

Apple AMX contains a very interesting instruction called genlut. In the forward direction ("lookup"), it is somewhere between AVX512's vpshufb and vgatherps. In the backward direction ("generate") it is something like an inverse vpshufb, performing arbitrary 2/3/4/5-bit quantisation. When used in both directions, it can be useful for piecewise linear interpolation, or as an alternative to AVX512's vfpclassps / vfixupimmps.

Computing parity using CRC

It is a party trick rather than anything practically useful, but the following four functions all compute the same thing:

uint8_t parity_b(const uint64_t* data, size_t n) {
    uint8_t acc = 0;
    for (size_t i = 0; i < n; ++i) {
        for (size_t j = 0; j < 64; ++j) {
            if (data[i] & (1ull << j)) acc ^= 1;
        }
    }
    return acc;
}
uint8_t parity_p(const uint64_t* data, size_t n) {
    uint8_t acc = 0;
    for (size_t i = 0; i < n; ++i) {
        acc += _mm_popcnt_u64(data[i]);
    }
    return acc & 1;
}
uint8_t parity_x(const uint64_t* data, size_t n) {
    uint64_t acc = 0;
    for (size_t i = 0; i < n; ++i) {
        acc ^= data[i];
    }
    return _mm_popcnt_u64(acc) & 1;
}
uint8_t parity_c(const uint64_t* data, size_t n) {
    uint32_t acc = 0;
    for (size_t i = 0; i < n; ++i) {
        acc = _mm_crc32_u64(acc, data[i]);
    }
    return _mm_popcnt_u64(acc) & 1;
}

Going from parity_b to parity_p is an easy step: replace the inner loop with a popcnt. Going from parity_p to parity_x is also an easy step: lift the popcnt out of the loop. Hence it is not too surprising that parity_b and parity_p and parity_x all compute the same thing. The parity_c function is like parity_x, except replacing ^= with crc, and the surprising thing is that parity_c computes the same thing as all the other functions. To understand why, we need to look at the polynomial used by the crc instruction (the CRC32-C polynomial):

x^{32} + x^{28} + x^{27} + x^{26} + x^{25} + x^{23} + x^{22} + x^{20} + x^{19} + x^{18} + x^{14} + x^{13} + x^{11} + x^{10} + x^{9} + x^{8} + x^{6} + 1

This polynomial can be expressed as the product of two separate polynomials:

(x + 1)(x^{31} + x^{30} + x^{29} + x^{28} + x^{26} + x^{24} + x^{23} + x^{21} + x^{20} + x^{18} + x^{13} + x^{10} + x^{8} + x^{5} + x^{4} + x^{3} + x^{2} + x^{1} + 1)

Accordingly, CRC32-C isn't really a 32-bit checksum, it is really two separate things packed into 32 bits: a 1-bit checksum and a 31-bit checksum. The method of packing is slightly funky, and thus the method of unpacking is also slightly funky: from a mathematical point of view, to get the 1-bit checksum out of the 32-bit value, we need to view the 32-bit value as a Z2 polynomial and then compute the remainder modulo x + 1. This is equivalent to iteratively applying the reduction rule x = 1 until nothing more can be reduced. If we were to start with the 4-bit polynomial b3 * x3 + b2 * x2 + b1 * x + b0 and iteratively apply x = 1, we'd get b3 * 1*1*1 + b2 * 1*1 + b1 * 1 + b0, which is just b3 + b2 + b1 + b0, which is just another way of saying popcnt (mod 2). The same thing is true starting with a 32-bit polynomial: the remainder modulo x + 1 is just popcnt (mod 2). In other words, we can use popcnt to unpack the 1-bit checksum out of the 32-bit value, and a 1-bit checksum is just parity. Hence why parity_c computes the same thing as all the other functions.

This combination of a 1-bit checksum and a 31-bit checksum can be seen as a good thing for checksumming purposes, as the two parts protect against different problems. Notably, every bit-flip in the data being checksummed will invert the 1-bit checksum, and hence any odd number of bit-flips will cause the 1-bit checksum to mismatch. The (ahem) flip-side is that the 1-bit checksum will match for any even number of bit-flips, which is where the 31-bit checksum comes in: it needs to protect against as many even number of bit-flips as possible.

Polynomial remainders by direct computation

Last time, we had a need to compute mod_P(A * x32) where A was a polynomial with degree at most 31, and mod_P took the remainder after dividing by a polynomial P with degree 32. The traditional Barrett reduction approach involved finding the quotient and then calculating the remainder from that:

Q = ((A * x32) * (x63 // P)) // x63
R =  (A * x32) - Q * P
return R

Via a combination of the * x32, and the degree of P being 32, and the context being Z2 polynomials, this simplified to:

Q = ((A * x32) * (x63 // P)) // x63
return (Q * P)[0:32]

The resultant C code (for bit-reversed polynomials) was very short:

uint32_t mul_x32_mod_P(uint32_t A) {
    __m128i k = _mm_setr_epi32(0, 0xdea713f1, 0x05ec76f1, 1);
    __m128i a = _mm_cvtsi32_si128(A);
    __m128i b = _mm_clmulepi64_si128(a, k, 0x00);
    __m128i c = _mm_clmulepi64_si128(b, k, 0x10);
    return _mm_extract_epi32(c, 2);
}

Though our particular case permitted this simplification, it is not permitted in general. In Faster Remainder by Direct Computation (D. Lemire, O. Kaser, and N. Kurz, 2018), a different approach was presented: the remainder can be computed directly, rather than computing the quotient and getting the remainder from that. Said paper is about integers, and its main Theorem 1 is effectively:

Provided that d in [1, 2N) and n in [0, 2N) and F ≥ N + log2(d), then:
n % d equals (((n * ceil(2F / d)) % 2F) * d) // 2F.

Translated from the language of integers to the language of polynomials, this theorem would be:

Provided that S, P, T are polynomials satisfying deg((S * P) // T) ≤ 0, then:
S % P equals (((S * (T // P)) % T) * P) // T.

This lets us turn % P into multiplication by T // P followed by % T followed by * P followed by // T. The % T and // T steps are trivial if T is chosen to be xn (with n ≥ deg(S) + deg(P) to make the precondition hold), and T // P can be precomputed if P is fixed.

It turns out that the translated theorem is true, and that the proof isn't too hard. The proof builds upon the Barrett reduction property, which is:

Provided that deg(S // T) ≤ 0, S // P equals (S * (T // P)) // T.

We start with the complicated side of our (supposed) equality:

(((S * (T // P)) % T) * P) // T

We're going to be overwhelmed by parentheses if we're not careful, so I'm going to say that a sequence of mixed * and % and // operations always happens left-to-right, thus allowing some of the parentheses to be dropped:

S * (T // P) % T * P // T

From here, the % can be expanded: (F % G equals F - (F // G * G))

((S * (T // P)) - (S * (T // P) // T * T)) * P // T

Then push the - to the top level:

(S * (T // P) * P // T) - (S * (T // P) // T * T * P // T)

Making use of our deg((S * P) // T) ≤ 0 precondition, by the Barrett reduction property, (S * P) // P equals (S * P) * (T // P) // T:

(S * P // P) - (S * (T // P) // T * T * P // T)

We can then simplify away * G // G for any G:

S - (S * (T // P) // T * P)

Our deg((S * P) // T) ≤ 0 precondition implies deg(S // T) ≤ 0, so by the Barrett reduction property, S // P equals S * (T // P) // T:

S - (S // P * P)

Which is the definition of:

S % P

Thus our supposed equality is an actual equality.

An alternative exposition of crc32_4k_pclmulqdq

A recent post presented a faster way of computing CRC32C on x86. Part of that was crc32_4k_pclmulqdq, based on Intel's whitepaper and Chromium's implementation thereof. The Intel whitepaper is the traditional exposition for this crc32_4k_pclmulqdq function, but I don't like the exposition therein, as I think it spends too much time on some obvious things, and not nearly enough time on some non-obvious things. I present an alternative exposition here.

We begin by thinking about CPU-native data types. The types u32 and i32 should be familiar to most programmers, being 32-bit unsigned integers and 32-bit two's complement signed integers respectively. Two-operand addition can be defined for both of them; u32 + u32 ↦ u32 and i32 + i32 ↦ i32. The neat thing is that the hardware circuit for both types of addition is the same, and is what we call the CPU add instruction. On 64-bit CPUs, these 32-bit types live in the least significant bits of 64-bit registers. I'll use u32_z and i32_z for such things, where the _z suffix denotes zero-extension of the 32-bit value to the 64-bit storage. Note that value-preserving conversion from u32_z to u64 is a no-op, whereas value-preserving conversion from i32_z to i64 requires sign extension (movsx instruction on x86, sxtw instruction on ARM).

Some less common CPU-native data types are p32 and r32: polynomials over Z2 of degree at most 31 (p32), and bit-reversed versions thereof (r32). As with u and i types, these p and r types exist at a variety of bit-widths, and (at times) live in the least significant bits of wider registers. The rbit instruction on ARM is a value-preserving conversion p32 ⇆ r32 (x86 has no such thing).

The ARM crc32cb instruction, and the matching overload of the x86 crc instruction, has type signature r32, r8 ↦ r32_z and implementation:

def crc32cb(acc, val):
    val = val * x32
    acc = acc * x8 + val
    P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
        x18 + x14 + x13 + x11 + x10 + x9  + x8  + x6 + 1
    q, r = floor_divide(acc, P)
    return r

To avoid some future repitition, we can pull out the tail of crc32cb and call it mod_P (with type signature rN ↦ r32 for any N):

def crc32cb(acc, val):
    val = val * x32
    acc = acc * x8 + val
    return mod_P(acc)
def mod_P(acc): P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 + x18 + x14 + x13 + x11 + x10 + x9 + x8 + x6 + 1 q, r = floor_divide(acc, P) return r

The ARM crc32ch instruction, and the matching overload of the x86 crc instruction, has type signature r32, r16 ↦ r32_z and implementation:

def crc32ch(acc, val):
    q, r = floor_divide(val, x8)
    acc = crc32cb(acc, q)
    acc = crc32cb(acc, r)
    return acc

A totally equivalent implementation is the following, which is identical to the crc32cb implementation except for one number:

def crc32ch(acc, val):
    val = val * x32
    acc = acc * x16 + val # The x8 on this line has changed to x16
    return mod_P(acc)

The crc32cw (r32, r32 ↦ r32_z) and crc32cx (r32, r64 ↦ r32_z) instructions can be similarly implemented either as two calls to a lower-width implementation, or as crc32cb with a different scaling factor when merging acc with val. In all cases, the equivalence of the two implementations arises due to the nature of mod_P: so long as mod_P is done once at the end, it can be done any number of times at any point in the preceding computation. It is useful to do so regularly in order to bound the size of acc, but it doesn't need to be done. In the extreme, if acc could grow without bound, a valid multi-byte implementation r32, r8[] ↦ r32_z would be:

def crc32c_bytes(acc, bytes):
    for val in bytes:
        acc = acc * x8 + val * x32
    return mod_P(acc)

Something similar operating 64-bits at a time rather than 8-bits at a time (r32, r64[] ↦ r32_z) would be:

def crc32c_qwords(acc, qwords):
    for val in qwords:
        acc = acc * x64 + val * x32
    return mod_P(acc)

At this width, the math can be fiddled around with slightly to do * x32 once at the end of the loop rather than in every loop iteration:

def crc32c_qwords(acc, qwords):
    acc = acc * x32 + qwords[0]
    for i in range(1, len(qwords)):
        acc = acc * x64 + qwords[i]
    acc = acc * x32
    return mod_P(acc)

In preparation for SIMD, we can go wider still, and operate on 128 bits at a time (r32, r128[] ↦ r32_z):

def crc32c_xwords(acc, xwords):
    acc = acc * x96 + xwords[0]
    for i in range(1, len(xwords)):
        acc = acc * x128 + xwords[i]
    acc = acc * x32
    return mod_P(acc)

Furthermore, multiple accumulators could be used (this being a standard latency-hiding trick):

def crc32c_xwords(acc, xwords):
    acc0 = xwords[0] + acc * x96
    acc1 = xwords[1]
    for i in range(2, len(xwords), 2):
        acc0 = acc0 * x256 + xwords[i + 0]
        acc1 = acc1 * x256 + xwords[i + 1]
    acc = acc0 * x128 + acc1
    acc = acc * x32
    return mod_P(acc)

This is starting to look very similar to crc32_4k_pclmulqdq, albeit we've got a way to go still: having even more accumulators, implementing polynomial + and *, preventing the accumulators from getting too big, and implementing the final mod_P.

To prevent the accumulators from getting too big, we can insert some extra mod_P calls (remembering that these can be freely inserted at any point in the computation):

def crc32c_xwords(acc, xwords):
    acc0 = xwords[0] + acc * x96
    acc1 = xwords[1]
    for i in range(2, len(xwords), 2):
        acc0 = mod_P(acc0 * x256)
        acc1 = mod_P(acc1 * x256)
        acc0 += xwords[i + 0]
        acc1 += xwords[i + 1]
    acc = acc0 * x128 + acc1
    acc = acc * x32
    return mod_P(acc)

This solves the accumulator problem, albeit at the cost of introducing the problem of implementing the extra mod_P calls. That problem is slightly fiddly, so we'll come back to it later, and instead address polynomial + and * next.

For small N, every CPU implements Z2 polynomial + with signature pN_z + pN_z ↦ pN_z — it is the xor instruction on x86, and the eor instruction on ARM. Every CPU also implements Z2 polynomial + with signature rN_z + rN_z ↦ rN_z — it is also the xor instruction (in the same way that the add instruction is both u32 and i32). Modern CPUs implement Z2 polynomial * under the name of pclmulqdq (x86) or pmull / pmull2 (ARM). The kernel of this instruction is normally seen as p64 * p64 ↦ p127_z, though it can equally be seen as r64 * r64 ↦ r127_z. Value-preserving conversion from p127_z to p128 is a no-op (like going from u32_z to u64), whereas value-preserving conversion from r127_z to r128 requires a left-shift by one (like going from i32_z to i64 requires a sign extension). When operating on r128, a right-shift by one is multiplication by x1. Doing both shifts in sequence results in no shift at all, and just a multiplication by x1. For N ≤ 64 and M ≤ 64, the kernel can alternatively be seen as pN_z * pM_z ↦ p(N+M-1)_z or as rN_z * rM_z ↦ r(N+M-1)_z. As the output of the kernel can have up to 127 bits, it goes into a 128-bit SIMD register. This causes the two inputs to the kernel to also come from 128-bit SIMD registers, and thus causes the type signature of the overall instruction to be p128, p128 ↦ p127_z or r128, r128 ↦ r127_z. To get the inputs down from p128 to p64 (or from r128 to r64), floor_divide by x64 is done, and then either the quotient or the remainder is used as the 64-bit input to the kernel. On x86, an immediate argument to pclmulqdq controls whether to take the quotient or the remainder of each input. On ARM, there is slightly less flexibility: either both inputs have their remainder taken or both have their quotient taken. In practice, the decreased flexibility is rarely a problem.

With this knowledge of how to do polynomial + and *, we can return to the question of how to do acc0 = mod_P(acc0 * x256). The type of acc0 is going to be r128 in order to match the SIMD register width, and a floor_divide by x64 can split it into two r64 pieces:

q, r = floor_divide(acc0, x64)
assert acc0 == q * x64 + r

We can then do some arbitrary-looking algebraic manipulations of the expression we want:

   mod_P(acc0 * x256)
== mod_P((q * x64 + r) * x256)
== mod_P(q * x256+64 + r * x256)
== mod_P(q * x256+64-33 * x33 + r * x256-33 * x33)

The next trick is where the magic happens: because of the mod_P call at the end of crc32c_xwords, we're free to insert to remove mod_P calls anywhere else. We previously used this freedom to insert some extra mod_P calls, and we're going to use it again now to remove those mod_P calls and insert new ones elsewhere:

   mod_P(q * x256+64-33  * x33 + r *       x256-33  * x33)
-> q * mod_P(x256+64-33) * x33 + r * mod_P(x256-33) * x33

Note that these two expressions give different values (hence -> rather than ==), but this difference evaporates away after the mod_P call at the end of crc32c_xwords. This new expression is very convenient to compute: q has type r64_z, and mod_P(x256+64-33) is a constant with type r32_z, so their product has type r95_z. A value-preserving conversion from r95_z to r128 involves a left-shift by 33 bits, and then * x33 is a right-shift by 33 bits. These two shifts cancel each other out, meaning that we effectively get the * x33 for free. In other words, a single CPU instruction can obtain q from acc0 and compute q * mod_P(x256+64-33) * x33. Similarly, a single CPU instruction can obtain r from acc0 and compute r * mod_P(x256-33) * x33. Adding the two terms is another CPU instruction - pxor on x86, or eor on ARM (see also eor3 on ARM, and also note that the M1 can fuse pmull with eor, albeit the fused operation still takes two retire slots). We can wrap this expression up in a function, call it mul_xN_mod_P, and make use of it in crc32c_xwords:

def mul_xN_mod_P(acc, N): # r128, int ↦ r128
    q, r = floor_divide(acc, x64)
    return q * mod_P(xN+64-33) * x33 + r * mod_P(xN-33) * x33
def crc32c_xwords(acc, xwords): acc0 = xwords[0] + acc * x96 acc1 = xwords[1] for i in range(2, len(xwords), 2): acc0 = mul_xN_mod_P(acc0, 256) acc1 = mul_xN_mod_P(acc1, 256) acc0 += xwords[i + 0] acc1 += xwords[i + 1] acc = acc0 * x128 + acc1 acc = acc * x32 return mod_P(acc)

This mul_xN_mod_P is also useful for doing acc = acc0 * x128 + acc1 at the end:

def crc32c_xwords(acc, xwords):
    acc0 = xwords[0] + acc * x96
    acc1 = xwords[1]
    for i in range(2, len(xwords), 2):
        acc0 = mul_xN_mod_P(acc0, 256)
        acc1 = mul_xN_mod_P(acc1, 256)
        acc0 += xwords[i + 0]
        acc1 += xwords[i + 1]
    acc = mul_xN_mod_P(acc0, 128) + acc1
    return mod_P(acc * x32)

At this point, only three problems remain: having even more accumulators, precomputing the various mod_P(xM) constants, and computing the final mod_P(acc * x32).

The optimal number of accumulators depends on the throughput and latency of the CPU's polynomial + and * operations. + is typically latency 1 and throughput better than or equal to the throughput of *. On recent Intel chips, * (i.e. pclmulqdq) has a latency of between 5 and 7 cycles, and a throughput of 1 per cycle. On the M1, * (i.e. pmull / pmull2) has a latency of 3 cycles, and a throughput of 4 per cycle (these numbers are incredible compared to Intel's). We need two * and one + for each mul_xN_mod_P call. This makes 4 (or possibly 5) accumulators optimal on Intel, and 12 accumulators optimal on M1. A version with four accumulators looks like:

def crc32c_xwords(acc, xwords):
    acc0 = xwords[0] + acc * x96
    acc1 = xwords[1]
    acc2 = xwords[2]
    acc3 = xwords[3]
    for i in range(4, len(xwords), 4):
        acc0 = mul_xN_mod_P(acc0, 128*4)
        acc1 = mul_xN_mod_P(acc1, 128*4)
        acc2 = mul_xN_mod_P(acc2, 128*4)
        acc3 = mul_xN_mod_P(acc3, 128*4)
        acc0 += xwords[i + 0]
        acc1 += xwords[i + 1]
        acc2 += xwords[i + 2]
        acc3 += xwords[i + 3]
    acc0 = mul_xN_mod_P(acc0, 128) + acc1
    acc2 = mul_xN_mod_P(acc2, 128) + acc3
    acc = mul_xN_mod_P(acc0, 128*2) + acc2
    return mod_P(acc * x32)

If we have a hardware crc32c instruction, then the remaining two problems are easy. mod_P(acc * x32) is just crc32c(0, acc). In our case, acc is r128, and the widest hardware crc32c instruction only takes r64, but following the pattern from the start of this post, a wide CRC can be constructed from two narrow CRCs:

def crc32c_128(acc, val): # r32, r128 ↦ r32_z
    q, r = floor_divide(val, x64)
    acc = crc32c_64(acc, q)
    acc = crc32c_64(acc, r)
    return acc

Meanwhile, mod_P(xM) is just acc = xM % 32 followed by M//32 iterations of acc = mod_P(acc * x32), and the latter is just a CRC instruction. In concrete C code, this is just:

uint32_t magic(uint64_t M) { // int ↦ r32_z
    uint32_t acc = ((uint32_t)0x80000000) >> (M & 31);
    for (M >>= 5; M; --M) acc = _mm_crc32_u32(acc, 0);
    return acc;
}

For bonus points, note that magic can be implemented to run in O(log(M)) time rather than O(M) time:

uint32_t magic(uint64_t M) {
    uint64_t stack = ~(uint64_t)1;
    for (; M > 191; M = (M >> 1) - 16) {
        stack = (stack << 1) + (M & 1);
    }
    stack = ~stack;
    uint32_t acc = ((uint32_t)0x80000000) >> (M & 31);
    for (M >>= 5; M; --M) acc = _mm_crc32_u32(acc, 0);
    for (;;) {
        uint32_t low = stack & 1;
        if (!(stack >>= 1)) return acc;
        __m128i x = _mm_cvtsi32_si128(acc);
        uint64_t y = _mm_cvtsi128_si64(_mm_clmulepi64_si128(x, x, 0));
        acc = _mm_crc32_u64(0, y << low);
    }
}

For more bonus points, the ARM translation of this is:

uint32_t magic(uint64_t M) {
    uint64_t stack = ~(uint64_t)1;
    for (; M > 191; M = (M >> 1) - 16) {
        stack = (stack << 1) + (M & 1);
    }
    stack = ~stack;
    uint32_t acc = ((uint32_t)0x80000000) >> (M & 31);
    for (M >>= 5; M; --M) acc = __crc32cw(acc, 0);
    for (;;) {
        uint32_t low = stack & 1;
        if (!(stack >>= 1)) return acc;
        poly8x8_t x = vreinterpret_p8_u64(vmov_n_u64(acc));
        uint64_t y = vgetq_lane_u64(vreinterpretq_u64_p16(vmull_p8(x, x)), 0);
        acc = __crc32cd(0, y << low);
    }
}

If we do not have a hardware crc32c instruction, or we want to change P to something which lacks a hardware instruction, then things are harder, but not insurmountable. All we have to do is build the equivalent of the 32-bit CRC instruction:

def crc32cw(acc, val): # r32, r32 ↦ r32_z
    acc = (acc + val) * x32
    return mod_P(acc)

Expanding the mod_P gets us to:

def crc32cw(acc, val): # r32, r32 ↦ r32_z
    acc = (acc + val) * x32
    P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
        x18 + x14 + x13 + x11 + x10 + x9  + x8  + x6 + 1
    q, r = floor_divide(acc, P)
    return r

We can compute q by Barrett reduction, and then use that to compute r:

def crc32cw(acc, val): # r32, r32 ↦ r32_z
    acc = (acc + val) * x32
    P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
        x18 + x14 + x13 + x11 + x10 + x9  + x8  + x6 + 1
    q = (acc * (x63 // P)) // x63                # Anything ≥ 63 works
    r = acc - q * P
    return r

As we're operating in Z2, the - in r = acc - q * P can be replaced by +. As deg(r) < 32 and deg(acc) ≥ 32, the acc in r = acc - q * P can be replaced with term-slicing. After doing both these things, we're left with:

def crc32cw(acc, val): # r32, r32 ↦ r32_z
    acc = (acc + val) * x32
    P = x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 +
        x18 + x14 + x13 + x11 + x10 + x9  + x8  + x6 + 1
    q = (acc * (x63 // P)) // x63                # Anything ≥ 63 works
    return (q * P)[0:32]

As concrete C code, this is:

uint32_t crc32cw(uint32_t acc, uint32_t val) {
    __m128i k = _mm_setr_epi32(0, 0xdea713f1, 0x05ec76f1, 1);
    __m128i a = _mm_cvtsi32_si128(acc ^ val);
    __m128i b = _mm_clmulepi64_si128(a, k, 0x00);
    __m128i c = _mm_clmulepi64_si128(b, k, 0x10);
    return _mm_extract_epi32(c, 2);
}

Where the magic numbers in k come from:

def floor_divide(S, P):
    Q = 0
    while S:
        d = S.bit_length() - P.bit_length()
        if d < 0:
            break
        Q ^= 1 << d
        S ^= P << d
    return Q, S

def rev32(F):
    G = 0
    for _ in range(32):
        G = (G << 1) + (F & 1)
        F >>= 1
    assert F == 0
    return G

P = 0x11EDC6F41
Q, _ = floor_divide(1 << 63, P)
print(hex(rev32(Q >> 32)))
print(hex(rev32(Q & 0xffffffff)))
print(hex(rev32(P >> 1)))
print(P & 1)

We can then build a 64-bit CRC function as two calls to crc32cw, or directly:

uint32_t crc32cx(uint32_t acc, uint64_t val) {
    __m128i k = _mm_setr_epi32(0xdea713f1, 0x4869ec38, 0x05ec76f1, 1);
    __m128i a = _mm_cvtsi64_si128(acc ^ val);
    __m128i b = _mm_clmulepi64_si128(a, k, 0x00);
    __m128i c = _mm_clmulepi64_si128(b, k, 0x10);
    return _mm_extract_epi32(c, 2);
}

Where the magic numbers in k come from:

P = 0x11EDC6F41
Q, _ = floor_divide(1 << 95, P)
print(hex(rev32(Q >> 32)))
print(hex(rev32(Q & 0xffffffff)))
print(hex(rev32(P >> 1)))
print(P & 1)

For bonus points, the zlib CRC equivalent (P = 0x104C11DB7) is the following:

uint32_t crc32x(uint32_t acc, uint64_t val) {
    __m128i k = _mm_setr_epi32(0xf7011641, 0xb4e5b025, 0xdb710641, 1);
    __m128i a = _mm_cvtsi64_si128(acc ^ val);
    __m128i b = _mm_clmulepi64_si128(a, k, 0x00);
    __m128i c = _mm_clmulepi64_si128(b, k, 0x10);
    return _mm_extract_epi32(c, 2);
}

This can be plugged into Chromium's implementation to give a more succinct and slightly faster tail; everything under "Fold 128-bits to 64-bits" and "Barret reduce to 32-bits" and "Return the crc32" can be replaced by:

    __m128i k = _mm_setr_epi32(0xf7011641, 0xb4e5b025, 0xdb710641, 1);
    __m128i b = _mm_clmulepi64_si128(x1, k, 0x00);
    __m128i c = _mm_clmulepi64_si128(b, k, 0x10);
    c = _mm_blend_epi16(c, _mm_setzero_si128(), 0xcf); // or: c = _mm_and_si128(c, _mm_setr_epi32(0, 0, ~0, 0));
    __m128i a = _mm_xor_si128(c, x1);
    b = _mm_clmulepi64_si128(a, k, 0x01);
    c = _mm_clmulepi64_si128(b, k, 0x10);
    return _mm_extract_epi32(c, 2);

Barrett reduction for polynomials

Let F denote some field (e.g. the real numbers, or integers modulo some prime), and let F[x] denote the ring of polynomials over that field, and let P, S, T denote elements of that polynomial ring. Polynomial division can then be defined; division of one polynomial by another yields a (unique) quotient and a (unique) remainder. Following the Python 3 notation where // denotes floor division, let S // P denote the quotient obtained from dividing S by P. The key property of Barrett reduction is then:

Provided that deg(S // T) ≤ 0, S // P equals (S * (T // P)) // T.

This lets us turn division by P into multiplication by T // P followed by division by T. Said division is trivial if T is chosen to be xn (with n ≥ deg(S) to make the precondition hold), and T // P can be precomputed if P is fixed.

Proving that this property is true turns out to be surprisingly elusive; all the good expositions on the internet are about Barrett reduction for integers rather than for polynomials. Accordingly, I'm providing a proof in this post. The proof proceeds in four parts:

  1. Defining / (true division rather than floor division) for (extended) polynomials, and defining term-slicing P[j:k] for (extended) polynomials.
  2. Showing that S // P equals (S * (T / P)) // T.
  3. Showing that (S*(T/P))//T equals (S*(T//P))//T plus (S*(T/P)[-∞:0])//T.
  4. Showing that (S*(T/P)[-∞:0])//T equals zero. This final part is the only part which requires the precondition of deg(S // T) ≤ 0.

These four parts combine in the obvious way to prove the overall reduction property.

Part 1: Definitions

P being an element of the polynomial ring F[x] just means that P equals sum(pi * xi), where i runs over some finite set of non-negative integers, and each pi is a non-zero element of F. Note that deg(P) denotes the maximum such i from this set (or -∞ if this set is empty).

Extended polynomials can be defined similarly; P equals sum(pi * xi), where i runs over some (possibly infinite) set of (possibly negative) integers, and each pi is a non-zero element of F. The requirement on pi being non-zero can also be dropped, provided that deg(P) is redefined as the maximum i such that pi is non-zero. As described, deg(P) could be +∞, but we'll restrict ourselves to deg(P) being finite (or being -∞ for the zero polynomial).

Every polynomial is trivially an extended polynomial. Going the other way is harder; an extended polynomial can only be trivially turned into a polynomial when it has a finite number of terms and has no xi terms with negative i.

It is also useful to define term-slicing; given P and integer j and integer k, P[j:k] denotes sum(pi * xi) restricted to just j ≤ i < k. As a shorthand, P[j] denotes P[j:j+1]. It is trivially true that P equals P[-∞:j] plus P[j:+∞].

Addition and multiplication of extended polynomials follows that of normal polynomials. Division is where things get interesting; the whole point of introducing extended polynomials is to allow the definition of true division (/) for extended polynomials, which is just like floor division (//) for normal polynomials, except that it produces a quotient with an infinite number of (negative i) terms (and no remainder), rather than terminating with a finite quotient and a remainder. For comparison, the division algorithms are:

def floor_division(S, P):
    Q = 0
    while S != 0 and deg(S) >= deg(P):
        T = S[deg(S)] / P[deg(P)]
        Q = Q + T
        S = S - T * P
    return Q, S
def true_division(S, P):
    Q = 0
    while S != 0:
        T = S[deg(S)] / P[deg(P)]
        Q = Q + T
        S = S - T * P
    return Q

Note that as written true_division often never terminates. For actual computation, it needs to be re-expressed as a generator which yields terms of Q in descending i order (and then only a finite number of terms used). Both algorithms involve division of monomonials, which is just division of the field elements and subtraction of the x powers.

Given the similarity in algorithms, it is clear that S // P equals (S / P)[0:+∞]. Various other properties hold; most notably (S / P) // T equals S // (P * T).

One benefit of true division over floor division is that true division is associative with multiplication; that is (S / P) * T equals S / (P * T). Accordingly, S / P * T can expressed without any parentheses. Additionally, it commutes with multiplication, so S / P * T equals S * T / P, and it commutes with itself in so far as S / P / T equals S / T / P.

Part 2: S // P equals (S * (T / P)) // T

This is just algebraic manipulation:

S // P == (S      / P    ) // 1
       == (S *  T / P / T) // 1
       == (S *  T / P    ) // T
       == (S * (T / P)   ) // T

Part 3: (S*(T/P))//T equals (S*(T//P))//T plus (S*(T/P)[-∞:0])//T

This involves term-slicing on T / P, splitting it into (T / P)[-∞:0] plus (T / P)[0:+∞], then noting that the latter is T // P:

(S*(T/P))//T == ( S*((T /P)[0:+∞] +    (T/P)[-∞:0]) )//T
             == ( S*((T//P)       +    (T/P)[-∞:0]) )//T
             ==  (S* (T//P))//T   + (S*(T/P)[-∞:0] )//T

Part 4: (S*(T/P)[-∞:0])//T equals zero

Start with some algebraic manipulation:

(S*(T/P)[-∞:0])//T == ( S    * (T/P)[-∞:0]  )//T
                   == ( S    * (T/P)[-∞:0]/T)//1
                   == ((S/T) * (T/P)[-∞:0]  )//1
                   == ((S/T) * (T/P)[-∞:0]  )[0:+∞]

As deg(S // T) ≤ 0 (by hypothesis), it is also the case that deg(S / T) ≤ 0. Clearly also deg((T/P)[-∞:0]) < 0 by the nature of term-slicing. Hence the deg of their product is also less than zero (as multiplication of polynomials sums their degrees). Hence the [0:+∞] term-slicing results in no terms — or in other words, results in the zero polynomial.

page: 2 3 4 5 6