Galois field instructions on 2021 CPUs

CPUs like to operate on data types such as uint8_t, uint16_t, uint32_t, and uint64_t. These data types can be viewed as integers modulo 2⁸, 2¹⁶, 2³², and 2⁶⁴ respectively. Addition and multiplication can be defined on these integers modulo 2^N in a way which is familiar to most people; for example, in uint8_t math, we have 199 + 201 == 144 and 16 * 20 == 64. uint8_t math can be viewed as infinite-precision integer math combined with the reduction rule of 256 == 0; for example in infinite-precision integer math we have 199 + 201 == 400 == 256 + 144 and 16 * 20 == 320 == 256 + 64.

Reduction rules of the form 2^N == 0 (for N > 1) are CPU-friendly, but not particularly math-friendly. On the other hand, reduction rules of the form P == 0 for some prime number P are math-friendly, but not CPU-friendly. With infinite-precision integer math combined with the reduction rule of say 251 == 0, we have 199 + 201 == 149 and 16 * 20 == 69. As 251 is prime, this reduction rule is math-friendly. By that, I mean that for any x other than zero, there is some y such that x * y == 1. For example, taking x of 16, we have 16 * 204 == 1. This property is not true for the 256 == 0 reduction rule; with that rule, there is no y such that 16 * y == 1. Where it exists, this y can be called inv(x) or 1/x.

If we want to keep the CPU-friendly uint8_t, uint16_t, uint32_t, and uint64_t data types, and also keep the math-friendly property of having inv(x) exist for all non-zero x, then we are actually in luck, albeit we have to use some exotic definitions of addition and multiplication. The exotic definitions come from polynomial math rather than integer math. A string of bits b₀, b₁, ..., b_n-1 can be viewed as the polynomial b₀ + b₁ * x¹ + ... + b_n-1 * x^n-1, at which point addition or multiplication of two strings of bits can be done by converting both strings to polynomials, doing addition or multiplication of those two polynomials, and then converting the resultant polynomial back to a bit string. That final back-conversion step requires that every b_i in the resultant polynomial is either 0 or 1, which we can ensure by applying the reduction rule 2 == 0 to every b_i. In the case of multiplication, the resultant bit string (or, equivalently, polynomial) is often going to be twice as long as its inputs. This should not come as a surprise, as the same thing happens with infinite-precision integer math. Continuing the trend of non-surprising things, just as reduction rules can be used to bring infinite-precision integer math down to fixed bit lengths, reduction rules can also be used to bring infinite-precision polynomial math down to fixed bit lengths. As yet another non-surprise, some polynomial reduction rules are math-friendly, and others are not. Some math-friendly polynomial reduction rules are:

x⁸ == x⁴ + x³ + x + 1
x¹⁶ == x⁵ + x³ + x + 1
x³² == x⁷ + x³ + x² + 1
x⁶⁴ == x⁴ + x³ + x + 1

(These reduction rules come from Table of Low-Weight Binary Irreducible Polynomials, and in some sense are the simplest reduction rules for each xⁱ on the left hand side of the ==.)

By choosing one of these reduction rules, we can define exotic addition and multiplication for any of uint8_t, uint16_t, uint32_t, and uint64_t. For example, exotic addition or multiplication of two uint8_t values can be done by converting both values to polynomials, doing infinite-precision polynomial addition or multiplication, applying the reduction rule x⁸ == x⁴ + x³ + x + 1 to the polynomial (to get rid of x⁸, x⁹, and all other higher terms), and then converting the polynomial back to a string of 8 bits (remembering the 2 == 0 reduction applied to each b_i in the polynomial). It turns out that this exotic addition doesn't depend on the choice of polynomial reduction rule, and that the 2 == 0 reduction means that this exotic addition is actually just bitwise XOR of the original values. Unfortunately, exotic multiplication does depend on the choice of polynomial reduction rule, and for any math-friendly choice, exotic multiplication doesn't turn out to be the same as an everyday operation.

With one of these polynomial reduction rules in hand, we can define exotic addition and multiplication for each of the uint8_t, uint16_t, uint32_t, and uint64_t data types. Furthermore, assuming that the polynomial reduction rule is math-friendly, this exotic multiplication has an inv(x) which exists for all non-zero x. In other words, we have what mathematicians call a field. As the number of elements (2⁸, 2¹⁶, 2³², and 2⁶⁴, respectively) is finite in every case, they are furthermore called Galois fields. This is shortened to GF, giving GF(2⁸), GF(2¹⁶), GF(2³²), and GF(2⁶⁴).

With that long introduction done, the question becomes: how can addition, multiplication, and inv of GF(2⁸) / GF(2¹⁶) / GF(2³²) / GF(2⁶⁴) values be done efficiently on contemporary CPUs? Addition is just XOR, so the interesting questions are around multiplication and inv.

For GF(2⁸), inv could be implemented as a uint8_t[256] lookup table. Multiplication could be implemented as a uint8_t[256][256] lookup table. This latter table is starting to get a little large, but there's a time/space tradeoff possible: analogues to log and exp exist in GF(2^N), and both of log and exp can be implemented as uint8_t[256] lookup tables. Multiplication of two values thus becomes two log lookups, an integer addition, and an exp lookup.

For GF(2¹⁶), the same broad strokes apply, with each of inv and log and exp implementable as uint16_t[65536] lookup tables.

For GF(2³²) and GF(2⁶⁴), the lookup tables would become too big, so alternative approaches are required. Recall that multiplication consists of two major substeps: infinite-precision polynomial multiplication, and then applying a reduction rule. For the infinite-precision polynomial multiplication part, some uncommon CPU instructions come to our aid:

On x86 / amd64, pclmulqdq takes two 64-bit polynomials and returns the 128-bit polynomial product. As a fun extra twist, the inputs are actually 128-bit registers, and an immediate operand can be used to select between the low/high 64-bit halves of each input.
On ARMv8-A / AArch64, pmul takes two 8-bit polynomials, computes the 16-bit polynomial product, and then returns the low 8 terms of that product. As a fun extra twist, this is done in SIMD fashion, either 8-wide or 16-wide.
On ARMv8-A / AArch64, pmull takes two 8-bit polynomials or two 64-bit polynomials and returns the 16-bit or 128-bit polynomial product. The variant with a 16-bit result is done as 8-side SIMD. The variant with a 128-bit result is equivalent to pclmulqdq (selecting the low halves of each input). It is notable that pmull followed by eor is one of the few fused instruction patterns on the Apple M1.
On ARMv8-A / AArch64, pmull2 is like pmull, but selecting the high halves of each input. The variant with a 128-bit result is equivalent to pclmulqdq (selecting the high halves of each input).

The pclmulqdq instruction is available as the _mm_clmulepi64_si128 intrinsic:

#include <stdint.h>
#include <wmmintrin.h>

__m128i poly_mul(uint64_t a, uint64_t b) {
    return _mm_clmulepi64_si128(_mm_cvtsi64_si128(a),
                                _mm_cvtsi64_si128(b), 0);
}

To complete the GF(2⁶⁴) multiplication, this infinite-precision polynomial multiplication needs to be followed by reduction back down to 64 bits using a suitable reduction rule. With the x⁶⁴ == x⁴ + x³ + x + 1 rule, the right hand side of the rule can be represented as the bit string (1 << 4) + (1 << 3) + (1 << 1) + (1 << 0), or 0x1b for short. This rule can be applied in full by using pclmulqdq twice:

uint64_t poly_reduce(__m128i x, uint64_t r = 0x1b) {
    __m128i xr = _mm_cvtsi64_si128(r);
    __m128i x2 = _mm_clmulepi64_si128(x, xr, 1);
    x ^= x2;
    x ^= _mm_clmulepi64_si128(x2, xr, 1);
    return _mm_cvtsi128_si64(x);
}

The two pieces combine to give multiplication in GF(2⁶⁴):

uint64_t gf_mul(uint64_t a, uint64_t b, uint64_t r = 0x1b) {
    return poly_reduce(poly_mul(a, b), r);
}

If performing a dot product rather than just one multiplication, then the reduction step can be performed just once rather than after every multiplication:

uint64_t gf_dot(const uint64_t* as, const uint64_t* bs,
                uint32_t n, uint64_t r = 0x1b) {
    __m128i x = _mm_setzero_si128();
    for (uint32_t i = 0; i < n; ++i) {
        x ^= poly_mul(as[i], bs[i]);
    }
    return poly_reduce(x, r);
}

Thanks to some neat mathematical properties, the inv operation can be implemented in terms of multiplication:

uint64_t gf_inv(uint64_t x, uint64_t r = 0x1b) {
    uint64_t y = x = gf_mul(x, x, r);
    for (int i = 2; i < 64; ++i) {
      x = gf_mul(x, x, r);
      y = gf_mul(x, y, r);
    }
    return y;
}

For GF(2³²), things are actually slightly harder, as shifts by 32 bits are required in the reduction step rather than shifts by 64 bits, and pclmulqdq / pmull2 can only do a "free" shift by 64 bits, so explicit shift/shuffle instructions are required. One interesting option is to use the x86 crc32 instruction, as the reduction x³² == x²⁸ + x²⁷ + x²⁶ + x²⁵ + x²³ + x²² + x²⁰ + x¹⁹ + x¹⁸ + x¹⁴ + x¹³ + x¹¹ + x¹⁰ + x⁹ + x⁸ + x⁶ + 1 is performed within this instruction, but this forces you into a particular reduction rule, and also subjects you to the other steps performed within crc32, namely some shifts and bit reflections.

Going back down to GF(2⁸), very recent Intel chips have added three very relevant instructions under the name of GFNI (Galois Field New Instructions). The first of these is gf2p8mulb, which takes two uint8_t values, performs infinite-precision polynomial multiplication, and then does reduction using the x⁸ == x⁴ + x³ + x + 1 rule to return a uint8_t value. This is available at any SIMD width, i.e. 16 bytes at a time in xmm registers, 32 bytes at a time in ymm registers (AVX), or 64 bytes at a time in zmm registers (AVX512). The other two instructions are gf2p8affineqb and gf2p8affineinvqb, which both follow the same sequence of steps:

Take some uint8_t value as input.
Optionally perform the inv operation on the uint8_t (with reference to x⁸ == x⁴ + x³ + x + 1). gf2p8affineinvqb performs this step, whereas gf2p8affineqb skips it.
Form a new uint8_t where every bit of the output uint8_t is the horizontal XOR of any subset of bits from the input uint8_t.
XOR the uint8_t with an eight-bit immediate.

Step 3 requires a uint64_t operand, as there are eight output bits, and each of those requires an eight-bit mask to specify the input subset. Both of gf2p8affineqb and gf2p8affineinvqb are available at any SIMD width, i.e. 16/32/64 bytes at a time. The second input to the instruction is another vector of the same width; it specifies the operands for step 3, but as this operand is uint64_t rather than uint8_t, the same operand is used for each group of eight bytes. If just the inv from step 2 is desired, then step 3 can be disabled by specifying that output bit i is formed from just input bit i, and step 4 can be disabled by specifying a constant of zero.

(As an aside, step 3 on its own allows for all sorts of bit tricks. Amongst other things, the bits within an uint8_t can be permuted or shifted in any fashion, including reversal or rotation.)

One final case to consider is multiplication by a constant in GF(2⁸). That is, multiplying a with c for lots of different values of a, where both a and c are uint8_t. This extends to doing dot products where one of the vectors is constant, or doing matrix/vector multiplication where the matrix is constant. One option would be to replace c with a uint8_t[256] table, at which point multiplication against c is just a table lookup. A related observation is that gf_mul(a, c) == gf_mul(a & 0xf0, c) ^ gf_mul(a & 0x0f, c), and both of the multiplications on the right only have 16 possible inputs for the a & 0xf0 and a & 0x0f terms. This means that they can both be implemented with a uint8_t[16] table, bringing the memory requirement down from uint8_t[256] to uint8_t[16+16]. At this smaller size, the uint8_t[16] tables can exist in SIMD registers rather than in memory, and table lookup becomes a pshufb (amd64) or tbl (AArch64) instruction rather than a memory load. Even better, multiple such table lookups can be performed in parallel for any SIMD width. This is the approach used by Intel's ISA-L EC routines.