CPUs like to operate on data types such as `uint8_t`, `uint16_t`, `uint32_t`, and `uint64_t`. These data types can be viewed as integers modulo 28, 216, 232, and 264 respectively. Addition and multiplication can be defined on these integers modulo 2N in a way which is familiar to most people; for example, in `uint8_t` math, we have `199 + 201 == 144` and `16 * 20 == 64`. `uint8_t` math can be viewed as infinite-precision integer math combined with the reduction rule of `256 == 0`; for example in infinite-precision integer math we have `199 + 201 == 400 == 256 + 144` and `16 * 20 == 320 == 256 + 64`.

Reduction rules of the form `2N == 0` (for `N > 1`) are CPU-friendly, but not particularly math-friendly. On the other hand, reduction rules of the form `P == 0` for some prime number `P` are math-friendly, but not CPU-friendly. With infinite-precision integer math combined with the reduction rule of say `251 == 0`, we have `199 + 201 == 149` and `16 * 20 == 69`. As `251` is prime, this reduction rule is math-friendly. By that, I mean that for any `x` other than zero, there is some `y` such that `x * y == 1`. For example, taking `x` of 16, we have `16 * 204 == 1`. This property is not true for the `256 == 0` reduction rule; with that rule, there is no `y` such that `16 * y == 1`. Where it exists, this `y` can be called `inv(x)` or `1/x`.

If we want to keep the CPU-friendly `uint8_t`, `uint16_t`, `uint32_t`, and `uint64_t` data types, and also keep the math-friendly property of having `inv(x)` exist for all non-zero `x`, then we are actually in luck, albeit we have to use some exotic definitions of addition and multiplication. The exotic definitions come from polynomial math rather than integer math. A string of bits `b0`, `b1`, ..., `bn-1` can be viewed as the polynomial `b0 + b1 * x1 + ... + bn-1 * xn-1`, at which point addition or multiplication of two strings of bits can be done by converting both strings to polynomials, doing addition or multiplication of those two polynomials, and then converting the resultant polynomial back to a bit string. That final back-conversion step requires that every `bi` in the resultant polynomial is either `0` or `1`, which we can ensure by applying the reduction rule `2 == 0` to every `bi`. In the case of multiplication, the resultant bit string (or, equivalently, polynomial) is often going to be twice as long as its inputs. This should not come as a surprise, as the same thing happens with infinite-precision integer math. Continuing the trend of non-surprising things, just as reduction rules can be used to bring infinite-precision integer math down to fixed bit lengths, reduction rules can also be used to bring infinite-precision polynomial math down to fixed bit lengths. As yet another non-surprise, some polynomial reduction rules are math-friendly, and others are not. Some math-friendly polynomial reduction rules are:

• `x8 == x4 + x3 + x + 1`
• `x16 == x5 + x3 + x + 1`
• `x32 == x7 + x3 + x2 + 1`
• `x64 == x4 + x3 + x + 1`

(These reduction rules come from Table of Low-Weight Binary Irreducible Polynomials, and in some sense are the simplest reduction rules for each `xi` on the left hand side of the `==`.)

By choosing one of these reduction rules, we can define exotic addition and multiplication for any of `uint8_t`, `uint16_t`, `uint32_t`, and `uint64_t`. For example, exotic addition or multiplication of two `uint8_t` values can be done by converting both values to polynomials, doing infinite-precision polynomial addition or multiplication, applying the reduction rule `x8 == x4 + x3 + x + 1` to the polynomial (to get rid of `x8`, `x9`, and all other higher terms), and then converting the polynomial back to a string of 8 bits (remembering the `2 == 0` reduction applied to each `bi` in the polynomial). It turns out that this exotic addition doesn't depend on the choice of polynomial reduction rule, and that the `2 == 0` reduction means that this exotic addition is actually just bitwise XOR of the original values. Unfortunately, exotic multiplication does depend on the choice of polynomial reduction rule, and for any math-friendly choice, exotic multiplication doesn't turn out to be the same as an everyday operation.

With one of these polynomial reduction rules in hand, we can define exotic addition and multiplication for each of the `uint8_t`, `uint16_t`, `uint32_t`, and `uint64_t` data types. Furthermore, assuming that the polynomial reduction rule is math-friendly, this exotic multiplication has an `inv(x)` which exists for all non-zero `x`. In other words, we have what mathematicians call a field. As the number of elements (28, 216, 232, and 264, respectively) is finite in every case, they are furthermore called Galois fields. This is shortened to GF, giving GF(28), GF(216), GF(232), and GF(264).

With that long introduction done, the question becomes: how can addition, multiplication, and `inv` of GF(28) / GF(216) / GF(232) / GF(264) values be done efficiently on contemporary CPUs? Addition is just XOR, so the interesting questions are around multiplication and `inv`.

For GF(28), `inv` could be implemented as a `uint8_t[256]` lookup table. Multiplication could be implemented as a `uint8_t[256][256]` lookup table. This latter table is starting to get a little large, but there's a time/space tradeoff possible: analogues to `log` and `exp` exist in GF(2N), and both of `log` and `exp` can be implemented as `uint8_t[256]` lookup tables. Multiplication of two values thus becomes two `log` lookups, an integer addition, and an `exp` lookup.

For GF(216), the same broad strokes apply, with each of `inv` and `log` and `exp` implementable as `uint16_t[65536]` lookup tables.

For GF(232) and GF(264), the lookup tables would become too big, so alternative approaches are required. Recall that multiplication consists of two major substeps: infinite-precision polynomial multiplication, and then applying a reduction rule. For the infinite-precision polynomial multiplication part, some uncommon CPU instructions come to our aid:

• On x86 / amd64, `pclmulqdq` takes two 64-bit polynomials and returns the 128-bit polynomial product. As a fun extra twist, the inputs are actually 128-bit registers, and an immediate operand can be used to select between the low/high 64-bit halves of each input.
• On ARMv8-A / AArch64, `pmul` takes two 8-bit polynomials, computes the 16-bit polynomial product, and then returns the low 8 terms of that product. As a fun extra twist, this is done in SIMD fashion, either 8-wide or 16-wide.
• On ARMv8-A / AArch64, `pmull` takes two 8-bit polynomials or two 64-bit polynomials and returns the 16-bit or 128-bit polynomial product. The variant with a 16-bit result is done as 8-side SIMD. The variant with a 128-bit result is equivalent to `pclmulqdq` (selecting the low halves of each input). It is notable that `pmull` followed by `eor` is one of the few fused instruction patterns on the Apple M1.
• On ARMv8-A / AArch64, `pmull2` is like `pmull`, but selecting the high halves of each input. The variant with a 128-bit result is equivalent to `pclmulqdq` (selecting the high halves of each input).

The `pclmulqdq` instruction is available as the `_mm_clmulepi64_si128` intrinsic:

``````#include <stdint.h>
#include <wmmintrin.h>

__m128i poly_mul(uint64_t a, uint64_t b) {
return _mm_clmulepi64_si128(_mm_cvtsi64_si128(a),
_mm_cvtsi64_si128(b), 0);
}
``````

To complete the GF(264) multiplication, this infinite-precision polynomial multiplication needs to be followed by reduction back down to 64 bits using a suitable reduction rule. With the `x64 == x4 + x3 + x + 1` rule, the right hand side of the rule can be represented as the bit string `(1 << 4) + (1 << 3) + (1 << 1) + (1 << 0)`, or `0x1b` for short. This rule can be applied in full by using `pclmulqdq` twice:

``````uint64_t poly_reduce(__m128i x, uint64_t r = 0x1b) {
__m128i xr = _mm_cvtsi64_si128(r);
__m128i x2 = _mm_clmulepi64_si128(x, xr, 1);
x ^= x2;
x ^= _mm_clmulepi64_si128(x2, xr, 1);
return _mm_cvtsi128_si64(x);
}
``````

The two pieces combine to give multiplication in GF(264):

``````uint64_t gf_mul(uint64_t a, uint64_t b, uint64_t r = 0x1b) {
return poly_reduce(poly_mul(a, b), r);
}
``````

If performing a dot product rather than just one multiplication, then the reduction step can be performed just once rather than after every multiplication:

``````uint64_t gf_dot(const uint64_t* as, const uint64_t* bs,
uint32_t n, uint64_t r = 0x1b) {
__m128i x = _mm_setzero_si128();
for (uint32_t i = 0; i < n; ++i) {
x ^= poly_mul(as[i], bs[i]);
}
return poly_reduce(x, r);
}
``````

Thanks to some neat mathematical properties, the `inv` operation can be implemented in terms of multiplication:

``````uint64_t gf_inv(uint64_t x, uint64_t r = 0x1b) {
uint64_t y = x = gf_mul(x, x, r);
for (int i = 2; i < 64; ++i) {
x = gf_mul(x, x, r);
y = gf_mul(x, y, r);
}
return y;
}
``````

For GF(232), things are actually slightly harder, as shifts by 32 bits are required in the reduction step rather than shifts by 64 bits, and `pclmulqdq` / `pmull2` can only do a "free" shift by 64 bits, so explicit shift/shuffle instructions are required. One interesting option is to use the x86 `crc32` instruction, as the reduction `x32 == x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 + x18 + x14 + x13 + x11 + x10 + x9 + x8 + x6 + 1` is performed within this instruction, but this forces you into a particular reduction rule, and also subjects you to the other steps performed within `crc32`, namely some shifts and bit reflections.

Going back down to GF(28), very recent Intel chips have added three very relevant instructions under the name of GFNI (Galois Field New Instructions). The first of these is `gf2p8mulb`, which takes two `uint8_t` values, performs infinite-precision polynomial multiplication, and then does reduction using the `x8 == x4 + x3 + x + 1` rule to return a `uint8_t` value. This is available at any SIMD width, i.e. 16 bytes at a time in `xmm` registers, 32 bytes at a time in `ymm` registers (AVX), or 64 bytes at a time in `zmm` registers (AVX512). The other two instructions are `gf2p8affineqb` and `gf2p8affineinvqb`, which both follow the same sequence of steps:

1. Take some `uint8_t` value as input.
2. Optionally perform the `inv` operation on the `uint8_t` (with reference to `x8 == x4 + x3 + x + 1`). `gf2p8affineinvqb` performs this step, whereas `gf2p8affineqb` skips it.
3. Form a new `uint8_t` where every bit of the output `uint8_t` is the horizontal XOR of any subset of bits from the input `uint8_t`.
4. XOR the `uint8_t` with an eight-bit immediate.

Step 3 requires a `uint64_t` operand, as there are eight output bits, and each of those requires an eight-bit mask to specify the input subset. Both of `gf2p8affineqb` and `gf2p8affineinvqb` are available at any SIMD width, i.e. 16/32/64 bytes at a time. The second input to the instruction is another vector of the same width; it specifies the operands for step 3, but as this operand is `uint64_t` rather than `uint8_t`, the same operand is used for each group of eight bytes. If just the `inv` from step 2 is desired, then step 3 can be disabled by specifying that output bit i is formed from just input bit i, and step 4 can be disabled by specifying a constant of zero.

(As an aside, step 3 on its own allows for all sorts of bit tricks. Amongst other things, the bits within an `uint8_t` can be permuted or shifted in any fashion, including reversal or rotation.)

One final case to consider is multiplication by a constant in GF(28). That is, multiplying `a` with `c` for lots of different values of `a`, where both `a` and `c` are `uint8_t`. This extends to doing dot products where one of the vectors is constant, or doing matrix/vector multiplication where the matrix is constant. One option would be to replace `c` with a `uint8_t[256]` table, at which point multiplication against `c` is just a table lookup. A related observation is that `gf_mul(a, c) == gf_mul(a & 0xf0, c) ^ gf_mul(a & 0x0f, c)`, and both of the multiplications on the right only have 16 possible inputs for the `a & 0xf0` and `a & 0x0f` terms. This means that they can both be implemented with a `uint8_t[16]` table, bringing the memory requirement down from `uint8_t[256]` to `uint8_t[16+16]`. At this smaller size, the `uint8_t[16]` tables can exist in SIMD registers rather than in memory, and table lookup becomes a `pshufb` (amd64) or `tbl` (AArch64) instruction rather than a memory load. Even better, multiple such table lookups can be performed in parallel for any SIMD width. This is the approach used by Intel's ISA-L EC routines.