corsix.org

ARM processors have an rbit instruction for reversing bits. On these processors, the following function can be executed as three instructions; rbit followed by rbit followed by cmp, with the result in flags:

bool rbit_lt(uint32_t a, uint32_t b) {
  return __builtin_arm_rbit(a) < __builtin_arm_rbit(b);
}

x86 processors lack an rbit instruction, but following a derivation from the expired patent US 6347318 B1, there is a different three-instruction sequence for x86 processors which achieves the same thing. Start by pulling out the less-than comparison:

bool rbit_lt(uint32_t a, uint32_t b) {
  return lt(__builtin_arm_rbit(a), __builtin_arm_rbit(b));
}

bool lt(uint32_t a, uint32_t b) {
  return a < b;
}

Then replace the less-than comparison with a much more complicated version. This is based on the observation that the less-than comparison only cares about the most significant bit position where the two inputs differ. In turn, most significant bit position is just reversed least significant bit position, and standard bit tricks can be used to get the least significant set bit:

bool lt(uint32_t a, uint32_t b) {
  return (msb(a ^ b) & b) != 0;
}

uint32_t msb(uint32_t x) {
  return __builtin_arm_rbit(lsb(__builtin_arm_rbit(x)));
}

uint32_t lsb(uint32_t x) {
  return x & -x;
}

Then merge (the more complicated) lt back into rbit_lt:

bool rbit_lt(uint32_t a, uint32_t b) {
  return (msb(__builtin_arm_rbit(a ^ b)) & __builtin_arm_rbit(b)) != 0;
}

Then expand msb:

bool rbit_lt(uint32_t a, uint32_t b) {
  return (__builtin_arm_rbit(lsb(a ^ b)) & __builtin_arm_rbit(b)) != 0;
}

Then the reversals can be dropped:

bool rbit_lt(uint32_t a, uint32_t b) {
  return (lsb(a ^ b) & b) != 0;
}

This is three instructions; xor followed by blsi followed by test, with the result in flags. The blsi instruction is part of the BMI1 instruction set, and thus not present on older x86 processors. If we care about older processors, then xor-lsb can be replaced with with equivalent sub-lsb:

bool rbit_lt(uint32_t a, uint32_t b) {
  return (lsb(a - b) & b) != 0;
}

Then lsb expanded:

bool rbit_lt(uint32_t a, uint32_t b) {
  return ((a - b) & (b - a) & b) != 0;
}

This comes out to four instructions on ARM; sub, sub, and, and tst. As x86 lacks relevant three-operand operations, it comes out to five on x86: mov, sub, sub, and, and test (or sub, mov, neg, and, and test), though the mov will be almost free on recent x86 processors.

Software RAID is a mechanism for aggregating multiple hard drives together, with the aim of improving at least one of:

Storage capacity.
Data throughput rate (read speed, or write speed, or both).
Mean time between failure (MTBF).

Optimising purely for storage capacity is easy; given N hard drives each with 1 terabyte of storage capacity, it isn't hard to imagine how to aggregate them into a system with N terabytes of storage capacity. Optimising purely for data throughput rate is similarly easy. Optimising for MTBF is the interesting part. In order to improve MTBF, the aggregation needs to ensure that data is not lost even when some hard drives fail. It isn't hard to see that some capacity needs to be sacrified in order to improve MTBF. One simple approach would be to write the same data to every hard drive, so N hard drives each with 1 terabyte of storage capacity would yield an aggregate with just 1 terabyte of storage capacity, but any N-1 drives could fail without data loss. Another simple approach would be to arrange the hard drives in pairs and write any given piece of data to both drives in some pair, yielding an aggregate system with N/2 terabytes of storage capacity and which allows one hard drive failure within each pair. A slightly more complex approach would be to nominate one hard drive as storing the XOR of all the others, yielding N-1 terabytes of storage and allowing one hard drive failure.

Compared against these simple aggregations, a Reed-Solomon (RS) aggregation initially looks like magic. An RS(10,4) system is an aggregation of 14 hard drives which has the capacity of 10 hard drives, and allows any 4 of the 14 to fail without data being lost. The general RS(N,M) construction requires N+M hard drives, has the capacity of N hard drives, and allows any M of the N+M to fail without data being lost. The system is described at the level of hard drives, but the underlying mechanism operates at the level of small words. As hard drives store bytes, it is useful for a byte to be an integer number of words, or for a word to be an integer number of bytes. For this reason, the word size is typically one of: 1 bit, 2 bits, 4 bits, 8 bits, 16 bits, 32 bits, or 64 bits.

In the framing of matrix math, the RS(N,M) system can be viewed as taking a vector of N words and multiplying it by some N×(N+M) matrix (of words) to give a vector of N+M words. The N+M words are then stored on N+M different hard drives, and any N of the output N+M words can be used to recover all N original input words. For this system to work, "all" we need is for every N×N submatrix of the N×(N+M) matrix to be invertible. To see why, assume we have N output words, then take the N×N submatrix which gives just those outputs, invert it, and multiply it by the outputs to give the original inputs. Note that throughout this piece I'm using the most general meaning of submatrix; a submatrix is formed by deleting zero or more rows and zero or more columns - in particular, a submatrix need not be a contiguous block of the original matrix. It is convenient if some N×N submatrix of the N×(N+M) matrix is the identity matrix, as then N of the output words are identical to N of the input words, and this is actually easy to achieve: take any N×N submatrix, invert it, and then multiply this inverse with the full N×(N+M) matrix. Accordingly, we can view the N×(N+M) matrix as the concatenation of an N×N identity matrix with an N×M matrix.

The problem thus decomposes to:

Coming up with an N×(N+M) matrix which happens to have every N×N submatrix be invertible.
Being able to invert N×N matrices when the time comes.
Efficiently doing K×N by N×M matrix multiplication for large K (K of 1 is the basic case, but in practice K will be the file system block size divided by the word size).

For step 2, it helps if all the non-zero elements of the matrix are themselves invertible. This isn't true in general for integers modulo 2^W, so the W-bit words need to be viewed as elements of GF(2^W) rather than integers modulo 2^W.

For step 1, there are a few common approaches: Vandermonde matrices, Cauchy matrices, and special case constructions. The latter two approaches tend to operate directly on the view of the concatenation of an N×N identity matrix with an N×M matrix. Every N×N submatrix of such a concatenation being invertible is equivalent to every square submatrix (of any size) of the N×M matrix being invertible. To see why, we can use matrix determinants and Laplace expansion: first note that matrix invertibility is equivalent to having a non-zero determinant, then consider some N×N submatrix of the concatenation, and do Laplace expansion of all the parts of the submatrix which came from the identity part of the concatenation. After repeated expansion, the remaining minor will be some square submatrix of the N×M matrix, and modulo signs, the determinant of this minor will equal the determinant of the N×N submatrix.

Special case constructions are worth considering first. Firstly, if the word size is just 1 bit, then the only N×M matrix meeting the invertibility requirement is the N×1 matrix with a value of one in every position: no zeroes can appear in the N×M matrix (as the 1×1 square submatrix containing that zero isn't invertible), and 1-bit words can only be zero or one, so the N×M matrix has to be entirely ones, at which point M ≥ 2 is impossible (as any 2×2 square submatrix consisting entirely of ones isn't invertible). This special case is exactly the case of having one hard drive store the XOR of all the others. One possible conclusion is that word sizes larger than 1 bit are the key component of doing better than the simple approaches outlined at the start of this piece. The second special case to consider is M of 2 with the N×M matrix being the stacking of two N×1 vectors: the first vector consisting entirely of ones, and the second vector containing N distinct non-zero elements of GF(2^W). This meets the invertibility requirements (every 1×1 submatrix is non-zero, and every 2×2 submatrix can have its determinant easily computed and seen to be non-zero). Such a matrix can be constructed provided that N < 2^W, and is a common RAID6 construction.

Next up are Cauchy matrices. A square Cauchy matrix is always invertible, and every submatrix of a Cauchy matrix is a Cauchy matrix. Taken together, this means that an N×M Cauchy matrix meets the requirement of every square submatrix being invertible. Furthermore, an N×M Cauchy matrix is easy to construct: choose N+M distinct elements of GF(2^W), call the first N of these X, and the last M of these Y, then the N×M matrix has inv(X[i] ^ Y[j]) in position i,j. Such a matrix can be constructed provided that N+M ≤ 2^W. For RS(10,4) this would mean that words need to be at least 4 bits.

Finally we get to Vandermonde matrices. This construction gives an N×(N+M) matrix meeting the requirement of every N×N submatrix being invertible, which then requires further processing to get to the concatenation-with-identity form. Some texts instead directly concatenate an N×M Vandermonde matrix to an identity matrix, but this does not work in general. Again, N+M distinct elements of GF(2^W) are chosen, but this time each element gives rise to an N×1 vector. For an element e, the vector is [1, e, e*e, e*e*e, ...]. These vectors are stacked to give an N×(N+M) Vandermonde matrix. Every N×N submatrix of this N×(N+M) matrix is itself a Vandermonde matrix, and as the N elements which define the submatrix are distinct, the determinant is non-zero, and thus the submatrix is invertible.

Vandermonde matrices are appealing because they contain lots of 1 values; the first element in every vector is 1, and an entire vector of 1 is obtained when e == 1. If presence of these 1 values in particular locations can be presumed, then some computationally expensive GF(2^W) multiplications can be elided. For example, in an RS(10,4) system, if all these 1 values could be presumed, then the number of GF(2^W) multiplications required to calculate 4 checksum words from 10 input words reduces from 40 to just 27. Unfortunately, the transformation of the N×(N+M) Vandermonde matrix to concatenation-with-identity form need not preserve any of these nice 1 values. All is not lost though; if we have an identity matrix concatencated with an N×M matrix (be that a transformed Vandermonde matrix, or a Cauchy matrix, or any other construction), then the N×M matrix can be transformed in certain ways without affecting the invertibility properties of (every square submatrix of) that N×M matrix. In particular, any row and/or column can be scaled by any non-zero constant without affecting invertibility. This means we can look at every row in turn, and scale that row by the inv of the row's first element, and then do the same for every column. This will give an N×M matrix where the first row and the first column are both entirely 1.

As a worked example for RS(10,4), here is a 10×14 Vandermonde matrix in GF(2⁸) with x⁸ == x⁴ + x³ + x + 1, elements given as their hex representation:

01 01 01 01 01 01 01 01 01 01 01 01 01 01
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d
00 01 04 05 10 11 14 15 40 41 44 45 50 51
00 01 08 0f 40 55 78 6b 36 7f 9e d1 ed b0
00 01 10 11 1b 1a 0b 0a ab aa bb ba b0 b1
00 01 20 33 6c 72 3a 36 2f 8d c2 72 01 bc
00 01 40 55 ab a1 9c 82 63 89 d5 2b 0c ed
00 01 80 ff 9a 13 65 a3 35 ad 43 3e 50 5d
00 01 1b 1a 5e 5f 45 44 b3 b2 a8 a9 ed ec
00 01 36 2e 63 38 85 c7 ef 55 7c df b0 50

The inverse of the leftmost 10×10 submatrix is:

01 ee 01 ab 42 42 00 45 43 43
00 01 ef ee 45 07 45 45 00 43
00 ad 68 99 a5 66 8a 45 78 28
00 3f 3e dc 0c 23 cf 45 50 28
00 c5 2f 88 b6 8c 0f 45 9b 89
00 0d ed cd 70 c9 4a 45 12 89
00 95 90 ad e0 92 85 45 e8 f2
00 1d fd e8 b2 d7 c0 45 1a f2
00 ce 92 17 f1 3a 00 00 90 10
00 f3 85 17 cb 3a 00 00 80 10

Multiplying these two matrices gives concatenation-with-identity form:

01 00 00 00 00 00 00 00 00 00 81 96 b3 da
00 01 00 00 00 00 00 00 00 00 96 81 da b3
00 00 01 00 00 00 00 00 00 00 af b8 6e 06
00 00 00 01 00 00 00 00 00 00 b8 af 06 6e
00 00 00 00 01 00 00 00 00 00 d2 c4 0c 65
00 00 00 00 00 01 00 00 00 00 c4 d2 65 0c
00 00 00 00 00 00 01 00 00 00 fe e8 d5 bd
00 00 00 00 00 00 00 01 00 00 e8 fe bd d5
00 00 00 00 00 00 00 00 01 00 03 02 05 04
00 00 00 00 00 00 00 00 00 01 02 03 04 05

Then rescaling rows and columns of the rightmost 10×4 block to give 1 values along its leading row and column:

01 00 00 00 00 00 00 00 00 00 01 01 01 01
00 01 00 00 00 00 00 00 00 00 01 2c 5e 2e
00 00 01 00 00 00 00 00 00 00 01 45 4e 1e
00 00 00 01 00 00 00 00 00 00 01 2d c6 b1
00 00 00 00 01 00 00 00 00 00 01 d9 7a 94
00 00 00 00 00 01 00 00 00 00 01 fe d0 56
00 00 00 00 00 00 01 00 00 00 01 5e 53 da
00 00 00 00 00 00 00 01 00 00 01 2e da 7e
00 00 00 00 00 00 00 00 01 00 01 30 f6 85
00 00 00 00 00 00 00 00 00 01 01 3c a4 d5

It can be exhaustively verified that every 10×10 submatrix is invertible.

Alternatively, for doing a Cauchy construction, start with N distinct values vertically, and M more horizontally:

     0a 0b 0c 0d
    ------------
00 |
01 |
02 |
03 |
04 |
05 |
06 |
07 |
08 |
09 |

Then construct the N×M matrix one element at a time by doing GF(2⁸) inv of the XOR of the corresponding vertical value and horizontal value:

     0a 0b 0c 0d
    ------------
00 | 29 c0 b0 e1
01 | c0 29 e1 b0
02 | e8 4f e5 c7
03 | 4f e8 c7 e5
04 | e5 c7 e8 4f
05 | c7 e5 4f e8
06 | b0 e1 29 c0
07 | e1 b0 c0 29
08 | 8d f6 cb 52
09 | f6 8d 52 cb

Then rescale each row and column to give 1 values along the leading row and column:

     0a 0b 0c 0d
    ------------
00 | 01 01 01 01
01 | 01 2c 5e 2e
02 | 01 45 4e 1e
03 | 01 2d c6 b1
04 | 01 d9 7a 94
05 | 01 fe d0 56
06 | 01 5e 53 da
07 | 01 2e da 7e
08 | 01 30 f6 85
09 | 01 3c a4 d5

Then remove the guide values and instead concatenate with an identity matrix:

01 00 00 00 00 00 00 00 00 00 01 01 01 01
00 01 00 00 00 00 00 00 00 00 01 2c 5e 2e
00 00 01 00 00 00 00 00 00 00 01 45 4e 1e
00 00 00 01 00 00 00 00 00 00 01 2d c6 b1
00 00 00 00 01 00 00 00 00 00 01 d9 7a 94
00 00 00 00 00 01 00 00 00 00 01 fe d0 56
00 00 00 00 00 00 01 00 00 00 01 5e 53 da
00 00 00 00 00 00 00 01 00 00 01 2e da 7e
00 00 00 00 00 00 00 00 01 00 01 30 f6 85
00 00 00 00 00 00 00 00 00 01 01 3c a4 d5

This happens to be the exact same 10×14 matrix as given by the Vandermonde construction. This is no coincidence; it can be shown that taking the inverse of a square Vandermonde matrix and multiplying it by a different Vandermonde matrix (which is effectively what the Vandermonde construction does for the rightmost N×M block) yields a Cauchy matrix, modulo some per-row and per-column scaling. That scaling, whatever it is, gets undone by forcing the leading row and column to 1.

CPUs like to operate on data types such as uint8_t, uint16_t, uint32_t, and uint64_t. These data types can be viewed as integers modulo 2⁸, 2¹⁶, 2³², and 2⁶⁴ respectively. Addition and multiplication can be defined on these integers modulo 2^N in a way which is familiar to most people; for example, in uint8_t math, we have 199 + 201 == 144 and 16 * 20 == 64. uint8_t math can be viewed as infinite-precision integer math combined with the reduction rule of 256 == 0; for example in infinite-precision integer math we have 199 + 201 == 400 == 256 + 144 and 16 * 20 == 320 == 256 + 64.

Reduction rules of the form 2^N == 0 (for N > 1) are CPU-friendly, but not particularly math-friendly. On the other hand, reduction rules of the form P == 0 for some prime number P are math-friendly, but not CPU-friendly. With infinite-precision integer math combined with the reduction rule of say 251 == 0, we have 199 + 201 == 149 and 16 * 20 == 69. As 251 is prime, this reduction rule is math-friendly. By that, I mean that for any x other than zero, there is some y such that x * y == 1. For example, taking x of 16, we have 16 * 204 == 1. This property is not true for the 256 == 0 reduction rule; with that rule, there is no y such that 16 * y == 1. Where it exists, this y can be called inv(x) or 1/x.

If we want to keep the CPU-friendly uint8_t, uint16_t, uint32_t, and uint64_t data types, and also keep the math-friendly property of having inv(x) exist for all non-zero x, then we are actually in luck, albeit we have to use some exotic definitions of addition and multiplication. The exotic definitions come from polynomial math rather than integer math. A string of bits b₀, b₁, ..., b_n-1 can be viewed as the polynomial b₀ + b₁ * x¹ + ... + b_n-1 * x^n-1, at which point addition or multiplication of two strings of bits can be done by converting both strings to polynomials, doing addition or multiplication of those two polynomials, and then converting the resultant polynomial back to a bit string. That final back-conversion step requires that every b_i in the resultant polynomial is either 0 or 1, which we can ensure by applying the reduction rule 2 == 0 to every b_i. In the case of multiplication, the resultant bit string (or, equivalently, polynomial) is often going to be twice as long as its inputs. This should not come as a surprise, as the same thing happens with infinite-precision integer math. Continuing the trend of non-surprising things, just as reduction rules can be used to bring infinite-precision integer math down to fixed bit lengths, reduction rules can also be used to bring infinite-precision polynomial math down to fixed bit lengths. As yet another non-surprise, some polynomial reduction rules are math-friendly, and others are not. Some math-friendly polynomial reduction rules are:

x⁸ == x⁴ + x³ + x + 1
x¹⁶ == x⁵ + x³ + x + 1
x³² == x⁷ + x³ + x² + 1
x⁶⁴ == x⁴ + x³ + x + 1

(These reduction rules come from Table of Low-Weight Binary Irreducible Polynomials, and in some sense are the simplest reduction rules for each xⁱ on the left hand side of the ==.)

By choosing one of these reduction rules, we can define exotic addition and multiplication for any of uint8_t, uint16_t, uint32_t, and uint64_t. For example, exotic addition or multiplication of two uint8_t values can be done by converting both values to polynomials, doing infinite-precision polynomial addition or multiplication, applying the reduction rule x⁸ == x⁴ + x³ + x + 1 to the polynomial (to get rid of x⁸, x⁹, and all other higher terms), and then converting the polynomial back to a string of 8 bits (remembering the 2 == 0 reduction applied to each b_i in the polynomial). It turns out that this exotic addition doesn't depend on the choice of polynomial reduction rule, and that the 2 == 0 reduction means that this exotic addition is actually just bitwise XOR of the original values. Unfortunately, exotic multiplication does depend on the choice of polynomial reduction rule, and for any math-friendly choice, exotic multiplication doesn't turn out to be the same as an everyday operation.

With one of these polynomial reduction rules in hand, we can define exotic addition and multiplication for each of the uint8_t, uint16_t, uint32_t, and uint64_t data types. Furthermore, assuming that the polynomial reduction rule is math-friendly, this exotic multiplication has an inv(x) which exists for all non-zero x. In other words, we have what mathematicians call a field. As the number of elements (2⁸, 2¹⁶, 2³², and 2⁶⁴, respectively) is finite in every case, they are furthermore called Galois fields. This is shortened to GF, giving GF(2⁸), GF(2¹⁶), GF(2³²), and GF(2⁶⁴).

With that long introduction done, the question becomes: how can addition, multiplication, and inv of GF(2⁸) / GF(2¹⁶) / GF(2³²) / GF(2⁶⁴) values be done efficiently on contemporary CPUs? Addition is just XOR, so the interesting questions are around multiplication and inv.

For GF(2⁸), inv could be implemented as a uint8_t[256] lookup table. Multiplication could be implemented as a uint8_t[256][256] lookup table. This latter table is starting to get a little large, but there's a time/space tradeoff possible: analogues to log and exp exist in GF(2^N), and both of log and exp can be implemented as uint8_t[256] lookup tables. Multiplication of two values thus becomes two log lookups, an integer addition, and an exp lookup.

For GF(2¹⁶), the same broad strokes apply, with each of inv and log and exp implementable as uint16_t[65536] lookup tables.

For GF(2³²) and GF(2⁶⁴), the lookup tables would become too big, so alternative approaches are required. Recall that multiplication consists of two major substeps: infinite-precision polynomial multiplication, and then applying a reduction rule. For the infinite-precision polynomial multiplication part, some uncommon CPU instructions come to our aid:

On x86 / amd64, pclmulqdq takes two 64-bit polynomials and returns the 128-bit polynomial product. As a fun extra twist, the inputs are actually 128-bit registers, and an immediate operand can be used to select between the low/high 64-bit halves of each input.
On ARMv8-A / AArch64, pmul takes two 8-bit polynomials, computes the 16-bit polynomial product, and then returns the low 8 terms of that product. As a fun extra twist, this is done in SIMD fashion, either 8-wide or 16-wide.
On ARMv8-A / AArch64, pmull takes two 8-bit polynomials or two 64-bit polynomials and returns the 16-bit or 128-bit polynomial product. The variant with a 16-bit result is done as 8-side SIMD. The variant with a 128-bit result is equivalent to pclmulqdq (selecting the low halves of each input). It is notable that pmull followed by eor is one of the few fused instruction patterns on the Apple M1.
On ARMv8-A / AArch64, pmull2 is like pmull, but selecting the high halves of each input. The variant with a 128-bit result is equivalent to pclmulqdq (selecting the high halves of each input).

The pclmulqdq instruction is available as the _mm_clmulepi64_si128 intrinsic:

#include <stdint.h>
#include <wmmintrin.h>

__m128i poly_mul(uint64_t a, uint64_t b) {
    return _mm_clmulepi64_si128(_mm_cvtsi64_si128(a),
                                _mm_cvtsi64_si128(b), 0);
}

To complete the GF(2⁶⁴) multiplication, this infinite-precision polynomial multiplication needs to be followed by reduction back down to 64 bits using a suitable reduction rule. With the x⁶⁴ == x⁴ + x³ + x + 1 rule, the right hand side of the rule can be represented as the bit string (1 << 4) + (1 << 3) + (1 << 1) + (1 << 0), or 0x1b for short. This rule can be applied in full by using pclmulqdq twice:

uint64_t poly_reduce(__m128i x, uint64_t r = 0x1b) {
    __m128i xr = _mm_cvtsi64_si128(r);
    __m128i x2 = _mm_clmulepi64_si128(x, xr, 1);
    x ^= x2;
    x ^= _mm_clmulepi64_si128(x2, xr, 1);
    return _mm_cvtsi128_si64(x);
}

The two pieces combine to give multiplication in GF(2⁶⁴):

uint64_t gf_mul(uint64_t a, uint64_t b, uint64_t r = 0x1b) {
    return poly_reduce(poly_mul(a, b), r);
}

If performing a dot product rather than just one multiplication, then the reduction step can be performed just once rather than after every multiplication:

uint64_t gf_dot(const uint64_t* as, const uint64_t* bs,
                uint32_t n, uint64_t r = 0x1b) {
    __m128i x = _mm_setzero_si128();
    for (uint32_t i = 0; i < n; ++i) {
        x ^= poly_mul(as[i], bs[i]);
    }
    return poly_reduce(x, r);
}

Thanks to some neat mathematical properties, the inv operation can be implemented in terms of multiplication:

uint64_t gf_inv(uint64_t x, uint64_t r = 0x1b) {
    uint64_t y = x = gf_mul(x, x, r);
    for (int i = 2; i < 64; ++i) {
      x = gf_mul(x, x, r);
      y = gf_mul(x, y, r);
    }
    return y;
}

For GF(2³²), things are actually slightly harder, as shifts by 32 bits are required in the reduction step rather than shifts by 64 bits, and pclmulqdq / pmull2 can only do a "free" shift by 64 bits, so explicit shift/shuffle instructions are required. One interesting option is to use the x86 crc32 instruction, as the reduction x³² == x²⁸ + x²⁷ + x²⁶ + x²⁵ + x²³ + x²² + x²⁰ + x¹⁹ + x¹⁸ + x¹⁴ + x¹³ + x¹¹ + x¹⁰ + x⁹ + x⁸ + x⁶ + 1 is performed within this instruction, but this forces you into a particular reduction rule, and also subjects you to the other steps performed within crc32, namely some shifts and bit reflections.

Going back down to GF(2⁸), very recent Intel chips have added three very relevant instructions under the name of GFNI (Galois Field New Instructions). The first of these is gf2p8mulb, which takes two uint8_t values, performs infinite-precision polynomial multiplication, and then does reduction using the x⁸ == x⁴ + x³ + x + 1 rule to return a uint8_t value. This is available at any SIMD width, i.e. 16 bytes at a time in xmm registers, 32 bytes at a time in ymm registers (AVX), or 64 bytes at a time in zmm registers (AVX512). The other two instructions are gf2p8affineqb and gf2p8affineinvqb, which both follow the same sequence of steps:

Take some uint8_t value as input.
Optionally perform the inv operation on the uint8_t (with reference to x⁸ == x⁴ + x³ + x + 1). gf2p8affineinvqb performs this step, whereas gf2p8affineqb skips it.
Form a new uint8_t where every bit of the output uint8_t is the horizontal XOR of any subset of bits from the input uint8_t.
XOR the uint8_t with an eight-bit immediate.

Step 3 requires a uint64_t operand, as there are eight output bits, and each of those requires an eight-bit mask to specify the input subset. Both of gf2p8affineqb and gf2p8affineinvqb are available at any SIMD width, i.e. 16/32/64 bytes at a time. The second input to the instruction is another vector of the same width; it specifies the operands for step 3, but as this operand is uint64_t rather than uint8_t, the same operand is used for each group of eight bytes. If just the inv from step 2 is desired, then step 3 can be disabled by specifying that output bit i is formed from just input bit i, and step 4 can be disabled by specifying a constant of zero.

(As an aside, step 3 on its own allows for all sorts of bit tricks. Amongst other things, the bits within an uint8_t can be permuted or shifted in any fashion, including reversal or rotation.)

One final case to consider is multiplication by a constant in GF(2⁸). That is, multiplying a with c for lots of different values of a, where both a and c are uint8_t. This extends to doing dot products where one of the vectors is constant, or doing matrix/vector multiplication where the matrix is constant. One option would be to replace c with a uint8_t[256] table, at which point multiplication against c is just a table lookup. A related observation is that gf_mul(a, c) == gf_mul(a & 0xf0, c) ^ gf_mul(a & 0x0f, c), and both of the multiplications on the right only have 16 possible inputs for the a & 0xf0 and a & 0x0f terms. This means that they can both be implemented with a uint8_t[16] table, bringing the memory requirement down from uint8_t[256] to uint8_t[16+16]. At this smaller size, the uint8_t[16] tables can exist in SIMD registers rather than in memory, and table lookup becomes a pshufb (amd64) or tbl (AArch64) instruction rather than a memory load. Even better, multiple such table lookups can be performed in parallel for any SIMD width. This is the approach used by Intel's ISA-L EC routines.

Starting with CUDA 10.1, nVidia introduced a new library called cuBLASLt, as an alternative to the cuBLAS function cublasGemmEx (and related gemm routines). The reasons for using cuBLASLt include, but are not limited to:

To be able to specify the CUDA stream, and cuBLAS workspace memory, on a per-function-call basis.
To make use of certain nVidia-provided epilogues fused into the matrix multiplication kernel.
To make use of integer tensor cores on sufficiently modern GPUs.

On the other hand, there's one big reason for not using cuBLASLt: it is significantly more complicated to use than cublasGemmEx.

To start, a cublasLtHandle_t is required - this can come from cublasLtCreate, or an existing cublasHandle_t can be cast to cublasLtHandle_t. However, lots of setup needs to be performed before cublasLtMatmul can be used to actually compute a matrix multiplication. The first piece of setup is to initialise a cublasLtMatmulDesc_t object describing some of the attributes of the matrix multiplication. Such an object is created by cublasLtMatmulDescCreate followed by zero or more calls to cublasLtMatmulDescSetAttribute. Some of the common attributes on this object are:

Attribute	Default Value
`CUBLASLT_MATMUL_DESC_COMPUTE_TYPE`	`cublasLtMatmulDescCreate` parameter
`CUBLASLT_MATMUL_DESC_SCALE_TYPE`	`cublasLtMatmulDescCreate` parameter
`CUBLASLT_MATMUL_DESC_POINTER_MODE`	`CUBLASLT_POINTER_MODE_HOST`
`CUBLASLT_MATMUL_DESC_TRANSA`	`CUBLAS_OP_N`
`CUBLASLT_MATMUL_DESC_TRANSB`	`CUBLAS_OP_N`
`CUBLASLT_MATMUL_DESC_TRANSC`	`CUBLAS_OP_N`
`CUBLASLT_MATMUL_DESC_EPILOGUE`	`CUBLASLT_EPILOGUE_DEFAULT` (none)

Next up, a cublasLtMatrixLayout_t object needs to be initialised for each of the three matrix shapes involved in the matrix multiplication. Such an object is created by cublasLtMatrixLayoutCreate followed by zero or more calls to cublasLtMatrixLayoutSetAttribute. Some of the common attributes on this object are:

Attribute	Default Value
`CUBLASLT_MATRIX_LAYOUT_TYPE`	`cublasLtMatrixLayoutCreate` parameter
`CUBLASLT_MATRIX_LAYOUT_ROWS`	`cublasLtMatrixLayoutCreate` parameter
`CUBLASLT_MATRIX_LAYOUT_COLS`	`cublasLtMatrixLayoutCreate` parameter
`CUBLASLT_MATRIX_LAYOUT_LD`	`cublasLtMatrixLayoutCreate` parameter
`CUBLASLT_MATRIX_LAYOUT_ORDER`	`CUBLASLT_ORDER_COL`
`CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT`	`1` (not batched)

The third thing which cublasLtMatmul requires is a cublasLtMatmulAlgo_t, but such a thing isn't created directly. Instead, a cublasLtMatmulPreference_t object is initialised, and cublasLtMatmulAlgoGetHeuristic is used to take all of the previously created objects and spit out a list of potential cublasLtMatmulAlgo_t objects. A cublasLtMatmulPreference_t object is created by cublasLtMatmulPreferenceCreate followed by zero or more calls to cublasLtMatmulPreferenceSetAttribute. Some of the common attributes on this object are:

Attribute	Default Value
`CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES`	`0`
`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_A_BYTES`	`256`
`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_B_BYTES`	`256`
`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_C_BYTES`	`256`
`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_D_BYTES`	`256`

With all these objects in hand, cublasLtMatmulAlgoGetHeuristic can be called to populate an array of cublasLtMatmulHeuristicResult_t objects. Once populated, the algo field is a ready-to-use cublasLtMatmulAlgo_t. This field is relatively opaque, but some information about it can be obtained if desired. This information comes in three places: firstly there are other fields on cublasLtMatmulHeuristicResult_t (namely wavesCount and workspaceSize), secondly read-only attributes can be queried using cublasLtMatmulAlgoCapGetAttribute (e.g. CUBLASLT_ALGO_CAP_NUMERICAL_IMPL_FLAGS and CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_A_BYTES through CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_D_BYTES), and thirdly read-write attributes can be queried using cublasLtMatmulAlgoConfigGetAttribute.

With a cublasLtMatmulAlgo_t object chosen (typically the front of the array passed to cublasLtMatmulAlgoGetHeuristic), cublasLtMatmul can finally be used. This function computes D = alpha * (A @ B) + beta * C, which when C == D is equivalent to cublasGemmEx.

That concludes basic usage notes, but if CUBLAS_COMPUTE_32I (or CUBLAS_COMPUTE_32I_PEDANTIC) is being used, then there's another whole chapter of usage notes. This chapter starts by noting that the list of supported configurations for integer matrix multiplication is (at least currently) very limited:

CUDA_R_32I destination, computation using legacy CUDA cores:
- scaleType must be CUDA_R_32I, but only 0 or 1 supported
- A matrix must be CUDA_R_8I with either CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW
- B matrix must be CUDA_R_8I with either CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW
- C matrix must be CUDA_R_32I with either CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW
- CUBLASLT_MATMUL_DESC_EPILOGUE must be CUBLASLT_EPILOGUE_DEFAULT
CUDA_R_8I destination, computation using legacy CUDA cores:
- scaleType must be CUDA_R_32F
- A matrix must be CUDA_R_8I with CUBLAS_OP_T
- B matrix must be CUDA_R_8I with CUBLAS_OP_N
- C matrix must be CUDA_R_8I
- CUBLASLT_MATMUL_DESC_EPILOGUE must be CUBLASLT_EPILOGUE_DEFAULT
CUDA_R_32I destination, computation using integer tensor cores:
- scaleType must be CUDA_R_32I, but only 0 or 1 supported
- A matrix must be CUDA_R_8I with CUBLAS_OP_N and CUBLASLT_ORDER_COL32
- B matrix must be CUDA_R_8I with CUBLAS_OP_T and CUBLASLT_ORDER_COL4_4R2_8C (Turing or Ampere) or CUBLASLT_ORDER_COL32_2R_4R4 (Ampere)
- C matrix must be CUDA_R_32I with CUBLAS_OP_N and CUBLASLT_ORDER_COL32
- CUBLASLT_MATMUL_DESC_EPILOGUE must be CUBLASLT_EPILOGUE_DEFAULT
CUDA_R_8I destination, computation using integer tensor cores:
- scaleType must be CUDA_R_32F
- A matrix must be CUDA_R_8I with CUBLAS_OP_N and CUBLASLT_ORDER_COL32
- B matrix must be CUDA_R_8I with CUBLAS_OP_T and CUBLASLT_ORDER_COL4_4R2_8C (Turing or Ampere) or CUBLASLT_ORDER_COL32_2R_4R4 (Ampere)
- C matrix must be CUDA_R_8I with CUBLAS_OP_N and CUBLASLT_ORDER_COL32

Of particular note are the strange CUBLASLT_MATRIX_LAYOUT_ORDER values required for using integer tensor cores. In C/C++ terminology, CUBLASLT_ORDER_COL can be thought of as a two-dimensional array indexed as [I][J], with the [J] part packed densely, and the leading dimension specifying how to advance by one in the [I] part. In the same terminology, CUBLASLT_ORDER_COL32 can be thought of as a three-dimensional array indexed as [I/32][J][I%32], with the [J][I%32] part packed densely, and the leading dimension specifying how to advance by one in the [I/32] part.

The CUBLASLT_ORDER_COL4_4R2_8C and CUBLASLT_ORDER_COL32_2R_4R4 layouts are even more exotic. Rather than trying to explain their layout, it is best to consider them to be completely opaque layouts. Thankfully, the cublasLtMatrixTransform function is provided to convert between layouts, so a matrix can be constructed using a known simple layout (such as CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW) and then converted to CUBLASLT_ORDER_COL4_4R2_8C or CUBLASLT_ORDER_COL32_2R_4R4 using cublasLtMatrixTransform. To use cublasLtMatrixTransform, a cublasLtMatrixTransformDesc_t object is required. Such an object is created by cublasLtMatrixTransformDescCreate followed by zero or more calls to cublasLtMatrixTransformDescSetAttribute. Some of the common attributes on this object are:

Attribute	Default Value
`CUBLASLT_MATRIX_TRANSFORM_DESC_SCALE_TYPE`	`cublasLtMatrixTransformDescCreate` parameter
`CUBLASLT_MATRIX_TRANSFORM_DESC_POINTER_MODE`	`CUBLASLT_POINTER_MODE_HOST`
`CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSA`	`CUBLAS_OP_N`
`CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSB`	`CUBLAS_OP_N`

cublasLtMatrixTransform can also be used to convert in or out of CUBLASLT_ORDER_COL32, but said conversions obviously come with a time cost, so it is better to keep matrices in the CUBLASLT_ORDER_COL32 format, and perform all operations on them in that layout. This might mean rewriting a bunch of CUDA kernels to understand the layout, if said kernels care about the two dimensional structure of the matrix.

For very many years now, x86 and x64 processors have been able to take an arithmetic instruction immediately followed by a conditional jump instruction, and execute the two instructions as if they were a single instruction (which is called "macro-op fusion"). When writing very hot loops in assembly code by hand, this can effectively give you an extra instruction for free.

In Intel chips, fusion first appeared in the original Core microarchitecture, launched in 2006. An improved version of fusion made its debut in the Nehalem microarchitecture (Nhlm) a few years later in 2008. The Sandy Bridge microarchitecture (SnBr) released in 2011 contained yet more improvements to fusion. Things remained fairly stable through until Cannon Lake (CnLk), but then Ice Lake in 2019 removed support for a few cases of fusion (INC and DEC instructions, and instructions with memory operands).

In AMD chips, fusion first appeared in the Bulldozer microarchitecture (Bdzr), launched in 2011. It supported fusing TEST and CMP instructions with all flag-based conditional jumps. Things remained fairly stable through until Zen 3 in 2020, which added support for a number of other instructions.

The table below summarises which instructions can fuse with which jumps on which microarchitectures. Combinations which are supported, but which are pointless in practice, are parenthesised.

	TEST	AND	OR XOR	CMP	ADD SUB	INC DEC
J[N]O	(Core+)	(SnBr+)	-	-	-	-
J[N]S	Core+	SnBr+	-	-	-	-
J[N]P JP[EO]	Core+	SnBr+	-	-	-	-
J[N]B J[N]AE J[N]C	(Core+)	(SnBr+)	-	Core+	SnBr+	-
J[N]BE J[N]A	(Core+)	(SnBr+)	-	Core+	SnBr+	-
J[N]Z J[N]E	Core+	SnBr+	-	Core+	SnBr+	SnBr-CnLk
J[N]L J[N]GE	(Core+)	(SnBr+)	-	Nhlm+	SnBr+	SnBr-CnLk
J[N]G J[N]LE	Core+	SnBr+	-	Nhlm+	SnBr+	SnBr-CnLk
	TEST	AND	OR XOR	CMP	ADD SUB	INC DEC
J[N]O	(Bdzr+)	(Zen3+)	(Zen3+)	Bdzr+	Zen3+	Zen3+
J[N]S	Bdzr+	Zen3+	Zen3+	Bdzr+	Zen3+	Zen3+
J[N]P JP[EO]	Bdzr+	Zen3+	Zen3+	Bdzr+	Zen3+	Zen3+
J[N]B J[N]AE J[N]C	(Bdzr+)	(Zen3+)	(Zen3+)	Bdzr+	Zen3+	(Zen3+)
J[N]BE J[N]A	(Bdzr+)	(Zen3+)	(Zen3+)	Bdzr+	Zen3+	(Zen3+)
J[N]Z J[N]E	Bdzr+	Zen3+	Zen3+	Bdzr+	Zen3+	Zen3+
J[N]L J[N]GE	(Bdzr+)	(Zen3+)	(Zen3+)	Bdzr+	Zen3+	Zen3+
J[N]G J[N]LE	Bdzr+	Zen3+	Zen3+	Bdzr+	Zen3+	Zen3+

The following paragraphs describe the various conditional jumps, and in particular what it means to fuse with ADD.

Jumps which depend only upon the result bits (for TEST, the result is what AND would give, and for CMP, the result is what SUB would give):

	Description	Flags	Fusion (SnBr)
JS	Jump if sign	`SF = 1`	`TEST` `AND`
JNS	Jump if not sign	`SF = 0`	`TEST` `AND`
JP JPE	Jump if parity Jump if parity even	`PF = 1`	`TEST` `AND`
JNP JPO	Jump if not parity Jump if parity odd	`PF = 0`	`TEST` `AND`
JZ JE	Jump if zero Jump if equal	`ZF = 1`	`TEST` `AND` `INC` `DEC` `CMP` `ADD` `SUB`
JNZ JNE	Jump if not zero Jump if not equal	`ZF = 0`	`TEST` `AND` `INC` `DEC` `CMP` `ADD` `SUB`

SF is a copy of the most significant bit of the result (which for conceptually signed results, indicates a result less than zero), PF is set based on the population count of the low 8 bits of the result (PF = 1 implies that said population count was even), and ZF is set if all bits of the result are zero (which for CMP and SUB implies that the two operands were equal).

Jumps which conceptually operate on unsigned operands:

	Description	Flags	Fusion (SnBr)
JB JNAE JC	Jump if below Jump if not above or equal Jump if carry	`CF = 1`	`CMP` `ADD` `SUB`
JNB JAE JNC	Jump if not below Jump if above or equal Jump if not carry	`CF = 0`	`CMP` `ADD` `SUB`
JBE JNA	Jump if below or equal Jump if not above	`CF = 1 or ZF = 1`	`CMP` `ADD` `SUB`
JNBE JA	Jump if not below or equal Jump if above	`CF = 0 and ZF = 0`	`CMP` `ADD` `SUB`

For an N-bit result, CF is the (N+1)^th bit. Fusion with TEST and AND is technically possible, but CF is always zero following these instructions, making it pointless. The "above" and "below" descriptions are from the perspective of CMP or SUB with unsigned operands. To extend the description to ADD, consider the instruction to be taking two N-bit unsigned operands, zero-extending both to (N+1) bits, and producing an (N+1)-bit signed result. Then "above" and "below" are whether this (N+1)-bit signed result is above or below zero.

Jumps which conceptually operate on signed operands:

	Description	Flags	Fusion (SnBr)
JO	Jump if overflow	`OF = 1`	-
JNO	Jump if not overflow	`OF = 0`	-
JL JNGE	Jump if less Jump if not greater or equal	`SF ≠ OF`	`INC` `DEC` `CMP` `ADD` `SUB`
JNL JGE	Jump if not less Jump if greater or equal	`SF = OF`	`INC` `DEC` `CMP` `ADD` `SUB`
JNG JLE	Jump if not greater Jump if less or equal	`SF ≠ OF or ZF = 1`	`TEST` `AND` `INC` `DEC` `CMP` `ADD` `SUB`
JG JNLE	Jump if greater Jump if not less or equal	`SF = OF and ZF = 0`	`TEST` `AND` `INC` `DEC` `CMP` `ADD` `SUB`

Conceptually, the arithmetic instructions take two N-bit signed operands (the 2^nd operand being 1 for INC and DEC), sign extend both to (N+1)-bits, compute an (N+1)-bit signed result, then OF is set to the exclusive-or of the two most signicant bits of those (N+1). This means that SF ≠ OF corresponds to the most significant bit of this signed (N+1)-bit result (in the same way that CF corresponds to the most significant bit of the (N+1)-bit result when the operands are zero-extended). In other words, OF is set if sign-extending the N-bit result to (N+1)-bits gives a different result to sign-extending the operands to (N+1)-bits and then performing (N+1)-bit arithmetic. Fusion with TEST and AND is technically possible for all these jumps, but OF is always zero following TEST or AND, making many combinations pointless. The "greater" and "less" descriptions are from the perspective of CMP or SUB with signed operands. To extend the description to ADD, consider the instruction to be taking two N-bit signed operands, sign-extending both to (N+1) bits, and producing an (N+1)-bit signed result. Then "greater" and "less" are whether this (N+1)-bit signed result is greater or less than zero. The INC and DEC instructions are equivalent to ADD and SUB with the 2^nd operand being 1 (except for INC and DEC not touching CF).

In the context of optimising hot loops, it is common to be iterating over the half-open range [0, N), by some step S. Given all the references above to comparing to zero, it is often beneficial to transform this range to [-N, 0), as then the loop-control instructions can be ADD reg, S followed by a conditional jump, rather than ADD reg, S followed by CMP reg, N followed by a conditional jump. In the case of fusing ADD reg, S with a conditional jump, if N is known to be non-zero, then the conditional jump can be JA or JAE (or synonyms thereof), relying on CF = 1 occuring when the counter transitions from negative to non-negative. If N might zero, and a single iteration of the loop is desired (or harmless) even when N is zero, then the conditional jump cannot be JA or JAE, but instead can be JL (or a synonym thereof), relying on sign extension of both operands to (N+1) bits and then jumping if the (N+1)-bit result is less than zero.

Comparison after bit reversal

Reed-Solomon for software RAID

Galois field instructions on 2021 CPUs

cuBLASLt notes

x86 macro-op fusion notes