Galois field instructions on 2021 CPUs

CPUs like to operate on data types such as uint8_t, uint16_t, uint32_t, and uint64_t. These data types can be viewed as integers modulo 28, 216, 232, and 264 respectively. Addition and multiplication can be defined on these integers modulo 2N in a way which is familiar to most people; for example, in uint8_t math, we have 199 + 201 == 144 and 16 * 20 == 64. uint8_t math can be viewed as infinite-precision integer math combined with the reduction rule of 256 == 0; for example in infinite-precision integer math we have 199 + 201 == 400 == 256 + 144 and 16 * 20 == 320 == 256 + 64.

Reduction rules of the form 2N == 0 (for N > 1) are CPU-friendly, but not particularly math-friendly. On the other hand, reduction rules of the form P == 0 for some prime number P are math-friendly, but not CPU-friendly. With infinite-precision integer math combined with the reduction rule of say 251 == 0, we have 199 + 201 == 149 and 16 * 20 == 69. As 251 is prime, this reduction rule is math-friendly. By that, I mean that for any x other than zero, there is some y such that x * y == 1. For example, taking x of 16, we have 16 * 204 == 1. This property is not true for the 256 == 0 reduction rule; with that rule, there is no y such that 16 * y == 1. Where it exists, this y can be called inv(x) or 1/x.

If we want to keep the CPU-friendly uint8_t, uint16_t, uint32_t, and uint64_t data types, and also keep the math-friendly property of having inv(x) exist for all non-zero x, then we are actually in luck, albeit we have to use some exotic definitions of addition and multiplication. The exotic definitions come from polynomial math rather than integer math. A string of bits b0, b1, ..., bn-1 can be viewed as the polynomial b0 + b1 * x1 + ... + bn-1 * xn-1, at which point addition or multiplication of two strings of bits can be done by converting both strings to polynomials, doing addition or multiplication of those two polynomials, and then converting the resultant polynomial back to a bit string. That final back-conversion step requires that every bi in the resultant polynomial is either 0 or 1, which we can ensure by applying the reduction rule 2 == 0 to every bi. In the case of multiplication, the resultant bit string (or, equivalently, polynomial) is often going to be twice as long as its inputs. This should not come as a surprise, as the same thing happens with infinite-precision integer math. Continuing the trend of non-surprising things, just as reduction rules can be used to bring infinite-precision integer math down to fixed bit lengths, reduction rules can also be used to bring infinite-precision polynomial math down to fixed bit lengths. As yet another non-surprise, some polynomial reduction rules are math-friendly, and others are not. Some math-friendly polynomial reduction rules are:

(These reduction rules come from Table of Low-Weight Binary Irreducible Polynomials, and in some sense are the simplest reduction rules for each xi on the left hand side of the ==.)

By choosing one of these reduction rules, we can define exotic addition and multiplication for any of uint8_t, uint16_t, uint32_t, and uint64_t. For example, exotic addition or multiplication of two uint8_t values can be done by converting both values to polynomials, doing infinite-precision polynomial addition or multiplication, applying the reduction rule x8 == x4 + x3 + x + 1 to the polynomial (to get rid of x8, x9, and all other higher terms), and then converting the polynomial back to a string of 8 bits (remembering the 2 == 0 reduction applied to each bi in the polynomial). It turns out that this exotic addition doesn't depend on the choice of polynomial reduction rule, and that the 2 == 0 reduction means that this exotic addition is actually just bitwise XOR of the original values. Unfortunately, exotic multiplication does depend on the choice of polynomial reduction rule, and for any math-friendly choice, exotic multiplication doesn't turn out to be the same as an everyday operation.

With one of these polynomial reduction rules in hand, we can define exotic addition and multiplication for each of the uint8_t, uint16_t, uint32_t, and uint64_t data types. Furthermore, assuming that the polynomial reduction rule is math-friendly, this exotic multiplication has an inv(x) which exists for all non-zero x. In other words, we have what mathematicians call a field. As the number of elements (28, 216, 232, and 264, respectively) is finite in every case, they are furthermore called Galois fields. This is shortened to GF, giving GF(28), GF(216), GF(232), and GF(264).

With that long introduction done, the question becomes: how can addition, multiplication, and inv of GF(28) / GF(216) / GF(232) / GF(264) values be done efficiently on contemporary CPUs? Addition is just XOR, so the interesting questions are around multiplication and inv.

For GF(28), inv could be implemented as a uint8_t[256] lookup table. Multiplication could be implemented as a uint8_t[256][256] lookup table. This latter table is starting to get a little large, but there's a time/space tradeoff possible: analogues to log and exp exist in GF(2N), and both of log and exp can be implemented as uint8_t[256] lookup tables. Multiplication of two values thus becomes two log lookups, an integer addition, and an exp lookup.

For GF(216), the same broad strokes apply, with each of inv and log and exp implementable as uint16_t[65536] lookup tables.

For GF(232) and GF(264), the lookup tables would become too big, so alternative approaches are required. Recall that multiplication consists of two major substeps: infinite-precision polynomial multiplication, and then applying a reduction rule. For the infinite-precision polynomial multiplication part, some uncommon CPU instructions come to our aid:

The pclmulqdq instruction is available as the _mm_clmulepi64_si128 intrinsic:

#include <stdint.h>
#include <wmmintrin.h>

__m128i poly_mul(uint64_t a, uint64_t b) {
    return _mm_clmulepi64_si128(_mm_cvtsi64_si128(a),
                                _mm_cvtsi64_si128(b), 0);
}

To complete the GF(264) multiplication, this infinite-precision polynomial multiplication needs to be followed by reduction back down to 64 bits using a suitable reduction rule. With the x64 == x4 + x3 + x + 1 rule, the right hand side of the rule can be represented as the bit string (1 << 4) + (1 << 3) + (1 << 1) + (1 << 0), or 0x1b for short. This rule can be applied in full by using pclmulqdq twice:

uint64_t poly_reduce(__m128i x, uint64_t r = 0x1b) {
    __m128i xr = _mm_cvtsi64_si128(r);
    __m128i x2 = _mm_clmulepi64_si128(x, xr, 1);
    x ^= x2;
    x ^= _mm_clmulepi64_si128(x2, xr, 1);
    return _mm_cvtsi128_si64(x);
}

The two pieces combine to give multiplication in GF(264):

uint64_t gf_mul(uint64_t a, uint64_t b, uint64_t r = 0x1b) {
    return poly_reduce(poly_mul(a, b), r);
}

If performing a dot product rather than just one multiplication, then the reduction step can be performed just once rather than after every multiplication:

uint64_t gf_dot(const uint64_t* as, const uint64_t* bs,
                uint32_t n, uint64_t r = 0x1b) {
    __m128i x = _mm_setzero_si128();
    for (uint32_t i = 0; i < n; ++i) {
        x ^= poly_mul(as[i], bs[i]);
    }
    return poly_reduce(x, r);
}

Thanks to some neat mathematical properties, the inv operation can be implemented in terms of multiplication:

uint64_t gf_inv(uint64_t x, uint64_t r = 0x1b) {
    uint64_t y = x = gf_mul(x, x, r);
    for (int i = 2; i < 64; ++i) {
      x = gf_mul(x, x, r);
      y = gf_mul(x, y, r);
    }
    return y;
}

For GF(232), things are actually slightly harder, as shifts by 32 bits are required in the reduction step rather than shifts by 64 bits, and pclmulqdq / pmull2 can only do a "free" shift by 64 bits, so explicit shift/shuffle instructions are required. One interesting option is to use the x86 crc32 instruction, as the reduction x32 == x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 + x18 + x14 + x13 + x11 + x10 + x9 + x8 + x6 + 1 is performed within this instruction, but this forces you into a particular reduction rule, and also subjects you to the other steps performed within crc32, namely some shifts and bit reflections.

Going back down to GF(28), very recent Intel chips have added three very relevant instructions under the name of GFNI (Galois Field New Instructions). The first of these is gf2p8mulb, which takes two uint8_t values, performs infinite-precision polynomial multiplication, and then does reduction using the x8 == x4 + x3 + x + 1 rule to return a uint8_t value. This is available at any SIMD width, i.e. 16 bytes at a time in xmm registers, 32 bytes at a time in ymm registers (AVX), or 64 bytes at a time in zmm registers (AVX512). The other two instructions are gf2p8affineqb and gf2p8affineinvqb, which both follow the same sequence of steps:

  1. Take some uint8_t value as input.
  2. Optionally perform the inv operation on the uint8_t (with reference to x8 == x4 + x3 + x + 1). gf2p8affineinvqb performs this step, whereas gf2p8affineqb skips it.
  3. Form a new uint8_t where every bit of the output uint8_t is the horizontal XOR of any subset of bits from the input uint8_t.
  4. XOR the uint8_t with an eight-bit immediate.

Step 3 requires a uint64_t operand, as there are eight output bits, and each of those requires an eight-bit mask to specify the input subset. Both of gf2p8affineqb and gf2p8affineinvqb are available at any SIMD width, i.e. 16/32/64 bytes at a time. The second input to the instruction is another vector of the same width; it specifies the operands for step 3, but as this operand is uint64_t rather than uint8_t, the same operand is used for each group of eight bytes. If just the inv from step 2 is desired, then step 3 can be disabled by specifying that output bit i is formed from just input bit i, and step 4 can be disabled by specifying a constant of zero.

(As an aside, step 3 on its own allows for all sorts of bit tricks. Amongst other things, the bits within an uint8_t can be permuted or shifted in any fashion, including reversal or rotation.)

One final case to consider is multiplication by a constant in GF(28). That is, multiplying a with c for lots of different values of a, where both a and c are uint8_t. This extends to doing dot products where one of the vectors is constant, or doing matrix/vector multiplication where the matrix is constant. One option would be to replace c with a uint8_t[256] table, at which point multiplication against c is just a table lookup. A related observation is that gf_mul(a, c) == gf_mul(a & 0xf0, c) ^ gf_mul(a & 0x0f, c), and both of the multiplications on the right only have 16 possible inputs for the a & 0xf0 and a & 0x0f terms. This means that they can both be implemented with a uint8_t[16] table, bringing the memory requirement down from uint8_t[256] to uint8_t[16+16]. At this smaller size, the uint8_t[16] tables can exist in SIMD registers rather than in memory, and table lookup becomes a pshufb (amd64) or tbl (AArch64) instruction rather than a memory load. Even better, multiple such table lookups can be performed in parallel for any SIMD width. This is the approach used by Intel's ISA-L EC routines.

cuBLASLt notes

Starting with CUDA 10.1, nVidia introduced a new library called cuBLASLt, as an alternative to the cuBLAS function cublasGemmEx (and related gemm routines). The reasons for using cuBLASLt include, but are not limited to:

  1. To be able to specify the CUDA stream, and cuBLAS workspace memory, on a per-function-call basis.
  2. To make use of certain nVidia-provided epilogues fused into the matrix multiplication kernel.
  3. To make use of integer tensor cores on sufficiently modern GPUs.

On the other hand, there's one big reason for not using cuBLASLt: it is significantly more complicated to use than cublasGemmEx.

To start, a cublasLtHandle_t is required - this can come from cublasLtCreate, or an existing cublasHandle_t can be cast to cublasLtHandle_t. However, lots of setup needs to be performed before cublasLtMatmul can be used to actually compute a matrix multiplication. The first piece of setup is to initialise a cublasLtMatmulDesc_t object describing some of the attributes of the matrix multiplication. Such an object is created by cublasLtMatmulDescCreate followed by zero or more calls to cublasLtMatmulDescSetAttribute. Some of the common attributes on this object are:

AttributeDefault Value
CUBLASLT_MATMUL_DESC_COMPUTE_TYPEcublasLtMatmulDescCreate parameter
CUBLASLT_MATMUL_DESC_SCALE_TYPEcublasLtMatmulDescCreate parameter
CUBLASLT_MATMUL_DESC_POINTER_MODECUBLASLT_POINTER_MODE_HOST
CUBLASLT_MATMUL_DESC_TRANSACUBLAS_OP_N
CUBLASLT_MATMUL_DESC_TRANSBCUBLAS_OP_N
CUBLASLT_MATMUL_DESC_TRANSCCUBLAS_OP_N
CUBLASLT_MATMUL_DESC_EPILOGUECUBLASLT_EPILOGUE_DEFAULT (none)

Next up, a cublasLtMatrixLayout_t object needs to be initialised for each of the three matrix shapes involved in the matrix multiplication. Such an object is created by cublasLtMatrixLayoutCreate followed by zero or more calls to cublasLtMatrixLayoutSetAttribute. Some of the common attributes on this object are:

AttributeDefault Value
CUBLASLT_MATRIX_LAYOUT_TYPEcublasLtMatrixLayoutCreate parameter
CUBLASLT_MATRIX_LAYOUT_ROWScublasLtMatrixLayoutCreate parameter
CUBLASLT_MATRIX_LAYOUT_COLScublasLtMatrixLayoutCreate parameter
CUBLASLT_MATRIX_LAYOUT_LDcublasLtMatrixLayoutCreate parameter
CUBLASLT_MATRIX_LAYOUT_ORDERCUBLASLT_ORDER_COL
CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT1 (not batched)

The third thing which cublasLtMatmul requires is a cublasLtMatmulAlgo_t, but such a thing isn't created directly. Instead, a cublasLtMatmulPreference_t object is initialised, and cublasLtMatmulAlgoGetHeuristic is used to take all of the previously created objects and spit out a list of potential cublasLtMatmulAlgo_t objects. A cublasLtMatmulPreference_t object is created by cublasLtMatmulPreferenceCreate followed by zero or more calls to cublasLtMatmulPreferenceSetAttribute. Some of the common attributes on this object are:

AttributeDefault Value
CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES0
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_A_BYTES256
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_B_BYTES256
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_C_BYTES256
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_D_BYTES256

With all these objects in hand, cublasLtMatmulAlgoGetHeuristic can be called to populate an array of cublasLtMatmulHeuristicResult_t objects. Once populated, the algo field is a ready-to-use cublasLtMatmulAlgo_t. This field is relatively opaque, but some information about it can be obtained if desired. This information comes in three places: firstly there are other fields on cublasLtMatmulHeuristicResult_t (namely wavesCount and workspaceSize), secondly read-only attributes can be queried using cublasLtMatmulAlgoCapGetAttribute (e.g. CUBLASLT_ALGO_CAP_NUMERICAL_IMPL_FLAGS and CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_A_BYTES through CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_D_BYTES), and thirdly read-write attributes can be queried using cublasLtMatmulAlgoConfigGetAttribute.

With a cublasLtMatmulAlgo_t object chosen (typically the front of the array passed to cublasLtMatmulAlgoGetHeuristic), cublasLtMatmul can finally be used. This function computes D = alpha * (A @ B) + beta * C, which when C == D is equivalent to cublasGemmEx.

That concludes basic usage notes, but if CUBLAS_COMPUTE_32I (or CUBLAS_COMPUTE_32I_PEDANTIC) is being used, then there's another whole chapter of usage notes. This chapter starts by noting that the list of supported configurations for integer matrix multiplication is (at least currently) very limited:

  1. CUDA_R_32I destination, computation using legacy CUDA cores:
    • scaleType must be CUDA_R_32I, but only 0 or 1 supported
    • A matrix must be CUDA_R_8I with either CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW
    • B matrix must be CUDA_R_8I with either CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW
    • C matrix must be CUDA_R_32I with either CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW
    • CUBLASLT_MATMUL_DESC_EPILOGUE must be CUBLASLT_EPILOGUE_DEFAULT
  2. CUDA_R_8I destination, computation using legacy CUDA cores:
    • scaleType must be CUDA_R_32F
    • A matrix must be CUDA_R_8I with CUBLAS_OP_T
    • B matrix must be CUDA_R_8I with CUBLAS_OP_N
    • C matrix must be CUDA_R_8I
    • CUBLASLT_MATMUL_DESC_EPILOGUE must be CUBLASLT_EPILOGUE_DEFAULT
  3. CUDA_R_32I destination, computation using integer tensor cores:
    • scaleType must be CUDA_R_32I, but only 0 or 1 supported
    • A matrix must be CUDA_R_8I with CUBLAS_OP_N and CUBLASLT_ORDER_COL32
    • B matrix must be CUDA_R_8I with CUBLAS_OP_T and CUBLASLT_ORDER_COL4_4R2_8C (Turing or Ampere) or CUBLASLT_ORDER_COL32_2R_4R4 (Ampere)
    • C matrix must be CUDA_R_32I with CUBLAS_OP_N and CUBLASLT_ORDER_COL32
    • CUBLASLT_MATMUL_DESC_EPILOGUE must be CUBLASLT_EPILOGUE_DEFAULT
  4. CUDA_R_8I destination, computation using integer tensor cores:
    • scaleType must be CUDA_R_32F
    • A matrix must be CUDA_R_8I with CUBLAS_OP_N and CUBLASLT_ORDER_COL32
    • B matrix must be CUDA_R_8I with CUBLAS_OP_T and CUBLASLT_ORDER_COL4_4R2_8C (Turing or Ampere) or CUBLASLT_ORDER_COL32_2R_4R4 (Ampere)
    • C matrix must be CUDA_R_8I with CUBLAS_OP_N and CUBLASLT_ORDER_COL32

Of particular note are the strange CUBLASLT_MATRIX_LAYOUT_ORDER values required for using integer tensor cores. In C/C++ terminology, CUBLASLT_ORDER_COL can be thought of as a two-dimensional array indexed as [I][J], with the [J] part packed densely, and the leading dimension specifying how to advance by one in the [I] part. In the same terminology, CUBLASLT_ORDER_COL32 can be thought of as a three-dimensional array indexed as [I/32][J][I%32], with the [J][I%32] part packed densely, and the leading dimension specifying how to advance by one in the [I/32] part.

The CUBLASLT_ORDER_COL4_4R2_8C and CUBLASLT_ORDER_COL32_2R_4R4 layouts are even more exotic. Rather than trying to explain their layout, it is best to consider them to be completely opaque layouts. Thankfully, the cublasLtMatrixTransform function is provided to convert between layouts, so a matrix can be constructed using a known simple layout (such as CUBLASLT_ORDER_COL or CUBLASLT_ORDER_ROW) and then converted to CUBLASLT_ORDER_COL4_4R2_8C or CUBLASLT_ORDER_COL32_2R_4R4 using cublasLtMatrixTransform. To use cublasLtMatrixTransform, a cublasLtMatrixTransformDesc_t object is required. Such an object is created by cublasLtMatrixTransformDescCreate followed by zero or more calls to cublasLtMatrixTransformDescSetAttribute. Some of the common attributes on this object are:

AttributeDefault Value
CUBLASLT_MATRIX_TRANSFORM_DESC_SCALE_TYPEcublasLtMatrixTransformDescCreate parameter
CUBLASLT_MATRIX_TRANSFORM_DESC_POINTER_MODECUBLASLT_POINTER_MODE_HOST
CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSACUBLAS_OP_N
CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSBCUBLAS_OP_N

cublasLtMatrixTransform can also be used to convert in or out of CUBLASLT_ORDER_COL32, but said conversions obviously come with a time cost, so it is better to keep matrices in the CUBLASLT_ORDER_COL32 format, and perform all operations on them in that layout. This might mean rewriting a bunch of CUDA kernels to understand the layout, if said kernels care about the two dimensional structure of the matrix.

x86 macro-op fusion notes

For very many years now, x86 and x64 processors have been able to take an arithmetic instruction immediately followed by a conditional jump instruction, and execute the two instructions as if they were a single instruction (which is called "macro-op fusion"). When writing very hot loops in assembly code by hand, this can effectively give you an extra instruction for free.

In Intel chips, fusion first appeared in the original Core microarchitecture, launched in 2006. An improved version of fusion made its debut in the Nehalem microarchitecture (Nhlm) a few years later in 2008. The Sandy Bridge microarchitecture (SnBr) released in 2011 contained yet more improvements to fusion. Things remained fairly stable through until Cannon Lake (CnLk), but then Ice Lake in 2019 removed support for a few cases of fusion (INC and DEC instructions, and instructions with memory operands).

In AMD chips, fusion first appeared in the Bulldozer microarchitecture (Bdzr), launched in 2011. It supported fusing TEST and CMP instructions with all flag-based conditional jumps. Things remained fairly stable through until Zen 3 in 2020, which added support for a number of other instructions.

The table below summarises which instructions can fuse with which jumps on which microarchitectures. Combinations which are supported, but which are pointless in practice, are parenthesised.


TEST

AND
OR
XOR

CMP
ADD
SUB
INC
DEC
J[N]O(Core+)(SnBr+)----
J[N]SCore+SnBr+----
J[N]P
JP[EO]
Core+SnBr+----
J[N]B
J[N]AE
J[N]C
(Core+)(SnBr+)-Core+SnBr+-
J[N]BE
J[N]A
(Core+)(SnBr+)-Core+SnBr+-
J[N]Z
J[N]E
Core+SnBr+-Core+SnBr+SnBr-CnLk
J[N]L
J[N]GE
(Core+)(SnBr+)-Nhlm+SnBr+SnBr-CnLk
J[N]G
J[N]LE
Core+SnBr+-Nhlm+SnBr+SnBr-CnLk

TEST

AND
OR
XOR

CMP
ADD
SUB
INC
DEC
J[N]O(Bdzr+)(Zen3+)(Zen3+)Bdzr+Zen3+Zen3+
J[N]SBdzr+Zen3+Zen3+Bdzr+Zen3+Zen3+
J[N]P
JP[EO]
Bdzr+Zen3+Zen3+Bdzr+Zen3+Zen3+
J[N]B
J[N]AE
J[N]C
(Bdzr+)(Zen3+)(Zen3+)Bdzr+Zen3+(Zen3+)
J[N]BE
J[N]A
(Bdzr+)(Zen3+)(Zen3+)Bdzr+Zen3+(Zen3+)
J[N]Z
J[N]E
Bdzr+Zen3+Zen3+Bdzr+Zen3+Zen3+
J[N]L
J[N]GE
(Bdzr+)(Zen3+)(Zen3+)Bdzr+Zen3+Zen3+
J[N]G
J[N]LE
Bdzr+Zen3+Zen3+Bdzr+Zen3+Zen3+
The following paragraphs describe the various conditional jumps, and in particular what it means to fuse with ADD.
Jumps which depend only upon the result bits (for TEST, the result is what AND would give, and for CMP, the result is what SUB would give):
DescriptionFlagsFusion (SnBr)
JSJump if signSF = 1TEST AND
JNSJump if not signSF = 0TEST AND
JP
JPE
Jump if parity
Jump if parity even
PF = 1TEST AND
JNP
JPO
Jump if not parity
Jump if parity odd
PF = 0TEST AND
JZ
JE
Jump if zero
Jump if equal
ZF = 1TEST AND INC DEC
CMP ADD SUB
JNZ
JNE
Jump if not zero
Jump if not equal
ZF = 0TEST AND INC DEC
CMP ADD SUB
SF is a copy of the most significant bit of the result (which for conceptually signed results, indicates a result less than zero), PF is set based on the population count of the low 8 bits of the result (PF = 1 implies that said population count was even), and ZF is set if all bits of the result are zero (which for CMP and SUB implies that the two operands were equal).
Jumps which conceptually operate on unsigned operands:
DescriptionFlagsFusion (SnBr)
JB
JNAE
JC
Jump if below
Jump if not above or equal
Jump if carry
CF = 1CMP ADD SUB
JNB
JAE
JNC
Jump if not below
Jump if above or equal
Jump if not carry
CF = 0CMP ADD SUB
JBE
JNA
Jump if below or equal
Jump if not above
CF = 1 or ZF = 1CMP ADD SUB
JNBE
JA
Jump if not below or equal
Jump if above
CF = 0 and ZF = 0CMP ADD SUB
For an N-bit result, CF is the (N+1)th bit. Fusion with TEST and AND is technically possible, but CF is always zero following these instructions, making it pointless. The "above" and "below" descriptions are from the perspective of CMP or SUB with unsigned operands. To extend the description to ADD, consider the instruction to be taking two N-bit unsigned operands, zero-extending both to (N+1) bits, and producing an (N+1)-bit signed result. Then "above" and "below" are whether this (N+1)-bit signed result is above or below zero.
Jumps which conceptually operate on signed operands:
DescriptionFlagsFusion (SnBr)
JOJump if overflowOF = 1-
JNOJump if not overflowOF = 0-
JL
JNGE
Jump if less
Jump if not greater or equal
SF ≠ OFINC DEC
CMP ADD SUB
JNL
JGE
Jump if not less
Jump if greater or equal
SF = OFINC DEC
CMP ADD SUB
JNG
JLE
Jump if not greater
Jump if less or equal
SF ≠ OF or ZF = 1TEST AND INC DEC
CMP ADD SUB
JG
JNLE
Jump if greater
Jump if not less or equal
SF = OF and ZF = 0TEST AND INC DEC
CMP ADD SUB
Conceptually, the arithmetic instructions take two N-bit signed operands (the 2nd operand being 1 for INC and DEC), sign extend both to (N+1)-bits, compute an (N+1)-bit signed result, then OF is set to the exclusive-or of the two most signicant bits of those (N+1). This means that SF ≠ OF corresponds to the most significant bit of this signed (N+1)-bit result (in the same way that CF corresponds to the most significant bit of the (N+1)-bit result when the operands are zero-extended). In other words, OF is set if sign-extending the N-bit result to (N+1)-bits gives a different result to sign-extending the operands to (N+1)-bits and then performing (N+1)-bit arithmetic. Fusion with TEST and AND is technically possible for all these jumps, but OF is always zero following TEST or AND, making many combinations pointless. The "greater" and "less" descriptions are from the perspective of CMP or SUB with signed operands. To extend the description to ADD, consider the instruction to be taking two N-bit signed operands, sign-extending both to (N+1) bits, and producing an (N+1)-bit signed result. Then "greater" and "less" are whether this (N+1)-bit signed result is greater or less than zero. The INC and DEC instructions are equivalent to ADD and SUB with the 2nd operand being 1 (except for INC and DEC not touching CF).
In the context of optimising hot loops, it is common to be iterating over the half-open range [0, N), by some step S. Given all the references above to comparing to zero, it is often beneficial to transform this range to [-N, 0), as then the loop-control instructions can be ADD reg, S followed by a conditional jump, rather than ADD reg, S followed by CMP reg, N followed by a conditional jump. In the case of fusing ADD reg, S with a conditional jump, if N is known to be non-zero, then the conditional jump can be JA or JAE (or synonyms thereof), relying on CF = 1 occuring when the counter transitions from negative to non-negative. If N might zero, and a single iteration of the loop is desired (or harmless) even when N is zero, then the conditional jump cannot be JA or JAE, but instead can be JL (or a synonym thereof), relying on sign extension of both operands to (N+1) bits and then jumping if the (N+1)-bit result is less than zero.

AVX-512 notes

I have access to some Intel Skylake-X CPUs and some Intel Cascade Lake CPUs, both of which support AVX-512 instructions. AVX-512 has lots of subsets; both of these CPUs support the F, CD, VL, DQ, and BW subsets. Additionally, Cascade Lake supports the VNNI subset. A small part of this post is about VNNI, but other than that, everything is applicable to both Skylake-X and Cascade Lake.

There are new 512-bit-wide vector registers zmm0 through zmm31, which extend ymm0 through ymm15 both in their width and their number. The low parts of these new registers are available as ymm16 through ymm31, or xmm16 through xmm31, should you have code which would benefit from more registers rather than wider registers. There are also new 64-bit-wide "mask" registers k0 through k7.

Starting with "mask" registers:

Most vector instructions which allow a memory operand and have a lane width of 32-bits or 64-bits now support the memory operand being an embedded broadcast of a 32-bit or 64-bit value. In terms of syntax, this is done by putting {1toN} after the memory operand, for example vpaddd zmm0, zmm0, dword ptr [rax] {1to16}. Some instructions gain optional modifiers for controlling the rounding mode, or for suppressing exceptions. As a quirk of the instruction encoding, all three pieces of functionality (broadcasting, rounding mode control, and suppressing exceptions) are enabled/disabled by the same bit, which might cause surprises.

Assorted new floating-point instructions:

Assorted new integer instructions:

There are new instructions for converting between unsigned integers and floating-point values, in the form of v(cvt|cvtt)[ps][sd]2u(dq|si|qq) and vcvtu(dq|si|qq)2[ps][sd]. Also new are packed conversions between int64 and floating-point values, in the form of v(cvt|cvtt)p[sd]2qq and vcvtqq2p[sd].

Assorted new permutation and shuffling and blending instructions:

Converting floats to strings, part 1

Much has been written about converting floating point numbers to strings: Bruce Dawson has lots to say, David M. Gay's venerable dtoa is a classic, and Florian Loitsch's relatively recent grisu paper is worth studying. Often the problem is framed as "converting a floating point number to the shortest possible decimal string representation", but this framing is neither neccessary nor sufficient for implementing the %e / %f / %g formats of sprintf. Furthermore, this framing introduces significant complexity. As such, I'd like to begin by considering a much simpler framing: convert a double-precision floating point number to a decimal string, such that the string gives the exact mathematical real number represented by the float.

First of all, we need a quick reminder of what an IEEE754 double-precision floating point number looks like under the hood. If we ignore negative numbers, infinities, NaNs, and denormals, then we have just an 11-bit exponent e and a 53-bit mantissa m, which together represent the number m (an integer) times 2e (where e might be negative). Denormals also fit the pattern of m times 2e, albeit with a different encoding of m and e.

Armed with this knowledge, we can pull the m and e fields out of a number:

typedef union {
  double n;
  uint64_t u64;
  struct {
    uint32_t lo;
    uint32_t hi;
  } u32;
} TValue;

void decode(double n) {
  TValue t;
  t.n = n;
  if ((t.u32.hi << 1) >= 0xffe00000) {
    if (((t.u32.hi & 0x000fffff) | t.u32.lo) != 0) {
      printf("NaN\n");
    } else {
      printf("Infinity\n");
    }
  } else {
    int32_t e = (t.u32.hi >> 20) & 0x7ff;
    uint64_t m = t.u32.hi & 0xfffff;
    if (e == 0) {
      e++;
    } else {
      m |= 0x100000;
    }
    e -= 1043;
    if (t.u32.lo) {
      e -= 32;
      m = (m << 32) | t.u32.lo;
    }
    printf("%llu * 2^%d\n", (long long unsigned)m, (int)e);
  }
}

Some example outputs of this decode function are:

nOutput
0./0.NaN
1./0.Infinity
0.0 * 2^-1042
1.1048576 * 2^-20
10.1310720 * 2^-17
0.17205759403792794 * 2^-56
1e-3082024022533073106 * 2^-1074
pow(2, 1020)1048576 * 2^1000

These outputs are not in decimal, and they're not particularly convenient for human comprehension, but they are 100% accurate.

From m and e values, one classical approach for getting decimal digits is:

  1. Convert m to a binary bignum (easy).
  2. Multiply the binary bignum by 2e (easy).
  3. In a loop, compute the binary bignum modulo ten to get one digit (hard), and then divide the binary bignum by ten (hard).

I prefer the following approach, as it gets rid of the hard part:

  1. Convert m to a decimal bignum (not easy, but not hard).
  2. Multiply the decimal bignum by 2e (not easy, but not hard).
  3. In a loop, print the digits of the decimal bignum (easy).

What do I mean by binary bignum and decimal bignum? A binary bignum is a number stored as multiple uint32_t pieces, with each piece being in the range 0 through 232-1, and the number being sum(piecei * (232)i). On the other hand, a decimal bignum is a number stored as multiple uint32_t pieces, with each piece being in the range 0 through 109-1, and the number being sum(piecei * (109)i). 109 is chosen as it is the largest power of ten which fits into a uint32_t (even better, 109 requires 29.9 bits to store, so only 2.1 bits out of every 32 are wasted).

I'd like to represent a decimal bignum using the following variables:

uint32_t nd[128];
int32_t ndlo;
int32_t ndhi;

For ndlo <= i <= ndhi, piecei is nd[i & 127], and otherwise piecei is zero.

As it happens, we already know how to print the digits of an individual decimal bignum piece. Using one of those functions from last time, printing an entire decimal bignum is painless:

void nd_print(char* p, uint32_t* nd, int32_t ndlo, int32_t ndhi) {
  int32_t i;
  for (i = ndhi; i >= 0; --i) {
    nasonov9(p, nd[i & 127]); p += 9;
  }
  *p++ = '.';
  for (; i >= ndlo; --i) {
    nasonov9(p, nd[i & 127]); p += 9;
  }
  *p = 0;
}

Multiplying by 2e takes a bit of effort. Let's start by considering the case where e is negative, at which point we're really dividing by 2-e. In turn, this is dividing by 2 -e times. Just dividing by 2 doesn't take that much code:

int32_t nd_div2(uint32_t* nd, int32_t ndlo, int32_t ndhi) {
  uint32_t i = ndhi & 127, carry = 0;
  for (;;) {
    uint32_t val = nd[i];
    nd[i] = (val >> 1) + carry;
    carry = (val & 1) * 500000000;
    if (i == (ndlo & 127)) break;
    i = (i - 1) & 127;
  }
  if (carry) nd[--ndlo & 127] = carry;
  return ndlo;
}

We can generalise this to dividing by 2k:

int32_t nd_div2k(uint32_t* nd, int32_t ndlo, int32_t ndhi,
                 uint32_t k) {
  uint32_t mask = (1U << k) - 1, mul = 1000000000 >> k;
  uint32_t i = ndhi & 127, carry = 0;
  for (;;) {
    uint32_t val = nd[i];
    nd[i] = (val >> k) + carry;
    carry = (val & mask) * mul;
    if (i == (ndlo & 127)) break;
    i = (i - 1) & 127;
  }
  if (carry) nd[--ndlo & 127] = carry;
  return ndlo;
}

The above will work for k between zero and nine inclusive; if k is larger than nine, then 1000000000 >> k can no longer be represented by an integer. As such, the complete solution has to start by dividing in batches of 29:

int32_t nd_div2k(uint32_t* nd, int32_t ndlo, int32_t ndhi,
                 uint32_t k) {
  while (k >= 9) {
    uint32_t i = ndhi & 127, carry = 0;
    for (;;) {
      uint32_t val = nd[i];
      nd[i] = (val >> 9) + carry;
      carry = (val & 0x1ff) * 1953125;
      if (i == (ndlo & 127)) break;
      i = (i - 1) & 127;
    }
    if (carry) nd[--ndlo & 127] = carry;
    k -= 9;
  }
  if (k) {
    uint32_t mask = (1U << k) - 1, mul = 1000000000 >> k;
    uint32_t i = ndhi & 127, carry = 0;
    for (;;) {
      uint32_t val = nd[i];
      nd[i] = (val >> k) + carry;
      carry = (val & mask) * mul;
      if (i == (ndlo & 127)) break;
      i = (i - 1) & 127;
    }
    if (carry) nd[--ndlo & 127] = carry;
  }
  return ndlo;
}

We can then go through the same process for multiplying, starting with multiplication by 2:

int32_t nd_mul2(uint32_t* nd, int32_t ndhi) {
  uint32_t carry_in = 0;
  for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
    uint32_t val = (nd[i] << 1) | carry_in;
    carry_in = val / 1000000000;
    nd[i] = val - carry_in * 1000000000;
  }
  if (carry_in) nd[++ndhi] = carry_in;
  return ndhi;
}

By promoting val to 64-bits, this can be generalised to small k:

int32_t nd_mul2k(uint32_t* nd, int32_t ndhi, uint32_t k) {
  uint32_t carry_in = 0;
  for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
    uint64_t val = ((uint64_t)nd[i] << k) | carry_in;
    carry_in = (uint32_t)(val / 1000000000);
    nd[i] = (uint32_t)val - carry_in * 1000000000;
  }
  if (carry_in) nd[++ndhi] = carry_in;
  return ndhi;
}

This time the constraint on k comes from wanting val to be no more than (109)2, which limits k to 29. It also turns out to be useful to make carry_in a parameter, all of which leads to the complete code for multiplying by 2k:

int32_t nd_mul2k(uint32_t* nd, int32_t ndhi, uint32_t k,
                 uint32_t carry_in) {
  while (k >= 29) {
    for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
      uint64_t val = ((uint64_t)nd[i] << 29) | carry_in;
      carry_in = (uint32_t)(val / 1000000000);
      nd[i] = (uint32_t)val - carry_in * 1000000000;
    }
    if (carry_in) {
      nd[++ndhi] = carry_in; carry_in = 0;
    }
    k -= 29;
  }
  if (k) {
    for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
      uint64_t val = ((uint64_t)nd[i] << k) | carry_in;
      carry_in = (uint32_t)(val / 1000000000);
      nd[i] = (uint32_t)val - carry_in * 1000000000;
    }
    if (carry_in) nd[++ndhi] = carry_in;
  }
  return ndhi;
}

We can plug these routines into the decode function from earlier to create a print function:

void print(double n) {
  TValue t;
  t.n = n;
  if ((t.u32.hi << 1) >= 0xffe00000) {
    if (((t.u32.hi & 0x000fffff) | t.u32.lo) != 0) {
      printf("NaN\n");
    } else {
      printf("Infinity\n");
    }
  } else {
    char buf[1154];
    uint32_t nd[128];
    int32_t ndlo = 0;
    int32_t ndhi = 0;
    int32_t e = (t.u32.hi >> 20) & 0x7ff;
    nd[0] = t.u32.hi & 0xfffff;
    if (e == 0) {
      e++;
    } else {
      nd[0] |= 0x100000;
    }
    e -= 1043;
    if (t.u32.lo) {
      e -= 32;
      nd[0] = (nd[0] << 3) | (t.u32.lo >> 29);
      ndhi = nd_mul2k(nd, ndhi, 29, t.u32.lo & 0x1fffffff);
    }
    if (e >= 0) {
      ndhi = nd_mul2k(nd, ndhi, (uint32_t)e, 0);
    } else {
      ndlo = nd_div2k(nd, ndlo, ndhi, (uint32_t)-e);
    }
    nd_print(buf, nd, ndlo, ndhi);
    printf("%s\n", buf);
  }
}

Some example outputs of this print function are:

nOutput
0./0.NaN
1./0.Infinity
0.000000000.
1.000000001.
10.000000010.
0.1000000000000000000.1000000000000000055511
15123125782702118158340454101562500000000
1e-308000000000000000000.0000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000009
99999999999999909326625337248461995470488
73403204569370722504933164788134100221702
36685306110285951575783017584918228243784
38792553200763769833775473829862512856683
41346193998972906543693727922885247662294
86591679434355446221493480729436132941672
16662821737555414480159115639791276054897
20142038977058035153396077150619905566488
97702602917109778267250244017165230316273
90652604144008597950935492433262042405635
56399326294969169893097546113480479123599
46979384052000893178607312050101591177117
04697471514344499487123311264707354172378
09953873785021982614510236627959137966047
18812599767273565216024053297899062477635
21525981391443887618575275588619928089116
90506171197530846785775640581096161907433
18668839610809435427125598308539800029848
265694454312324523925781250000000
pow(2, 1020)00000001123558209288947442330815744243140
45851123561183894160795893800723582922378
43810195794279832650471001320007117491962
08485367436055090103890580296441496713277
36104933390540928297688887250778808824658
17684505312860552384417646403930092119569
40880170232270940691778664363999670287115
4982269052209770601514008576.

Some of these outputs could be tidied up by trimming leading and trailing zeroes, but other than that, this is the output we were aiming for. We can see that the double-precision floating point number closest to 0.1 is actually ever so slightly more than 0.1, that the double-precision floating point number closest to 1e-308 is actually ever so slightly less than 1e-308, and that in some cases the string representation of the exact decimal represented by a float can be extremely long (though, to be fair, 1e-308 is almost as bad as it gets for length).

Next time we'll see how to adapt nd_print into something which can behave like the %e, %f, and %g formats of sprintf (i.e. outputs which you might actually want, as opposed to pedantic exact representations).

page: 3 4 5 6 7