A whirlwind tour of AArch64 vector instructions (NEON)

32 vector registers, each 128 bits wide. Also a control register (FPCR) and a status register (FPSR), though scalar comparison instructions use PSTATE. Each vector register can contain:

128 bits might seem small compared to AVX-512, but for most vector instructions, the M1's Firestorm cores can issue four instances of the instruction per cycle, which gets you to a similar place. There are also AMX units on the M1.

A lane is 1/8/16/32/64/128 bits wide. The common lane types are:

1-bit8-bit16-bit32-bit64-bit128-bit
uintbituint8 [q]uint16 [q]uint32 [q]uint64 [q]
sintsint8 [q]sint16 [q]sint32 [q]sint64 [q]
fpfp16 [a]fp32fp64
bfbf16 [b]
polybitpoly8poly16poly64 [c]poly128 [c]
[a] Requires fp16 extension, and improved further by fp16fml extension.
[b] Requires bf16 extension (present on Apple M2, but not on M1).
[c] Requires crypto extension.
[q] Often with the choice of truncate on overflow or saturate on overflow.

The syntactic names for the 32 registers vary based on the lane width and also how much of the register is being used:

Lane widthLow scalarLow 64 bitsAll 128 bits
1 bit N/AV0.8BV31.8BV0.16BV31.16B
8 bitsB0B31V0.8BV31.8BV0.16BV31.16B
16 bitsH0H31V0.4HV31.4HV0.8HV31.8H
32 bitsS0S31V0.2SV31.2SV0.4SV31.4S
64 bitsD0D31V0.1DV31.1DV0.2DV31.2D
128 bitsQ0Q31N/AV0.1QV31.1Q
In some contexts, individual lanes can be referenced. The syntax for this appends [0][15] to one of the above, for example V7.4S[3] denotes the most significant 32-bit lane of V7. Writes to a single lane with this syntax generally preserve other lanes, whereas in all other cases, writes to the low bits of a register generally zero the remaining bits (though see FPCR.NEP).

Vector instructions have up to six input registers, and up to one output register (with the exception of loads, which can have up to four output registers). In most cases, all the registers involved in a single instruction have the same lane width, though there are exceptions:

In most cases, operations are lane-wise: lane i of the output register is formed by combining lane i of each input register, though there are exceptions:

GPR to / from vector

InstructionDirectionBehaviour
FMOVGPR to vectorTruncate to lane width, then zero extend
DUPGPR to vectorTruncate then replicate to all lanes
INSGPR to vectorTruncate then insert to arbitrary lane
FMOVVector to GPRTake low lane, zero extend to GPR width
UMOVVector to GPRArbitrary lane, zero extend to GPR width
SMOVVector to GPRArbitrary lane, sign extend to GPR width

A number of data type conversion instrucions can also have GPR as source or destination (see "Data type conversion, integer to float" and "Data type conversion, float to integer").

Load / store

A scalar load moves 8/16/32/64 bits from memory to (part of) a vector register, whereas a vector load moves 64/128 bits. This can be repeated up to four times, reading from consecutive memory locations and writing to distinct vector registers (which must be consecutively numbered, except for LDP / LDNP).

×1×2×3×4
Scalar (low lane,
zero others)
LDR
LDUR
LDP
LDNP
Scalar (any lane,
preserve others)
LD1(SS 1R)LD2(SS 2R)LD3(SS 3R)LD4(SS 4R)
Scalar (replicate
to all lanes)
LD1RLD2RLD3RLD4R
VectorLD1(MS 1R)
LDR
LDUR
LD1(MS 2R)
LDP
LDNP
LD1(MS 3R)LD1(MS 4R)
Vector, transposedLD2(MS 2R)LD3(MS 3R)LD4(MS 4R)

With the exception of scalar replicating to all lanes, all load instructions have a corresponding store instruction performing the inverse - replace LD with ST.

The SS or MS suffix denotes what the ARM reference manual calls "single structure" or "multiple structures", and is followed by the number of destination registers (1, 2, 3, or 4). The operand syntax relates to this suffix.

The vector transposed loads perform one of the following tranposes, where M denotes the number of destination registers (2, 3, or 4):

8-bit lanes16-bit lanes32-bit lanes64-bit lanes
64-bit vectors8×M ↦ M×84×M ↦ M×42×M ↦ M×2
128-bit vectors16×M ↦ M×168×M ↦ M×84×M ↦ M×42×M ↦ M×2

The addressing modes are all over the place:

BaseOffsetWriteback Mode
LDRXn or SPsigned imm9Pre- or post-index
LDURXn or SPsigned imm9No writeback
LDRXn or SPunsigned imm12, scaledNo writeback
LDRXn or SPXm or Wm, optional extend/scaleNo writeback
LDRPCsigned imm19, times 4No writeback
LDPXn or SPsigned imm7, scaledPre- or post-index
LDPXn or SPsigned imm7, scaledNo writeback
LDNPXn or SPsigned imm7, scaledPre- or post-index
LDNPXn or SPsigned imm7, scaledNo writeback
OthersXn or SPunsigned imm1, scaled (or Xm)Post-index

Data movement

Moving lanes around:

Source laneDestination laneOther lanes
FMOVLow laneLow laneZeroed
MOV (DUP)ArbitraryLow laneZeroed
MOV (INS)ArbitraryArbitraryPreserved
DUPArbitraryAll lanes (replicated)N/A
MOV (ORR)All lanesAll lanes (1:1)N/A

Various reversals:

in bytesin u16sin u32sin u64sin u128s
Reverse bitsRBIT
Reverse bytesno-opREV16REV32REV64TBL
Reverse u16sno-opREV32REV64TBL
Reverse u32sno-opREV64TBL
Reverse u64sno-opEXT

Various ways of creating one vector from two:

First stepSecond step
TRN1, TRN2Discard odd or even lanesInterleave lanes
ZIP1, ZIP2Discard high half or low halfInterleave lanes
UZP1, UZP2Concatenate vectorsDiscard odd or even lanes
EXTConcatenate vectorsTake a contiguous 16-byte span
Note that EXT with the same source vector for both operands gives a rotation by a whole number of bytes. Note that some math instructions come with a free UZP1 and UZP2 (annotated as [p]).

The TBL, TBX family concatenate 1-4 vectors from consecutively numbered registers to form a table T of 16/32/48/64 bytes, then another byte vector serves as indices into said table:

Per-byte behaviour
TBLDi = (Yi < len(T)) ? T[Yi] : 0
TBXDi = (Yi < len(T)) ? T[Yi] : Di
Note that x86 pshufb does Di = (Yi < 128) ? X[Yi & 15] : 0, which is similar.

Immediates

One flavour of FMOV loads constants of the form ±(1.0 + m/16) × 2e (where 0 ≤ m ≤ 15 and −3 ≤ e ≤ 4) into the low fp16/fp32/fp64 lane, and either zeros the other lanes or replicates the constant to the other lanes.

One flavour of MOVI loads a constant into the low 64-bit lane, each byte of which is independently either 0 or 255, and either zeros the high u64 lane or replicates the constant there.

The remaining flavours of MOVI and MVNI load a constant into the low 8/16/32-bit lane, one byte of which is an arbitrary value, bytes to the left either all 0 or all 255, and bytes to right either all 0 or all 255, then replicates this constant to all lanes.

The bitwise BIC and ORR instructions support constants of the form: 16/32-bit lane, one byte of which is an arbitrary value, other bytes all 0, replicated to all lanes.

Various comparison instructions support comparison against constant zero:

SignedUnsignedFloating
X == 0CMEQ X, #0CMEQ X, #0FCMEQ X, #0.0
X <= 0CMLE X, #0CMEQ X, #0FCMLE X, #0.0
X < 0CMLT X, #0always falseFCMLT X, #0.0
X > 0CMGT X, #0CMTST X, XFCMGT X, #0.0
X >= 0CMGE X, #0always trueFCMGE X, #0.0
X <=> 0N/AN/AFCMP X, #0.0

Shifts

Note >>R is used to denote a rounding right shift: if the most significant bit shifted out was a 1, then 1 is added to the result. If shifting right by N, this is equivalent to adding 1 << (N - 1) to the input before shifting.

In-lane shifts by immediate:

Per-lane behaviour
SHLD = X << N
SQSHL, UQSHLD = sat(X << N)
SQSHLUD = sat(X << N) (X signed, D unsigned)
SLID = (X << N) | bzhi(D, N) (bzhi clears all but the low N bits)
SSHR, USHRD = X >> N
SRSHR, URSHRD = X >>R N
SSRA, USRAD += X >> N
SRSRA, URSRAD += X >>R N
SRID = (X >> N) | bzlo(D, N) (bzlo clears all but the high N bits)

In-lane variable shifts, where the shift amount is a signed value in the low 8 bits of each lane in the 2nd operand:

Per-lane behaviour (Y > 0)Per-lane behaviour (Y < 0)
SSHL, USHLD = X << YD = X >> -Y
SRSHL, URSHLD = X << YD = X >>R -Y
SQSHL, UQSHLD = sat(X << Y)D = X >> -Y
SQRSHL, UQRSHLD = sat(X << Y)D = X >>R -Y

Widening shifts by immediate (from 8-bit lanes to 16-bit, 16-bit lanes to 32-bit, or 32-bit lanes to 64-bit), where X is sign-extended or zero-extended before use, and D lanes are twice the width of X lanes:

Per-lane behaviour (twice-width D)
SSHLL, SSHLL2, USHLL, USHLL2D = X << N (where 0 ≤ N < bitwidth(X))
SHLL, SHLL2D = X << bitwidth(X)

Narrowing shifts by immediate (from 64-bit lanes to 32-bit, 32-bit lanes to 16-bit, or 16-bit lanes to 8-bit), where D lanes are half the width of X lanes. In all cases, 1 ≤ N ≤ bitwidth(D):

Per-lane behaviour (half-width D)
XTN, XTN2D = truncate(X)
SHRN, SHRN2D = truncate(X >> N)
RSHRN, RSHRN2D = truncate(X >>R N)
SQXTN, SQXTN2D = sat(X) (signed)
UQXTN, UQXTN2D = sat(X) (unsigned)
SQXTUN, SQXTUN2D = sat(X) (X signed, D unsigned)
SQSHRN, SQSHRN2D = sat(X >> N) (signed)
UQSHRN, UQSHRN2D = sat(X >> N) (unsigned)
SQSHRUN, SQSHRUN2D = sat(X >> N) (X signed, D unsigned)
SQRSHRN, SQRSHRN2D = sat(X >>R N) (signed)
UQRSHRN, UQRSHRN2D = sat(X >>R N) (unsigned)
SQRSHRUN, SQRSHRUN2D = sat(X >>R N) (X signed, D unsigned)

There is no narrowing shift from 8-bit lanes to something narrower. This is a notable difference to x86, where pmovmskb can pull 1 bit out of every 8-bit lane. That said, SHRN from 16-bit lanes to 8-bit lanes with N set to 4 does something interesting: it pulls 4 bits out of every 8-bit lane, taking the high 4 bits of even lanes and the low 4 bits of odd lanes. Alternating between high/low halves is weird, but innocuous if every 8-bit lane starts out containing either 0 or 255.

Subject to the sha3 extension, two instructions provide rotates in 64-bit lanes:

Per-lane behaviour (64-bit lanes only)
RAX1D = X xor rotate_left(Y, 1)
XARD = rotate_right(X xor Y, N)

Note that rotate by immediate (without xor, for any lane width, and without needing sha3) can be constructed from SHL followed by USRA.

Bitwise

Assorted instructions operating bitwise:

DescriptionPer-bit behaviour
ANDAndD = X and Y
BCAX [s]Clear and xorD = X xor (Y and not Z)
BICClearD = X and not Y
BICClear (immedate)D = D and not imm [i]
BIFInsert if falseD = Y ? D : X
BITInsert if trueD = Y ? X : D
BSLSelectD = D ? X : Y
EORXorD = X xor Y
EOR3 [s]Xor (three-way)D = X xor Y xor Z
NOTNotD = not X
ORNOr notD = X or not Y
ORROrD = X or Y
ORROr (immediate)D = D or imm [i]
[s] Requires sha3 extension.
[i] Immediate is a 16-bit or 32-bit constant where one byte is an arbitrary imm8 and other bytes are zero, broadcast to all 16-bit or 32-bit lanes.

Bit counting instructions:

CountsPossible lane widths
CLSLeading sign bits8-bit, 16-bit, 32-bit
CLZLeading zero bits (i.e. lzcnt)8-bit, 16-bit, 32-bit
CNTNon-zero bits (i.e. popcnt)8-bit [a]
[a] Other lane widths can be achieved by a follow-up UADDLP or ADDV.

There are no horizontal bitwise instructions in the traditional sense, though various horizontal reductions can be constructed from other instructions:

Integer math

Assorted instructions operating lanewise, with 8/16/32/64-bit lanes:

Per-lane behaviour
ABSD = abs(X)
SQABSD = sat(abs(X))
NEGD = -X
SQNEGD = sat(-X)
ADD, SUBD = X ± Y
ADDPD = A + B [p]
ADDVD0 = X0 + X1 + ⋯ + Xn-1 [v]
SQADD, UQADD, SUQADD, USQADDD = sat(X + Y)
SQSUB, UQSUBD = sat(X - Y)
SABD, UABD [d]D = abs(X - Y)
SABA, UABA [d]D += abs(X - Y)
SHADD, SHSUB, UHADD, UHSUB [d]D = (X ± Y) >> 1
SRHADD, URHADD [d]D = (X + Y) >>R 1
MUL [b] [d]D = X * Y
MLA, MLS [b] [d]D ±= X * Y
SQDMULH [a] [b] [d]D = sat((2 * X * Y) >> bitwidth(D))
SQRDMULH [a] [b] [d]D = sat((2 * X * Y) >>R bitwidth(D))
SQRDMLAH, SQRDMLSH [a] [b] [d]D = sat(D ± ((2 * X * Y) >>R bitwidth(D)))
SMIN, UMIN [d]D = min(X, Y)
SMINP, UMINP [d]D = min(A, B) [p]
SMINV, UMINV [d]D0 = min(X0, X1, …, Xn-1) [v]
SMAX, UMAX [d]D = max(X, Y)
SMAXP, UMAXP [d]D = max(A, B) [p]
SMAXV, UMAXV [d]D0 = max(X0, X1, …, Xn-1) [v]
CMEQ [z]D = (X == Y) ? ones_mask : 0
CMGE, CMHS [z]D = (X >= Y) ? ones_mask : 0
CMGT, CMHI [z]D = (X > Y) ? ones_mask : 0
CMTSTD = (X & Y) ? ones_mask : 0
[a] Not available for 8-bit lanes. Not available as unsigned.
[b] When using 16/32-bit lanes, can broadcast a single lane of Y to all lanes of Y.
[d] Not available for 64-bit lanes.
[p] Ai is concat(X, Y)2*i+0, Bi is concat(X, Y)2*i+1, i.e. adjacent pairs.
[v] Low lane of D gets sum/min/max of all lanes of X, rest of D cleared.
[z] Operands can be registers or constant zero (at least logically).

Assorted instructions operating lanewise, with 8/16/32-bit lanes for X and Y, and D lanes twice as wide (X/Y sign-extended or zero-extended before use):

Per-lane behaviour (twice-width D)
SXTL, SXTL2, UXTL, UXTL2D = X (i.e. just sign/zero extend)
SABDL, SABDL2, UABDL, UABDL2D = abs(X - Y)
SABAL, SABAL2, UABAL, UABAL2D += abs(X - Y)
SADDL, SADDL2, UADDL, UADDL2D = X + Y
SSUBL, SSUBL2, USUBL, USUBL2D = X - Y
SADDLP, UADDLPD = A + B [p]
SADALP, UADALPD += A + B [p]
SADDLV, UADDLVD0 = X0 + X1 + ⋯ + Xn-1 [v]
SMULL, SMULL2, UMULL, UMULL2 [b]D = X * Y
SMLAL, SMLAL2, UMLAL, UMLAL2 [b]D += X * Y
SMLSL, SMLSL2, UMLSL, UMLSL2 [b]D -= X * Y
SQDMULL, SQDMULL2 [a] [b]D = sat(2 * X * Y)
SQDMLAL, SQDMLAL2 [a] [b]D = sat(D + sat(2 * X * Y))
SQDMLSL, SQDMLSL2 [a] [b]D = sat(D - sat(2 * X * Y))
[a] Not available for 8-bit lanes of X/Y. Not available as unsigned.
[b] When using 16/32-bit lanes of Y, can broadcast a single lane of Y to all lanes.
[p] Ai is X2*i+0, Bi is X2*i+1, i.e. adjacent pairs.
[v] Low lane of D gets sum of all lanes of X, rest of D cleared.

A few instructions operating lanewise, with 16/32/64-bit lanes for D and X, and Y lanes half as wide (Y sign-extended or zero-extended before use):

Per-lane behaviour (half-width Y)
SADDW, SADDW2, UADDW, UADDW2D = X + Y
SSUBW, SSUBW2, USUBW, USUBW2D = X - Y

A few instructions operating lanewise, with 16/32/64-bit lanes for X and Y, and D lanes half as wide:

Per-lane behaviour (half-width D)
ADDHN, ADDHN2, SUBHN, SUBHN2D = (X ± Y) >> bitwidth(D)
RADDHN, RADDHN2, RSUBHN, RSUBHN2D = (X ± Y) >>R bitwidth(D)

Dense linear algebra instructions:

BehaviourD typeX typeY type
SDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]s8[4][4]s8[4][4]
UDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]u8[4][4]u8[4][4]
USDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]u8[4][4]s8[4][4]
SUDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]s8[4][4]u8[4][4]
SMMLAD += X @ YTs32[2][2] or u32[2][2]s8[2][8]s8[2][8]
UMMLAD += X @ YTs32[2][2] or u32[2][2]u8[2][8]u8[2][8]
USMMLAD += X @ YTs32[2][2] or u32[2][2]u8[2][8]s8[2][8]
[b] Can broadcast a 32-bit lane of Y to all 32-bit lanes.

Two oddball instructions operate on 32-bit unsigned lanes containing fixed-precision numbers with 32 fractional bits (i.e. range is 0 through 1-ε):

Per-lane behaviour
URECPED = sat(0.5 * X-1) (approximate, using just top 9 bits)
URSQRTED = sat(0.5 * X-0.5) (approximate, using just top 9 bits)

Float math

A broad range of floating-point math instructions are available, operating on fp32 or fp64 lanes (or fp16 subject to the fp16 extension), in either vector form or scalar form:

Per-lane behaviour
FABSD = abs(X)
FNEGD = -X
FADD, FSUBD = X ± Y
FADDPD = A + B [p]
FABDD = abs(X - Y)
FMUL [b]D = X * Y
FMULX [b]D = X * Y (except that ±0 times ±infinity is ±2.0)
FNMUL [s]D = -(X * Y)
FMADD, FMSUB [s]D = Z ± X * Y
FNMADD, FNMSUB [s]D = -(Z ± X * Y)
FMLA, FMLS [b]D ±= X * Y
FDIVD = X / Y
FSQRTD = X0.5
FRECPX [s]D = X-1 (crude approximate [a], using no fractional bits)
FRECPED = X-1 (approximate, using just 8 fractional bits)
FRECPSD = 2.0 - X * Y [c]
FRSQRTED = X-0.5 (approximate, using just 8 fractional bits)
FRSQRTSD = 1.5 - 0.5 * X * Y [d]
FMIN, FMINNMD = min(X, Y) [m]
FMINP, FMINNMPD = min(A, B) [m] [p]
FMINV, FMINNMVD0 = min(X0, X1, …, Xn-1) [m] [v]
FMAX, FMAXNMD = max(X, Y) [m]
FMAXP, FMAXNMPD = max(A, B) [m] [p]
FMAXV, FMAXNMVD0 = max(X0, X1, …, Xn-1) [m] [v]
FCMEQ [z]D = (X == Y) ? ones_mask : 0
FCMGE [z]D = (X >= Y) ? ones_mask : 0
FCMGT [z]D = (X > Y) ? ones_mask : 0
FACGED = (abs(X) >= abs(Y)) ? ones_mask : 0
FACGTD = (abs(X) > abs(Y)) ? ones_mask : 0
[a] Clears fraction bits, then adds one to exponent if zero, then bitwise inverse of exponent bits. Can be used with FMULX as part of vector normalisation.
[b] Can broadcast a single lane of Y to all lanes of Y.
[c] Useful as part of Newton-Raphson step where successive approximations to a-1 are computed as xn+1 = xn * (2.0 - a * xn). See FRECPE.
[d] Useful as part of Newton-Raphson step where successive approximations to a-0.5 are computed as xn+1 = xn * (1.5 - 0.5 * a * xn * xn). See FRSQRTE.
[m] Note that min/max are not quite equivalent to comparison followed by selection, due to signed zeros and NaNs. The NM variants return the non-NaN operand if exactly one operand is NaN.
[p] Ai is concat(X, Y)2*i+0, Bi is concat(X, Y)2*i+1, i.e. adjacent pairs.
[s] Scalar form only, no vector form.
[v] Low lane of D gets min/max of all lanes of X, rest of D cleared.
[z] Operands can be registers or constant zero (at least logically).

Various per-lane rounding instructions with floating-point inputs and outputs (see "Data type conversion, float to integer" for integer outputs):

RoundingRange [r]Exceptions
FRINT32XMode from FPCR-231 ⋯ 231-1Inexact, InvalidOp
FRINT32ZToward zero (truncate)-231 ⋯ 231-1Inexact, InvalidOp
FRINT64XMode from FPCR-263 ⋯ 263-1Inexact, InvalidOp
FRINT64ZToward zero (truncate)-263 ⋯ 263-1Inexact, InvalidOp
FRINTATo nearest, ties away from zeroUnbounded
FRINTIMode from FPCRUnbounded
FRINTMToward minus infinity (floor)Unbounded
FRINTNTo nearest, ties toward evenUnbounded
FRINTPToward positive infinity (ceil)Unbounded
FRINTXMode from FPCRUnboundedInexact
FRINTZToward zero (truncate)Unbounded
[r] Out of range results (in either direction) replaced by -231 or -263.

Mixed-width operations and dense linear algebra:

BehaviourD typeX/Y type
FMLAL, FMLSL [b]Di ±= Xi+0 * Yi+0fp32[4]fp16[8]
FMLAL2, FMLSL2 [b]Di ±= Xi+4 * Yi+4fp32[4]fp16[8]
BFMLALB [b]Di += X2*i+0 * Y2*i+0fp32[4]bf16[8]
BFMLALT [b]Di += X2*i+1 * Y2*i+1fp32[4]bf16[8]
BFDOT [b]Di += dot(Xi, Yi)fp32[4]bf16[4][2]
BFMMLAD += X @ YTfp32[2][2]bf16[2][4]
[b] Can broadcast a single lane of Y to all lanes of Y (for BFDOT, a lane is 32 bits).

Float comparisons involving PSTATE

The FCMP, FCMPE family perform a three-way comparison of the low fp16/fp32/fp64 lane of two operands, writing the result to PSTATE:

Following FCMP X, Y, the meaning of condition codes is:

EQX0 == Y0NE!(X0 == Y0)
LSX0 <= Y0HI!(X0 <= Y0)
LOX0 < Y0HS!(X0 < Y0)
MIX0 < Y0PL!(X0 < Y0)
CCX0 < Y0CS!(X0 < Y0)
GTX0 > Y0LE!(X0 > Y0)
GEX0 >= Y0LT!(X0 >= Y0)
VSis_nan(X0) or is_nan(Y0)VC!is_nan(X0) and !is_nan(Y0)

The FCCMP, FCCMPE family perform a conditional three-way comparison of the low fp16/fp32/fp64 lane of two operands: some condition is evaluated against the contents of PSTATE; if true, the instruction behaves like FCMP/FCMPE; if false, a four-bit immediate is written to the relevant PSTATE bits.

The FCSEL instruction uses PSTATE to conditionally select between two scalar fp16/fp32/fp64 operands: D0 = cond ? X0 : Y0 (other lanes of D cleared).

Data type conversion, float to float

Vector form, FPCR rounding:

to bf16to fp16to fp32to fp64
From bf16no changevia fp32SSHLL, SSHLL2via fp32
From fp16via fp32no changeFCVTL, FCVTL2via fp32
From fp32BFCVTN, BFCVTNFCVTN, FCVTN2no changeFCVTL, FCVTL2
From fp64via fp32 [x]via fp32 [x]FCVTN, FCVTN2no change
[x] Using FCVTXN or FCVTXN2, which employ round-to-odd rounding mode.

For scalar conversions, FCVT can convert from any of fp16/fp32/fp64 to any other of fp16/fp32/fp64.

Data type conversion, integer to float

Vector form, FPCR rounding, free division by a power of two afterwards:

to fp16to fp32to fp64
From s16 or u16SCVTF or UCVTFvia s32 or u32via s64 or u64
From s32 or u32via fp32SCVTF or UCVTFvia s64 or u64
From s64 or u64via fp64via fp64SCVTF or UCVTF

SCVTF and UCVTF can also take a GPR as input (32-bit or 64-bit, signed or unsigned), and convert that to any of fp16/fp32/fp64, again with a free division by a power of two afterwards.

Data type conversion, float to integer

A family of conversion instructions exist, available in two forms:

RoundingOverflow
FCVTAS, FCVTAUTo nearest, ties away from zeroSaturate
FCVTMS, FCVTMUToward minus infinity (floor)Saturate
FCVTNS, FCVTNUTo nearest, ties toward evenSaturate
FCVTPS, FCVTPUToward positive infinity (ceil)Saturate
FCVTZS, FCVTZU [f]Toward zero (truncate)Saturate
FJCVTZS [j]Toward zero (truncate)Modulo 232
[f] Free multiplication by a power of two possible before the conversion.
[j] Only exists in fp64 to s32 GPR form. Also sets PSTATE.

Complex float math

A pair of floating point lanes can represent a complex floating point number, where the low scalar lane contains the real part of the complex number and the high scalar lane contains the imaginary part of the complex number. A 128-bit register can then contain 4 fp16 complex lanes, or 2 fp32 complex lanes, or a single fp64 complex lane. A few instructions exist for manipulating these:

Real part of resultImaginary part of result
FCADD #90Re(D) = Re(X) - Im(Y)Im(D) = Im(X) + Re(Y)
FCADD #270Re(D) = Re(X) + Im(Y)Im(D) = Im(X) - Re(Y)
FCMLA #0 [b]Re(D) += Re(X) * Re(Y)Im(D) += Re(X) * Im(Y)
FCMLA #90 [b]Re(D) -= Im(X) * Im(Y)Im(D) += Im(X) * Re(Y)
FCMLA #180 [b]Re(D) -= Re(X) * Re(Y)Im(D) -= Re(X) * Im(Y)
FCMLA #270 [b]Re(D) += Im(X) * Im(Y)Im(D) -= Im(X) * Re(Y)
[b] Can broadcast a complex lane (i.e. 2 scalars) of Y to all complex lanes of Y.

Polynomial math

The PMUL, PMULL, and PMULL2 instructions all perform D = X * Y, where all lanes of D/X/Y contain ℤ2 polynomials. This is alternatively known as carryless multiplication (pclmulqdq on x86).

D lanesX/Y lanes
PMUL8-bit poly (high 7 bits of result discarded)8-bit poly
PMULL, PMULL216-bit poly (top bit always clear)8-bit poly
PMULL, PMULL2 [c]128-bit poly (top bit always clear)64-bit poly
[c] Requires crypto extension.

For ℤ2 polynomial addition/subtraction, see EOR or EOR3. Polynomial division and remainder against a constant Y can be performed via multiplication.

Cryptography

Some instructions are provided to accelerate AES encryption. A single round of AES encryption consists of AddRoundKey (just xor), then SubBytes and ShiftRows (in either order), then optionally MixColumns (performed for every round except the last). The provided instructions are:

Steps
AESEAddRoundKey then ShiftRows and SubBytes
AESMCMixColumns
AESDInverse AddRoundKey then inverse ShiftRows and inverse SubBytes
AESIMCInverse MixColumns

Note that x86 AES instructions are slightly different, for example aesenc there does ShiftRows and SubBytes, then MixColumns, then AddRoundKey.

Some instructions are provided to accelerate SHA-1 hashes: SHA1C, SHA1H, SHA1M, SHA1P, SHA1SU0, SHA1SU1.

Some instructions are provided to accelerate SHA-2 hashes: SHA256H, SHA256H2, SHA256SU0, SHA256SU1 for SHA-256, and SHA512H, SHA512H2, SHA512SU0, SHA512SU1 for SHA-512.

Some instructions are provided to accelerate SHA-3 hashes: EOR3, RAX1, XAR, BCAX. See "Shifts" or "Bitwise" for descriptions.

Some instructions are provided to accelerate SM3 hashes: SM3SS1, SM3TT1A, SM3TT1B, SM3TT2A, SM3TT2B, SM3PARTW1, SM3PARTW2.

Some instructions are provided to accelerate SM4 encryption: SM4E, SM4EKEY.

The many ways of converting FP32 to FP16

An IEEE 754 single-precision 32-bit binary float (henceforth FP32) can store:

An IEEE 754 half-precision 16-bit binary float (henceforth FP16) can store:

Converting from FP32 to FP16 thus involves getting rid of m11 through m23, and dealing with the reduced range of e. If e starts in the range -14 ≤ e ≤ 15 and m11 through m23 are all 0, then this conversion is trivial. If any of m11 through m23 are 1 then a rounding decision has to be made, typical options for which include:

If e starts in the range e < -14 (i.e. underflow), then the resultant FP16 has to end up with e = -14, as it cannot go any smaller. Some number of mi will be discarded, and so a rounding decision again has to be made. A fun bonus decision crops up when rounding toward ±∞ and the starting FP32 is denormal: should denormals be treated as zero? (for other rounding modes, denormal inputs will naturally round to zero) Regardless of rounding mode, if the resultant FP16 is denormal, then there's the decision of whether to flush to zero.

If e starts in the range 15 < e (i.e. overflow), then there's no particularly great FP16 to convert to, with the options being:

Choosing between infinity and 65504 is yet again a rounding decision, with the unintuitive quirk that 65520 is considered to be nearer to infinity than to 65504.

If the starting FP32 is NaN, then there's choice about what to do with the sign bit (either preserve it, or force it to a particular value), and again a choice about what to do with the payload bits (either preserve some of them, or force them to a particular canonical pattern).

All said, there are clearly lots of decision points in the conversion process, so it is no surprise that different implementations make different choices. We'll look at a bunch of software and hardware implementations, and the choices that they make (as they don't always advertise their decisions).

The summary of implementations and most important choices is:

ImplementationRoundingNaNs
numpyToward nearest, ties toward evenPreserve (1)
CPythonToward nearest, ties toward even (†)±Canonical
James Tursa (Matlab)Toward nearest, ties away from zero-Canonical
Fabian "ryg" GiesenToward nearest, details vary±Canonical
MaratyszczaToward nearest, ties toward even±Canonical
VCVTPS2PH (x86)ConfigurablePreserve (2)
FCVT / FCVTN (ARM)ConfigurableConfigurable
(†) Except that overflows throw an exception
(1) Top 10 bits of payload preserved, then LSB set to 1 if required
(2) Top bit of payload set to 1, then next 9 bits preserved

Jumping into the details, we start with numpy, which consists of very well-commented bit manipulations:

uint16_t numpy_floatbits_to_halfbits(uint32_t f) {
  uint16_t h_sgn = (uint16_t)((f & 0x80000000u) >> 16);
  uint32_t f_exp = f & 0x7f800000u;
  uint32_t f_sig = f & 0x007fffffu;

  // Exponent overflow/NaN converts to signed inf/NaN
  if (f_exp >= 0x47800000u) {
    if ((f_exp == 0x7f800000u) && (f_sig != 0)) {
      // NaN - propagate the flag in the significand...
      uint16_t ret = (uint16_t)(0x7c00u + (f_sig >> 13));
      ret += (ret == 0x7c00u); // ...but make sure it stays a NaN
      return h_sgn + ret;
    } else {
      // (overflow to) signed inf
      return (uint16_t)(h_sgn + 0x7c00u);
    }
  }

  // Exponent underflow converts to a subnormal half or signed zero
  if (f_exp <= 0x38000000u) {
    // Signed zeros, subnormal floats, and floats with small
    // exponents all convert to signed zero half-floats.
    if (f_exp < 0x33000000u) {
      return h_sgn;
    }
    // Make the subnormal significand
    f_exp >>= 23;
    f_sig += 0x00800000u;
    f_sig >>= (113 - f_exp);
    // Handle rounding by adding 1 to the bit beyond half precision
    //
    // If the last bit in the half significand is 0 (already even),
    // and the remaining bit pattern is 1000...0, then we do not add
    // one to the bit after the half significand. However, the
    // (113 - f_exp) shift can lose up to 11 bits, so the || checks
    // them in the original. In all other cases, we can just add one.
    if (((f_sig & 0x3fffu) != 0x1000u) || (f & 0x07ffu)) {
      f_sig += 0x1000u;
    }
    uint16_t h_sig = (uint16_t)(f_sig >> 13);
    // If the rounding causes a bit to spill into h_exp, it will
    // increment h_exp from zero to one and h_sig will be zero.
    // This is the correct result.
    return (uint16_t)(h_sgn + h_sig);
  }

  // Regular case with no overflow or underflow
  uint16_t h_exp = (uint16_t)((f_exp - 0x38000000u) >> 13);
  // Handle rounding by adding 1 to the bit beyond half precision
  //
  // If the last bit in the half significand is 0 (already even), and
  // the remaining bit pattern is 1000...0, then we do not add one to
  // the bit after the half significand. In all other cases, we do.
  if ((f_sig & 0x3fffu) != 0x1000u) {
      f_sig += 0x1000u;
  }
  uint16_t h_sig = (uint16_t)(f_sig >> 13);
  // If the rounding causes a bit to spill into h_exp, it will
  // increment h_exp by one and h_sig will be zero. This is the
  // correct result. h_exp may increment to 15, at greatest, in
  // which case the result overflows to a signed inf.
  return (uint16_t)(h_sgn + h_exp + h_sig);
}

CPython takes a very different approach, involving copysign and frexp:

int PyFloat_Pack2(double x) {
  uint16_t sign, bits;
  int e;

  if (x == 0.0) {
    sign = (copysign(1.0, x) == -1.0);
    e = 0;
    bits = 0;
  } else if (isinf(x)) {
    sign = (x < 0.0);
    e = 0x1f;
    bits = 0;
  } else if (isnan(x)) {
    // There are 2046 distinct half-precision NaNs (1022 signaling
    // and 1024 quiet), but there are only two quiet NaNs that don't
    // arise by quieting a signaling NaN; we get those by setting
    // the topmost bit of the fraction field and clearing all other
    // fraction bits. We choose the one with the appropriate sign.
    sign = (copysign(1.0, x) == -1.0);
    e = 0x1f;
    bits = 0x200;
  } else {
    sign = (x < 0.0);
    if (sign) {
      x = -x;
    }

    double f = frexp(x, &e);
    // Normalize f to be in the range [1.0, 2.0)
    f *= 2.0;
    e--;

    if (e >= 16) {
      goto Overflow;
    } else if (e < -25) {
      // |x| < 2**-25. Underflow to zero.
      f = 0.0;
      e = 0;
    } else if (e < -14) {
      // |x| < 2**-14. Gradual underflow
      f = ldexp(f, 14 + e);
      e = 0;
    } else /* if (!(e == 0 && f == 0.0)) */ {
      e += 15;
      f -= 1.0; // Get rid of leading 1
    }

    f *= 1024.0; // 2**10
    // Round to even
    bits = (uint16_t)f; // Note the truncation
    assert(bits < 1024);
    assert(e < 31);
    if ((f - bits > 0.5) || ((f - bits == 0.5) && (bits % 2 == 1))) {
      if (++bits == 1024) {
        // The carry propagated out of a string of 10 1 bits.
        bits = 0;
        if (++e == 31) {
          goto Overflow;
        }
      }
    }
  }

  return (sign << 15) | (e << 10) | bits;
Overflow:
  PyErr_SetString(PyExc_OverflowError,
                  "float too large to pack with e format");
  return -1;
}

The "Round to even" part of this CPython code was explicitly put in to match the numpy behaviour, however it diverges in its treatment of finite inputs that can't plausibly be represented by a finite FP16 - numpy will round them to infinity, whereas CPython will throw. This can be seen in the following example:

import numpy
import struct

def np_f2h(x):
  return int(numpy.array(x, 'u4').view('f4').astype('f2').view('u2'))

def py_f2h(x):
  f, = struct.unpack('f', struct.pack('I', x))
  return struct.unpack('H', struct.pack('e', f))[0]

print(hex(np_f2h(0x49800000))) # 0x7c00
print(hex(py_f2h(0x49800000))) # OverflowError

The other difference is in treatment of NaN values, where numpy tries to preserve them as much as possible, whereas CPython only preserves the sign bit and otherwise turns all NaN values into a canonical quiet NaN:

print(hex(np_f2h(0xffffffff))) # 0xffff
print(hex(py_f2h(0xffffffff))) # 0xfe00

A different lineage of code comes from James Tursa. It was originally written for Matlab and requires a MathWorks account to download, but the interesting part found its way to GitHub regardless, and is transcribed with slight modifications here:

uint16_t tursa_floatbits_to_halfbits(uint32_t x) {
  uint32_t xs = x & 0x80000000u; // Pick off sign bit
  uint32_t xe = x & 0x7f800000u; // Pick off exponent bits
  uint32_t xm = x & 0x007fffffu; // Pick off mantissa bits
  uint16_t hs = (uint16_t)(xs >> 16); // Sign bit
  if (xe == 0) { // Denormal will underflow, return a signed zero
    return hs;
  }
  if (xe == 0x7f800000u) { // Inf or NaN
    if (xm == 0) { // If mantissa is zero ...
      return (uint16_t)(hs | 0x7c00u); // Signed Inf
    } else {
      return (uint16_t)0xfe00u; // NaN, only 1st mantissa bit set
    }
  }
  // Normalized number
  int hes = ((int)(xe >> 23)) - 127 + 15; // Exponent unbias the single, then bias the halfp
  if (hes >= 0x1f) { // Overflow
    return (uint16_t)(hs | 0x7c00u); // Signed Inf
  } else if (hes <= 0) { // Underflow
    uint16_t hm;
    if ((14 - hes) > 24) { // Mantissa shifted all the way off & no rounding possibility
      hm = (uint16_t)0u; // Set mantissa to zero
    } else {
      xm |= 0x00800000u; // Add the hidden leading bit
      hm = (uint16_t)(xm >> (14 - hes)); // Mantissa
      if ((xm >> (13 - hes)) & 1u) { // Check for rounding
        hm += (uint16_t)1u; // Round, might overflow into exp bit, but this is OK
      }
    }
    return (hs | hm); // Combine sign bit and mantissa bits, biased exponent is zero
  } else {
    uint16_t he = (uint16_t)(hes << 10); // Exponent
    uint16_t hm = (uint16_t)(xm >> 13); // Mantissa
    if (xm & 0x1000u) { // Check for rounding
      return (hs | he | hm) + (uint16_t)1u; // Round, might overflow to inf, this is OK
    } else {
      return (hs | he | hm); // No rounding
    }
  }
}

Though it shares a lot in spirit with the numpy code, this code has different rounding behaviour to what we've seen before; ties are rounded away from zero rather than towards even. It also has yet another take on NaNs: all input NaNs get turned into a single output NaN, with not even the sign bit preserved.

The code from James Tursa was adopted into ISPC, meaning that ISPC had the same rounding and NaN behaviours. Then came along Fabian "ryg" Giesen, deploying his usual optimisation skills, and also fixing the rounding behaviour (back to ties toward even) and the NaN behaviour (preserve sign bit) at the same time. Several variants were written along that journey (and then adopted back by ISPC), but I'll choose this one to showcase:

uint16_t float_to_half_fast3_rtne(uint32_t x) {
  uint32_t x_sgn = x & 0x80000000u;
  x ^= x_sgn;

  uint16_t o;
  if (x >= 0x47800000u) { // result is Inf or NaN
    o = (x > 0x7f800000u) ? 0x7e00  // NaN->qNaN
                          : 0x7c00; // and Inf->Inf
  } else { // (De)normalized number or zero
    if (x < 0x38800000u) { // resulting FP16 is subnormal or zero
      // use a magic value to align our 10 mantissa bits at
      // the bottom of the float. as long as FP addition
      // is round-to-nearest-even this just works.
      union { uint32_t u; float f; } f, denorm_magic;
      f.u = x;
      denorm_magic.u = ((127 - 14) + (23 - 10)) << 23;

      f.f += denorm_magic.f;

      // and one integer subtract of the bias later, we have our
      // final float!
      o = f.u - denorm_magic.u;
    } else {
      uint32_t mant_odd = (x >> 13) & 1; // resulting mantissa is odd

      // update exponent, rounding bias part 1
      x += ((15 - 127) << 23) + 0xfff;
      // rounding bias part 2
      x += mant_odd;
      // take the bits!
      o = x >> 13;
    }
  }
  return (x_sgn >> 16) | o;
}

The interesting part of the above is how it handles the e < -14 case. Both the numpy code and the code from James Tursa handled this case with a variable distance bitshift and then some fiddly rounding logic, whereas the above makes the observation that there's existing hardware in a contemporary CPU for doing shifting and fiddly rounding: the floating point addition circuit. The first step of a floating point addition circuit is a variable distance bitshift of the smaller operand, with the aim of making its exponent equal to that of the larger operand. The above engineers the larger operand to have e = -1, which is what we want, as FP16 bottoms out at e = -14 and there are 13 (23 - 10) extra mantissa bits in FP32 compared to FP16. Any bits that get shifted out as part of exponent equalisation will contribute to rounding, which is exactly what we want.

Taking the idea further, we have this gem from Maratyszcza:

uint16_t fp16_ieee_from_fp32_value(uint32_t x) {
  uint32_t x_sgn = x & 0x80000000u;
  uint32_t x_exp = x & 0x7f800000u;
  x_exp = (x_exp < 0x38800000u) ? 0x38800000u : x_exp; // max(e, -14)
  x_exp += 15u << 23; // e += 15
  x &= 0x7fffffffu; // Discard sign

  union { uint32_t u; float f; } f, magic;
  f.u = x;
  magic.u = x_exp;

  // If 15 < e then inf, otherwise e += 2
  f.f = (f.f * 0x1.0p+112f) * 0x1.0p-110f;

  // If we were in the x_exp >= 0x38800000u case:
  // Replace f's exponent with that of x_exp, and discard 13 bits of
  // f's significand, leaving the remaining bits in the low bits of
  // f, relying on FP addition being round-to-nearest-even. If f's
  // significand was previously `a.bcdefghijk`, then it'll become
  // `1.000000000000abcdefghijk`, from which `bcdefghijk` will become
  // the FP16 mantissa, and `a` will add onto the exponent. Note that
  // rounding might add one to all this.
  // If we were in the x_exp < 0x38800000u case:
  // Replace f's exponent with the minimum FP16 exponent, discarding
  // however many bits are required to make that happen, leaving
  // whatever is left in the low bits.
  f.f += magic.f;

  uint32_t h_exp = (f.u >> 13) & 0x7c00u; // low 5 bits of exponent
  uint32_t h_sig = f.u & 0x0fffu; // low 12 bits (10 are mantissa)
  h_sig = (x > 0x7f800000u) ? 0x0200u : h_sig; // any NaN -> qNaN
  return (x_sgn >> 16) + h_exp + h_sig;
}

The 15 < e case is elegantly folded into the infinity case by using the floating point multiply circuit and the constant 0x1.0p+112f. After that, the x_exp < 0x38800000u case plays out similarly to ryg's code: add a fixed constant to cause the result to appear in the low bits. The x_exp >= 0x38800000u case uses a dynamic magic number rather than a constant magic number, with the aim of shifting off exactly 13 bits, getting the rounding done by the floating point addition circuit, and again having the result appear in the low bits. The final trickery is in computing the FP16 exponent bits; starting with the 8 bits of FP32 exponent, the top 3 bits are discarded, and modulo-32 arithmetic is done on the low 5 bits. Modulo 32, the bias adjustment is +16 ((15-127) % 32), which gets done in two parts: a fixed +15, and then either +0 or +1 depending on whether the result is denormal or not.

Sadly, at some point, new hardware makes software tricks obsolete. On x86, said new hardware was the F16C instruction set, which (amongst other things) adds a VCVTPS2PH instruction for converting (a vector of) FP32 to FP16. Four different rounding modes are available: round to nearest with ties toward even, round toward +∞, round toward -∞, round toward zero. Note that round to nearest with ties away from zero isn't an option. An immediate operand specifies the rounding mode, or it can come from MXCSR.RC. NaN signs are preserved, and NaN payloads are mostly preserved: the MSB of the input payload is forced to 1, and then the top 10 bits of the input payload become the output payload. When rounding toward +∞ or -∞, MXCSR.DAZ causes input denormals to be treated as zero. On the other hand, MXCSR.FTZ is ignored, so denormal results are never flushed to zero. Code to exercise this instruction looks incredibly boring:

uint16_t cvtps2ph(uint32_t x) {
    __m128 xmm = _mm_castsi128_ps(_mm_cvtsi32_si128(x));
    return _mm_extract_epi16(_mm_cvtps_ph(xmm, 0), 0);
}

On ARMv8, we have FCVT (scalar) and FCVTN / FCVTN2 (vector), which are capable of performing FP32 to FP16 conversion. FPCR controls the rounding mode, with the same options as on x86. FPCR also controls denormal flushing, of both inputs and outputs. Finally, FPCR.DN controls whether NaNs are canonicalised (without even the sign bit preserved), or whether they have the same preservation behaviour as x86.

Optimising SDI CRC, part 2

Yesterday, we looked at optimising an SDI CRC computation. After a bunch of mathematical optimisations, we reached this point:

uint64_t pack60(const uint16_t* data) {
    return (((uint64_t)data[ 0]) <<  4) ^
           (((uint64_t)data[ 2]) << 14) ^
           (((uint64_t)data[ 4]) << 24) ^
           (((uint64_t)data[ 6]) << 34) ^
           (((uint64_t)data[ 8]) << 44) ^
           (((uint64_t)data[10]) << 54);
}

__m128i pack120(const uint16_t* data) {
    return _mm_set_epi64x(pack60(data + 12) >> 4, pack60(data));
}

__m128i xor_clmul(__m128i a, __m128i b) {
    return _mm_xor_si128(_mm_clmulepi64_si128(a, b, 0x00),
                         _mm_clmulepi64_si128(a, b, 0x11));
}

void crc_sdi(uint32_t* crcs, const uint16_t* data, size_t n) {
    __m128i c = _mm_cvtsi32_si128(crcs[0]);
    __m128i y = _mm_cvtsi32_si128(crcs[1]);
    { // *= x^-14 semi-mod P
        __m128i k = _mm_cvtsi32_si128(
            0x9f380000 /* x^-14-(64-18)-32-1 mod P */);
        c = _mm_clmulepi64_si128(c, k, 0x00);
        y = _mm_clmulepi64_si128(y, k, 0x00);
    }
    for (size_t i = 0; i < n; i += 24) {
        { // *= x^120 semi-mod P
            __m128i k = _mm_setr_epi32(
                0, 0x4b334000 /* x^120+64-1 mod P */,
                0, 0x96d30000 /* x^120-1    mod P */);
            c = xor_clmul(c, k);
            y = xor_clmul(y, k);
        }
        { // +=
            c = _mm_xor_si128(c, pack120(data + i));
            y = _mm_xor_si128(y, pack120(data + i + 1));
        }
    }
    { // *= x^14 semi-mod P
        __m128i k = _mm_setr_epi32(
            0, 0x14980000 /* x^14+64-1 mod P */,
            0, 0x00040000 /* x^14-1    mod P */);
        c = xor_clmul(c, k);
        y = xor_clmul(y, k);
    }
    { // mod P
        __m128i k = _mm_setr_epi32( /* x^128-1 div P */
            0x14980559, 0x4c9bb5d5,
            0x80040000, 0x5e405011);
        c = _mm_xor_si128(_mm_srli_si128(xor_clmul(c, k), 8),
                          _mm_clmulepi64_si128(c, k, 0x01));
        y = _mm_xor_si128(_mm_srli_si128(xor_clmul(y, k), 8),
                          _mm_clmulepi64_si128(y, k, 0x01));
        __m128i P = _mm_cvtsi32_si128(0x46001 /* P */);
        c = _mm_clmulepi64_si128(c, P, 0x00);
        y = _mm_clmulepi64_si128(y, P, 0x00);
    }
    crcs[0] = _mm_cvtsi128_si32(c);
    crcs[1] = _mm_cvtsi128_si32(y);
}

This ran at approximately 11 bits per cycle:

VersionBroadwellSkylakeIce Lake
_mm_clmulepi64_si12810.811.012.8

From here, the question is how to best implement pack120. Compared to yesterday, there is a very different approach based around _mm_shuffle_epi8:

void crc_sdi_pack(uint32_t* crcs, const uint16_t* data, size_t n) {
...
        { // +=
            __m128i d1 = _mm_loadu_si128((__m128i*)(data + i));
            __m128i d2 = _mm_loadu_si128((__m128i*)(data + i + 8));
            __m128i d3 = _mm_loadu_si128((__m128i*)(data + i + 16));
            __m128i k1 = _mm_setr_epi16(16, 16, 64, 64, 1, 1, 4, 4);
            __m128i k2 = _mm_setr_epi16(16,  0, 64,  0, 1, 0, 4, 0);
            __m128i k3 = _mm_setr_epi16( 0, 16,  0, 64, 0, 1, 0, 4);
            d1 = _mm_mullo_epi16(k1, d1);
            __m128i k4=_mm_setr_epi8( 0, 1, -1,  8,  9, -1, -1, -1,
                                      2, 3, -1, 10, 11, -1, -1, -1);
            __m128i k5=_mm_setr_epi8(-1, 4,  5, -1, 12, 13, -1, -1,
                                     -1, 6,  7, -1, 14, 15, -1, -1);
            d1 = _mm_shuffle_epi8(d1, k4) ^ _mm_shuffle_epi8(d1, k5);
            __m128i d1m; // Force a store to memory
            asm("vmovdqa %1, %0" : "=m"(d1m) : "x"(d1) : "memory");
            __m128i cd = _mm_packus_epi32(_mm_madd_epi16(d2, k2),
                                          _mm_madd_epi16(d3, k2));
            __m128i yd = _mm_packus_epi32(_mm_madd_epi16(d2, k3),
                                          _mm_madd_epi16(d3, k3));
            __m128i k6=_mm_setr_epi8(-1, -1, -1, -1, -1,  0,  1, -1,
                                      4,  5,  8,  9, -1, 12, 13, -1);
            __m128i k7=_mm_setr_epi8(-1, -1, -1, -1, -1, -1,  2,  3,
                                     -1,  6,  7, 10, 11, -1, 14, 15);
            cd = _mm_shuffle_epi8(cd, k6) ^ _mm_shuffle_epi8(cd, k7)
               ^ _mm_loadu_si64(&d1m);
            yd = _mm_shuffle_epi8(yd, k6) ^ _mm_shuffle_epi8(yd, k7)
               ^ _mm_loadu_si64(8 + (char*)&d1m);
            c = _mm_xor_si128(c, cd);
            y = _mm_xor_si128(y, yd);
        }
...
}

This is very impressive, getting us to the region of 20 bits per cycle:

VersionBroadwellSkylakeIce Lake
_mm_shuffle_epi820.219.723.4

If targetting AVX2, then some pairs of operations can be collapsed:

__m256i broadcast128(const uint16_t* data) {
    __m256i result;
    asm("vbroadcasti128 %1, %0" : "=x"(result) :
                                   "m"(*(const __m128i*)data));
    return result;
}

void crc_sdi_pack(uint32_t* crcs, const uint16_t* data, size_t n) {
...
        { // +=
            __m128i d1 = _mm_loadu_si128((__m128i*)(data + i));
            __m256i d2 = broadcast128(data + i + 8);
            __m256i d3 = broadcast128(data + i + 16);
            __m128i k1 = _mm_setr_epi16(
                16, 16, 64, 64, 1, 1, 4, 4);
            __m256i k23 = _mm256_setr_epi16(
                16,  0, 64,  0, 1, 0, 4, 0,
                 0, 16,  0, 64, 0, 1, 0, 4);
            d1 = _mm_mullo_epi16(k1, d1);
            __m128i k4 = _mm_setr_epi8(
                0,  1, -1,  8,  9, -1, -1, -1,
                2,  3, -1, 10, 11, -1, -1, -1);
            __m128i k5 = _mm_setr_epi8(
                -1, 4,  5, -1, 12, 13, -1, -1,
                -1, 6,  7, -1, 14, 15, -1, -1);
            d1 = _mm_shuffle_epi8(d1, k4) ^ _mm_shuffle_epi8(d1, k5);
            __m128i d1m; // Force a store to memory
            asm("vmovdqa %1, %0" : "=m"(d1m) : "x"(d1) : "memory");
            __m256i cdyd = _mm256_packus_epi32(
                _mm256_madd_epi16(d2, k23),
                _mm256_madd_epi16(d3, k23));
            __m256i k6 = _mm256_setr_epi8(
                -1, -1, -1, -1, -1,  0,  1, -1,
                 4,  5,  8,  9, -1, 12, 13, -1,
                -1, -1, -1, -1, -1,  0,  1, -1, 
                 4,  5,  8,  9, -1, 12, 13, -1);
            __m256i k7 = _mm256_setr_epi8(
                -1, -1, -1, -1, -1, -1,  2,  3,
                -1,  6,  7, 10, 11, -1, 14, 15,
                -1, -1, -1, -1, -1, -1,  2,  3,
                -1,  6,  7, 10, 11, -1, 14, 15);
            cdyd = _mm256_shuffle_epi8(cdyd, k6)
                 ^ _mm256_shuffle_epi8(cdyd, k7);
            __m256i m; // Force a store to memory
            asm("vmovdqa %1, %0" : "=m"(m) : "x"(cdyd) : "memory");
            __m128i cd = _mm256_castsi256_si128(cdyd)
                       ^ _mm_loadu_si64(&d1m);
            __m128i yd = _mm_loadu_si128(1 + (__m128i*)&m)
                       ^ _mm_loadu_si64(8 + (char*)&d1m);
            c = _mm_xor_si128(c, cd);
            y = _mm_xor_si128(y, yd);
        }
...
}

This gains us a few bits per cycle on all three processors:

VersionBroadwellSkylakeIce Lake
AVX222.323.225.4

If targetting AVX512, then the xor3 trick from yesterday can be applied on top:

__m128i xor3(__m128i a, __m128i b, __m128i c) {
    return _mm_ternarylogic_epi64(a, b, c, 0x96);
}

void crc_sdi_pack(uint32_t* crcs, const uint16_t* data, size_t n) {
...
    for (size_t i = 0; i < n; i += 24) {
        // *= x^120 semi-mod P
        // +=
        __m128i k = _mm_setr_epi32(
            0, 0x4b334000 /* x^120+64-1 mod P */,
            0, 0x96d30000 /* x^120-1    mod P */);
        __m128i d1 = _mm_loadu_si128((__m128i*)(data + i));
        __m256i d2 = broadcast128(data + i + 8);
        __m256i d3 = broadcast128(data + i + 16);
        __m128i k1 = _mm_setr_epi16(
            16, 16, 64, 64, 1, 1, 4, 4);
        __m256i k23 = _mm256_setr_epi16(
            16,  0, 64,  0, 1, 0, 4, 0,
             0, 16,  0, 64, 0, 1, 0, 4);
        d1 = _mm_mullo_epi16(k1, d1);
        __m128i k4 = _mm_setr_epi8(
            0,  1, -1,  8,  9, -1, -1, -1,
            2,  3, -1, 10, 11, -1, -1, -1);
        __m128i k5 = _mm_setr_epi8(
            -1, 4,  5, -1, 12, 13, -1, -1,
            -1, 6,  7, -1, 14, 15, -1, -1);
        d1 = _mm_shuffle_epi8(d1, k4) ^ _mm_shuffle_epi8(d1, k5);
        __m128i d1m; // Force a store to memory
        asm("vmovdqa %1, %0" : "=m"(d1m) : "x"(d1) : "memory");
        __m256i cdyd = _mm256_packus_epi32(
            _mm256_madd_epi16(d2, k23),
            _mm256_madd_epi16(d3, k23));
        __m256i k6 = _mm256_setr_epi8(
            -1, -1, -1, -1, -1,  0,  1, -1,
             4,  5,  8,  9, -1, 12, 13, -1,
            -1, -1, -1, -1, -1,  0,  1, -1, 
             4,  5,  8,  9, -1, 12, 13, -1);
        __m256i k7 = _mm256_setr_epi8(
            -1, -1, -1, -1, -1, -1,  2,  3,
            -1,  6,  7, 10, 11, -1, 14, 15,
            -1, -1, -1, -1, -1, -1,  2,  3,
            -1,  6,  7, 10, 11, -1, 14, 15);
        cdyd = _mm256_shuffle_epi8(cdyd, k6)
             ^ _mm256_shuffle_epi8(cdyd, k7);
        __m256i m; // Force a store to memory
        asm("vmovdqa %1, %0" : "=m"(m) : "x"(cdyd) : "memory");
        __m128i cd = _mm256_castsi256_si128(cdyd)
                   ^ _mm_loadu_si64(&d1m);
        __m128i yd = _mm_loadu_si128(1 + (__m128i*)&m)
                   ^ _mm_loadu_si64(8 + (char*)&d1m);
        c = xor3(_mm_clmulepi64_si128(c, k, 0x00),
                 _mm_clmulepi64_si128(c, k, 0x11), cd);
        y = xor3(_mm_clmulepi64_si128(y, k, 0x00),
                 _mm_clmulepi64_si128(y, k, 0x11), yd);
    }
...
}

On applicable processors, this gains another few bits per cycle:

VersionBroadwellSkylakeIce Lake
AVX512N/A25.828.9

Optimising SDI CRC

An SDI data stream consists of 10-bit data. In hardware, this is not a problem. In contemporary software, 10-bit values are slightly awkward, and are generally represented in the low 10 bits of a uint16_t. SDI has two separate streams of 10-bit data (called c and y), which software might choose to represent in an interleaved fashion. Finally, each stream has an 18-bit CRC computed over it (polynomial x18 + x5 + x4 + x0), where the 18-bit value might be represented in the low 18 bits of a uint32_t. Given all this, the CRCs could be computed using the following code:

void crc_sdi(uint32_t* crcs, const uint16_t* data, size_t n) {
    uint32_t c = crcs[0];
    uint32_t y = crcs[1];
    for (size_t i = 0; i < n; i += 2) {
        c ^= data[i];
        y ^= data[i+1];
        for (int k = 0; k < 10; k++) {
            c = c & 1 ? (c >> 1) ^ 0x23000 : c >> 1;
            y = y & 1 ? (y >> 1) ^ 0x23000 : y >> 1;
        }
    }
    crcs[0] = c;
    crcs[1] = y;
}

The above code is short, but it isn't fast: on my three test systems, it manages approximately 0.5 bits per cycle. Expressed differently, that is 40 cycles per iteration of the outer loop (which is ingesting 20 bits per iteration).

VersionBroadwellSkylakeIce Lake
Starting point0.5360.5280.477

At the cost of a 2KiB table, this can be improved to approximately 2.1 bits per cycle, which is better, but still not great:

uint16_t table[1u << 10];

void make_table() {
    for (uint32_t idx = 0; idx < (1u << 10); idx++) {
        uint32_t c = idx;
        for (int k = 0; k < 10; k++) {
            c = c & 1 ? (c >> 1) ^ 0x23000 : c >> 1;
        }
        table[idx] = (c >> 2);
    }
}

void crc_sdi(uint32_t* crcs, const uint16_t* data, size_t n) {
    uint32_t c = crcs[0];
    uint32_t y = crcs[1];
    for (size_t i = 0; i < n; i += 2) {
        c ^= data[i];
        y ^= data[i+1];
        c = (table[c & 0x3ff] << 2) ^ (c >> 10);
        y = (table[y & 0x3ff] << 2) ^ (y >> 10);
    }
    crcs[0] = c;
    crcs[1] = y;
}
VersionBroadwellSkylakeIce Lake
2KiB table2.102.152.05

To do better, some mathematical tricks need to be pulled. The polynomial ring ℤ2[x] is relevant, along with an isomorphism between bit strings and said polynomials. The original code can be re-expressed, but borrowing some notation from polynomials:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = crcs[0];
    polynomial y = crcs[1];
    for (size_t i = 0; i < n; i += 2) {
        c += data[i]   * x8;
        y += data[i+1] * x8;
        for (int k = 0; k < 10; k++) {
            c = (c * x1) mod P; // P is x18 + x5 + x4 + x0
            y = (y * x1) mod P;
        }
    }
    crcs[0] = c;
    crcs[1] = y;
}

The first mathematical trick is eliminating the inner loop:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = crcs[0];
    polynomial y = crcs[1];
    for (size_t i = 0; i < n; i += 2) {
        c = (c * x10 + data[i]   * x18) mod P;
        y = (y * x10 + data[i+1] * x18) mod P;
    }
    crcs[0] = c;
    crcs[1] = y;
}

The next mathematical trick is pulling * x18 and mod P out of the loop:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = crcs[0] * x-18;
    polynomial y = crcs[1] * x-18;
    for (size_t i = 0; i < n; i += 2) {
        c = c * x10 + data[i];
        y = y * x10 + data[i+1];
    }
    crcs[0] = (c * x18) mod P;
    crcs[1] = (y * x18) mod P;
}

At this point, a step back towards practical implementation is required, which takes the form of unrolling the loop 12 times:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = crcs[0] * x-18;
    polynomial y = crcs[1] * x-18;
    for (size_t i = 0; i < n; i += 24) {
        c = c * x120 + pack120(data + i);
        y = y * x120 + pack120(data + i + 1);
    }
    crcs[0] = (c * x18) mod P;
    crcs[1] = (y * x18) mod P;
}

polynomial pack120(const polynomial* data) { return pack60(data) * x60 + pack60(data + 12); }

polynomial pack60(const polynomial* data) { polynomial result = 0; for (size_t i = 0; i < 12; i += 2) { result = result * x10 + data[i]; } return result; }

The * x60 in the middle of pack120 is slightly awkward, and would be more convenient as * x64. This can be achieved by shuffling a few things around:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = crcs[0] * x-14;
    polynomial y = crcs[1] * x-14;
    for (size_t i = 0; i < n; i += 24) {
        c = c * x120 + pack120(data + i);
        y = y * x120 + pack120(data + i + 1);
    }
    crcs[0] = (c * x14) mod P;
    crcs[1] = (y * x14) mod P;
}

polynomial pack120(const polynomial* data) { return pack60(data) * x64 + pack60(data + 12) * x4; }

polynomial pack60(const polynomial* data) { polynomial result = 0; for (size_t i = 0; i < 12; i += 2) { result = result * x10 + data[i]; } return result; }

To continue back towards a practical implementation, the size of each polynomial variable needs to be bounded. For that, I'll introduce the semi-mod operator, where for a polynomial F, (F * xn) semi-mod P is defined as (F mod x64) * (xn mod P) + (F div x64) * (xn+64 mod P). This is then tactically deployed in a few places:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = (crcs[0] * x-14) semi-mod P;
    polynomial y = (crcs[1] * x-14) semi-mod P;
    for (size_t i = 0; i < n; i += 24) {
        c = (c * x120) semi-mod P;
        y = (y * x120) semi-mod P;
        c = c + pack120(data + i);
        y = y + pack120(data + i + 1);
    }
    c = (c * x14) semi-mod P;
    y = (y * x14) semi-mod P;
    crcs[0] = c mod P;
    crcs[1] = y mod P;
}

polynomial pack120(const polynomial* data) { return pack60(data) * x64 + pack60(data + 12) * x4; }

polynomial pack60(const polynomial* data) { polynomial result = 0; for (size_t i = 0; i < 12; i += 2) { result = result * x10 + data[i]; } return result; }

The next transformation is unrolling pack60:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = (crcs[0] * x-14) semi-mod P;
    polynomial y = (crcs[1] * x-14) semi-mod P;
    for (size_t i = 0; i < n; i += 24) {
        c = (c * x120) semi-mod P;
        y = (y * x120) semi-mod P;
        c = c + pack120(data + i);
        y = y + pack120(data + i + 1);
    }
    c = (c * x14) semi-mod P;
    y = (y * x14) semi-mod P;
    crcs[0] = c mod P;
    crcs[1] = y mod P;
}

polynomial pack120(const polynomial* data) { return pack60(data) * x64 + pack60(data + 12) * x4; }

polynomial pack60(const polynomial* data) { return data[ 0] * x50 + data[ 2] * x40 + data[ 4] * x30 + data[ 6] * x20 + data[ 8] * x10 + data[10]; }

Moving back toward the practical realm, pack60 can return a 64-bit value. This value will be a polynomial in reversed bit order, meaning that the MSB represents x0 and the LSB represents x63. pack60 thus becomes:

uint64_t pack60(const uint16_t* data) {
    return (((uint64_t)data[ 0]) <<  4) ^
           (((uint64_t)data[ 2]) << 14) ^
           (((uint64_t)data[ 4]) << 24) ^
           (((uint64_t)data[ 6]) << 34) ^
           (((uint64_t)data[ 8]) << 44) ^
           (((uint64_t)data[10]) << 54);
}

Next up is pack120, which can return a 128-bit value. Again, this value will be a polynomial in reversed bit order, so the MSB represents x0 and the LSB represents x127. pack120 thus becomes:

__m128i pack120(const uint16_t* data) {
    return _mm_set_epi64x(pack60(data + 12) >> 4, pack60(data));
}

The c and y polynomials within the main function can also be 128-bit values, in the same reversed bit order. At this point, _mm_clmulepi64_si128 needs to be introduced, which can perform multiplication of two 64-bit polynomials, either in regular bit order or in reversed bit order. For polynomials F and G in regular bit order, it performs F * G with the result in regular order. For polynomials F and G in reversed bit order, it performs F * G * x1 with the result in reversed order. Accordingly, _mm_clmulepi64_si128 can be used to implement the semi-mod operator. It can also be used to implement mod P. It is neither short nor readable, but the resultant code is:

__m128i xor_clmul(__m128i a, __m128i b) {
    return _mm_xor_si128(_mm_clmulepi64_si128(a, b, 0x00),
                         _mm_clmulepi64_si128(a, b, 0x11));
}

void crc_sdi(uint32_t* crcs, const uint16_t* data, size_t n) {
    __m128i c = _mm_cvtsi32_si128(crcs[0]);
    __m128i y = _mm_cvtsi32_si128(crcs[1]);
    { // *= x^-14 semi-mod P
        __m128i k = _mm_cvtsi32_si128(
            0x9f380000 /* x^-14-(64-18)-32-1 mod P */);
        c = _mm_clmulepi64_si128(c, k, 0x00);
        y = _mm_clmulepi64_si128(y, k, 0x00);
    }
    for (size_t i = 0; i < n; i += 24) {
        { // *= x^120 semi-mod P
            __m128i k = _mm_setr_epi32(
                0, 0x4b334000 /* x^120+64-1 mod P */,
                0, 0x96d30000 /* x^120-1    mod P */);
            c = xor_clmul(c, k);
            y = xor_clmul(y, k);
        }
        { // +=
            c = _mm_xor_si128(c, pack120(data + i));
            y = _mm_xor_si128(y, pack120(data + i + 1));
        }
    }
    { // *= x^14 semi-mod P
        __m128i k = _mm_setr_epi32(
            0, 0x14980000 /* x^14+64-1 mod P */,
            0, 0x00040000 /* x^14-1    mod P */);
        c = xor_clmul(c, k);
        y = xor_clmul(y, k);
    }
    { // mod P
        __m128i k = _mm_setr_epi32( /* x^128-1 div P */
            0x14980559, 0x4c9bb5d5,
            0x80040000, 0x5e405011);
        c = _mm_xor_si128(_mm_srli_si128(xor_clmul(c, k), 8),
                          _mm_clmulepi64_si128(c, k, 0x01));
        y = _mm_xor_si128(_mm_srli_si128(xor_clmul(y, k), 8),
                          _mm_clmulepi64_si128(y, k, 0x01));
        __m128i P = _mm_cvtsi32_si128(0x46001 /* P */);
        c = _mm_clmulepi64_si128(c, P, 0x00);
        y = _mm_clmulepi64_si128(y, P, 0x00);
    }
    crcs[0] = _mm_cvtsi128_si32(c);
    crcs[1] = _mm_cvtsi128_si32(y);
}

This version comes in at approximately 11 bits per cycle:

VersionBroadwellSkylakeIce Lake
_mm_clmulepi64_si12810.811.012.8

This is good, but we can do better still. At this point, all the mathematical tricks have been pulled, so the next avenue of exploration is vectorisation. In particular, there are four pack60 calls per iteration, and it should be possible to perform two calls at once by operating at 128-bit vector width:

__m128i pmovzxwq(const uint16_t* data) {
    __m128i out;
    asm("vpmovzxwq %1, %0" : "=x"(out) :
                              "m"(*(const uint32_t*)data));
    return out;
}

__m128i pack60x2(const uint16_t* data) {
    __m128i result;
    result  = _mm_slli_epi64(pmovzxwq(data     ),  4);
    result ^= _mm_slli_epi64(pmovzxwq(data +  2), 14);
    result ^= _mm_slli_epi64(pmovzxwq(data +  4), 24);
    result ^= _mm_slli_epi64(pmovzxwq(data +  6), 34);
    result ^= _mm_slli_epi64(pmovzxwq(data +  8), 44);
    result ^= _mm_slli_epi64(pmovzxwq(data + 10), 54);
    return result;
}

void crc_sdi_pack(uint32_t* crcs, const uint16_t* data, size_t n) {
...
        { // +=
            __m128i lo = pack60x2(data + i);
            __m128i hi = _mm_srli_epi64(pack60x2(data + i + 12), 4);
            c = _mm_xor_si128(c, _mm_unpacklo_epi64(lo, hi));
            y = _mm_xor_si128(y, _mm_unpackhi_epi64(lo, hi));
        }
...
}

This gives an extra 0.7 bits per cycle on Broadwell. On Skylake and Ice Lake, it improves things by more than 2 bits per cycle:

VersionBroadwellSkylakeIce Lake
pack60x211.513.215.2

From here, a few shift instructions can be eliminated by having two variants of pack60x2:

__m128i pack60x2_4(const uint16_t* data) {
    __m128i result;
    result  = _mm_slli_epi64(pmovzxwq(data     ),  4);
    result ^= _mm_slli_epi64(pmovzxwq(data +  2), 14);
    result ^= _mm_slli_epi64(pmovzxwq(data +  4), 24);
    result ^= _mm_slli_epi64(pmovzxwq(data +  6), 34);
    result ^= _mm_slli_epi64(pmovzxwq(data +  8), 44);
    result ^= _mm_slli_epi64(pmovzxwq(data + 10), 54);
    return result;
}

__m128i pack60x2_0(const uint16_t* data) {
    __m128i result;
    result  =                pmovzxwq(data     );
    result ^= _mm_slli_epi64(pmovzxwq(data +  2), 10);
    result ^= _mm_slli_epi64(pmovzxwq(data +  4), 20);
    result ^= _mm_slli_epi64(pmovzxwq(data +  6), 30);
    result ^= _mm_slli_epi64(pmovzxwq(data +  8), 40);
    result ^= _mm_slli_epi64(pmovzxwq(data + 10), 50);
    return result;
}

void crc_sdi_pack(uint32_t* crcs, const uint16_t* data, size_t n) {
...
        { // +=
            __m128i lo = pack60x2_4(data + i);
            __m128i hi = pack60x2_0(data + i + 12);
            c = _mm_xor_si128(c, _mm_unpacklo_epi64(lo, hi));
            y = _mm_xor_si128(y, _mm_unpackhi_epi64(lo, hi));
        }
...
}

This makes basically no difference on Skylake, but gives a minor improvement on both Broadwell and Ice Lake:

VersionBroadwellSkylakeIce Lake
pack60x2_4 and pack60x2_012.613.215.9

At this point it is worth reviewing exactly which micro-ops are being executed per iteration of the loop (which is now ingesting 240 bits per iteration):

InstructionCountBDW μopsSKL μopsICL μops
_mm_clmulepi64_si128 x, x, i4p0p5p5
pmovzxwq x, m12p5 + p23p5 + p23p15 + p23
_mm_slli_epi64 x, i11p0p01p01
_mm_xor_si128 x, x14p015p015p015
_mm_unpack(lo|hi)_epi64 x, x2p5p5p15

On Skylake, we're bound by port 5 (which is why eliminating the shifts didn't help): 240 bits require 18 μops executed by port 5, and 240 bits divided by 18 cycles is 13.3 bits per cycle, which is pretty much the measured 13.2 bits per cycle. Things are slightly less neat on Broadwell and Ice Lake, as we're hitting suboptimal dispatch of μops to execution ports, causing us to be bound on a port even though clairvoyant dispatch could do slightly better. Broadwell could be described as being bound on port 0 (due to some xor μops being dispatched to port 0, even though it would be better to dispatch them all to port 1), and Ice Lake could be described as being bound on a mix of port 1 and port 5 (due to some xor μops not being dispatched to port 0).

To get past this bound, some work needs to be removed from port 5. Each pair of pmovzxwq and _mm_slli_epi64 causes a memory μop on port 2 or 3, a shuffle μop on port 5 (or port 1 on Ice Lake), and an arithmetic μop on port 0 (or port 1 on Skylake / Ice Lake). The shuffle μop can be removed by using a different pair of instructions: broadcastss (one memory μop) followed by _mm_madd_epi16 against a constant (one arithmetic μop). This different pair can only be used when the shift amount, modulo 32, is between 0 and 14 (inclusive), as the constant values in question are int16_t values. Half of the cases satisfy this constraint on the shift amount, and can thus be replaced:

__m128i broadcastss(const uint16_t* data) {
    __m128i result;
    asm("vbroadcastss %1, %0" : "=x"(result) :
                                 "m"(*(const uint32_t*)data));
    return result;
}

__m128i pack60x2_4(const uint16_t* data) {
    __m128i shift4  = _mm_setr_epi32(1u <<  4, 0, 1u << ( 4+16), 0);
    __m128i shift14 = _mm_setr_epi32(1u << 14, 0, 1u << (14+16), 0);
    __m128i shift34 = _mm_setr_epi32(0, 1u <<  2, 0, 1u << ( 2+16));
    __m128i shift44 = _mm_setr_epi32(0, 1u << 12, 0, 1u << (12+16));
    __m128i result;
    result  = _mm_madd_epi16(broadcastss(data     ), shift4);
    result ^= _mm_madd_epi16(broadcastss(data +  2), shift14);
    result ^= _mm_slli_epi64(   pmovzxwq(data +  4), 24);
    result ^= _mm_madd_epi16(broadcastss(data +  6), shift34);
    result ^= _mm_madd_epi16(broadcastss(data +  8), shift44);
    result ^= _mm_slli_epi64(   pmovzxwq(data + 10), 54);
    return result;
}

__m128i pack60x2_0(const uint16_t* data) {
    __m128i shift10 = _mm_setr_epi32(1u << 10, 0, 1u << (10+16), 0);
    __m128i shift40 = _mm_setr_epi32(0, 1u << 8, 0, 1u << (8+16));
    __m128i result;
    result  =                   pmovzxwq(data     );
    result ^= _mm_madd_epi16(broadcastss(data +  2), shift10);
    result ^= _mm_slli_epi64(   pmovzxwq(data +  4), 20);
    result ^= _mm_slli_epi64(   pmovzxwq(data +  6), 30);
    result ^= _mm_madd_epi16(broadcastss(data +  8), shift40);
    result ^= _mm_slli_epi64(   pmovzxwq(data + 10), 50);
    return result;
}

This gives a nice improvement across the board, getting us to at least 18 bits per cycle on Skylake and Ice Lake:

VersionBroadwellSkylakeIce Lake
broadcastss + _mm_madd_epi1613.518.018.8

A few more pairs of pmovzxwq and _mm_slli_epi64 can be replaced with broadcastss and _mm_madd_epi16, albeit at the cost of one extra arithmetic μop for every two removed shuffle μops:

__m128i pack60x2_4(const uint16_t* data) {
    __m128i shift4  = _mm_setr_epi32(1u <<  4, 0, 1u << ( 4+16), 0);
    __m128i shift14 = _mm_setr_epi32(1u << 14, 0, 1u << (14+16), 0);
    __m128i shift34 = _mm_setr_epi32(0, 1u <<  2, 0, 1u << ( 2+16));
    __m128i shift44 = _mm_setr_epi32(0, 1u << 12, 0, 1u << (12+16));
    __m128i result, hi;
    result  = _mm_madd_epi16(broadcastss(data     ), shift4);
    result ^= _mm_madd_epi16(broadcastss(data +  2), shift14);
    hi      = _mm_madd_epi16(broadcastss(data +  4), shift4);
    result ^= _mm_madd_epi16(broadcastss(data +  6), shift34);
    result ^= _mm_madd_epi16(broadcastss(data +  8), shift44);
    hi     ^= _mm_madd_epi16(broadcastss(data + 10), shift34);
    result ^= _mm_slli_epi64(hi, 20);
    return result;
}

__m128i pack60x2_0(const uint16_t* data) {
    __m128i shift10 = _mm_setr_epi32(1u << 10, 0, 1u << (10+16), 0);
    __m128i shift40 = _mm_setr_epi32(0, 1u << 8, 0, 1u << (8+16));
    __m128i result, hi;
    result  =                   pmovzxwq(data     );
    result ^= _mm_madd_epi16(broadcastss(data +  2), shift10);
    hi      =                   pmovzxwq(data +  4);
    hi     ^= _mm_madd_epi16(broadcastss(data +  6), shift10);
    result ^= _mm_slli_epi64(hi, 20);
    result ^= _mm_madd_epi16(broadcastss(data +  8), shift40);
    result ^= _mm_slli_epi64(   pmovzxwq(data + 10), 50);
    return result;
}

This is a minor regression on Broadwell (because at this point we're bound by port 0, and that is where the extra arithmetic μop goes), and a small improvement on Skylake and Ice Lake (where the bound is shuffle rather than arithmetic):

VersionBroadwellSkylakeIce Lake
More broadcastss + _mm_madd_epi1613.218.819.7

To break through the barrier of 20 bits per cycle, AVX512's ternary bitwise instruction can be used to merge together pairs of xor instructions:

__m128i xor3(__m128i a, __m128i b, __m128i c) {
    return _mm_ternarylogic_epi64(a, b, c, 0x96);
}

__m128i pack60x2_4(const uint16_t* data) {
    __m128i shift4  = _mm_setr_epi32(1u <<  4, 0, 1u << ( 4+16), 0);
    __m128i shift14 = _mm_setr_epi32(1u << 14, 0, 1u << (14+16), 0);
    __m128i shift34 = _mm_setr_epi32(0, 1u <<  2, 0, 1u << ( 2+16));
    __m128i shift44 = _mm_setr_epi32(0, 1u << 12, 0, 1u << (12+16));
    __m128i result, hi;
    result = xor3(_mm_madd_epi16(broadcastss(data     ), shift4),
                  _mm_madd_epi16(broadcastss(data +  2), shift14),
                  _mm_madd_epi16(broadcastss(data +  6), shift34));
    hi     =      _mm_madd_epi16(broadcastss(data +  4), shift4)
           ^      _mm_madd_epi16(broadcastss(data + 10), shift34);
    result = xor3(result,
                  _mm_madd_epi16(broadcastss(data +  8), shift44),
                  _mm_slli_epi64(hi, 20));
    return result;
}

__m128i pack60x2_0(const uint16_t* data) {
    __m128i shift10 = _mm_setr_epi32(1u << 10, 0, 1u << (10+16), 0);
    __m128i shift40 = _mm_setr_epi32(0, 1u << 8, 0, 1u << (8+16));
    __m128i result, hi;
    hi     =                        pmovzxwq(data +  4)
           ^      _mm_madd_epi16(broadcastss(data +  6), shift10);
    result = xor3(                  pmovzxwq(data     ),
                  _mm_madd_epi16(broadcastss(data +  2), shift10),
                  _mm_slli_epi64(hi, 20));
    result = xor3(result,
                  _mm_madd_epi16(broadcastss(data +  8), shift40),
                  _mm_slli_epi64(   pmovzxwq(data + 10), 50));
    return result;
}

void crc_sdi(uint32_t* crcs, const uint16_t* data, size_t n) {
    __m128i c = _mm_cvtsi32_si128(crcs[0]);
    __m128i y = _mm_cvtsi32_si128(crcs[1]);
    { // *= x^-14 semi-mod P
        __m128i k = _mm_cvtsi32_si128(
            0x9f380000 /* x^-14-(64-18)-32-1 mod P */);
        c = _mm_clmulepi64_si128(c, k, 0x00);
        y = _mm_clmulepi64_si128(y, k, 0x00);
    }
    for (size_t i = 0; i < n; i += 24) {
        // *= x^120 semi-mod P
        // +=
        __m128i lo = pack60x2_4(data + i);
        __m128i hi = pack60x2_0(data + i + 12); 
        __m128i k = _mm_setr_epi32(
            0, 0x4b334000 /* x^120+64-1 mod P */,
            0, 0x96d30000 /* x^120-1    mod P */);
        c = xor3(_mm_clmulepi64_si128(c, k, 0x00),
                 _mm_clmulepi64_si128(c, k, 0x11),
                 _mm_unpacklo_epi64(lo, hi));
        y = xor3(_mm_clmulepi64_si128(y, k, 0x00),
                 _mm_clmulepi64_si128(y, k, 0x11),
                 _mm_unpackhi_epi64(lo, hi));
    }
    { // *= x^14 semi-mod P
        __m128i k = _mm_setr_epi32(
            0, 0x14980000 /* x^14+64-1 mod P */,
            0, 0x00040000 /* x^14-1    mod P */);
        c = xor_clmul(c, k);
        y = xor_clmul(y, k);
    }
    { // mod P
        __m128i k = _mm_setr_epi32( /* x^128-1 div P */
            0x14980559, 0x4c9bb5d5,
            0x80040000, 0x5e405011);
        c = _mm_xor_si128(_mm_srli_si128(xor_clmul(c, k), 8),
                          _mm_clmulepi64_si128(c, k, 0x01));
        y = _mm_xor_si128(_mm_srli_si128(xor_clmul(y, k), 8),
                          _mm_clmulepi64_si128(y, k, 0x01));
        __m128i P = _mm_cvtsi32_si128(0x46001 /* P */);
        c = _mm_clmulepi64_si128(c, P, 0x00);
        y = _mm_clmulepi64_si128(y, P, 0x00);
    }
    crcs[0] = _mm_cvtsi128_si32(c);
    crcs[1] = _mm_cvtsi128_si32(y);
}

AVX512 isn't available on Broadwell, nor on the client variant of Skylake, but where it is available it pushes us well above 20 bits per cycle:

VersionBroadwellSkylakeIce Lake
_mm_ternarylogic_epi64N/A21.923.5

To recap the journey, when measuring bits per cycle, we've got a 25x overall improvement on Broadwell, more than 40x overall improvement on Skylake, and nearly a 50x overall improvement on Ice Lake:

VersionBroadwellSkylakeIce Lake
Starting point0.5360.5280.477
2KiB table2.102.152.05
_mm_clmulepi64_si12810.811.012.8
pack60x211.513.215.2
pack60x2_4 and pack60x2_012.613.215.9
broadcastss + _mm_madd_epi1613.518.018.8
More broadcastss + _mm_madd_epi1613.218.819.7
_mm_ternarylogic_epi64N/A21.923.5

Contrasting Intel AMX and Apple AMX

Intel has an x64 instruction set extension called AMX, meanwhile Apple has a totally different aarch64 instruction set extension also called AMX.

Register file

Intel AMX: 8 tmm registers, each a 16 by 16 matrix of 32-bit elements (technically, each can be configured to be any size - square or rectangular - between 1 by 1 and 16 by 16, though element size is fixed at 32-bits regardless). Total architectural state 8 kilobytes.

Apple AMX: Total architectural state 5 kilobytes, broken down as:

The vectors have 8/16/32/64-bit elements, like regular SIMD registers. Note that Intel AMX does not need vector registers, as Intel already has AVX512 with 64-byte vectors (32 of which are in the AVX512 architectural register file).

Data types

Intel AMX: Multiplicands are always 32-bit, either i8[4] or u8[4] or bf16[2], combined via dot product. Accumulators are always 32-bit, either i32 or u32 or f32.

Apple AMX: Multiplicands are scalars, any of i8 / u8 / i16 / u16 / f16 / f32 / f64. Accumulators are any of i16 / u16 / i32 / u32 / f16 / f32 / f64. Note f16 (i.e. IEEE 754 half-precision with 5-bit exponent and 10-bit fraction) rather than bf16 (8-bit exponent, 7-bit fraction), though bf16 support is added on M2 and later.

Computational operations

Intel AMX: Matrix multiplication of any two tmm registers, accumulating onto a third tmm register. For the multiplication, matrix elements are themselves (very small) vectors, combined via dot product. This is the only operation. Viewed differently, this is doing 16×64 by 64×16 (int8) or 16×32 by 32×16 (bf16) matmul, then adding onto a 16×16 matrix.

Apple AMX: Outer product of any x register with any y register (viewed as a column vector), accumulating onto any (matrix view) z register). For the multiplication, x / y elements are scalars (depending on the data type, this might be viewed as doing 16×1 by 1×16 matmul then adding onto a 16×16 matrix). Alternatively, pointwise product of any x register with any y register (viewed as a row vector), accumulating onto any (vector view) z register. Many more operations as well, though the general theme is {outer or pointwise} {multiplication or addition or subtraction}, possibly followed by right-shift, possibly followed by integer saturation. The most exotic exceptions to the theme are min / max / popcnt.

Memory operations

Intel AMX: Matrix load or store (up to 1 kilobyte), configurable with a base address (register + immediate offset) and a row stride (register or zero, optionally shifted left by 1-3 bits).

Apple AMX: Vector load or store (64 bytes), configurable with a base address (register). Also load or store pair (128 bytes), though the two registers must be consecutive, and the row stride is fixed at 64 bytes, and the base address must be 128-byte aligned. Loads or stores with y effectively give a free vector transpose, as y registers can be viewed as column vectors.

Masking modes

Intel AMX: Each tmm register can be configured to any size - square or rectangular - between 1 by 1 and 16 by 16. This is (mostly) equivalent to saying that a tmm register is always 16 by 16, but has an associated mask on each dimension to only enable some number of leading rows and columns. These per-register masks are architectural state.

Apple AMX: Per-dimension masking is available on a per-instruction basis (though notably not for loads / stores). Available masking modes are: all rows (columns), even/odd rows (columns) only, first N rows (columns) only, last N rows (columns) only, row (column) N only.

Note that both of these approaches are different to Intel's AVX512 approach, which is a separate register file containing 8 mask registers (k0 through k7) and every operation optionally taking a mask register as an input.

Other

Apple AMX contains a very interesting instruction called genlut. In the forward direction ("lookup"), it is somewhere between AVX512's vpshufb and vgatherps. In the backward direction ("generate") it is something like an inverse vpshufb, performing arbitrary 2/3/4/5-bit quantisation. When used in both directions, it can be useful for piecewise linear interpolation, or as an alternative to AVX512's vfpclassps / vfixupimmps.

page: 1 2 3 4