# A whirlwind tour of AArch64 vector instructions (NEON)

32 vector registers, each 128 bits wide. Also a control register (`FPCR`

) and a status register (`FPSR`

), though scalar comparison instructions use `PSTATE`

. Each vector register can contain:

- A single lane in the low bits (scalar)
- Multiple lanes in the low 64 bits (back-compat with AArch32)
- Multiple lanes collectively occupying all 128 bits

128 bits might seem small compared to AVX-512, but for most vector instructions, the M1's Firestorm cores can issue four instances of the instruction per cycle, which gets you to a similar place. There are also AMX units on the M1.

A lane is 1/8/16/32/64/128 bits wide. The common lane types are:

1-bit | 8-bit | 16-bit | 32-bit | 64-bit | 128-bit | |
---|---|---|---|---|---|---|

uint | bit | uint8 ^{[q]} | uint16 ^{[q]} | uint32 ^{[q]} | uint64 ^{[q]} | |

sint | sint8 ^{[q]} | sint16 ^{[q]} | sint32 ^{[q]} | sint64 ^{[q]} | ||

fp | fp16 ^{[a]} | fp32 | fp64 | |||

bf | bf16 ^{[b]} | |||||

poly | bit | poly8 | poly16 | poly64 ^{[c]} | poly128 ^{[c]} |

^{[a]}Requires

`fp16`

extension, and improved further by `fp16fml`

extension.^{[b]}Requires

`bf16`

extension (present on Apple M2, but not on M1).^{[c]}Requires

`crypto`

extension.^{[q]}Often with the choice of truncate on overflow or saturate on overflow.

The syntactic names for the 32 registers vary based on the lane width and also how much of the register is being used:

Lane width | Low scalar | Low 64 bits | All 128 bits |
---|---|---|---|

1 bit | N/A | `V0.8B` ⋯ `V31.8B` | `V0.16B` ⋯ `V31.16B` |

8 bits | `B0` ⋯ `B31` | `V0.8B` ⋯ `V31.8B` | `V0.16B` ⋯ `V31.16B` |

16 bits | `H0` ⋯ `H31` | `V0.4H` ⋯ `V31.4H` | `V0.8H` ⋯ `V31.8H` |

32 bits | `S0` ⋯ `S31` | `V0.2S` ⋯ `V31.2S` | `V0.4S` ⋯ `V31.4S` |

64 bits | `D0` ⋯ `D31` | `V0.1D` ⋯ `V31.1D` | `V0.2D` ⋯ `V31.2D` |

128 bits | `Q0` ⋯ `Q31` | N/A | `V0.1Q` ⋯ `V31.1Q` |

`[0]`

⋯ `[15]`

to one of the above, for example `V7.4S[3]`

denotes the most significant 32-bit lane of `V7`

. Writes to a single lane with this syntax generally preserve other lanes, whereas in all other cases, writes to the low bits of a register generally zero the remaining bits (though see `FPCR.NEP`

).
Vector instructions have up to six input registers, and up to one output register (with the exception of loads, which can have up to four output registers). In most cases, all the registers involved in a single instruction have the same lane width, though there are exceptions:

*Narrowing instructions*have an output register whose lane width is half that of the input registers. These only write to half of the output register, and typically come in pairs: one instruction writes to the low half and clears the high half, while the other writes to the high half and preserves the low half.*Widening instructions*have an output register whose lane width is twice that of the input registers. These only consume half of each input register, and typically come in pairs (consuming different halves of the inputs).*Partially widening instructions*have an output register whose lane width is twice that of*some*inputs, and equal to that of other inputs. Typically come in pairs (consuming different halves of the*appplicable*inputs).*Widen then multiple accumulate instructions*have an output register whose lane width is twice or four times that of*some*input registers. Inputs are widened, then combined, then*multiple*intermediate results are accumulated onto each output lane. Do not come in pairs, as the entire input is used.

In most cases, operations are lane-wise: lane `i`

of the output register is formed by combining lane `i`

of each input register, though there are exceptions:

- Instructions with a single input register sometimes perform a scalar reduction: the low lane of the output register is formed by combining
*all*lanes of the input register, and other lanes are zeroed. Annotated as^{[v]}. - Instructions with two input registers sometimes concatenate the inputs to form a 256-bit value and then operate on
*adjacent*pairs of lanes: lane`i`

of the output is made by combining lanes`i*2`

and`i*2+1`

of the concatenation. Annotated as^{[p]}. - Instructions sometimes recycle one of the inputs as an output, so lane
`i`

of the output register is formed by combining lane`i`

of the output register with lane`i`

of each input register. - Some instructions allow a single lane from one of the inputs to be broadcast to all lanes (the term
*by element*is used for this). Often not permitted for 8-bit lanes. Often slightly restricted for 16-bit lanes: source register must be`V0`

through`V15`

(i.e. cannot be`V16`

through`V31`

). Annotated as^{[b]}. - Data movement instructions (shuffles/permutes/etc) can have a bespoke relationship between input lanes and output lanes.

**GPR to / from vector**

Instruction | Direction | Behaviour |
---|---|---|

`FMOV` | GPR to vector | Truncate to lane width, then zero extend |

`DUP` | GPR to vector | Truncate then replicate to all lanes |

`INS` | GPR to vector | Truncate then insert to arbitrary lane |

`FMOV` | Vector to GPR | Take low lane, zero extend to GPR width |

`UMOV` | Vector to GPR | Arbitrary lane, zero extend to GPR width |

`SMOV` | Vector to GPR | Arbitrary lane, sign extend to GPR width |

A number of data type conversion instrucions can also have GPR as source or destination (see "Data type conversion, integer to float" and "Data type conversion, float to integer").

**Load / store**

A scalar load moves 8/16/32/64 bits from memory to (part of) a vector register, whereas a vector load moves 64/128 bits. This can be repeated up to four times, reading from consecutive memory locations and writing to distinct vector registers (which must be consecutively numbered, except for `LDP`

/ `LDNP`

).

×1 | ×2 | ×3 | ×4 | |
---|---|---|---|---|

Scalar (low lane, zero others) | `LDR` `LDUR` | `LDP` `LDNP` | ||

Scalar (any lane, preserve others) | `LD1` (SS 1R) | `LD2` (SS 2R) | `LD3` (SS 3R) | `LD4` (SS 4R) |

Scalar (replicate to all lanes) | `LD1R` | `LD2R` | `LD3R` | `LD4R` |

Vector | `LD1` (MS 1R)`LDR` `LDUR` | `LD1` (MS 2R)`LDP` `LDNP` | `LD1` (MS 3R) | `LD1` (MS 4R) |

Vector, transposed | `LD2` (MS 2R) | `LD3` (MS 3R) | `LD4` (MS 4R) |

With the exception of scalar replicating to all lanes, all load instructions have a corresponding store instruction performing the inverse - replace `LD`

with `ST`

.

The SS or MS suffix denotes what the ARM reference manual calls "single structure" or "multiple structures", and is followed by the number of destination registers (1, 2, 3, or 4). The operand syntax relates to this suffix.

The vector transposed loads perform one of the following tranposes, where M denotes the number of destination registers (2, 3, or 4):

8-bit lanes | 16-bit lanes | 32-bit lanes | 64-bit lanes | |
---|---|---|---|---|

64-bit vectors | 8×M ↦ M×8 | 4×M ↦ M×4 | 2×M ↦ M×2 | |

128-bit vectors | 16×M ↦ M×16 | 8×M ↦ M×8 | 4×M ↦ M×4 | 2×M ↦ M×2 |

The addressing modes are all over the place:

Base | Offset | Writeback Mode | |
---|---|---|---|

`LDR` | Xn or SP | signed `imm9` | Pre- or post-index |

`LDUR` | Xn or SP | signed `imm9` | No writeback |

`LDR` | Xn or SP | unsigned `imm12` , scaled | No writeback |

`LDR` | Xn or SP | Xm or Wm, optional extend/scale | No writeback |

`LDR` | PC | signed `imm19` , times 4 | No writeback |

`LDP` | Xn or SP | signed `imm7` , scaled | Pre- or post-index |

`LDP` | Xn or SP | signed `imm7` , scaled | No writeback |

`LDNP` | Xn or SP | signed `imm7` , scaled | Pre- or post-index |

`LDNP` | Xn or SP | signed `imm7` , scaled | No writeback |

Others | Xn or SP | unsigned `imm1` , scaled (or Xm) | Post-index |

**Data movement**

Moving lanes around:

Source lane | Destination lane | Other lanes | |
---|---|---|---|

`FMOV` | Low lane | Low lane | Zeroed |

`MOV` (`DUP` ) | Arbitrary | Low lane | Zeroed |

`MOV` (`INS` ) | Arbitrary | Arbitrary | Preserved |

`DUP` | Arbitrary | All lanes (replicated) | N/A |

`MOV` (`ORR` ) | All lanes | All lanes (1:1) | N/A |

Various reversals:

in bytes | in u16s | in u32s | in u64s | in u128s | |
---|---|---|---|---|---|

Reverse bits | `RBIT` | ||||

Reverse bytes | no-op | `REV16` | `REV32` | `REV64` | `TBL` |

Reverse u16s | no-op | `REV32` | `REV64` | `TBL` | |

Reverse u32s | no-op | `REV64` | `TBL` | ||

Reverse u64s | no-op | `EXT` |

Various ways of creating one vector from two:

First step | Second step | |
---|---|---|

`TRN1` , `TRN2` | Discard odd or even lanes | Interleave lanes |

`ZIP1` , `ZIP2` | Discard high half or low half | Interleave lanes |

`UZP1` , `UZP2` | Concatenate vectors | Discard odd or even lanes |

`EXT` | Concatenate vectors | Take a contiguous 16-byte span |

`EXT`

with the same source vector for both operands gives a rotation by a whole number of bytes. Note that some math instructions come with a free `UZP1`

and `UZP2`

(annotated as ^{[p]}).

The `TBL`

, `TBX`

family concatenate 1-4 vectors from consecutively numbered registers to form a table `T`

of 16/32/48/64 bytes, then another byte vector serves as indices into said table:

Per-byte behaviour | |
---|---|

`TBL` | `D` |

`TBX` | `D` |

`pshufb`

does `D`_{i} = (Y_{i} < 128) ? X[Y_{i} & 15] : 0

, which is similar.
**Immediates**

One flavour of `FMOV`

loads constants of the form `±(1.0 + m/16) × 2`

(where ^{e}`0 ≤ m ≤ 15`

and `−3 ≤ e ≤ 4`

) into the low fp16/fp32/fp64 lane, and either zeros the other lanes or replicates the constant to the other lanes.

One flavour of `MOVI`

loads a constant into the low 64-bit lane, each byte of which is independently either 0 or 255, and either zeros the high u64 lane or replicates the constant there.

The remaining flavours of `MOVI`

and `MVNI`

load a constant into the low 8/16/32-bit lane, one byte of which is an arbitrary value, bytes to the left either all 0 or all 255, and bytes to right either all 0 or all 255, then replicates this constant to all lanes.

The bitwise `BIC`

and `ORR`

instructions support constants of the form: 16/32-bit lane, one byte of which is an arbitrary value, other bytes all 0, replicated to all lanes.

Various comparison instructions support comparison against constant zero:

Signed | Unsigned | Floating | |
---|---|---|---|

`X == 0` | `CMEQ X, #0` | `CMEQ X, #0` | `FCMEQ X, #0.0` |

`X <= 0` | `CMLE X, #0` | `CMEQ X, #0` | `FCMLE X, #0.0` |

`X < 0` | `CMLT X, #0` | always false | `FCMLT X, #0.0` |

`X > 0` | `CMGT X, #0` | `CMTST X, X` | `FCMGT X, #0.0` |

`X >= 0` | `CMGE X, #0` | always true | `FCMGE X, #0.0` |

`X <=> 0` | N/A | N/A | `FCMP X, #0.0` |

**Shifts**

Note `>>R`

is used to denote a rounding right shift: if the most significant bit shifted out was a 1, then 1 is added to the result. If shifting right by `N`

, this is equivalent to adding `1 << (N - 1)`

to the input before shifting.

In-lane shifts by immediate:

Per-lane behaviour | |
---|---|

`SHL` | `D = X << N` |

`SQSHL` , `UQSHL` | `D = sat(X << N)` |

`SQSHLU` | `D = sat(X << N)` (X signed, D unsigned) |

`SLI` | `D = (X << N) | bzhi(D, N)` (bzhi clears all but the low N bits) |

`SSHR` , `USHR` | `D = X >> N` |

`SRSHR` , `URSHR` | `D = X >>R N` |

`SSRA` , `USRA` | `D += X >> N` |

`SRSRA` , `URSRA` | `D += X >>R N` |

`SRI` | `D = (X >> N) | bzlo(D, N)` (bzlo clears all but the high N bits) |

In-lane variable shifts, where the shift amount is a *signed* value in the low 8 bits of each lane in the 2^{nd} operand:

Per-lane behaviour (Y > 0) | Per-lane behaviour (Y < 0) | |
---|---|---|

`SSHL` , `USHL` | `D = X << Y` | `D = X >> -Y` |

`SRSHL` , `URSHL` | `D = X << Y` | `D = X >>R -Y` |

`SQSHL` , `UQSHL` | `D = sat(X << Y)` | `D = X >> -Y` |

`SQRSHL` , `UQRSHL` | `D = sat(X << Y)` | `D = X >>R -Y` |

Widening shifts by immediate (from 8-bit lanes to 16-bit, 16-bit lanes to 32-bit, or 32-bit lanes to 64-bit), where X is sign-extended or zero-extended before use, and D lanes are twice the width of X lanes:

Per-lane behaviour (twice-width D) | |
---|---|

`SSHLL` , `SSHLL2` , `USHLL` , `USHLL2` | `D = X << N` (where `0 ≤ N < bitwidth(X)` ) |

`SHLL` , `SHLL2` | `D = X << bitwidth(X)` |

Narrowing shifts by immediate (from 64-bit lanes to 32-bit, 32-bit lanes to 16-bit, or 16-bit lanes to 8-bit), where D lanes are half the width of X lanes. In all cases, `1 ≤ N ≤ bitwidth(D)`

:

Per-lane behaviour (half-width D) | |
---|---|

`XTN` , `XTN2` | `D = truncate(X)` |

`SHRN` , `SHRN2` | `D = truncate(X >> N)` |

`RSHRN` , `RSHRN2` | `D = truncate(X >>R N)` |

`SQXTN` , `SQXTN2` | `D = sat(X)` (signed) |

`UQXTN` , `UQXTN2` | `D = sat(X)` (unsigned) |

`SQXTUN` , `SQXTUN2` | `D = sat(X)` (X signed, D unsigned) |

`SQSHRN` , `SQSHRN2` | `D = sat(X >> N)` (signed) |

`UQSHRN` , `UQSHRN2` | `D = sat(X >> N)` (unsigned) |

`SQSHRUN` , `SQSHRUN2` | `D = sat(X >> N)` (X signed, D unsigned) |

`SQRSHRN` , `SQRSHRN2` | `D = sat(X >>R N)` (signed) |

`UQRSHRN` , `UQRSHRN2` | `D = sat(X >>R N)` (unsigned) |

`SQRSHRUN` , `SQRSHRUN2` | `D = sat(X >>R N)` (X signed, D unsigned) |

There is no narrowing shift from 8-bit lanes to something narrower. This is a notable difference to x86, where `pmovmskb`

can pull 1 bit out of every 8-bit lane. That said, `SHRN`

from 16-bit lanes to 8-bit lanes with `N`

set to 4 does something interesting: it pulls 4 bits out of every 8-bit lane, taking the high 4 bits of even lanes and the low 4 bits of odd lanes. Alternating between high/low halves is weird, but innocuous if every 8-bit lane starts out containing either 0 or 255.

Subject to the `sha3`

extension, two instructions provide rotates in 64-bit lanes:

Per-lane behaviour (64-bit lanes only) | |
---|---|

`RAX1` | `D = X xor rotate_left(Y, 1)` |

`XAR` | `D = rotate_right(X xor Y, N)` |

Note that rotate by immediate (without xor, for any lane width, and without needing `sha3`

) can be constructed from `SHL`

followed by `USRA`

.

**Bitwise**

Assorted instructions operating bitwise:

Description | Per-bit behaviour | |
---|---|---|

`AND` | And | `D = X and Y` |

`BCAX` ^{[s]} | Clear and xor | `D = X xor (Y and not Z)` |

`BIC` | Clear | `D = X and not Y` |

`BIC` | Clear (immedate) | `D = D and not imm` ^{[i]} |

`BIF` | Insert if false | `D = Y ? D : X` |

`BIT` | Insert if true | `D = Y ? X : D` |

`BSL` | Select | `D = D ? X : Y` |

`EOR` | Xor | `D = X xor Y` |

`EOR3` ^{[s]} | Xor (three-way) | `D = X xor Y xor Z` |

`NOT` | Not | `D = not X` |

`ORN` | Or not | `D = X or not Y` |

`ORR` | Or | `D = X or Y` |

`ORR` | Or (immediate) | `D = D or imm` ^{[i]} |

^{[s]}Requires

`sha3`

extension.^{[i]}Immediate is a 16-bit or 32-bit constant where one byte is an arbitrary

`imm8`

and other bytes are zero, broadcast to all 16-bit or 32-bit lanes.
Bit counting instructions:

Counts | Possible lane widths | |
---|---|---|

`CLS` | Leading sign bits | 8-bit, 16-bit, 32-bit |

`CLZ` | Leading zero bits (i.e. lzcnt) | 8-bit, 16-bit, 32-bit |

`CNT` | Non-zero bits (i.e. popcnt) | 8-bit ^{[a]} |

^{[a]}Other lane widths can be achieved by a follow-up

`UADDLP`

or `ADDV`

.
There are no horizontal bitwise instructions in the traditional sense, though various horizontal reductions can be constructed from other instructions:

- Branch if any bit set:
`UMAXP`

(any lane width) to reduce 128 bits to 64 bits,`FMOV`

to move 64 bits to a GPR, then`CBNZ`

. If GPRs are at a premium, can instead`UMAXP`

(any lane width), then`CMTST`

(with 64-bit lanes), then`FCMP`

(any lane width) against literal floating-point zero, then`B.NE`

(if floating-point exceptions around NaNs are a concern, insert`BIC`

with an immediate between`CMTST`

and`FCMP`

). On x86, this is`ptest`

followed by`jcc`

. - Within each group of 8/16/32/64 bits, set all bits if any bit set:
`CMTST`

. For the bitwise inverse of this,`CMEQ`

against literal zero. Note that`CMTST`

comes with a free bitwise-and (`CMEQ`

doesn't), though this can be bypassed by specifying the same input register twice. - Provided that each group of 8/16/32/64 bits is either all ones or all zeros, find the index of the first group of ones:
`SHRN`

(from 16-bit lanes to 8-bit lanes, shifting by 4 bits), then`FMOV`

to move 64 bits to a GPR, then`RBIT`

, then`CLZ`

, then divide by 4/8/16/32. - Horizontal pairwise operations on groups of 8/16/32/64 bits, where each group is either all ones or all zeros:
`UMAXP`

or`SMINP`

give bitwise-or,`UMINP`

or`SMAXP`

give bitwise-and.

**Integer math**

Assorted instructions operating lanewise, with 8/16/32/64-bit lanes:

Per-lane behaviour | |
---|---|

`ABS` | `D = abs(X)` |

`SQABS` | `D = sat(abs(X))` |

`NEG` | `D = -X` |

`SQNEG` | `D = sat(-X)` |

`ADD` , `SUB` | `D = X ± Y` |

`ADDP` | `D = A + B` ^{[p]} |

`ADDV` | `D` ^{[v]} |

`SQADD` , `UQADD` , `SUQADD` , `USQADD` | `D = sat(X + Y)` |

`SQSUB` , `UQSUB` | `D = sat(X - Y)` |

`SABD` , `UABD` ^{[d]} | `D = abs(X - Y)` |

`SABA` , `UABA` ^{[d]} | `D += abs(X - Y)` |

`SHADD` , `SHSUB` , `UHADD` , `UHSUB` ^{[d]} | `D = (X ± Y) >> 1` |

`SRHADD` , `URHADD` ^{[d]} | `D = (X + Y) >>R 1` |

`MUL` ^{[b]} ^{[d]} | `D = X * Y` |

`MLA` , `MLS` ^{[b]} ^{[d]} | `D ±= X * Y` |

`SQDMULH` ^{[a]} ^{[b]} ^{[d]} | `D = sat((2 * X * Y) >> bitwidth(D))` |

`SQRDMULH` ^{[a]} ^{[b]} ^{[d]} | `D = sat((2 * X * Y) >>R bitwidth(D))` |

`SQRDMLAH` , `SQRDMLSH` ^{[a]} ^{[b]} ^{[d]} | `D = sat(D ± ((2 * X * Y) >>R bitwidth(D)))` |

`SMIN` , `UMIN` ^{[d]} | `D = min(X, Y)` |

`SMINP` , `UMINP` ^{[d]} | `D = min(A, B)` ^{[p]} |

`SMINV` , `UMINV` ^{[d]} | `D` ^{[v]} |

`SMAX` , `UMAX` ^{[d]} | `D = max(X, Y)` |

`SMAXP` , `UMAXP` ^{[d]} | `D = max(A, B)` ^{[p]} |

`SMAXV` , `UMAXV` ^{[d]} | `D` ^{[v]} |

`CMEQ` ^{[z]} | `D = (X == Y) ? ones_mask : 0` |

`CMGE` , `CMHS` ^{[z]} | `D = (X >= Y) ? ones_mask : 0` |

`CMGT` , `CMHI` ^{[z]} | `D = (X > Y) ? ones_mask : 0` |

`CMTST` | `D = (X & Y) ? ones_mask : 0` |

^{[a]}Not available for 8-bit lanes. Not available as unsigned.

^{[b]}When using 16/32-bit lanes, can broadcast a single lane of Y to all lanes of Y.

^{[d]}Not available for 64-bit lanes.

^{[p]}

`A`_{i}

is `concat(X, Y)`_{2*i+0}

, `B`_{i}

is `concat(X, Y)`_{2*i+1}

, i.e. adjacent pairs.^{[v]}Low lane of D gets sum/min/max of all lanes of X, rest of D cleared.

^{[z]}Operands can be registers or constant zero (at least logically).

Assorted instructions operating lanewise, with 8/16/32-bit lanes for X and Y, and D lanes twice as wide (X/Y sign-extended or zero-extended before use):

Per-lane behaviour (twice-width D) | |
---|---|

`SXTL` , `SXTL2` , `UXTL` , `UXTL2` | `D = X` (i.e. just sign/zero extend) |

`SABDL` , `SABDL2` , `UABDL` , `UABDL2` | `D = abs(X - Y)` |

`SABAL` , `SABAL2` , `UABAL` , `UABAL2` | `D += abs(X - Y)` |

`SADDL` , `SADDL2` , `UADDL` , `UADDL2` | `D = X + Y` |

`SSUBL` , `SSUBL2` , `USUBL` , `USUBL2` | `D = X - Y` |

`SADDLP` , `UADDLP` | `D = A + B` ^{[p]} |

`SADALP` , `UADALP` | `D += A + B` ^{[p]} |

`SADDLV` , `UADDLV` | `D` ^{[v]} |

`SMULL` , `SMULL2` , `UMULL` , `UMULL2` ^{[b]} | `D = X * Y` |

`SMLAL` , `SMLAL2` , `UMLAL` , `UMLAL2` ^{[b]} | `D += X * Y` |

`SMLSL` , `SMLSL2` , `UMLSL` , `UMLSL2` ^{[b]} | `D -= X * Y` |

`SQDMULL` , `SQDMULL2` ^{[a]} ^{[b]} | `D = sat(2 * X * Y)` |

`SQDMLAL` , `SQDMLAL2` ^{[a]} ^{[b]} | `D = sat(D + sat(2 * X * Y))` |

`SQDMLSL` , `SQDMLSL2` ^{[a]} ^{[b]} | `D = sat(D - sat(2 * X * Y))` |

^{[a]}Not available for 8-bit lanes of X/Y. Not available as unsigned.

^{[b]}When using 16/32-bit lanes of Y, can broadcast a single lane of Y to all lanes.

^{[p]}

`A`_{i}

is `X`_{2*i+0}

, `B`_{i}

is `X`_{2*i+1}

, i.e. adjacent pairs.^{[v]}Low lane of D gets sum of all lanes of X, rest of D cleared.

A few instructions operating lanewise, with 16/32/64-bit lanes for D and X, and Y lanes half as wide (Y sign-extended or zero-extended before use):

Per-lane behaviour (half-width Y) | |
---|---|

`SADDW` , `SADDW2` , `UADDW` , `UADDW2` | `D = X + Y` |

`SSUBW` , `SSUBW2` , `USUBW` , `USUBW2` | `D = X - Y` |

A few instructions operating lanewise, with 16/32/64-bit lanes for X and Y, and D lanes half as wide:

Per-lane behaviour (half-width D) | |
---|---|

`ADDHN` , `ADDHN2` , `SUBHN` , `SUBHN2` | `D = (X ± Y) >> bitwidth(D)` |

`RADDHN` , `RADDHN2` , `RSUBHN` , `RSUBHN2` | `D = (X ± Y) >>R bitwidth(D)` |

Dense linear algebra instructions:

Behaviour | D type | X type | Y type | |
---|---|---|---|---|

`SDOT` ^{[b]} | `D` | s32[4] or u32[4] | s8[4][4] | s8[4][4] |

`UDOT` ^{[b]} | `D` | s32[4] or u32[4] | u8[4][4] | u8[4][4] |

`USDOT` ^{[b]} | `D` | s32[4] or u32[4] | u8[4][4] | s8[4][4] |

`SUDOT` ^{[b]} | `D` | s32[4] or u32[4] | s8[4][4] | u8[4][4] |

`SMMLA` | `D += X @ Y` | s32[2][2] or u32[2][2] | s8[2][8] | s8[2][8] |

`UMMLA` | `D += X @ Y` | s32[2][2] or u32[2][2] | u8[2][8] | u8[2][8] |

`USMMLA` | `D += X @ Y` | s32[2][2] or u32[2][2] | u8[2][8] | s8[2][8] |

^{[b]}Can broadcast a 32-bit lane of Y to all 32-bit lanes.

Two oddball instructions operate on 32-bit unsigned lanes containing fixed-precision numbers with 32 fractional bits (i.e. range is 0 through 1-ε):

Per-lane behaviour | |
---|---|

`URECPE` | `D = sat(0.5 * X` (approximate, using just top 9 bits) |

`URSQRTE` | `D = sat(0.5 * X` (approximate, using just top 9 bits) |

**Float math**

A broad range of floating-point math instructions are available, operating on fp32 or fp64 lanes (or fp16 subject to the `fp16`

extension), in either vector form or scalar form:

Per-lane behaviour | |
---|---|

`FABS` | `D = abs(X)` |

`FNEG` | `D = -X` |

`FADD` , `FSUB` | `D = X ± Y` |

`FADDP` | `D = A + B` ^{[p]} |

`FABD` | `D = abs(X - Y)` |

`FMUL` ^{[b]} | `D = X * Y` |

`FMULX` ^{[b]} | `D = X * Y` (except that ±0 times ±infinity is ±2.0) |

`FNMUL` ^{[s]} | `D = -(X * Y)` |

`FMADD` , `FMSUB` ^{[s]} | `D = Z ± X * Y` |

`FNMADD` , `FNMSUB` ^{[s]} | `D = -(Z ± X * Y)` |

`FMLA` , `FMLS` ^{[b]} | `D ±= X * Y` |

`FDIV` | `D = X / Y` |

`FSQRT` | `D = X` |

`FRECPX` ^{[s]} | `D = X` (crude approximate ^{[a]}, using no fractional bits) |

`FRECPE` | `D = X` (approximate, using just 8 fractional bits) |

`FRECPS` | `D = 2.0 - X * Y` ^{[c]} |

`FRSQRTE` | `D = X` (approximate, using just 8 fractional bits) |

`FRSQRTS` | `D = 1.5 - 0.5 * X * Y` ^{[d]} |

`FMIN` , `FMINNM` | `D = min(X, Y)` ^{[m]} |

`FMINP` , `FMINNMP` | `D = min(A, B)` ^{[m]} ^{[p]} |

`FMINV` , `FMINNMV` | `D` ^{[m]} ^{[v]} |

`FMAX` , `FMAXNM` | `D = max(X, Y)` ^{[m]} |

`FMAXP` , `FMAXNMP` | `D = max(A, B)` ^{[m]} ^{[p]} |

`FMAXV` , `FMAXNMV` | `D` ^{[m]} ^{[v]} |

`FCMEQ` ^{[z]} | `D = (X == Y) ? ones_mask : 0` |

`FCMGE` ^{[z]} | `D = (X >= Y) ? ones_mask : 0` |

`FCMGT` ^{[z]} | `D = (X > Y) ? ones_mask : 0` |

`FACGE` | `D = (abs(X) >= abs(Y)) ? ones_mask : 0` |

`FACGT` | `D = (abs(X) > abs(Y)) ? ones_mask : 0` |

^{[a]}Clears fraction bits, then adds one to exponent if zero, then bitwise inverse of exponent bits. Can be used with

`FMULX`

as part of vector normalisation.^{[b]}Can broadcast a single lane of Y to all lanes of Y.

^{[c]}Useful as part of Newton-Raphson step where successive approximations to

`a`^{-1}

are computed as `x`_{n+1} = x_{n} * (2.0 - a * x_{n})

. See `FRECPE`

.^{[d]}Useful as part of Newton-Raphson step where successive approximations to

`a`^{-0.5}

are computed as `x`_{n+1} = x_{n} * (1.5 - 0.5 * a * x_{n} * x_{n})

. See `FRSQRTE`

.^{[m]}Note that min/max are not

*quite*equivalent to comparison followed by selection, due to signed zeros and NaNs. The

`NM`

variants return the non-NaN operand if exactly one operand is NaN.^{[p]}

`A`_{i}

is `concat(X, Y)`_{2*i+0}

, `B`_{i}

is `concat(X, Y)`_{2*i+1}

, i.e. adjacent pairs.^{[s]}Scalar form only, no vector form.

^{[v]}Low lane of D gets min/max of all lanes of X, rest of D cleared.

^{[z]}Operands can be registers or constant zero (at least logically).

Various per-lane rounding instructions with floating-point inputs and outputs (see "Data type conversion, float to integer" for integer outputs):

Rounding | Range ^{[r]} | Exceptions | |
---|---|---|---|

`FRINT32X` | Mode from FPCR | -2^{31} ⋯ 2^{31}-1 | Inexact, InvalidOp |

`FRINT32Z` | Toward zero (truncate) | -2^{31} ⋯ 2^{31}-1 | Inexact, InvalidOp |

`FRINT64X` | Mode from FPCR | -2^{63} ⋯ 2^{63}-1 | Inexact, InvalidOp |

`FRINT64Z` | Toward zero (truncate) | -2^{63} ⋯ 2^{63}-1 | Inexact, InvalidOp |

`FRINTA` | To nearest, ties away from zero | Unbounded | |

`FRINTI` | Mode from FPCR | Unbounded | |

`FRINTM` | Toward minus infinity (floor) | Unbounded | |

`FRINTN` | To nearest, ties toward even | Unbounded | |

`FRINTP` | Toward positive infinity (ceil) | Unbounded | |

`FRINTX` | Mode from FPCR | Unbounded | Inexact |

`FRINTZ` | Toward zero (truncate) | Unbounded |

^{[r]}Out of range results (in either direction) replaced by -2

^{31}or -2

^{63}.

Mixed-width operations and dense linear algebra:

Behaviour | D type | X/Y type | |
---|---|---|---|

`FMLAL` , `FMLSL` ^{[b]} | `D` | fp32[4] | fp16[8] |

`FMLAL2` , `FMLSL2` ^{[b]} | `D` | fp32[4] | fp16[8] |

`BFMLALB` ^{[b]} | `D` | fp32[4] | bf16[8] |

`BFMLALT` ^{[b]} | `D` | fp32[4] | bf16[8] |

`BFDOT` ^{[b]} | `D` | fp32[4] | bf16[4][2] |

`BFMMLA` | `D += X @ Y` | fp32[2][2] | bf16[2][4] |

^{[b]}Can broadcast a single lane of Y to all lanes of Y (for

`BFDOT`

, a lane is 32 bits).
**Float comparisons involving PSTATE**

The `FCMP`

, `FCMPE`

family perform a three-way comparison of the low fp16/fp32/fp64 lane of two operands, writing the result to `PSTATE`

:

`PSTATE.N`

is set to`X`

(false if either operand NaN)_{0}< Y_{0}`PSTATE.Z`

is set to`X`

(false if either operand NaN)_{0}== Y_{0}`PSTATE.C`

is set to`!(X`

(true if either operand NaN)_{0}< Y_{0})`PSTATE.V`

is set to`is_nan(X`

_{0}) or is_nan(Y_{0})- If either X
_{0}or Y_{0}is an SNaN, InvalidOp exception is raised `FCMPE`

only: if either X_{0}or Y_{0}is a QNaN, InvalidOp exception is raised

Following `FCMP X, Y`

, the meaning of condition codes is:

EQ | `X` | NE | `!(X` |
---|---|---|---|

LS | `X` | HI | `!(X` |

LO | `X` | HS | `!(X` |

MI | `X` | PL | `!(X` |

CC | `X` | CS | `!(X` |

GT | `X` | LE | `!(X` |

GE | `X` | LT | `!(X` |

VS | `is_nan(X` | VC | `!is_nan(X` |

The `FCCMP`

, `FCCMPE`

family perform a *conditional* three-way comparison of the low fp16/fp32/fp64 lane of two operands: some condition is evaluated against the contents of `PSTATE`

; if true, the instruction behaves like `FCMP`

/`FCMPE`

; if false, a four-bit immediate is written to the relevant `PSTATE`

bits.

The `FCSEL`

instruction uses `PSTATE`

to conditionally select between two scalar fp16/fp32/fp64 operands: `D`

(other lanes of D cleared)._{0} = cond ? X_{0} : Y_{0}

**Data type conversion, float to float**

Vector form, FPCR rounding:

to bf16 | to fp16 | to fp32 | to fp64 | |
---|---|---|---|---|

From bf16 | no change | via fp32 | `SSHLL` , `SSHLL2` | via fp32 |

From fp16 | via fp32 | no change | `FCVTL` , `FCVTL2` | via fp32 |

From fp32 | `BFCVTN` , `BFCVTN` | `FCVTN` , `FCVTN2` | no change | `FCVTL` , `FCVTL2` |

From fp64 | via fp32 ^{[x]} | via fp32 ^{[x]} | `FCVTN` , `FCVTN2` | no change |

^{[x]}Using

`FCVTXN`

or `FCVTXN2`

, which employ round-to-odd rounding mode.
For scalar conversions, `FCVT`

can convert from any of fp16/fp32/fp64 to any other of fp16/fp32/fp64.

**Data type conversion, integer to float**

Vector form, FPCR rounding, free division by a power of two afterwards:

to fp16 | to fp32 | to fp64 | |
---|---|---|---|

From s16 or u16 | `SCVTF` or `UCVTF` | via s32 or u32 | via s64 or u64 |

From s32 or u32 | via fp32 | `SCVTF` or `UCVTF` | via s64 or u64 |

From s64 or u64 | via fp64 | via fp64 | `SCVTF` or `UCVTF` |

`SCVTF`

and `UCVTF`

can also take a GPR as input (32-bit or 64-bit, signed or unsigned), and convert that to any of fp16/fp32/fp64, again with a free division by a power of two afterwards.

**Data type conversion, float to integer**

A family of conversion instructions exist, available in two forms:

- Vector destination, fp16 to s16/u16, fp32 to s32/u32, fp64 to s64/u64
- GPR destination, any of fp16/fp32/fp64 to any of s32/u32/s64/u64

Rounding | Overflow | |
---|---|---|

`FCVTAS` , `FCVTAU` | To nearest, ties away from zero | Saturate |

`FCVTMS` , `FCVTMU` | Toward minus infinity (floor) | Saturate |

`FCVTNS` , `FCVTNU` | To nearest, ties toward even | Saturate |

`FCVTPS` , `FCVTPU` | Toward positive infinity (ceil) | Saturate |

`FCVTZS` , `FCVTZU` ^{[f]} | Toward zero (truncate) | Saturate |

`FJCVTZS` ^{[j]} | Toward zero (truncate) | Modulo 2^{32} |

^{[f]}Free multiplication by a power of two possible before the conversion.

^{[j]}Only exists in fp64 to s32 GPR form. Also sets

`PSTATE`

.
**Complex float math**

A pair of floating point lanes can represent a complex floating point number, where the low scalar lane contains the real part of the complex number and the high scalar lane contains the imaginary part of the complex number. A 128-bit register can then contain 4 fp16 complex lanes, or 2 fp32 complex lanes, or a single fp64 complex lane. A few instructions exist for manipulating these:

Real part of result | Imaginary part of result | |
---|---|---|

`FCADD #90` | `Re(D) = Re(X) - Im(Y)` | `Im(D) = Im(X) + Re(Y)` |

`FCADD #270` | `Re(D) = Re(X) + Im(Y)` | `Im(D) = Im(X) - Re(Y)` |

`FCMLA #0` ^{[b]} | `Re(D) += Re(X) * Re(Y)` | `Im(D) += Re(X) * Im(Y)` |

`FCMLA #90` ^{[b]} | `Re(D) -= Im(X) * Im(Y)` | `Im(D) += Im(X) * Re(Y)` |

`FCMLA #180` ^{[b]} | `Re(D) -= Re(X) * Re(Y)` | `Im(D) -= Re(X) * Im(Y)` |

`FCMLA #270` ^{[b]} | `Re(D) += Im(X) * Im(Y)` | `Im(D) -= Im(X) * Re(Y)` |

^{[b]}Can broadcast a complex lane (i.e. 2 scalars) of Y to all complex lanes of Y.

**Polynomial math**

The `PMUL`

, `PMULL`

, and `PMULL2`

instructions all perform `D = X * Y`

, where all lanes of D/X/Y contain ℤ_{2} polynomials. This is alternatively known as carryless multiplication (`pclmulqdq`

on x86).

D lanes | X/Y lanes | |
---|---|---|

`PMUL` | 8-bit poly (high 7 bits of result discarded) | 8-bit poly |

`PMULL` , `PMULL2` | 16-bit poly (top bit always clear) | 8-bit poly |

`PMULL` , `PMULL2` ^{[c]} | 128-bit poly (top bit always clear) | 64-bit poly |

^{[c]}Requires

`crypto`

extension.
For ℤ_{2} polynomial addition/subtraction, see `EOR`

or `EOR3`

. Polynomial division and remainder against a constant `Y`

can be performed via multiplication.

**Cryptography**

Some instructions are provided to accelerate AES encryption. A single round of AES encryption consists of `AddRoundKey`

(just xor), then `SubBytes`

and `ShiftRows`

(in either order), then optionally `MixColumns`

(performed for every round except the last). The provided instructions are:

Steps | |
---|---|

AESE | `AddRoundKey` then `ShiftRows` and `SubBytes` |

AESMC | `MixColumns` |

AESD | Inverse `AddRoundKey` then inverse `ShiftRows` and inverse `SubBytes` |

AESIMC | Inverse `MixColumns` |

Note that x86 AES instructions are slightly different, for example `aesenc`

there does `ShiftRows`

and `SubBytes`

, then `MixColumns`

, then `AddRoundKey`

.

Some instructions are provided to accelerate SHA-1 hashes: `SHA1C`

, `SHA1H`

, `SHA1M`

, `SHA1P`

, `SHA1SU0`

, `SHA1SU1`

.

Some instructions are provided to accelerate SHA-2 hashes: `SHA256H`

, `SHA256H2`

, `SHA256SU0`

, `SHA256SU1`

for SHA-256, and `SHA512H`

, `SHA512H2`

, `SHA512SU0`

, `SHA512SU1`

for SHA-512.

Some instructions are provided to accelerate SHA-3 hashes: `EOR3`

, `RAX1`

, `XAR`

, `BCAX`

. See "Shifts" or "Bitwise" for descriptions.

Some instructions are provided to accelerate SM3 hashes: `SM3SS1`

, `SM3TT1A`

, `SM3TT1B`

, `SM3TT2A`

, `SM3TT2B`

, `SM3PARTW1`

, `SM3PARTW2`

.

Some instructions are provided to accelerate SM4 encryption: `SM4E`

, `SM4EKEY`

.