32 vector registers, each 128 bits wide. Also a control register (FPCR) and a status register (FPSR), though scalar comparison instructions use PSTATE. Each vector register can contain:

128 bits might seem small compared to AVX-512, but for most vector instructions, the M1's Firestorm cores can issue four instances of the instruction per cycle, which gets you to a similar place. There are also AMX units on the M1.

A lane is 1/8/16/32/64/128 bits wide. The common lane types are:

1-bit8-bit16-bit32-bit64-bit128-bit
uintbituint8 [q]uint16 [q]uint32 [q]uint64 [q]
sintsint8 [q]sint16 [q]sint32 [q]sint64 [q]
fpfp16 [a]fp32fp64
bfbf16 [b]
polybitpoly8poly16poly64 [c]poly128 [c]
[a] Requires fp16 extension, and improved further by fp16fml extension.
[b] Requires bf16 extension (present on Apple M2, but not on M1).
[c] Requires crypto extension.
[q] Often with the choice of truncate on overflow or saturate on overflow.

The syntactic names for the 32 registers vary based on the lane width and also how much of the register is being used:

Lane widthLow scalarLow 64 bitsAll 128 bits
1 bit N/AV0.8BV31.8BV0.16BV31.16B
8 bitsB0B31V0.8BV31.8BV0.16BV31.16B
16 bitsH0H31V0.4HV31.4HV0.8HV31.8H
32 bitsS0S31V0.2SV31.2SV0.4SV31.4S
64 bitsD0D31V0.1DV31.1DV0.2DV31.2D
128 bitsQ0Q31N/AV0.1QV31.1Q
In some contexts, individual lanes can be referenced. The syntax for this appends [0][15] to one of the above, for example V7.4S[3] denotes the most significant 32-bit lane of V7. Writes to a single lane with this syntax generally preserve other lanes, whereas in all other cases, writes to the low bits of a register generally zero the remaining bits (though see FPCR.NEP).

Vector instructions have up to six input registers, and up to one output register (with the exception of loads, which can have up to four output registers). In most cases, all the registers involved in a single instruction have the same lane width, though there are exceptions:

In most cases, operations are lane-wise: lane i of the output register is formed by combining lane i of each input register, though there are exceptions:

GPR to / from vector

InstructionDirectionBehaviour
FMOVGPR to vectorTruncate to lane width, then zero extend
DUPGPR to vectorTruncate then replicate to all lanes
INSGPR to vectorTruncate then insert to arbitrary lane
FMOVVector to GPRTake low lane, zero extend to GPR width
UMOVVector to GPRArbitrary lane, zero extend to GPR width
SMOVVector to GPRArbitrary lane, sign extend to GPR width

A number of data type conversion instrucions can also have GPR as source or destination (see "Data type conversion, integer to float" and "Data type conversion, float to integer").

Load / store

A scalar load moves 8/16/32/64 bits from memory to (part of) a vector register, whereas a vector load moves 64/128 bits. This can be repeated up to four times, reading from consecutive memory locations and writing to distinct vector registers (which must be consecutively numbered, except for LDP / LDNP).

×1×2×3×4
Scalar (low lane,
zero others)
LDR
LDUR
LDP
LDNP
Scalar (any lane,
preserve others)
LD1(SS 1R)LD2(SS 2R)LD3(SS 3R)LD4(SS 4R)
Scalar (replicate
to all lanes)
LD1RLD2RLD3RLD4R
VectorLD1(MS 1R)
LDR
LDUR
LD1(MS 2R)
LDP
LDNP
LD1(MS 3R)LD1(MS 4R)
Vector, transposedLD2(MS 2R)LD3(MS 3R)LD4(MS 4R)

With the exception of scalar replicating to all lanes, all load instructions have a corresponding store instruction performing the inverse - replace LD with ST.

The SS or MS suffix denotes what the ARM reference manual calls "single structure" or "multiple structures", and is followed by the number of destination registers (1, 2, 3, or 4). The operand syntax relates to this suffix.

The vector transposed loads perform one of the following tranposes, where M denotes the number of destination registers (2, 3, or 4):

8-bit lanes16-bit lanes32-bit lanes64-bit lanes
64-bit vectors8×M ↦ M×84×M ↦ M×42×M ↦ M×2
128-bit vectors16×M ↦ M×168×M ↦ M×84×M ↦ M×42×M ↦ M×2

The addressing modes are all over the place:

BaseOffsetWriteback Mode
LDRXn or SPsigned imm9Pre- or post-index
LDURXn or SPsigned imm9No writeback
LDRXn or SPunsigned imm12, scaledNo writeback
LDRXn or SPXm or Wm, optional extend/scaleNo writeback
LDRPCsigned imm19, times 4No writeback
LDPXn or SPsigned imm7, scaledPre- or post-index
LDPXn or SPsigned imm7, scaledNo writeback
LDNPXn or SPsigned imm7, scaledPre- or post-index
LDNPXn or SPsigned imm7, scaledNo writeback
OthersXn or SPunsigned imm1, scaled (or Xm)Post-index

Data movement

Moving lanes around:

Source laneDestination laneOther lanes
FMOVLow laneLow laneZeroed
MOV (DUP)ArbitraryLow laneZeroed
MOV (INS)ArbitraryArbitraryPreserved
DUPArbitraryAll lanes (replicated)N/A
MOV (ORR)All lanesAll lanes (1:1)N/A

Various reversals:

in bytesin u16sin u32sin u64sin u128s
Reverse bitsRBIT
Reverse bytesno-opREV16REV32REV64TBL
Reverse u16sno-opREV32REV64TBL
Reverse u32sno-opREV64TBL
Reverse u64sno-opEXT

Various ways of creating one vector from two:

First stepSecond step
TRN1, TRN2Discard odd or even lanesInterleave lanes
ZIP1, ZIP2Discard high half or low halfInterleave lanes
UZP1, UZP2Concatenate vectorsDiscard odd or even lanes
EXTConcatenate vectorsTake a contiguous 16-byte span
Note that EXT with the same source vector for both operands gives a rotation by a whole number of bytes. Note that some math instructions come with a free UZP1 and UZP2 (annotated as [p]).

The TBL, TBX family concatenate 1-4 vectors from consecutively numbered registers to form a table T of 16/32/48/64 bytes, then another byte vector serves as indices into said table:

Per-byte behaviour
TBLDi = (Yi < len(T)) ? T[Yi] : 0
TBXDi = (Yi < len(T)) ? T[Yi] : Di
Note that x86 pshufb does Di = (Yi < 128) ? X[Yi & 15] : 0, which is similar.

Immediates

One flavour of FMOV loads constants of the form ±(1.0 + m/16) × 2e (where 0 ≤ m ≤ 15 and −3 ≤ e ≤ 4) into the low fp16/fp32/fp64 lane, and either zeros the other lanes or replicates the constant to the other lanes.

One flavour of MOVI loads a constant into the low 64-bit lane, each byte of which is independently either 0 or 255, and either zeros the high u64 lane or replicates the constant there.

The remaining flavours of MOVI and MVNI load a constant into the low 8/16/32-bit lane, one byte of which is an arbitrary value, bytes to the left either all 0 or all 255, and bytes to right either all 0 or all 255, then replicates this constant to all lanes.

The bitwise BIC and ORR instructions support constants of the form: 16/32-bit lane, one byte of which is an arbitrary value, other bytes all 0, replicated to all lanes.

Various comparison instructions support comparison against constant zero:

SignedUnsignedFloating
X == 0CMEQ X, #0CMEQ X, #0FCMEQ X, #0.0
X <= 0CMLE X, #0CMEQ X, #0FCMLE X, #0.0
X < 0CMLT X, #0always falseFCMLT X, #0.0
X > 0CMGT X, #0CMTST X, XFCMGT X, #0.0
X >= 0CMGE X, #0always trueFCMGE X, #0.0
X <=> 0N/AN/AFCMP X, #0.0

Shifts

Note >>R is used to denote a rounding right shift: if the most significant bit shifted out was a 1, then 1 is added to the result. If shifting right by N, this is equivalent to adding 1 << (N - 1) to the input before shifting.

In-lane shifts by immediate:

Per-lane behaviour
SHLD = X << N
SQSHL, UQSHLD = sat(X << N)
SQSHLUD = sat(X << N) (X signed, D unsigned)
SLID = (X << N) | bzhi(D, N) (bzhi clears all but the low N bits)
SSHR, USHRD = X >> N
SRSHR, URSHRD = X >>R N
SSRA, USRAD += X >> N
SRSRA, URSRAD += X >>R N
SRID = (X >> N) | bzlo(D, N) (bzlo clears all but the high N bits)

In-lane variable shifts, where the shift amount is a signed value in the low 8 bits of each lane in the 2nd operand:

Per-lane behaviour (Y > 0)Per-lane behaviour (Y < 0)
SSHL, USHLD = X << YD = X >> -Y
SRSHL, URSHLD = X << YD = X >>R -Y
SQSHL, UQSHLD = sat(X << Y)D = X >> -Y
SQRSHL, UQRSHLD = sat(X << Y)D = X >>R -Y

Widening shifts by immediate (from 8-bit lanes to 16-bit, 16-bit lanes to 32-bit, or 32-bit lanes to 64-bit), where X is sign-extended or zero-extended before use, and D lanes are twice the width of X lanes:

Per-lane behaviour (twice-width D)
SSHLL, SSHLL2, USHLL, USHLL2D = X << N (where 0 ≤ N < bitwidth(X))
SHLL, SHLL2D = X << bitwidth(X)

Narrowing shifts by immediate (from 64-bit lanes to 32-bit, 32-bit lanes to 16-bit, or 16-bit lanes to 8-bit), where D lanes are half the width of X lanes. In all cases, 1 ≤ N ≤ bitwidth(D):

Per-lane behaviour (half-width D)
XTN, XTN2D = truncate(X)
SHRN, SHRN2D = truncate(X >> N)
RSHRN, RSHRN2D = truncate(X >>R N)
SQXTN, SQXTN2D = sat(X) (signed)
UQXTN, UQXTN2D = sat(X) (unsigned)
SQXTUN, SQXTUN2D = sat(X) (X signed, D unsigned)
SQSHRN, SQSHRN2D = sat(X >> N) (signed)
UQSHRN, UQSHRN2D = sat(X >> N) (unsigned)
SQSHRUN, SQSHRUN2D = sat(X >> N) (X signed, D unsigned)
SQRSHRN, SQRSHRN2D = sat(X >>R N) (signed)
UQRSHRN, UQRSHRN2D = sat(X >>R N) (unsigned)
SQRSHRUN, SQRSHRUN2D = sat(X >>R N) (X signed, D unsigned)

There is no narrowing shift from 8-bit lanes to something narrower. This is a notable difference to x86, where pmovmskb can pull 1 bit out of every 8-bit lane. That said, SHRN from 16-bit lanes to 8-bit lanes with N set to 4 does something interesting: it pulls 4 bits out of every 8-bit lane, taking the high 4 bits of even lanes and the low 4 bits of odd lanes. Alternating between high/low halves is weird, but innocuous if every 8-bit lane starts out containing either 0 or 255.

Subject to the sha3 extension, two instructions provide rotates in 64-bit lanes:

Per-lane behaviour (64-bit lanes only)
RAX1D = X xor rotate_left(Y, 1)
XARD = rotate_right(X xor Y, N)

Note that rotate by immediate (without xor, for any lane width, and without needing sha3) can be constructed from SHL followed by USRA.

Bitwise

Assorted instructions operating bitwise:

DescriptionPer-bit behaviour
ANDAndD = X and Y
BCAX [s]Clear and xorD = X xor (Y and not Z)
BICClearD = X and not Y
BICClear (immedate)D = D and not imm [i]
BIFInsert if falseD = Y ? D : X
BITInsert if trueD = Y ? X : D
BSLSelectD = D ? X : Y
EORXorD = X xor Y
EOR3 [s]Xor (three-way)D = X xor Y xor Z
NOTNotD = not X
ORNOr notD = X or not Y
ORROrD = X or Y
ORROr (immediate)D = D or imm [i]
[s] Requires sha3 extension.
[i] Immediate is a 16-bit or 32-bit constant where one byte is an arbitrary imm8 and other bytes are zero, broadcast to all 16-bit or 32-bit lanes.

Bit counting instructions:

CountsPossible lane widths
CLSLeading sign bits8-bit, 16-bit, 32-bit
CLZLeading zero bits (i.e. lzcnt)8-bit, 16-bit, 32-bit
CNTNon-zero bits (i.e. popcnt)8-bit [a]
[a] Other lane widths can be achieved by a follow-up UADDLP or ADDV.

There are no horizontal bitwise instructions in the traditional sense, though various horizontal reductions can be constructed from other instructions:

Integer math

Assorted instructions operating lanewise, with 8/16/32/64-bit lanes:

Per-lane behaviour
ABSD = abs(X)
SQABSD = sat(abs(X))
NEGD = -X
SQNEGD = sat(-X)
ADD, SUBD = X ± Y
ADDPD = A + B [p]
ADDVD0 = X0 + X1 + ⋯ + Xn-1 [v]
SQADD, UQADD, SUQADD, USQADDD = sat(X + Y)
SQSUB, UQSUBD = sat(X - Y)
SABD, UABD [d]D = abs(X - Y)
SABA, UABA [d]D += abs(X - Y)
SHADD, SHSUB, UHADD, UHSUB [d]D = (X ± Y) >> 1
SRHADD, URHADD [d]D = (X + Y) >>R 1
MUL [b] [d]D = X * Y
MLA, MLS [b] [d]D ±= X * Y
SQDMULH [a] [b] [d]D = sat((2 * X * Y) >> bitwidth(D))
SQRDMULH [a] [b] [d]D = sat((2 * X * Y) >>R bitwidth(D))
SQRDMLAH, SQRDMLSH [a] [b] [d]D = sat(D ± ((2 * X * Y) >>R bitwidth(D)))
SMIN, UMIN [d]D = min(X, Y)
SMINP, UMINP [d]D = min(A, B) [p]
SMINV, UMINV [d]D0 = min(X0, X1, …, Xn-1) [v]
SMAX, UMAX [d]D = max(X, Y)
SMAXP, UMAXP [d]D = max(A, B) [p]
SMAXV, UMAXV [d]D0 = max(X0, X1, …, Xn-1) [v]
CMEQ [z]D = (X == Y) ? ones_mask : 0
CMGE, CMHS [z]D = (X >= Y) ? ones_mask : 0
CMGT, CMHI [z]D = (X > Y) ? ones_mask : 0
CMTSTD = (X & Y) ? ones_mask : 0
[a] Not available for 8-bit lanes. Not available as unsigned.
[b] When using 16/32-bit lanes, can broadcast a single lane of Y to all lanes of Y.
[d] Not available for 64-bit lanes.
[p] Ai is concat(X, Y)2*i+0, Bi is concat(X, Y)2*i+1, i.e. adjacent pairs.
[v] Low lane of D gets sum/min/max of all lanes of X, rest of D cleared.
[z] Operands can be registers or constant zero (at least logically).

Assorted instructions operating lanewise, with 8/16/32-bit lanes for X and Y, and D lanes twice as wide (X/Y sign-extended or zero-extended before use):

Per-lane behaviour (twice-width D)
SXTL, SXTL2, UXTL, UXTL2D = X (i.e. just sign/zero extend)
SABDL, SABDL2, UABDL, UABDL2D = abs(X - Y)
SABAL, SABAL2, UABAL, UABAL2D += abs(X - Y)
SADDL, SADDL2, UADDL, UADDL2D = X + Y
SSUBL, SSUBL2, USUBL, USUBL2D = X - Y
SADDLP, UADDLPD = A + B [p]
SADALP, UADALPD += A + B [p]
SADDLV, UADDLVD0 = X0 + X1 + ⋯ + Xn-1 [v]
SMULL, SMULL2, UMULL, UMULL2 [b]D = X * Y
SMLAL, SMLAL2, UMLAL, UMLAL2 [b]D += X * Y
SMLSL, SMLSL2, UMLSL, UMLSL2 [b]D -= X * Y
SQDMULL, SQDMULL2 [a] [b]D = sat(2 * X * Y)
SQDMLAL, SQDMLAL2 [a] [b]D = sat(D + sat(2 * X * Y))
SQDMLSL, SQDMLSL2 [a] [b]D = sat(D - sat(2 * X * Y))
[a] Not available for 8-bit lanes of X/Y. Not available as unsigned.
[b] When using 16/32-bit lanes of Y, can broadcast a single lane of Y to all lanes.
[p] Ai is X2*i+0, Bi is X2*i+1, i.e. adjacent pairs.
[v] Low lane of D gets sum of all lanes of X, rest of D cleared.

A few instructions operating lanewise, with 16/32/64-bit lanes for D and X, and Y lanes half as wide (Y sign-extended or zero-extended before use):

Per-lane behaviour (half-width Y)
SADDW, SADDW2, UADDW, UADDW2D = X + Y
SSUBW, SSUBW2, USUBW, USUBW2D = X - Y

A few instructions operating lanewise, with 16/32/64-bit lanes for X and Y, and D lanes half as wide:

Per-lane behaviour (half-width D)
ADDHN, ADDHN2, SUBHN, SUBHN2D = (X ± Y) >> bitwidth(D)
RADDHN, RADDHN2, RSUBHN, RSUBHN2D = (X ± Y) >>R bitwidth(D)

Dense linear algebra instructions:

BehaviourD typeX typeY type
SDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]s8[4][4]s8[4][4]
UDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]u8[4][4]u8[4][4]
USDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]u8[4][4]s8[4][4]
SUDOT [b]Di += dot(Xi, Yi)s32[4] or u32[4]s8[4][4]u8[4][4]
SMMLAD += X @ YTs32[2][2] or u32[2][2]s8[2][8]s8[2][8]
UMMLAD += X @ YTs32[2][2] or u32[2][2]u8[2][8]u8[2][8]
USMMLAD += X @ YTs32[2][2] or u32[2][2]u8[2][8]s8[2][8]
[b] Can broadcast a 32-bit lane of Y to all 32-bit lanes.

Two oddball instructions operate on 32-bit unsigned lanes containing fixed-precision numbers with 32 fractional bits (i.e. range is 0 through 1-ε):

Per-lane behaviour
URECPED = sat(0.5 * X-1) (approximate, using just top 9 bits)
URSQRTED = sat(0.5 * X-0.5) (approximate, using just top 9 bits)

Float math

A broad range of floating-point math instructions are available, operating on fp32 or fp64 lanes (or fp16 subject to the fp16 extension), in either vector form or scalar form:

Per-lane behaviour
FABSD = abs(X)
FNEGD = -X
FADD, FSUBD = X ± Y
FADDPD = A + B [p]
FABDD = abs(X - Y)
FMUL [b]D = X * Y
FMULX [b]D = X * Y (except that ±0 times ±infinity is ±2.0)
FNMUL [s]D = -(X * Y)
FMADD, FMSUB [s]D = Z ± X * Y
FNMADD, FNMSUB [s]D = -(Z ± X * Y)
FMLA, FMLS [b]D ±= X * Y
FDIVD = X / Y
FSQRTD = X0.5
FRECPX [s]D = X-1 (crude approximate [a], using no fractional bits)
FRECPED = X-1 (approximate, using just 8 fractional bits)
FRECPSD = 2.0 - X * Y [c]
FRSQRTED = X-0.5 (approximate, using just 8 fractional bits)
FRSQRTSD = 1.5 - 0.5 * X * Y [d]
FMIN, FMINNMD = min(X, Y) [m]
FMINP, FMINNMPD = min(A, B) [m] [p]
FMINV, FMINNMVD0 = min(X0, X1, …, Xn-1) [m] [v]
FMAX, FMAXNMD = max(X, Y) [m]
FMAXP, FMAXNMPD = max(A, B) [m] [p]
FMAXV, FMAXNMVD0 = max(X0, X1, …, Xn-1) [m] [v]
FCMEQ [z]D = (X == Y) ? ones_mask : 0
FCMGE [z]D = (X >= Y) ? ones_mask : 0
FCMGT [z]D = (X > Y) ? ones_mask : 0
FACGED = (abs(X) >= abs(Y)) ? ones_mask : 0
FACGTD = (abs(X) > abs(Y)) ? ones_mask : 0
[a] Clears fraction bits, then adds one to exponent if zero, then bitwise inverse of exponent bits. Can be used with FMULX as part of vector normalisation.
[b] Can broadcast a single lane of Y to all lanes of Y.
[c] Useful as part of Newton-Raphson step where successive approximations to a-1 are computed as xn+1 = xn * (2.0 - a * xn). See FRECPE.
[d] Useful as part of Newton-Raphson step where successive approximations to a-0.5 are computed as xn+1 = xn * (1.5 - 0.5 * a * xn * xn). See FRSQRTE.
[m] Note that min/max are not quite equivalent to comparison followed by selection, due to signed zeros and NaNs. The NM variants return the non-NaN operand if exactly one operand is NaN.
[p] Ai is concat(X, Y)2*i+0, Bi is concat(X, Y)2*i+1, i.e. adjacent pairs.
[s] Scalar form only, no vector form.
[v] Low lane of D gets min/max of all lanes of X, rest of D cleared.
[z] Operands can be registers or constant zero (at least logically).

Various per-lane rounding instructions with floating-point inputs and outputs (see "Data type conversion, float to integer" for integer outputs):

RoundingRange [r]Exceptions
FRINT32XMode from FPCR-231 ⋯ 231-1Inexact, InvalidOp
FRINT32ZToward zero (truncate)-231 ⋯ 231-1Inexact, InvalidOp
FRINT64XMode from FPCR-263 ⋯ 263-1Inexact, InvalidOp
FRINT64ZToward zero (truncate)-263 ⋯ 263-1Inexact, InvalidOp
FRINTATo nearest, ties away from zeroUnbounded
FRINTIMode from FPCRUnbounded
FRINTMToward minus infinity (floor)Unbounded
FRINTNTo nearest, ties toward evenUnbounded
FRINTPToward positive infinity (ceil)Unbounded
FRINTXMode from FPCRUnboundedInexact
FRINTZToward zero (truncate)Unbounded
[r] Out of range results (in either direction) replaced by -231 or -263.

Mixed-width operations and dense linear algebra:

BehaviourD typeX/Y type
FMLAL, FMLSL [b]Di ±= Xi+0 * Yi+0fp32[4]fp16[8]
FMLAL2, FMLSL2 [b]Di ±= Xi+4 * Yi+4fp32[4]fp16[8]
BFMLALB [b]Di += X2*i+0 * Y2*i+0fp32[4]bf16[8]
BFMLALT [b]Di += X2*i+1 * Y2*i+1fp32[4]bf16[8]
BFDOT [b]Di += dot(Xi, Yi)fp32[4]bf16[4][2]
BFMMLAD += X @ YTfp32[2][2]bf16[2][4]
[b] Can broadcast a single lane of Y to all lanes of Y (for BFDOT, a lane is 32 bits).

Float comparisons involving PSTATE

The FCMP, FCMPE family perform a three-way comparison of the low fp16/fp32/fp64 lane of two operands, writing the result to PSTATE:

Following FCMP X, Y, the meaning of condition codes is:

EQX0 == Y0NE!(X0 == Y0)
LSX0 <= Y0HI!(X0 <= Y0)
LOX0 < Y0HS!(X0 < Y0)
MIX0 < Y0PL!(X0 < Y0)
CCX0 < Y0CS!(X0 < Y0)
GTX0 > Y0LE!(X0 > Y0)
GEX0 >= Y0LT!(X0 >= Y0)
VSis_nan(X0) or is_nan(Y0)VC!is_nan(X0) and !is_nan(Y0)

The FCCMP, FCCMPE family perform a conditional three-way comparison of the low fp16/fp32/fp64 lane of two operands: some condition is evaluated against the contents of PSTATE; if true, the instruction behaves like FCMP/FCMPE; if false, a four-bit immediate is written to the relevant PSTATE bits.

The FCSEL instruction uses PSTATE to conditionally select between two scalar fp16/fp32/fp64 operands: D0 = cond ? X0 : Y0 (other lanes of D cleared).

Data type conversion, float to float

Vector form, FPCR rounding:

to bf16to fp16to fp32to fp64
From bf16no changevia fp32SSHLL, SSHLL2via fp32
From fp16via fp32no changeFCVTL, FCVTL2via fp32
From fp32BFCVTN, BFCVTNFCVTN, FCVTN2no changeFCVTL, FCVTL2
From fp64via fp32 [x]via fp32 [x]FCVTN, FCVTN2no change
[x] Using FCVTXN or FCVTXN2, which employ round-to-odd rounding mode.

For scalar conversions, FCVT can convert from any of fp16/fp32/fp64 to any other of fp16/fp32/fp64.

Data type conversion, integer to float

Vector form, FPCR rounding, free division by a power of two afterwards:

to fp16to fp32to fp64
From s16 or u16SCVTF or UCVTFvia s32 or u32via s64 or u64
From s32 or u32via fp32SCVTF or UCVTFvia s64 or u64
From s64 or u64via fp64via fp64SCVTF or UCVTF

SCVTF and UCVTF can also take a GPR as input (32-bit or 64-bit, signed or unsigned), and convert that to any of fp16/fp32/fp64, again with a free division by a power of two afterwards.

Data type conversion, float to integer

A family of conversion instructions exist, available in two forms:

RoundingOverflow
FCVTAS, FCVTAUTo nearest, ties away from zeroSaturate
FCVTMS, FCVTMUToward minus infinity (floor)Saturate
FCVTNS, FCVTNUTo nearest, ties toward evenSaturate
FCVTPS, FCVTPUToward positive infinity (ceil)Saturate
FCVTZS, FCVTZU [f]Toward zero (truncate)Saturate
FJCVTZS [j]Toward zero (truncate)Modulo 232
[f] Free multiplication by a power of two possible before the conversion.
[j] Only exists in fp64 to s32 GPR form. Also sets PSTATE.

Complex float math

A pair of floating point lanes can represent a complex floating point number, where the low scalar lane contains the real part of the complex number and the high scalar lane contains the imaginary part of the complex number. A 128-bit register can then contain 4 fp16 complex lanes, or 2 fp32 complex lanes, or a single fp64 complex lane. A few instructions exist for manipulating these:

Real part of resultImaginary part of result
FCADD #90Re(D) = Re(X) - Im(Y)Im(D) = Im(X) + Re(Y)
FCADD #270Re(D) = Re(X) + Im(Y)Im(D) = Im(X) - Re(Y)
FCMLA #0 [b]Re(D) += Re(X) * Re(Y)Im(D) += Re(X) * Im(Y)
FCMLA #90 [b]Re(D) -= Im(X) * Im(Y)Im(D) += Im(X) * Re(Y)
FCMLA #180 [b]Re(D) -= Re(X) * Re(Y)Im(D) -= Re(X) * Im(Y)
FCMLA #270 [b]Re(D) += Im(X) * Im(Y)Im(D) -= Im(X) * Re(Y)
[b] Can broadcast a complex lane (i.e. 2 scalars) of Y to all complex lanes of Y.

Polynomial math

The PMUL, PMULL, and PMULL2 instructions all perform D = X * Y, where all lanes of D/X/Y contain ℤ2 polynomials. This is alternatively known as carryless multiplication (pclmulqdq on x86).

D lanesX/Y lanes
PMUL8-bit poly (high 7 bits of result discarded)8-bit poly
PMULL, PMULL216-bit poly (top bit always clear)8-bit poly
PMULL, PMULL2 [c]128-bit poly (top bit always clear)64-bit poly
[c] Requires crypto extension.

For ℤ2 polynomial addition/subtraction, see EOR or EOR3. Polynomial division and remainder against a constant Y can be performed via multiplication.

Cryptography

Some instructions are provided to accelerate AES encryption. A single round of AES encryption consists of AddRoundKey (just xor), then SubBytes and ShiftRows (in either order), then optionally MixColumns (performed for every round except the last). The provided instructions are:

Steps
AESEAddRoundKey then ShiftRows and SubBytes
AESMCMixColumns
AESDInverse AddRoundKey then inverse ShiftRows and inverse SubBytes
AESIMCInverse MixColumns

Note that x86 AES instructions are slightly different, for example aesenc there does ShiftRows and SubBytes, then MixColumns, then AddRoundKey.

Some instructions are provided to accelerate SHA-1 hashes: SHA1C, SHA1H, SHA1M, SHA1P, SHA1SU0, SHA1SU1.

Some instructions are provided to accelerate SHA-2 hashes: SHA256H, SHA256H2, SHA256SU0, SHA256SU1 for SHA-256, and SHA512H, SHA512H2, SHA512SU0, SHA512SU1 for SHA-512.

Some instructions are provided to accelerate SHA-3 hashes: EOR3, RAX1, XAR, BCAX. See "Shifts" or "Bitwise" for descriptions.

Some instructions are provided to accelerate SM3 hashes: SM3SS1, SM3TT1A, SM3TT1B, SM3TT2A, SM3TT2B, SM3PARTW1, SM3PARTW2.

Some instructions are provided to accelerate SM4 encryption: SM4E, SM4EKEY.