Contrasting Intel AMX and Apple AMX

Intel has an x64 instruction set extension called AMX, meanwhile Apple has a totally different aarch64 instruction set extension also called AMX.

Register file

Intel AMX: 8 tmm registers, each a 16 by 16 matrix of 32-bit elements (technically, each can be configured to be any size - square or rectangular - between 1 by 1 and 16 by 16, though element size is fixed at 32-bits regardless). Total architectural state 8 kilobytes.

Apple AMX: Total architectural state 5 kilobytes, broken down as:

8 x registers, each a 64-byte (row) vector
8 y registers, each a 64-byte (row or column) vector
Then z, which can be variously viewed as any of:
- 1 register, a 64 by 64 matrix of 8-bit elements
- 1 register, a 32 by 32 matrix of 32-bit elements
- 2 registers, each a 32 by 32 matrix of 16-bit elements
- 4 registers, each a 16 by 16 matrix of 32-bit elements
- 8 registers, each an 8 by 8 matrix of 64-bit elements
- 64 registers, each a 64-byte row vector

The vectors have 8/16/32/64-bit elements, like regular SIMD registers. Note that Intel AMX does not need vector registers, as Intel already has AVX512 with 64-byte vectors (32 of which are in the AVX512 architectural register file).

Data types

Intel AMX: Multiplicands are always 32-bit, either i8[4] or u8[4] or bf16[2], combined via dot product. Accumulators are always 32-bit, either i32 or u32 or f32.

Apple AMX: Multiplicands are scalars, any of i8 / u8 / i16 / u16 / f16 / f32 / f64. Accumulators are any of i16 / u16 / i32 / u32 / f16 / f32 / f64. Note f16 (i.e. IEEE 754 half-precision with 5-bit exponent and 10-bit fraction) rather than bf16 (8-bit exponent, 7-bit fraction), though bf16 support is added on M2 and later.

Computational operations

Intel AMX: Matrix multiplication of any two tmm registers, accumulating onto a third tmm register. For the multiplication, matrix elements are themselves (very small) vectors, combined via dot product. This is the only operation. Viewed differently, this is doing 16×64 by 64×16 (int8) or 16×32 by 32×16 (bf16) matmul, then adding onto a 16×16 matrix.

Apple AMX: Outer product of any x register with any y register (viewed as a column vector), accumulating onto any (matrix view) z register). For the multiplication, x / y elements are scalars (depending on the data type, this might be viewed as doing 16×1 by 1×16 matmul then adding onto a 16×16 matrix). Alternatively, pointwise product of any x register with any y register (viewed as a row vector), accumulating onto any (vector view) z register. Many more operations as well, though the general theme is {outer or pointwise} {multiplication or addition or subtraction}, possibly followed by right-shift, possibly followed by integer saturation. The most exotic exceptions to the theme are min / max / popcnt.

Memory operations

Intel AMX: Matrix load or store (up to 1 kilobyte), configurable with a base address (register + immediate offset) and a row stride (register or zero, optionally shifted left by 1-3 bits).

Apple AMX: Vector load or store (64 bytes), configurable with a base address (register). Also load or store pair (128 bytes), though the two registers must be consecutive, and the row stride is fixed at 64 bytes, and the base address must be 128-byte aligned. Loads or stores with y effectively give a free vector transpose, as y registers can be viewed as column vectors.

Masking modes

Intel AMX: Each tmm register can be configured to any size - square or rectangular - between 1 by 1 and 16 by 16. This is (mostly) equivalent to saying that a tmm register is always 16 by 16, but has an associated mask on each dimension to only enable some number of leading rows and columns. These per-register masks are architectural state.

Apple AMX: Per-dimension masking is available on a per-instruction basis (though notably not for loads / stores). Available masking modes are: all rows (columns), even/odd rows (columns) only, first N rows (columns) only, last N rows (columns) only, row (column) N only.

Note that both of these approaches are different to Intel's AVX512 approach, which is a separate register file containing 8 mask registers (k0 through k7) and every operation optionally taking a mask register as an input.

Other

Apple AMX contains a very interesting instruction called genlut. In the forward direction ("lookup"), it is somewhere between AVX512's vpshufb and vgatherps. In the backward direction ("generate") it is something like an inverse vpshufb, performing arbitrary 2/3/4/5-bit quantisation. When used in both directions, it can be useful for piecewise linear interpolation, or as an alternative to AVX512's vfpclassps / vfixupimmps.