# Contrasting Intel AMX and Apple AMX

Intel has an x64 instruction set extension called AMX, meanwhile Apple has a *totally different* aarch64 instruction set extension also called AMX.

**Register file**

Intel AMX: 8 `tmm`

registers, each a 16 by 16 matrix of 32-bit elements (technically, each can be configured to be any size - square or rectangular - between 1 by 1 and 16 by 16, though element size is fixed at 32-bits regardless). Total architectural state 8 kilobytes.

Apple AMX: Total architectural state 5 kilobytes, broken down as:

- 8
`x`

registers, each a 64-byte (row) vector - 8
`y`

registers, each a 64-byte (row or column) vector - Then
`z`

, which can be variously viewed as*any*of:- 1 register, a 64 by 64 matrix of 8-bit elements
- 1 register, a 32 by 32 matrix of 32-bit elements
- 2 registers, each a 32 by 32 matrix of 16-bit elements
- 4 registers, each a 16 by 16 matrix of 32-bit elements
- 8 registers, each an 8 by 8 matrix of 64-bit elements
- 64 registers, each a 64-byte row vector

The vectors have 8/16/32/64-bit elements, like regular SIMD registers. Note that Intel AMX does not need vector registers, as Intel already has AVX512 with 64-byte vectors (32 of which are in the AVX512 architectural register file).

**Data types**

Intel AMX: Multiplicands are always 32-bit, either `i8[4]`

or `u8[4]`

or `bf16[2]`

, combined via dot product. Accumulators are always 32-bit, either `i32`

or `u32`

or `f32`

.

Apple AMX: Multiplicands are scalars, any of `i8`

/ `u8`

/ `i16`

/ `u16`

/ `f16`

/ `f32`

/ `f64`

. Accumulators are any of `i16`

/ `u16`

/ `i32`

/ `u32`

/ `f16`

/ `f32`

/ `f64`

. Note `f16`

(i.e. IEEE 754 half-precision with 5-bit exponent and 10-bit fraction) rather than `bf16`

(8-bit exponent, 7-bit fraction), though `bf16`

support is added on M2 and later.

**Computational operations**

Intel AMX: *Matrix multiplication* of any two `tmm`

registers, accumulating onto a third `tmm`

register. For the multiplication, matrix elements are themselves (very small) vectors, combined via dot product. This is the *only* operation. Viewed differently, this is doing 16×64 by 64×16 (int8) or 16×32 by 32×16 (bf16) matmul, then adding onto a 16×16 matrix.

Apple AMX: *Outer product* of any `x`

register with any `y`

register (viewed as a column vector), accumulating onto any (matrix view) `z`

register). For the multiplication, `x`

/ `y`

elements are scalars (depending on the data type, this might be viewed as doing 16×1 by 1×16 matmul then adding onto a 16×16 matrix). Alternatively, *pointwise product* of any `x`

register with any `y`

register (viewed as a row vector), accumulating onto any (vector view) `z`

register. Many more operations as well, though the general theme is {outer or pointwise} {multiplication or addition or subtraction}, possibly followed by right-shift, possibly followed by integer saturation. The most exotic exceptions to the theme are `min`

/ `max`

/ `popcnt`

.

**Memory operations**

Intel AMX: Matrix load or store (up to 1 kilobyte), configurable with a base address (register + immediate offset) and a row stride (register or zero, optionally shifted left by 1-3 bits).

Apple AMX: Vector load or store (64 bytes), configurable with a base address (register). Also load or store pair (128 bytes), though the two registers must be consecutive, and the row stride is fixed at 64 bytes, and the base address must be 128-byte aligned. Loads or stores with `y`

effectively give a free vector transpose, as `y`

registers can be viewed as column vectors.

**Masking modes**

Intel AMX: Each `tmm`

register can be configured to any size - square or rectangular - between 1 by 1 and 16 by 16. This is (mostly) equivalent to saying that a `tmm`

register is always 16 by 16, but has an associated mask on each dimension to only enable some number of leading rows and columns. These per-register masks are architectural state.

Apple AMX: Per-dimension masking is available on a per-instruction basis (though notably not for loads / stores). Available masking modes are: all rows (columns), even/odd rows (columns) only, first N rows (columns) only, last N rows (columns) only, row (column) N only.

Note that both of these approaches are different to Intel's AVX512 approach, which is a separate register file containing 8 mask registers (`k0`

through `k7`

) and every operation optionally taking a mask register as an input.

**Other**

Apple AMX contains a *very* interesting instruction called `genlut`

. In the forward direction ("lookup"), it is somewhere between AVX512's `vpshufb`

and `vgatherps`

. In the backward direction ("generate") it is something like an *inverse* `vpshufb`

, performing arbitrary 2/3/4/5-bit quantisation. When used in both directions, it can be useful for piecewise linear interpolation, or as an alternative to AVX512's `vfpclassps`

/ `vfixupimmps`

.