AVX-512 notes
I have access to some Intel Skylake-X CPUs and some Intel Cascade Lake CPUs, both of which support AVX-512 instructions. AVX-512 has lots of subsets; both of these CPUs support the F, CD, VL, DQ, and BW subsets. Additionally, Cascade Lake supports the VNNI subset. A small part of this post is about VNNI, but other than that, everything is applicable to both Skylake-X and Cascade Lake.
There are new 512-bit-wide vector registers zmm0 through zmm31, which extend ymm0 through ymm15 both in their width and their number. The low parts of these new registers are available as ymm16 through ymm31, or xmm16 through xmm31, should you have code which would benefit from more registers rather than wider registers. There are also new 64-bit-wide "mask" registers k0 through k7.
Starting with "mask" registers:
- Movement between "mask" registers and GPRs (or memory) is done with
kmov[bwdq]. - Movement between "mask" registers and vector registers is done with
vpmovm2[bwdq](expand each bit of a mask register to an element of a vector register),vpmov[bwdq]2m(create a mask from the sign bits of vector elements), orvpbroadcastm(b2q|w2d)(zero-extend a mask and broadcast it to all elements of a vector). - A limited range of arithmetic on "mask" registers is done with
kadd[bwdq]/kand[bwdq]/kandn[bwdq]/knot[bwdq]/kor[bwdq]/kshiftl[bwdq]/kshiftr[bwdq]/kunpck(bw|wd|dq)/kxnor[bwdq]/kxor[bwdq], all of which are three-operand (exceptknot), and none of which affect flags. ktest[bwdq]andkortest[bwdq]compare two "mask" registers and write to flags.- Most vector instructions can optionally take a "mask" register (but not
k0), and use it to control which vector lanes are active for the instruction. Inactive lanes can either produce a zero output, or preserve the value of the destination register. In terms of syntax, this is done by putting{kN}after the destination register (preserving), or{kN}{z}(zero-ing), for examplevmulps zmm0 {k1}{z}, zmm1, zmm2. - Vector comparison instructions output to a "mask" register rather than a vector register (though
vpmovm2[bwdq]can be used to expand that to a vector). The newvp(test|testn)m[bwdq]instructions similarly compare two vectors and output to a "mask" register.
Most vector instructions which allow a memory operand and have a lane width of 32-bits or 64-bits now support the memory operand being an embedded broadcast of a 32-bit or 64-bit value. In terms of syntax, this is done by putting {1toN} after the memory operand, for example vpaddd zmm0, zmm0, dword ptr [rax] {1to16}. Some instructions gain optional modifiers for controlling the rounding mode, or for suppressing exceptions. As a quirk of the instruction encoding, all three pieces of functionality (broadcasting, rounding mode control, and suppressing exceptions) are enabled/disabled by the same bit, which might cause surprises.
Assorted new floating-point instructions:
vrangepsdoes aminormax(optionally ignoring sign bits), and then optionally replaces the sign bit of the result with0or1or the sign bit of the first operand. One potential use-case is clamping values to be between-tand+tin a single operation.vreducepsis a combinedvroundpsandvsubps. Optionally it can also scale by 2M for 0 <= M <= 15 on input and by a matching 2-M on output.vrndscalepsis likevroundps, with the extra trick of optionally scaling by 2M for 0 <= M <= 15 on input and by a matching 2-M on output.vrcp14psandvrsqrt14psare variants ofvrcpps(x-1) andvrsqrtps(x-0.5) with more precision (14 bits rather than 11).vfixupimmpsandvfpclasspshelp with handling edge-cases around zeros / NaNs / infinities / denormals.vgetexppsandvgetmantpsandvscalefpsare also new.
Assorted new integer instructions:
vpcmp[bwdq]generalisesvpcmp(eq|gt)[bwdq], and is also available for unsigned integers asvpcmpu[bwdq].vp(min|max)[su]qandvpabsqandvpmullqareqversions of instructions previously only present for[bwd].vp(rol|ror)[dq]andvp(rol|ror)v[dq]are rotate instructions.vplzcnt[dq]are vectorisedlzcntinstructions (similar tobsr).vdbpsadbwis an extension ofvpsadbw.vpconflict[dq]perform pair-wise equality comparisons of source elements, outputting bitmasks in every lane.vpternlog[dq]subsumes all boolean functions of up to three inputs (thoughvp(and|andn|or|xor)[dq]should still be used where appropriate, due to their shorter encoding and lack of dependency on the output register).vpmov(|s|us)q[dwb]andvpmov(|s|us)d[wb]andvpmov(|s|us)wbprovide down-conversion (in truncating, signed saturating, and unsigned saturating varieties) combined with packing. For example,vpmovsdb xmm0, zmm0converts 16int32values into 16int8values (via saturation), and packs the results into the low 128 bits of the destination.- On Cascade Lake, VNNI adds
vpdpbusdas a fusion ofvpmaddubsw+vpmaddwd+vpaddd.vpdpbusdsis similar, but with saturation.vpdpwssdandvpdpwssdsfuse justvpmaddwd+vpaddd. These instructions have a latency of 5 cycles, versusvpaddd's 1 cycle, so more accumulation registers are required in tight loops.
There are new instructions for converting between unsigned integers and floating-point values, in the form of v(cvt|cvtt)[ps][sd]2u(dq|si|qq) and vcvtu(dq|si|qq)2[ps][sd]. Also new are packed conversions between int64 and floating-point values, in the form of v(cvt|cvtt)p[sd]2qq and vcvtqq2p[sd].
Assorted new permutation and shuffling and blending instructions:
vblendm(ps|pd)andvpblendm[bwdq]are per-lane blends, controlled by a "mask" register.vcompress(ps|pd)andvpcompress[dq]provide cross-lane packing, controlled by a "mask" register (can have a memory destination). This is likepext, but operating on lanes rather than bits.vexpand(ps|pd)andvpexpand[dq]provide cross-lane un-packing, controlled by a "mask" register (can have a memory source). This is likepdep, but operating on lanes rather than bits.vextract[fi](32x8|64x4)andvinsert[fi](32x8|64x4)manipulate the 256-bit halves of a 512-bit register.vshuf[fi](32x4|64x2)shuffle at 128-bit granularity from two sources. These instructions are to the 128-bit lanes of a 512-bit register asshufpsis to the 32-bit lanes of a 128-bit register.vperm(ps|pd)andvperm[wdq]permute from one source, using indices from another source.vperm[ti]2(ps|pd)andvperm[ti]2[wdq]permute from two sources, using indices from another source. Theivariant has the indices register as the destination. Thetvariant has a source register as the destination.valign[dq]concatenate two 512-bit registers and extract a contiguous 512-bit slice.vscatter[dq](ps|pd)andvpscatter[dq][dq]provide scattered stores.