I have access to some Intel Skylake-X CPUs and some Intel Cascade Lake CPUs, both of which support AVX-512 instructions. AVX-512 has lots of subsets; both of these CPUs support the F, CD, VL, DQ, and BW subsets. Additionally, Cascade Lake supports the VNNI subset. A small part of this post is about VNNI, but other than that, everything is applicable to both Skylake-X and Cascade Lake.

There are new 512-bit-wide vector registers zmm0 through zmm31, which extend ymm0 through ymm15 both in their width and their number. The low parts of these new registers are available as ymm16 through ymm31, or xmm16 through xmm31, should you have code which would benefit from more registers rather than wider registers. There are also new 64-bit-wide "mask" registers k0 through k7.

Starting with "mask" registers:

Most vector instructions which allow a memory operand and have a lane width of 32-bits or 64-bits now support the memory operand being an embedded broadcast of a 32-bit or 64-bit value. In terms of syntax, this is done by putting {1toN} after the memory operand, for example vpaddd zmm0, zmm0, dword ptr [rax] {1to16}. Some instructions gain optional modifiers for controlling the rounding mode, or for suppressing exceptions. As a quirk of the instruction encoding, all three pieces of functionality (broadcasting, rounding mode control, and suppressing exceptions) are enabled/disabled by the same bit, which might cause surprises.

Assorted new floating-point instructions:

Assorted new integer instructions:

There are new instructions for converting between unsigned integers and floating-point values, in the form of v(cvt|cvtt)[ps][sd]2u(dq|si|qq) and vcvtu(dq|si|qq)2[ps][sd]. Also new are packed conversions between int64 and floating-point values, in the form of v(cvt|cvtt)p[sd]2qq and vcvtqq2p[sd].

Assorted new permutation and shuffling and blending instructions: