Optimising SDI CRC

An SDI data stream consists of 10-bit data. In hardware, this is not a problem. In contemporary software, 10-bit values are slightly awkward, and are generally represented in the low 10 bits of a uint16_t. SDI has two separate streams of 10-bit data (called c and y), which software might choose to represent in an interleaved fashion. Finally, each stream has an 18-bit CRC computed over it (polynomial x¹⁸ + x⁵ + x⁴ + x⁰), where the 18-bit value might be represented in the low 18 bits of a uint32_t. Given all this, the CRCs could be computed using the following code:

void crc_sdi(uint32_t* crcs, const uint16_t* data, size_t n) {
    uint32_t c = crcs[0];
    uint32_t y = crcs[1];
    for (size_t i = 0; i < n; i += 2) {
        c ^= data[i];
        y ^= data[i+1];
        for (int k = 0; k < 10; k++) {
            c = c & 1 ? (c >> 1) ^ 0x23000 : c >> 1;
            y = y & 1 ? (y >> 1) ^ 0x23000 : y >> 1;
        }
    }
    crcs[0] = c;
    crcs[1] = y;
}

The above code is short, but it isn't fast: on my three test systems, it manages approximately 0.5 bits per cycle. Expressed differently, that is 40 cycles per iteration of the outer loop (which is ingesting 20 bits per iteration).

Version	Broadwell	Skylake	Ice Lake
Starting point	0.536	0.528	0.477

At the cost of a 2KiB table, this can be improved to approximately 2.1 bits per cycle, which is better, but still not great:

uint16_t table[1u << 10];

void make_table() {
    for (uint32_t idx = 0; idx < (1u << 10); idx++) {
        uint32_t c = idx;
        for (int k = 0; k < 10; k++) {
            c = c & 1 ? (c >> 1) ^ 0x23000 : c >> 1;
        }
        table[idx] = (c >> 2);
    }
}

void crc_sdi(uint32_t* crcs, const uint16_t* data, size_t n) {
    uint32_t c = crcs[0];
    uint32_t y = crcs[1];
    for (size_t i = 0; i < n; i += 2) {
        c ^= data[i];
        y ^= data[i+1];
        c = (table[c & 0x3ff] << 2) ^ (c >> 10);
        y = (table[y & 0x3ff] << 2) ^ (y >> 10);
    }
    crcs[0] = c;
    crcs[1] = y;
}

Version	Broadwell	Skylake	Ice Lake
2KiB table	2.10	2.15	2.05

To do better, some mathematical tricks need to be pulled. The polynomial ring ℤ₂[x] is relevant, along with an isomorphism between bit strings and said polynomials. The original code can be re-expressed, but borrowing some notation from polynomials:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = crcs[0];
    polynomial y = crcs[1];
    for (size_t i = 0; i < n; i += 2) {
        c += data[i]   * x⁸;
        y += data[i+1] * x⁸;
        for (int k = 0; k < 10; k++) {
            c = (c * x¹) mod P; // P is x¹⁸ + x⁵ + x⁴ + x⁰
            y = (y * x¹) mod P;
        }
    }
    crcs[0] = c;
    crcs[1] = y;
}

The first mathematical trick is eliminating the inner loop:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = crcs[0];
    polynomial y = crcs[1];
    for (size_t i = 0; i < n; i += 2) {
        c = (c * x¹⁰ + data[i]   * x¹⁸) mod P;
        y = (y * x¹⁰ + data[i+1] * x¹⁸) mod P;
    }
    crcs[0] = c;
    crcs[1] = y;
}

The next mathematical trick is pulling * x¹⁸ and mod P out of the loop:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = crcs[0] * x^-18;
    polynomial y = crcs[1] * x^-18;
    for (size_t i = 0; i < n; i += 2) {
        c = c * x¹⁰ + data[i];
        y = y * x¹⁰ + data[i+1];
    }
    crcs[0] = (c * x¹⁸) mod P;
    crcs[1] = (y * x¹⁸) mod P;
}

At this point, a step back towards practical implementation is required, which takes the form of unrolling the loop 12 times:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = crcs[0] * x^-18;
    polynomial y = crcs[1] * x^-18;
    for (size_t i = 0; i < n; i += 24) {
        c = c * x¹²⁰ + pack120(data + i);
        y = y * x¹²⁰ + pack120(data + i + 1);
    }
    crcs[0] = (c * x¹⁸) mod P;
    crcs[1] = (y * x¹⁸) mod P;
}

polynomial pack120(const polynomial* data) {
    return pack60(data) * x⁶⁰ + pack60(data + 12);
}

polynomial pack60(const polynomial* data) {
    polynomial result = 0;
    for (size_t i = 0; i < 12; i += 2) {
        result = result * x¹⁰ + data[i];
    }
    return result;
}

The * x⁶⁰ in the middle of pack120 is slightly awkward, and would be more convenient as * x⁶⁴. This can be achieved by shuffling a few things around:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = crcs[0] * x^-14;
    polynomial y = crcs[1] * x^-14;
    for (size_t i = 0; i < n; i += 24) {
        c = c * x¹²⁰ + pack120(data + i);
        y = y * x¹²⁰ + pack120(data + i + 1);
    }
    crcs[0] = (c * x¹⁴) mod P;
    crcs[1] = (y * x¹⁴) mod P;
}

polynomial pack120(const polynomial* data) {
    return pack60(data) * x⁶⁴ + pack60(data + 12) * x⁴;
}

polynomial pack60(const polynomial* data) {
    polynomial result = 0;
    for (size_t i = 0; i < 12; i += 2) {
        result = result * x¹⁰ + data[i];
    }
    return result;
}

To continue back towards a practical implementation, the size of each polynomial variable needs to be bounded. For that, I'll introduce the semi-mod operator, where for a polynomial F, (F * xⁿ) semi-mod P is defined as (F mod x⁶⁴) * (xⁿ mod P) + (F div x⁶⁴) * (xⁿ⁺⁶⁴ mod P). This is then tactically deployed in a few places:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = (crcs[0] * x^-14) semi-mod P;
    polynomial y = (crcs[1] * x^-14) semi-mod P;
    for (size_t i = 0; i < n; i += 24) {
        c = (c * x¹²⁰) semi-mod P;
        y = (y * x¹²⁰) semi-mod P;
        c = c + pack120(data + i);
        y = y + pack120(data + i + 1);
    }
    c = (c * x¹⁴) semi-mod P;
    y = (y * x¹⁴) semi-mod P;
    crcs[0] = c mod P;
    crcs[1] = y mod P;
}

polynomial pack120(const polynomial* data) {
    return pack60(data) * x⁶⁴ + pack60(data + 12) * x⁴;
}

polynomial pack60(const polynomial* data) {
    polynomial result = 0;
    for (size_t i = 0; i < 12; i += 2) {
        result = result * x¹⁰ + data[i];
    }
    return result;
}

The next transformation is unrolling pack60:

void crc_sdi(polynomial* crcs, const polynomial* data, size_t n) {
    polynomial c = (crcs[0] * x^-14) semi-mod P;
    polynomial y = (crcs[1] * x^-14) semi-mod P;
    for (size_t i = 0; i < n; i += 24) {
        c = (c * x¹²⁰) semi-mod P;
        y = (y * x¹²⁰) semi-mod P;
        c = c + pack120(data + i);
        y = y + pack120(data + i + 1);
    }
    c = (c * x¹⁴) semi-mod P;
    y = (y * x¹⁴) semi-mod P;
    crcs[0] = c mod P;
    crcs[1] = y mod P;
}

polynomial pack120(const polynomial* data) {
    return pack60(data) * x⁶⁴ + pack60(data + 12) * x⁴;
}

polynomial pack60(const polynomial* data) {
    return data[ 0] * x⁵⁰ +
           data[ 2] * x⁴⁰ +
           data[ 4] * x³⁰ +
           data[ 6] * x²⁰ +
           data[ 8] * x¹⁰ +
           data[10];
}

Moving back toward the practical realm, pack60 can return a 64-bit value. This value will be a polynomial in reversed bit order, meaning that the MSB represents x⁰ and the LSB represents x⁶³. pack60 thus becomes:

uint64_t pack60(const uint16_t* data) {
    return (((uint64_t)data[ 0]) <<  4) ^
           (((uint64_t)data[ 2]) << 14) ^
           (((uint64_t)data[ 4]) << 24) ^
           (((uint64_t)data[ 6]) << 34) ^
           (((uint64_t)data[ 8]) << 44) ^
           (((uint64_t)data[10]) << 54);
}

Next up is pack120, which can return a 128-bit value. Again, this value will be a polynomial in reversed bit order, so the MSB represents x⁰ and the LSB represents x¹²⁷. pack120 thus becomes:

__m128i pack120(const uint16_t* data) {
    return _mm_set_epi64x(pack60(data + 12) >> 4, pack60(data));
}

The c and y polynomials within the main function can also be 128-bit values, in the same reversed bit order. At this point, _mm_clmulepi64_si128 needs to be introduced, which can perform multiplication of two 64-bit polynomials, either in regular bit order or in reversed bit order. For polynomials F and G in regular bit order, it performs F * G with the result in regular order. For polynomials F and G in reversed bit order, it performs F * G * x¹ with the result in reversed order. Accordingly, _mm_clmulepi64_si128 can be used to implement the semi-mod operator. It can also be used to implement mod P. It is neither short nor readable, but the resultant code is:

__m128i xor_clmul(__m128i a, __m128i b) {
    return _mm_xor_si128(_mm_clmulepi64_si128(a, b, 0x00),
                         _mm_clmulepi64_si128(a, b, 0x11));
}

void crc_sdi(uint32_t* crcs, const uint16_t* data, size_t n) {
    __m128i c = _mm_cvtsi32_si128(crcs[0]);
    __m128i y = _mm_cvtsi32_si128(crcs[1]);
    { // *= x^-14 semi-mod P
        __m128i k = _mm_cvtsi32_si128(
            0x9f380000 /* x^-14-(64-18)-32-1 mod P */);
        c = _mm_clmulepi64_si128(c, k, 0x00);
        y = _mm_clmulepi64_si128(y, k, 0x00);
    }
    for (size_t i = 0; i < n; i += 24) {
        { // *= x^120 semi-mod P
            __m128i k = _mm_setr_epi32(
                0, 0x4b334000 /* x^120+64-1 mod P */,
                0, 0x96d30000 /* x^120-1    mod P */);
            c = xor_clmul(c, k);
            y = xor_clmul(y, k);
        }
        { // +=
            c = _mm_xor_si128(c, pack120(data + i));
            y = _mm_xor_si128(y, pack120(data + i + 1));
        }
    }
    { // *= x^14 semi-mod P
        __m128i k = _mm_setr_epi32(
            0, 0x14980000 /* x^14+64-1 mod P */,
            0, 0x00040000 /* x^14-1    mod P */);
        c = xor_clmul(c, k);
        y = xor_clmul(y, k);
    }
    { // mod P
        __m128i k = _mm_setr_epi32( /* x^128-1 div P */
            0x14980559, 0x4c9bb5d5,
            0x80040000, 0x5e405011);
        c = _mm_xor_si128(_mm_srli_si128(xor_clmul(c, k), 8),
                          _mm_clmulepi64_si128(c, k, 0x01));
        y = _mm_xor_si128(_mm_srli_si128(xor_clmul(y, k), 8),
                          _mm_clmulepi64_si128(y, k, 0x01));
        __m128i P = _mm_cvtsi32_si128(0x46001 /* P */);
        c = _mm_clmulepi64_si128(c, P, 0x00);
        y = _mm_clmulepi64_si128(y, P, 0x00);
    }
    crcs[0] = _mm_cvtsi128_si32(c);
    crcs[1] = _mm_cvtsi128_si32(y);
}

This version comes in at approximately 11 bits per cycle:

Version	Broadwell	Skylake	Ice Lake
_mm_clmulepi64_si128	10.8	11.0	12.8

This is good, but we can do better still. At this point, all the mathematical tricks have been pulled, so the next avenue of exploration is vectorisation. In particular, there are four pack60 calls per iteration, and it should be possible to perform two calls at once by operating at 128-bit vector width:

__m128i pmovzxwq(const uint16_t* data) {
    __m128i out;
    asm("vpmovzxwq %1, %0" : "=x"(out) :
                              "m"(*(const uint32_t*)data));
    return out;
}

__m128i pack60x2(const uint16_t* data) {
    __m128i result;
    result  = _mm_slli_epi64(pmovzxwq(data     ),  4);
    result ^= _mm_slli_epi64(pmovzxwq(data +  2), 14);
    result ^= _mm_slli_epi64(pmovzxwq(data +  4), 24);
    result ^= _mm_slli_epi64(pmovzxwq(data +  6), 34);
    result ^= _mm_slli_epi64(pmovzxwq(data +  8), 44);
    result ^= _mm_slli_epi64(pmovzxwq(data + 10), 54);
    return result;
}

void crc_sdi_pack(uint32_t* crcs, const uint16_t* data, size_t n) {
...
        { // +=
            __m128i lo = pack60x2(data + i);
            __m128i hi = _mm_srli_epi64(pack60x2(data + i + 12), 4);
            c = _mm_xor_si128(c, _mm_unpacklo_epi64(lo, hi));
            y = _mm_xor_si128(y, _mm_unpackhi_epi64(lo, hi));
        }
...
}

This gives an extra 0.7 bits per cycle on Broadwell. On Skylake and Ice Lake, it improves things by more than 2 bits per cycle:

Version	Broadwell	Skylake	Ice Lake
pack60x2	11.5	13.2	15.2

From here, a few shift instructions can be eliminated by having two variants of pack60x2:

__m128i pack60x2_4(const uint16_t* data) {
    __m128i result;
    result  = _mm_slli_epi64(pmovzxwq(data     ),  4);
    result ^= _mm_slli_epi64(pmovzxwq(data +  2), 14);
    result ^= _mm_slli_epi64(pmovzxwq(data +  4), 24);
    result ^= _mm_slli_epi64(pmovzxwq(data +  6), 34);
    result ^= _mm_slli_epi64(pmovzxwq(data +  8), 44);
    result ^= _mm_slli_epi64(pmovzxwq(data + 10), 54);
    return result;
}

__m128i pack60x2_0(const uint16_t* data) {
    __m128i result;
    result  =                pmovzxwq(data     );
    result ^= _mm_slli_epi64(pmovzxwq(data +  2), 10);
    result ^= _mm_slli_epi64(pmovzxwq(data +  4), 20);
    result ^= _mm_slli_epi64(pmovzxwq(data +  6), 30);
    result ^= _mm_slli_epi64(pmovzxwq(data +  8), 40);
    result ^= _mm_slli_epi64(pmovzxwq(data + 10), 50);
    return result;
}

void crc_sdi_pack(uint32_t* crcs, const uint16_t* data, size_t n) {
...
        { // +=
            __m128i lo = pack60x2_4(data + i);
            __m128i hi = pack60x2_0(data + i + 12);
            c = _mm_xor_si128(c, _mm_unpacklo_epi64(lo, hi));
            y = _mm_xor_si128(y, _mm_unpackhi_epi64(lo, hi));
        }
...
}

This makes basically no difference on Skylake, but gives a minor improvement on both Broadwell and Ice Lake:

Version	Broadwell	Skylake	Ice Lake
pack60x2_4 and pack60x2_0	12.6	13.2	15.9

At this point it is worth reviewing exactly which micro-ops are being executed per iteration of the loop (which is now ingesting 240 bits per iteration):

Instruction	Count	BDW μops	SKL μops	ICL μops
_mm_clmulepi64_si128 x, x, i	4	p0	p5	p5
pmovzxwq x, m	12	p5 + p23	p5 + p23	p15 + p23
_mm_slli_epi64 x, i	11	p0	p01	p01
_mm_xor_si128 x, x	14	p015	p015	p015
_mm_unpack(lo\|hi)_epi64 x, x	2	p5	p5	p15

On Skylake, we're bound by port 5 (which is why eliminating the shifts didn't help): 240 bits require 18 μops executed by port 5, and 240 bits divided by 18 cycles is 13.3 bits per cycle, which is pretty much the measured 13.2 bits per cycle. Things are slightly less neat on Broadwell and Ice Lake, as we're hitting suboptimal dispatch of μops to execution ports, causing us to be bound on a port even though clairvoyant dispatch could do slightly better. Broadwell could be described as being bound on port 0 (due to some xor μops being dispatched to port 0, even though it would be better to dispatch them all to port 1), and Ice Lake could be described as being bound on a mix of port 1 and port 5 (due to some xor μops not being dispatched to port 0).

To get past this bound, some work needs to be removed from port 5. Each pair of pmovzxwq and _mm_slli_epi64 causes a memory μop on port 2 or 3, a shuffle μop on port 5 (or port 1 on Ice Lake), and an arithmetic μop on port 0 (or port 1 on Skylake / Ice Lake). The shuffle μop can be removed by using a different pair of instructions: broadcastss (one memory μop) followed by _mm_madd_epi16 against a constant (one arithmetic μop). This different pair can only be used when the shift amount, modulo 32, is between 0 and 14 (inclusive), as the constant values in question are int16_t values. Half of the cases satisfy this constraint on the shift amount, and can thus be replaced:

__m128i broadcastss(const uint16_t* data) {
    __m128i result;
    asm("vbroadcastss %1, %0" : "=x"(result) :
                                 "m"(*(const uint32_t*)data));
    return result;
}

__m128i pack60x2_4(const uint16_t* data) {
    __m128i shift4  = _mm_setr_epi32(1u <<  4, 0, 1u << ( 4+16), 0);
    __m128i shift14 = _mm_setr_epi32(1u << 14, 0, 1u << (14+16), 0);
    __m128i shift34 = _mm_setr_epi32(0, 1u <<  2, 0, 1u << ( 2+16));
    __m128i shift44 = _mm_setr_epi32(0, 1u << 12, 0, 1u << (12+16));
    __m128i result;
    result  = _mm_madd_epi16(broadcastss(data     ), shift4);
    result ^= _mm_madd_epi16(broadcastss(data +  2), shift14);
    result ^= _mm_slli_epi64(   pmovzxwq(data +  4), 24);
    result ^= _mm_madd_epi16(broadcastss(data +  6), shift34);
    result ^= _mm_madd_epi16(broadcastss(data +  8), shift44);
    result ^= _mm_slli_epi64(   pmovzxwq(data + 10), 54);
    return result;
}

__m128i pack60x2_0(const uint16_t* data) {
    __m128i shift10 = _mm_setr_epi32(1u << 10, 0, 1u << (10+16), 0);
    __m128i shift40 = _mm_setr_epi32(0, 1u << 8, 0, 1u << (8+16));
    __m128i result;
    result  =                   pmovzxwq(data     );
    result ^= _mm_madd_epi16(broadcastss(data +  2), shift10);
    result ^= _mm_slli_epi64(   pmovzxwq(data +  4), 20);
    result ^= _mm_slli_epi64(   pmovzxwq(data +  6), 30);
    result ^= _mm_madd_epi16(broadcastss(data +  8), shift40);
    result ^= _mm_slli_epi64(   pmovzxwq(data + 10), 50);
    return result;
}

This gives a nice improvement across the board, getting us to at least 18 bits per cycle on Skylake and Ice Lake:

Version	Broadwell	Skylake	Ice Lake
broadcastss + _mm_madd_epi16	13.5	18.0	18.8

A few more pairs of pmovzxwq and _mm_slli_epi64 can be replaced with broadcastss and _mm_madd_epi16, albeit at the cost of one extra arithmetic μop for every two removed shuffle μops:

__m128i pack60x2_4(const uint16_t* data) {
    __m128i shift4  = _mm_setr_epi32(1u <<  4, 0, 1u << ( 4+16), 0);
    __m128i shift14 = _mm_setr_epi32(1u << 14, 0, 1u << (14+16), 0);
    __m128i shift34 = _mm_setr_epi32(0, 1u <<  2, 0, 1u << ( 2+16));
    __m128i shift44 = _mm_setr_epi32(0, 1u << 12, 0, 1u << (12+16));
    __m128i result, hi;
    result  = _mm_madd_epi16(broadcastss(data     ), shift4);
    result ^= _mm_madd_epi16(broadcastss(data +  2), shift14);
    hi      = _mm_madd_epi16(broadcastss(data +  4), shift4);
    result ^= _mm_madd_epi16(broadcastss(data +  6), shift34);
    result ^= _mm_madd_epi16(broadcastss(data +  8), shift44);
    hi     ^= _mm_madd_epi16(broadcastss(data + 10), shift34);
    result ^= _mm_slli_epi64(hi, 20);
    return result;
}

__m128i pack60x2_0(const uint16_t* data) {
    __m128i shift10 = _mm_setr_epi32(1u << 10, 0, 1u << (10+16), 0);
    __m128i shift40 = _mm_setr_epi32(0, 1u << 8, 0, 1u << (8+16));
    __m128i result, hi;
    result  =                   pmovzxwq(data     );
    result ^= _mm_madd_epi16(broadcastss(data +  2), shift10);
    hi      =                   pmovzxwq(data +  4);
    hi     ^= _mm_madd_epi16(broadcastss(data +  6), shift10);
    result ^= _mm_slli_epi64(hi, 20);
    result ^= _mm_madd_epi16(broadcastss(data +  8), shift40);
    result ^= _mm_slli_epi64(   pmovzxwq(data + 10), 50);
    return result;
}

This is a minor regression on Broadwell (because at this point we're bound by port 0, and that is where the extra arithmetic μop goes), and a small improvement on Skylake and Ice Lake (where the bound is shuffle rather than arithmetic):

Version	Broadwell	Skylake	Ice Lake
More broadcastss + _mm_madd_epi16	13.2	18.8	19.7

To break through the barrier of 20 bits per cycle, AVX512's ternary bitwise instruction can be used to merge together pairs of xor instructions:

__m128i xor3(__m128i a, __m128i b, __m128i c) {
    return _mm_ternarylogic_epi64(a, b, c, 0x96);
}

__m128i pack60x2_4(const uint16_t* data) {
    __m128i shift4  = _mm_setr_epi32(1u <<  4, 0, 1u << ( 4+16), 0);
    __m128i shift14 = _mm_setr_epi32(1u << 14, 0, 1u << (14+16), 0);
    __m128i shift34 = _mm_setr_epi32(0, 1u <<  2, 0, 1u << ( 2+16));
    __m128i shift44 = _mm_setr_epi32(0, 1u << 12, 0, 1u << (12+16));
    __m128i result, hi;
    result = xor3(_mm_madd_epi16(broadcastss(data     ), shift4),
                  _mm_madd_epi16(broadcastss(data +  2), shift14),
                  _mm_madd_epi16(broadcastss(data +  6), shift34));
    hi     =      _mm_madd_epi16(broadcastss(data +  4), shift4)
           ^      _mm_madd_epi16(broadcastss(data + 10), shift34);
    result = xor3(result,
                  _mm_madd_epi16(broadcastss(data +  8), shift44),
                  _mm_slli_epi64(hi, 20));
    return result;
}

__m128i pack60x2_0(const uint16_t* data) {
    __m128i shift10 = _mm_setr_epi32(1u << 10, 0, 1u << (10+16), 0);
    __m128i shift40 = _mm_setr_epi32(0, 1u << 8, 0, 1u << (8+16));
    __m128i result, hi;
    hi     =                        pmovzxwq(data +  4)
           ^      _mm_madd_epi16(broadcastss(data +  6), shift10);
    result = xor3(                  pmovzxwq(data     ),
                  _mm_madd_epi16(broadcastss(data +  2), shift10),
                  _mm_slli_epi64(hi, 20));
    result = xor3(result,
                  _mm_madd_epi16(broadcastss(data +  8), shift40),
                  _mm_slli_epi64(   pmovzxwq(data + 10), 50));
    return result;
}

void crc_sdi(uint32_t* crcs, const uint16_t* data, size_t n) {
    __m128i c = _mm_cvtsi32_si128(crcs[0]);
    __m128i y = _mm_cvtsi32_si128(crcs[1]);
    { // *= x^-14 semi-mod P
        __m128i k = _mm_cvtsi32_si128(
            0x9f380000 /* x^-14-(64-18)-32-1 mod P */);
        c = _mm_clmulepi64_si128(c, k, 0x00);
        y = _mm_clmulepi64_si128(y, k, 0x00);
    }
    for (size_t i = 0; i < n; i += 24) {
        // *= x^120 semi-mod P
        // +=
        __m128i lo = pack60x2_4(data + i);
        __m128i hi = pack60x2_0(data + i + 12); 
        __m128i k = _mm_setr_epi32(
            0, 0x4b334000 /* x^120+64-1 mod P */,
            0, 0x96d30000 /* x^120-1    mod P */);
        c = xor3(_mm_clmulepi64_si128(c, k, 0x00),
                 _mm_clmulepi64_si128(c, k, 0x11),
                 _mm_unpacklo_epi64(lo, hi));
        y = xor3(_mm_clmulepi64_si128(y, k, 0x00),
                 _mm_clmulepi64_si128(y, k, 0x11),
                 _mm_unpackhi_epi64(lo, hi));
    }
    { // *= x^14 semi-mod P
        __m128i k = _mm_setr_epi32(
            0, 0x14980000 /* x^14+64-1 mod P */,
            0, 0x00040000 /* x^14-1    mod P */);
        c = xor_clmul(c, k);
        y = xor_clmul(y, k);
    }
    { // mod P
        __m128i k = _mm_setr_epi32( /* x^128-1 div P */
            0x14980559, 0x4c9bb5d5,
            0x80040000, 0x5e405011);
        c = _mm_xor_si128(_mm_srli_si128(xor_clmul(c, k), 8),
                          _mm_clmulepi64_si128(c, k, 0x01));
        y = _mm_xor_si128(_mm_srli_si128(xor_clmul(y, k), 8),
                          _mm_clmulepi64_si128(y, k, 0x01));
        __m128i P = _mm_cvtsi32_si128(0x46001 /* P */);
        c = _mm_clmulepi64_si128(c, P, 0x00);
        y = _mm_clmulepi64_si128(y, P, 0x00);
    }
    crcs[0] = _mm_cvtsi128_si32(c);
    crcs[1] = _mm_cvtsi128_si32(y);
}

AVX512 isn't available on Broadwell, nor on the client variant of Skylake, but where it is available it pushes us well above 20 bits per cycle:

Version	Broadwell	Skylake	Ice Lake
_mm_ternarylogic_epi64	N/A	21.9	23.5

To recap the journey, when measuring bits per cycle, we've got a 25x overall improvement on Broadwell, more than 40x overall improvement on Skylake, and nearly a 50x overall improvement on Ice Lake:

Version	Broadwell	Skylake	Ice Lake
Starting point	0.536	0.528	0.477
2KiB table	2.10	2.15	2.05
_mm_clmulepi64_si128	10.8	11.0	12.8
pack60x2	11.5	13.2	15.2
pack60x2_4 and pack60x2_0	12.6	13.2	15.9
broadcastss + _mm_madd_epi16	13.5	18.0	18.8
More broadcastss + _mm_madd_epi16	13.2	18.8	19.7
_mm_ternarylogic_epi64	N/A	21.9	23.5