The many ways of converting FP32 to FP16

An IEEE 754 single-precision 32-bit binary float (henceforth FP32) can store:

(-1)^s × 2^e × m₀.m₁m₂m₃m₄m₅m₆m₇m₈m₉m₁₀m₁₁m₁₂⋯m₂₁m₂₂m₂₃ with either:
- m₀ = 1, other m_i either 0 or 1, and -126 ≤ e ≤ 127 (normalised)
- m₀ = 0, other m_i either 0 or 1, and -126 = e (denormal or zero)
(-1)^s × ∞ (± infinity)
NaN (with a sign bit, and 23 bits of almost-arbitrary payload)

An IEEE 754 half-precision 16-bit binary float (henceforth FP16) can store:

(-1)^s × 2^e × m₀.m₁m₂m₃m₄m₅m₆m₇m₈m₉m₁₀ with either:
- m₀ = 1, other m_i either 0 or 1, and -14 ≤ e ≤ 15 (normalised)
- m₀ = 0, other m_i either 0 or 1, and -14 = e (denormal or zero)
(-1)^s × ∞ (± infinity)
NaN (with a sign bit, and 10 bits of almost-arbitrary payload)

Converting from FP32 to FP16 thus involves getting rid of m₁₁ through m₂₃, and dealing with the reduced range of e. If e starts in the range -14 ≤ e ≤ 15 and m₁₁ through m₂₃ are all 0, then this conversion is trivial. If any of m₁₁ through m₂₃ are 1 then a rounding decision has to be made, typical options for which include:

Round toward zero (i.e. discard all of m₁₁ through m₂₃)
Round toward +∞ or toward -∞ (i.e. add one or not based on s)
Round to nearest with ties away from zero (i.e. add one if m₁₁ is 1)
Round to nearest with ties toward even (i.e. add one if m₁₁ is 1 and at least one other of m₁₀ through m₂₃ is 1)

If e starts in the range e < -14 (i.e. underflow), then the resultant FP16 has to end up with e = -14, as it cannot go any smaller. Some number of m_i will be discarded, and so a rounding decision again has to be made. A fun bonus decision crops up when rounding toward ±∞ and the starting FP32 is denormal: should denormals be treated as zero? (for other rounding modes, denormal inputs will naturally round to zero) Regardless of rounding mode, if the resultant FP16 is denormal, then there's the decision of whether to flush to zero.

If e starts in the range 15 < e (i.e. overflow), then there's no particularly great FP16 to convert to, with the options being:

Infinity (with an appropriate sign)
e = 15, all m_i = 1 (i.e. a value of +65504 or -65504)
Throw an exception

Choosing between infinity and 65504 is yet again a rounding decision, with the unintuitive quirk that 65520 is considered to be nearer to infinity than to 65504.

If the starting FP32 is NaN, then there's choice about what to do with the sign bit (either preserve it, or force it to a particular value), and again a choice about what to do with the payload bits (either preserve some of them, or force them to a particular canonical pattern).

All said, there are clearly lots of decision points in the conversion process, so it is no surprise that different implementations make different choices. We'll look at a bunch of software and hardware implementations, and the choices that they make (as they don't always advertise their decisions).

The summary of implementations and most important choices is:

Implementation	Rounding	NaNs
numpy	Toward nearest, ties toward even	Preserve (1)
CPython	Toward nearest, ties toward even (†)	±Canonical
James Tursa (Matlab)	Toward nearest, ties away from zero	-Canonical
Fabian "ryg" Giesen	Toward nearest, details vary	±Canonical
Maratyszcza	Toward nearest, ties toward even	±Canonical
VCVTPS2PH (x86)	Configurable	Preserve (2)
FCVT / FCVTN (ARM)	Configurable	Configurable

(†) Except that overflows throw an exception
(1) Top 10 bits of payload preserved, then LSB set to 1 if required
(2) Top bit of payload set to 1, then next 9 bits preserved

Jumping into the details, we start with numpy, which consists of very well-commented bit manipulations:

uint16_t numpy_floatbits_to_halfbits(uint32_t f) {
  uint16_t h_sgn = (uint16_t)((f & 0x80000000u) >> 16);
  uint32_t f_exp = f & 0x7f800000u;
  uint32_t f_sig = f & 0x007fffffu;

  // Exponent overflow/NaN converts to signed inf/NaN
  if (f_exp >= 0x47800000u) {
    if ((f_exp == 0x7f800000u) && (f_sig != 0)) {
      // NaN - propagate the flag in the significand...
      uint16_t ret = (uint16_t)(0x7c00u + (f_sig >> 13));
      ret += (ret == 0x7c00u); // ...but make sure it stays a NaN
      return h_sgn + ret;
    } else {
      // (overflow to) signed inf
      return (uint16_t)(h_sgn + 0x7c00u);
    }
  }

  // Exponent underflow converts to a subnormal half or signed zero
  if (f_exp <= 0x38000000u) {
    // Signed zeros, subnormal floats, and floats with small
    // exponents all convert to signed zero half-floats.
    if (f_exp < 0x33000000u) {
      return h_sgn;
    }
    // Make the subnormal significand
    f_exp >>= 23;
    f_sig += 0x00800000u;
    f_sig >>= (113 - f_exp);
    // Handle rounding by adding 1 to the bit beyond half precision
    //
    // If the last bit in the half significand is 0 (already even),
    // and the remaining bit pattern is 1000...0, then we do not add
    // one to the bit after the half significand. However, the
    // (113 - f_exp) shift can lose up to 11 bits, so the || checks
    // them in the original. In all other cases, we can just add one.
    if (((f_sig & 0x3fffu) != 0x1000u) || (f & 0x07ffu)) {
      f_sig += 0x1000u;
    }
    uint16_t h_sig = (uint16_t)(f_sig >> 13);
    // If the rounding causes a bit to spill into h_exp, it will
    // increment h_exp from zero to one and h_sig will be zero.
    // This is the correct result.
    return (uint16_t)(h_sgn + h_sig);
  }

  // Regular case with no overflow or underflow
  uint16_t h_exp = (uint16_t)((f_exp - 0x38000000u) >> 13);
  // Handle rounding by adding 1 to the bit beyond half precision
  //
  // If the last bit in the half significand is 0 (already even), and
  // the remaining bit pattern is 1000...0, then we do not add one to
  // the bit after the half significand. In all other cases, we do.
  if ((f_sig & 0x3fffu) != 0x1000u) {
      f_sig += 0x1000u;
  }
  uint16_t h_sig = (uint16_t)(f_sig >> 13);
  // If the rounding causes a bit to spill into h_exp, it will
  // increment h_exp by one and h_sig will be zero. This is the
  // correct result. h_exp may increment to 15, at greatest, in
  // which case the result overflows to a signed inf.
  return (uint16_t)(h_sgn + h_exp + h_sig);
}

CPython takes a very different approach, involving copysign and frexp:

int PyFloat_Pack2(double x) {
  uint16_t sign, bits;
  int e;

  if (x == 0.0) {
    sign = (copysign(1.0, x) == -1.0);
    e = 0;
    bits = 0;
  } else if (isinf(x)) {
    sign = (x < 0.0);
    e = 0x1f;
    bits = 0;
  } else if (isnan(x)) {
    // There are 2046 distinct half-precision NaNs (1022 signaling
    // and 1024 quiet), but there are only two quiet NaNs that don't
    // arise by quieting a signaling NaN; we get those by setting
    // the topmost bit of the fraction field and clearing all other
    // fraction bits. We choose the one with the appropriate sign.
    sign = (copysign(1.0, x) == -1.0);
    e = 0x1f;
    bits = 0x200;
  } else {
    sign = (x < 0.0);
    if (sign) {
      x = -x;
    }

    double f = frexp(x, &e);
    // Normalize f to be in the range [1.0, 2.0)
    f *= 2.0;
    e--;

    if (e >= 16) {
      goto Overflow;
    } else if (e < -25) {
      // |x| < 2**-25. Underflow to zero.
      f = 0.0;
      e = 0;
    } else if (e < -14) {
      // |x| < 2**-14. Gradual underflow
      f = ldexp(f, 14 + e);
      e = 0;
    } else /* if (!(e == 0 && f == 0.0)) */ {
      e += 15;
      f -= 1.0; // Get rid of leading 1
    }

    f *= 1024.0; // 2**10
    // Round to even
    bits = (uint16_t)f; // Note the truncation
    assert(bits < 1024);
    assert(e < 31);
    if ((f - bits > 0.5) || ((f - bits == 0.5) && (bits % 2 == 1))) {
      if (++bits == 1024) {
        // The carry propagated out of a string of 10 1 bits.
        bits = 0;
        if (++e == 31) {
          goto Overflow;
        }
      }
    }
  }

  return (sign << 15) | (e << 10) | bits;
Overflow:
  PyErr_SetString(PyExc_OverflowError,
                  "float too large to pack with e format");
  return -1;
}

The "Round to even" part of this CPython code was explicitly put in to match the numpy behaviour, however it diverges in its treatment of finite inputs that can't plausibly be represented by a finite FP16 - numpy will round them to infinity, whereas CPython will throw. This can be seen in the following example:

import numpy
import struct

def np_f2h(x):
  return int(numpy.array(x, 'u4').view('f4').astype('f2').view('u2'))

def py_f2h(x):
  f, = struct.unpack('f', struct.pack('I', x))
  return struct.unpack('H', struct.pack('e', f))[0]

print(hex(np_f2h(0x49800000))) # 0x7c00
print(hex(py_f2h(0x49800000))) # OverflowError

The other difference is in treatment of NaN values, where numpy tries to preserve them as much as possible, whereas CPython only preserves the sign bit and otherwise turns all NaN values into a canonical quiet NaN:

print(hex(np_f2h(0xffffffff))) # 0xffff
print(hex(py_f2h(0xffffffff))) # 0xfe00

A different lineage of code comes from James Tursa. It was originally written for Matlab and requires a MathWorks account to download, but the interesting part found its way to GitHub regardless, and is transcribed with slight modifications here:

uint16_t tursa_floatbits_to_halfbits(uint32_t x) {
  uint32_t xs = x & 0x80000000u; // Pick off sign bit
  uint32_t xe = x & 0x7f800000u; // Pick off exponent bits
  uint32_t xm = x & 0x007fffffu; // Pick off mantissa bits
  uint16_t hs = (uint16_t)(xs >> 16); // Sign bit
  if (xe == 0) { // Denormal will underflow, return a signed zero
    return hs;
  }
  if (xe == 0x7f800000u) { // Inf or NaN
    if (xm == 0) { // If mantissa is zero ...
      return (uint16_t)(hs | 0x7c00u); // Signed Inf
    } else {
      return (uint16_t)0xfe00u; // NaN, only 1st mantissa bit set
    }
  }
  // Normalized number
  int hes = ((int)(xe >> 23)) - 127 + 15; // Exponent unbias the single, then bias the halfp
  if (hes >= 0x1f) { // Overflow
    return (uint16_t)(hs | 0x7c00u); // Signed Inf
  } else if (hes <= 0) { // Underflow
    uint16_t hm;
    if ((14 - hes) > 24) { // Mantissa shifted all the way off & no rounding possibility
      hm = (uint16_t)0u; // Set mantissa to zero
    } else {
      xm |= 0x00800000u; // Add the hidden leading bit
      hm = (uint16_t)(xm >> (14 - hes)); // Mantissa
      if ((xm >> (13 - hes)) & 1u) { // Check for rounding
        hm += (uint16_t)1u; // Round, might overflow into exp bit, but this is OK
      }
    }
    return (hs | hm); // Combine sign bit and mantissa bits, biased exponent is zero
  } else {
    uint16_t he = (uint16_t)(hes << 10); // Exponent
    uint16_t hm = (uint16_t)(xm >> 13); // Mantissa
    if (xm & 0x1000u) { // Check for rounding
      return (hs | he | hm) + (uint16_t)1u; // Round, might overflow to inf, this is OK
    } else {
      return (hs | he | hm); // No rounding
    }
  }
}

Though it shares a lot in spirit with the numpy code, this code has different rounding behaviour to what we've seen before; ties are rounded away from zero rather than towards even. It also has yet another take on NaNs: all input NaNs get turned into a single output NaN, with not even the sign bit preserved.

The code from James Tursa was adopted into ISPC, meaning that ISPC had the same rounding and NaN behaviours. Then came along Fabian "ryg" Giesen, deploying his usual optimisation skills, and also fixing the rounding behaviour (back to ties toward even) and the NaN behaviour (preserve sign bit) at the same time. Several variants were written along that journey (and then adopted back by ISPC), but I'll choose this one to showcase:

uint16_t float_to_half_fast3_rtne(uint32_t x) {
  uint32_t x_sgn = x & 0x80000000u;
  x ^= x_sgn;

  uint16_t o;
  if (x >= 0x47800000u) { // result is Inf or NaN
    o = (x > 0x7f800000u) ? 0x7e00  // NaN->qNaN
                          : 0x7c00; // and Inf->Inf
  } else { // (De)normalized number or zero
    if (x < 0x38800000u) { // resulting FP16 is subnormal or zero
      // use a magic value to align our 10 mantissa bits at
      // the bottom of the float. as long as FP addition
      // is round-to-nearest-even this just works.
      union { uint32_t u; float f; } f, denorm_magic;
      f.u = x;
      denorm_magic.u = ((127 - 14) + (23 - 10)) << 23;

      f.f += denorm_magic.f;

      // and one integer subtract of the bias later, we have our
      // final float!
      o = f.u - denorm_magic.u;
    } else {
      uint32_t mant_odd = (x >> 13) & 1; // resulting mantissa is odd

      // update exponent, rounding bias part 1
      x += ((15 - 127) << 23) + 0xfff;
      // rounding bias part 2
      x += mant_odd;
      // take the bits!
      o = x >> 13;
    }
  }
  return (x_sgn >> 16) | o;
}

The interesting part of the above is how it handles the e < -14 case. Both the numpy code and the code from James Tursa handled this case with a variable distance bitshift and then some fiddly rounding logic, whereas the above makes the observation that there's existing hardware in a contemporary CPU for doing shifting and fiddly rounding: the floating point addition circuit. The first step of a floating point addition circuit is a variable distance bitshift of the smaller operand, with the aim of making its exponent equal to that of the larger operand. The above engineers the larger operand to have e = -1, which is what we want, as FP16 bottoms out at e = -14 and there are 13 (23 - 10) extra mantissa bits in FP32 compared to FP16. Any bits that get shifted out as part of exponent equalisation will contribute to rounding, which is exactly what we want.

Taking the idea further, we have this gem from Maratyszcza:

uint16_t fp16_ieee_from_fp32_value(uint32_t x) {
  uint32_t x_sgn = x & 0x80000000u;
  uint32_t x_exp = x & 0x7f800000u;
  x_exp = (x_exp < 0x38800000u) ? 0x38800000u : x_exp; // max(e, -14)
  x_exp += 15u << 23; // e += 15
  x &= 0x7fffffffu; // Discard sign

  union { uint32_t u; float f; } f, magic;
  f.u = x;
  magic.u = x_exp;

  // If 15 < e then inf, otherwise e += 2
  f.f = (f.f * 0x1.0p+112f) * 0x1.0p-110f;

  // If we were in the x_exp >= 0x38800000u case:
  // Replace f's exponent with that of x_exp, and discard 13 bits of
  // f's significand, leaving the remaining bits in the low bits of
  // f, relying on FP addition being round-to-nearest-even. If f's
  // significand was previously `a.bcdefghijk`, then it'll become
  // `1.000000000000abcdefghijk`, from which `bcdefghijk` will become
  // the FP16 mantissa, and `a` will add onto the exponent. Note that
  // rounding might add one to all this.
  // If we were in the x_exp < 0x38800000u case:
  // Replace f's exponent with the minimum FP16 exponent, discarding
  // however many bits are required to make that happen, leaving
  // whatever is left in the low bits.
  f.f += magic.f;

  uint32_t h_exp = (f.u >> 13) & 0x7c00u; // low 5 bits of exponent
  uint32_t h_sig = f.u & 0x0fffu; // low 12 bits (10 are mantissa)
  h_sig = (x > 0x7f800000u) ? 0x0200u : h_sig; // any NaN -> qNaN
  return (x_sgn >> 16) + h_exp + h_sig;
}

The 15 < e case is elegantly folded into the infinity case by using the floating point multiply circuit and the constant 0x1.0p+112f. After that, the x_exp < 0x38800000u case plays out similarly to ryg's code: add a fixed constant to cause the result to appear in the low bits. The x_exp >= 0x38800000u case uses a dynamic magic number rather than a constant magic number, with the aim of shifting off exactly 13 bits, getting the rounding done by the floating point addition circuit, and again having the result appear in the low bits. The final trickery is in computing the FP16 exponent bits; starting with the 8 bits of FP32 exponent, the top 3 bits are discarded, and modulo-32 arithmetic is done on the low 5 bits. Modulo 32, the bias adjustment is +16 ((15-127) % 32), which gets done in two parts: a fixed +15, and then either +0 or +1 depending on whether the result is denormal or not.

Sadly, at some point, new hardware makes software tricks obsolete. On x86, said new hardware was the F16C instruction set, which (amongst other things) adds a VCVTPS2PH instruction for converting (a vector of) FP32 to FP16. Four different rounding modes are available: round to nearest with ties toward even, round toward +∞, round toward -∞, round toward zero. Note that round to nearest with ties away from zero isn't an option. An immediate operand specifies the rounding mode, or it can come from MXCSR.RC. NaN signs are preserved, and NaN payloads are mostly preserved: the MSB of the input payload is forced to 1, and then the top 10 bits of the input payload become the output payload. When rounding toward +∞ or -∞, MXCSR.DAZ causes input denormals to be treated as zero. On the other hand, MXCSR.FTZ is ignored, so denormal results are never flushed to zero. Code to exercise this instruction looks incredibly boring:

uint16_t cvtps2ph(uint32_t x) {
    __m128 xmm = _mm_castsi128_ps(_mm_cvtsi32_si128(x));
    return _mm_extract_epi16(_mm_cvtps_ph(xmm, 0), 0);
}

On ARMv8, we have FCVT (scalar) and FCVTN / FCVTN2 (vector), which are capable of performing FP32 to FP16 conversion. FPCR controls the rounding mode, with the same options as on x86. FPCR also controls denormal flushing, of both inputs and outputs. Finally, FPCR.DN controls whether NaNs are canonicalised (without even the sign bit preserved), or whether they have the same preservation behaviour as x86.