Much has been written about converting floating point numbers to strings: Bruce Dawson has lots to say, David M. Gay's venerable dtoa is a classic, and Florian Loitsch's relatively recent grisu paper is worth studying. Often the problem is framed as "converting a floating point number to the shortest possible decimal string representation", but this framing is neither neccessary nor sufficient for implementing the %e / %f / %g formats of sprintf. Furthermore, this framing introduces significant complexity. As such, I'd like to begin by considering a much simpler framing: convert a double-precision floating point number to a decimal string, such that the string gives the exact mathematical real number represented by the float.

First of all, we need a quick reminder of what an IEEE754 double-precision floating point number looks like under the hood. If we ignore negative numbers, infinities, NaNs, and denormals, then we have just an 11-bit exponent e and a 53-bit mantissa m, which together represent the number m (an integer) times 2e (where e might be negative). Denormals also fit the pattern of m times 2e, albeit with a different encoding of m and e.

Armed with this knowledge, we can pull the m and e fields out of a number:

typedef union {
  double n;
  uint64_t u64;
  struct {
    uint32_t lo;
    uint32_t hi;
  } u32;
} TValue;

void decode(double n) {
  TValue t;
  t.n = n;
  if ((t.u32.hi << 1) >= 0xffe00000) {
    if (((t.u32.hi & 0x000fffff) | t.u32.lo) != 0) {
      printf("NaN\n");
    } else {
      printf("Infinity\n");
    }
  } else {
    int32_t e = (t.u32.hi >> 20) & 0x7ff;
    uint64_t m = t.u32.hi & 0xfffff;
    if (e == 0) {
      e++;
    } else {
      m |= 0x100000;
    }
    e -= 1043;
    if (t.u32.lo) {
      e -= 32;
      m = (m << 32) | t.u32.lo;
    }
    printf("%llu * 2^%d\n", (long long unsigned)m, (int)e);
  }
}

Some example outputs of this decode function are:

nOutput
0./0.NaN
1./0.Infinity
0.0 * 2^-1042
1.1048576 * 2^-20
10.1310720 * 2^-17
0.17205759403792794 * 2^-56
1e-3082024022533073106 * 2^-1074
pow(2, 1020)1048576 * 2^1000

These outputs are not in decimal, and they're not particularly convenient for human comprehension, but they are 100% accurate.

From m and e values, one classical approach for getting decimal digits is:

  1. Convert m to a binary bignum (easy).
  2. Multiply the binary bignum by 2e (easy).
  3. In a loop, compute the binary bignum modulo ten to get one digit (hard), and then divide the binary bignum by ten (hard).

I prefer the following approach, as it gets rid of the hard part:

  1. Convert m to a decimal bignum (not easy, but not hard).
  2. Multiply the decimal bignum by 2e (not easy, but not hard).
  3. In a loop, print the digits of the decimal bignum (easy).

What do I mean by binary bignum and decimal bignum? A binary bignum is a number stored as multiple uint32_t pieces, with each piece being in the range 0 through 232-1, and the number being sum(piecei * (232)i). On the other hand, a decimal bignum is a number stored as multiple uint32_t pieces, with each piece being in the range 0 through 109-1, and the number being sum(piecei * (109)i). 109 is chosen as it is the largest power of ten which fits into a uint32_t (even better, 109 requires 29.9 bits to store, so only 2.1 bits out of every 32 are wasted).

I'd like to represent a decimal bignum using the following variables:

uint32_t nd[128];
int32_t ndlo;
int32_t ndhi;

For ndlo <= i <= ndhi, piecei is nd[i & 127], and otherwise piecei is zero.

As it happens, we already know how to print the digits of an individual decimal bignum piece. Using one of those functions from last time, printing an entire decimal bignum is painless:

void nd_print(char* p, uint32_t* nd, int32_t ndlo, int32_t ndhi) {
  int32_t i;
  for (i = ndhi; i >= 0; --i) {
    nasonov9(p, nd[i & 127]); p += 9;
  }
  *p++ = '.';
  for (; i >= ndlo; --i) {
    nasonov9(p, nd[i & 127]); p += 9;
  }
  *p = 0;
}

Multiplying by 2e takes a bit of effort. Let's start by considering the case where e is negative, at which point we're really dividing by 2-e. In turn, this is dividing by 2 -e times. Just dividing by 2 doesn't take that much code:

int32_t nd_div2(uint32_t* nd, int32_t ndlo, int32_t ndhi) {
  uint32_t i = ndhi & 127, carry = 0;
  for (;;) {
    uint32_t val = nd[i];
    nd[i] = (val >> 1) + carry;
    carry = (val & 1) * 500000000;
    if (i == (ndlo & 127)) break;
    i = (i - 1) & 127;
  }
  if (carry) nd[--ndlo & 127] = carry;
  return ndlo;
}

We can generalise this to dividing by 2k:

int32_t nd_div2k(uint32_t* nd, int32_t ndlo, int32_t ndhi,
                 uint32_t k) {
  uint32_t mask = (1U << k) - 1, mul = 1000000000 >> k;
  uint32_t i = ndhi & 127, carry = 0;
  for (;;) {
    uint32_t val = nd[i];
    nd[i] = (val >> k) + carry;
    carry = (val & mask) * mul;
    if (i == (ndlo & 127)) break;
    i = (i - 1) & 127;
  }
  if (carry) nd[--ndlo & 127] = carry;
  return ndlo;
}

The above will work for k between zero and nine inclusive; if k is larger than nine, then 1000000000 >> k can no longer be represented by an integer. As such, the complete solution has to start by dividing in batches of 29:

int32_t nd_div2k(uint32_t* nd, int32_t ndlo, int32_t ndhi,
                 uint32_t k) {
  while (k >= 9) {
    uint32_t i = ndhi & 127, carry = 0;
    for (;;) {
      uint32_t val = nd[i];
      nd[i] = (val >> 9) + carry;
      carry = (val & 0x1ff) * 1953125;
      if (i == (ndlo & 127)) break;
      i = (i - 1) & 127;
    }
    if (carry) nd[--ndlo & 127] = carry;
    k -= 9;
  }
  if (k) {
    uint32_t mask = (1U << k) - 1, mul = 1000000000 >> k;
    uint32_t i = ndhi & 127, carry = 0;
    for (;;) {
      uint32_t val = nd[i];
      nd[i] = (val >> k) + carry;
      carry = (val & mask) * mul;
      if (i == (ndlo & 127)) break;
      i = (i - 1) & 127;
    }
    if (carry) nd[--ndlo & 127] = carry;
  }
  return ndlo;
}

We can then go through the same process for multiplying, starting with multiplication by 2:

int32_t nd_mul2(uint32_t* nd, int32_t ndhi) {
  uint32_t carry_in = 0;
  for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
    uint32_t val = (nd[i] << 1) | carry_in;
    carry_in = val / 1000000000;
    nd[i] = val - carry_in * 1000000000;
  }
  if (carry_in) nd[++ndhi] = carry_in;
  return ndhi;
}

By promoting val to 64-bits, this can be generalised to small k:

int32_t nd_mul2k(uint32_t* nd, int32_t ndhi, uint32_t k) {
  uint32_t carry_in = 0;
  for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
    uint64_t val = ((uint64_t)nd[i] << k) | carry_in;
    carry_in = (uint32_t)(val / 1000000000);
    nd[i] = (uint32_t)val - carry_in * 1000000000;
  }
  if (carry_in) nd[++ndhi] = carry_in;
  return ndhi;
}

This time the constraint on k comes from wanting val to be no more than (109)2, which limits k to 29. It also turns out to be useful to make carry_in a parameter, all of which leads to the complete code for multiplying by 2k:

int32_t nd_mul2k(uint32_t* nd, int32_t ndhi, uint32_t k,
                 uint32_t carry_in) {
  while (k >= 29) {
    for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
      uint64_t val = ((uint64_t)nd[i] << 29) | carry_in;
      carry_in = (uint32_t)(val / 1000000000);
      nd[i] = (uint32_t)val - carry_in * 1000000000;
    }
    if (carry_in) {
      nd[++ndhi] = carry_in; carry_in = 0;
    }
    k -= 29;
  }
  if (k) {
    for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
      uint64_t val = ((uint64_t)nd[i] << k) | carry_in;
      carry_in = (uint32_t)(val / 1000000000);
      nd[i] = (uint32_t)val - carry_in * 1000000000;
    }
    if (carry_in) nd[++ndhi] = carry_in;
  }
  return ndhi;
}

We can plug these routines into the decode function from earlier to create a print function:

void print(double n) {
  TValue t;
  t.n = n;
  if ((t.u32.hi << 1) >= 0xffe00000) {
    if (((t.u32.hi & 0x000fffff) | t.u32.lo) != 0) {
      printf("NaN\n");
    } else {
      printf("Infinity\n");
    }
  } else {
    char buf[1154];
    uint32_t nd[128];
    int32_t ndlo = 0;
    int32_t ndhi = 0;
    int32_t e = (t.u32.hi >> 20) & 0x7ff;
    nd[0] = t.u32.hi & 0xfffff;
    if (e == 0) {
      e++;
    } else {
      nd[0] |= 0x100000;
    }
    e -= 1043;
    if (t.u32.lo) {
      e -= 32;
      nd[0] = (nd[0] << 3) | (t.u32.lo >> 29);
      ndhi = nd_mul2k(nd, ndhi, 29, t.u32.lo & 0x1fffffff);
    }
    if (e >= 0) {
      ndhi = nd_mul2k(nd, ndhi, (uint32_t)e, 0);
    } else {
      ndlo = nd_div2k(nd, ndlo, ndhi, (uint32_t)-e);
    }
    nd_print(buf, nd, ndlo, ndhi);
    printf("%s\n", buf);
  }
}

Some example outputs of this print function are:

nOutput
0./0.NaN
1./0.Infinity
0.000000000.
1.000000001.
10.000000010.
0.1000000000000000000.1000000000000000055511
15123125782702118158340454101562500000000
1e-308000000000000000000.0000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000000
00000000000000000000000000000000000000009
99999999999999909326625337248461995470488
73403204569370722504933164788134100221702
36685306110285951575783017584918228243784
38792553200763769833775473829862512856683
41346193998972906543693727922885247662294
86591679434355446221493480729436132941672
16662821737555414480159115639791276054897
20142038977058035153396077150619905566488
97702602917109778267250244017165230316273
90652604144008597950935492433262042405635
56399326294969169893097546113480479123599
46979384052000893178607312050101591177117
04697471514344499487123311264707354172378
09953873785021982614510236627959137966047
18812599767273565216024053297899062477635
21525981391443887618575275588619928089116
90506171197530846785775640581096161907433
18668839610809435427125598308539800029848
265694454312324523925781250000000
pow(2, 1020)00000001123558209288947442330815744243140
45851123561183894160795893800723582922378
43810195794279832650471001320007117491962
08485367436055090103890580296441496713277
36104933390540928297688887250778808824658
17684505312860552384417646403930092119569
40880170232270940691778664363999670287115
4982269052209770601514008576.

Some of these outputs could be tidied up by trimming leading and trailing zeroes, but other than that, this is the output we were aiming for. We can see that the double-precision floating point number closest to 0.1 is actually ever so slightly more than 0.1, that the double-precision floating point number closest to 1e-308 is actually ever so slightly less than 1e-308, and that in some cases the string representation of the exact decimal represented by a float can be extremely long (though, to be fair, 1e-308 is almost as bad as it gets for length).

Next time we'll see how to adapt nd_print into something which can behave like the %e, %f, and %g formats of sprintf (i.e. outputs which you might actually want, as opposed to pedantic exact representations).