# Converting floats to strings, part 1

Much has been written about converting floating point numbers to strings: Bruce Dawson has lots to say, David M. Gay's venerable dtoa is a classic, and Florian Loitsch's relatively recent grisu paper is worth studying. Often the problem is framed as "converting a floating point number to the shortest possible decimal string representation", but this framing is neither neccessary nor sufficient for implementing the `%e`

/ `%f`

/ `%g`

formats of `sprintf`

. Furthermore, this framing introduces significant complexity. As such, I'd like to begin by considering a much simpler framing: convert a double-precision floating point number to a decimal string, such that the string gives the exact mathematical real number represented by the float.

First of all, we need a quick reminder of what an IEEE754 double-precision floating point number looks like under the hood. If we ignore negative numbers, infinities, NaNs, and denormals, then we have just an 11-bit exponent *e* and a 53-bit mantissa *m*, which together represent the number *m* (an integer) times 2^{e} (where *e* might be negative). Denormals also fit the pattern of *m* times 2^{e}, albeit with a different encoding of *m* and *e*.

Armed with this knowledge, we can pull the *m* and *e* fields out of a number:

```
typedef union {
double n;
uint64_t u64;
struct {
uint32_t lo;
uint32_t hi;
} u32;
} TValue;
void decode(double n) {
TValue t;
t.n = n;
if ((t.u32.hi << 1) >= 0xffe00000) {
if (((t.u32.hi & 0x000fffff) | t.u32.lo) != 0) {
printf("NaN\n");
} else {
printf("Infinity\n");
}
} else {
int32_t e = (t.u32.hi >> 20) & 0x7ff;
uint64_t m = t.u32.hi & 0xfffff;
if (e == 0) {
e++;
} else {
m |= 0x100000;
}
e -= 1043;
if (t.u32.lo) {
e -= 32;
m = (m << 32) | t.u32.lo;
}
printf("%llu * 2^%d\n", (long long unsigned)m, (int)e);
}
}
```

Some example outputs of this `decode`

function are:

n | Output |
---|---|

`0./0.` | `NaN` |

`1./0.` | `Infinity` |

`0.` | `0 * 2^-1042` |

`1.` | `1048576 * 2^-20` |

`10.` | `1310720 * 2^-17` |

`0.1` | `7205759403792794 * 2^-56` |

`1e-308` | `2024022533073106 * 2^-1074` |

`pow(2, 1020)` | `1048576 * 2^1000` |

These outputs are not in decimal, and they're not particularly convenient for human comprehension, but they are 100% accurate.

From *m* and *e* values, one classical approach for getting decimal digits is:

- Convert
*m*to a binary bignum (easy). - Multiply the binary bignum by 2
^{e}(easy). - In a loop, compute the binary bignum modulo ten to get one digit (hard), and then divide the binary bignum by ten (hard).

I prefer the following approach, as it gets rid of the hard part:

- Convert
*m*to a**decimal**bignum (not easy, but not hard). - Multiply the decimal bignum by 2
^{e}(not easy, but not hard). - In a loop, print the digits of the decimal bignum (easy).

What do I mean by binary bignum and decimal bignum? A binary bignum is a number stored as multiple `uint32_t`

pieces, with each piece being in the range 0 through 2^{32}-1, and the number being `sum(piece`

. On the other hand, a decimal bignum is a number stored as multiple _{i} * (2^{32})^{i})`uint32_t`

pieces, with each piece being in the range 0 through 10^{9}-1, and the number being `sum(piece`

. 10_{i} * (10^{9})^{i})^{9} is chosen as it is the largest power of ten which fits into a `uint32_t`

(even better, 10^{9} requires 29.9 bits to store, so only 2.1 bits out of every 32 are wasted).

I'd like to represent a decimal bignum using the following variables:

```
uint32_t nd[128];
int32_t ndlo;
int32_t ndhi;
```

For `ndlo <= i <= ndhi`

, `piece`

is _{i}`nd[i & 127]`

, and otherwise `piece`

is zero._{i}

As it happens, we already know how to print the digits of an individual decimal bignum piece. Using one of those functions from last time, printing an entire decimal bignum is painless:

```
void nd_print(char* p, uint32_t* nd, int32_t ndlo, int32_t ndhi) {
int32_t i;
for (i = ndhi; i >= 0; --i) {
nasonov9(p, nd[i & 127]); p += 9;
}
*p++ = '.';
for (; i >= ndlo; --i) {
nasonov9(p, nd[i & 127]); p += 9;
}
*p = 0;
}
```

Multiplying by 2^{e} takes a bit of effort. Let's start by considering the case where *e* is negative, at which point we're really dividing by 2^{-e}. In turn, this is dividing by 2 -*e* times. Just dividing by 2 doesn't take that much code:

```
int32_t nd_div2(uint32_t* nd, int32_t ndlo, int32_t ndhi) {
uint32_t i = ndhi & 127, carry = 0;
for (;;) {
uint32_t val = nd[i];
nd[i] = (val >> 1) + carry;
carry = (val & 1) * 500000000;
if (i == (ndlo & 127)) break;
i = (i - 1) & 127;
}
if (carry) nd[--ndlo & 127] = carry;
return ndlo;
}
```

We can generalise this to dividing by 2^{k}:

```
int32_t nd_div2k(uint32_t* nd, int32_t ndlo, int32_t ndhi,
uint32_t k) {
uint32_t mask = (1U << k) - 1, mul = 1000000000 >> k;
uint32_t i = ndhi & 127, carry = 0;
for (;;) {
uint32_t val = nd[i];
nd[i] = (val >> k) + carry;
carry = (val & mask) * mul;
if (i == (ndlo & 127)) break;
i = (i - 1) & 127;
}
if (carry) nd[--ndlo & 127] = carry;
return ndlo;
}
```

The above will work for `k`

between zero and nine inclusive; if `k`

is larger than nine, then `1000000000 >> k`

can no longer be represented by an integer. As such, the complete solution has to start by dividing in batches of 2^{9}:

```
int32_t nd_div2k(uint32_t* nd, int32_t ndlo, int32_t ndhi,
uint32_t k) {
while (k >= 9) {
uint32_t i = ndhi & 127, carry = 0;
for (;;) {
uint32_t val = nd[i];
nd[i] = (val >> 9) + carry;
carry = (val & 0x1ff) * 1953125;
if (i == (ndlo & 127)) break;
i = (i - 1) & 127;
}
if (carry) nd[--ndlo & 127] = carry;
k -= 9;
}
if (k) {
uint32_t mask = (1U << k) - 1, mul = 1000000000 >> k;
uint32_t i = ndhi & 127, carry = 0;
for (;;) {
uint32_t val = nd[i];
nd[i] = (val >> k) + carry;
carry = (val & mask) * mul;
if (i == (ndlo & 127)) break;
i = (i - 1) & 127;
}
if (carry) nd[--ndlo & 127] = carry;
}
return ndlo;
}
```

We can then go through the same process for multiplying, starting with multiplication by 2:

```
int32_t nd_mul2(uint32_t* nd, int32_t ndhi) {
uint32_t carry_in = 0;
for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
uint32_t val = (nd[i] << 1) | carry_in;
carry_in = val / 1000000000;
nd[i] = val - carry_in * 1000000000;
}
if (carry_in) nd[++ndhi] = carry_in;
return ndhi;
}
```

By promoting `val`

to 64-bits, this can be generalised to small `k`

:

```
int32_t nd_mul2k(uint32_t* nd, int32_t ndhi, uint32_t k) {
uint32_t carry_in = 0;
for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
uint64_t val = ((uint64_t)nd[i] << k) | carry_in;
carry_in = (uint32_t)(val / 1000000000);
nd[i] = (uint32_t)val - carry_in * 1000000000;
}
if (carry_in) nd[++ndhi] = carry_in;
return ndhi;
}
```

This time the constraint on `k`

comes from wanting `val`

to be no more than (10^{9})^{2}, which limits `k`

to 29. It also turns out to be useful to make `carry_in`

a parameter, all of which leads to the complete code for multiplying by 2^{k}:

```
int32_t nd_mul2k(uint32_t* nd, int32_t ndhi, uint32_t k,
uint32_t carry_in) {
while (k >= 29) {
for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
uint64_t val = ((uint64_t)nd[i] << 29) | carry_in;
carry_in = (uint32_t)(val / 1000000000);
nd[i] = (uint32_t)val - carry_in * 1000000000;
}
if (carry_in) {
nd[++ndhi] = carry_in; carry_in = 0;
}
k -= 29;
}
if (k) {
for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
uint64_t val = ((uint64_t)nd[i] << k) | carry_in;
carry_in = (uint32_t)(val / 1000000000);
nd[i] = (uint32_t)val - carry_in * 1000000000;
}
if (carry_in) nd[++ndhi] = carry_in;
}
return ndhi;
}
```

We can plug these routines into the `decode`

function from earlier to create a `print`

function:

```
void print(double n) {
TValue t;
t.n = n;
if ((t.u32.hi << 1) >= 0xffe00000) {
if (((t.u32.hi & 0x000fffff) | t.u32.lo) != 0) {
printf("NaN\n");
} else {
printf("Infinity\n");
}
} else {
char buf[1154];
uint32_t nd[128];
int32_t ndlo = 0;
int32_t ndhi = 0;
int32_t e = (t.u32.hi >> 20) & 0x7ff;
nd[0] = t.u32.hi & 0xfffff;
if (e == 0) {
e++;
} else {
nd[0] |= 0x100000;
}
e -= 1043;
if (t.u32.lo) {
e -= 32;
nd[0] = (nd[0] << 3) | (t.u32.lo >> 29);
ndhi = nd_mul2k(nd, ndhi, 29, t.u32.lo & 0x1fffffff);
}
if (e >= 0) {
ndhi = nd_mul2k(nd, ndhi, (uint32_t)e, 0);
} else {
ndlo = nd_div2k(nd, ndlo, ndhi, (uint32_t)-e);
}
nd_print(buf, nd, ndlo, ndhi);
printf("%s\n", buf);
}
}
```

Some example outputs of this `print`

function are:

n | Output |
---|---|

`0./0.` | `NaN` |

`1./0.` | `Infinity` |

`0.` | `000000000.` |

`1.` | `000000001.` |

`10.` | `000000010.` |

`0.1` | `000000000000000000.1000000000000000055511` |

`1e-308` | `000000000000000000.0000000000000000000000` |

`pow(2, 1020)` | `00000001123558209288947442330815744243140` |

Some of these outputs could be tidied up by trimming leading and trailing zeroes, but other than that, this is the output we were aiming for. We can see that the double-precision floating point number closest to 0.1 is actually ever so slightly more than 0.1, that the double-precision floating point number closest to 1e-308 is actually ever so slightly less than 1e-308, and that in some cases the string representation of the exact decimal represented by a float can be extremely long (though, to be fair, 1e-308 is almost as bad as it gets for length).

Next time we'll see how to adapt `nd_print`

into something which can behave like the `%e`

, `%f`

, and `%g`

formats of `sprintf`

(i.e. outputs which you might actually want, as opposed to pedantic exact representations).