In the last episode we talked about the data representation of integer, a kind of fixed-point numbers. Today we’re going to learn about floating-point numbers.

Floating-point numbers are used to *approximate* real numbers. Because of the
fact that all the stuffs in computers are, eventually, just a limited sequence
of bits. The representation of floating-point number had to made trade-offs
between *ranges* and *precision*.

Due to its computational complexities, CPU also have a dedicated set of instructions to accelerate on floating-point arithmetics.

## Terminologies

The terminologies of floating-point number is coming from the
*scientific notation*,
where a real number can be represented as such:

1
2
3

1.2345 = 12345 × 10 ** -4
----- -- --
significand^ ^base ^exponent

*significand*, or*mantissa*, 有效数字, 尾数*base*, or*radix*底数*exponent*, 幂

So where is the *floating point*? It’s the `.`

of `1.2345`

. Imaging the dot
can be float to the left by one to make the representation `.12345`

.

The dot is called *radix point*, because to us it’s seem to be a *decimal point*,
but it’s really a *binary point* in the computers.

Now it becomes clear that, to represent a floating-point number in computers,
we will simply assign some bits for *significand* and some for *exponent*, and
potentially a bit for *sign* and that’s it.

## IEEE-754 32-bits Single-Precision Floats 单精度浮点数

It was called **single** back to IEEE-754-1985 and now **binary32** in the
relatively new IEEE-754-2008 standard.

1
2
3
4
5

(8 bits) (23 bits)
sign exponent fraction
0 011 1111 1 000 0000 0000 0000 0000 0000
31 30 .... 23 22 ....................... 0

- The
*sign*part took 1 bit to indicate the sign of the floats. (`0`

for`+`

and`1`

for`-`

. This is the same treatment as the sign magnitute. - The
*exponent*part took 8 bits and used*offset-binary (biased) form*to represent a signed integer. It’s a variant form since it took out the`-127`

(all 0s) for zero and`+128`

(all 1s) for non-numbers, thus it ranges only`[-126, 127]`

instead of`[-127, 128]`

. Then, it choose the zero offset of`127`

in these 254 bits (like using`128`

in*excess-128*), a.k.a the*exponent bias*in the standard. - The
*fraction*part took 23 bits with an*implicit leading bit*`1`

and represent the actual*significand*in total precision of 24-bits.

Don’t be confused by why it’s called *fraction* instead of *significand*!
It’s all because that the 23 bits in the representation is indeed, representing
the fraction part of the real significand in the scientific notation.

The floating-point version of “scientific notation” is more like:

1
2
3

(leading 1)
1. fraction × 2 ^ exponent × sign
(base-2) (base-2)

So what number does the above bits represent?

1
2

S F × E = R
+ 1.(0) × 0 = 1

Aha! It’s the real number `1`

!
Recall that the `E = 0b0111 1111 = 0`

because it used a biased representation!

We will add more non-trivial examples later.

## Demoing Floats in C/C++

Writing sample code converting between binaries (in hex) and floats are not as straightforward as it for integers. Luckily, there are still some hacks to perform it:

### C - Unsafe Cast

We unsafely cast a pointer to enable reinterpretation of the same binaries.

1
2
3
4
5
6

float f1 = 0x3f800000; // C doesn't have a floating literal taking hex.
printf("%f \n", f1); // 1065353216.000000 (???)
uint32_t u2 = 0x3f800000;
float* f2 = (float*)&u2; // unsafe cast
printf("%f \n", *f2); // 1.000000

### C - Union Trick

Oh I really enjoyed this one…Union in C is not only untagged union, but also share the exact same chunk of memory. So we are doing the same reinterpretation, but in a more structural and technically fancier way.

1
2
3
4
5
6
7
8
9
10
11
12
13

#include <stdint.h>
#include <inttypes.h>
#include <math.h>
float pi = (float)M_PI;
union {
float f;
uint32_t u;
} f2u = { .f = pi }; // we took the data as float
printf ("pi : %f\n : 0x%" PRIx32 "\n", pi, f2u.u); // but interpret as uint32_t
pi : 3.141593
: 0x40490fdb

N.B. this trick is well-known as type punning:

In computer science, type punning is a common term for any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.

### C++ - `reinterpret_cast`

C++ does provide such type punning to the standard language:

1
2
3

uint32_t u = 0x40490fdb;
float a = *reinterpret_cast<float*>(&u);
std::cout << a; // 3.14159

N.B. it still need to be a conversion between pointers, see https://en.cppreference.com/w/cpp/language/reinterpret_cast.

Besides, C++ 17 does add a floating point literal that can take hex, but it works in a different way, using an explicit radix point in the hex:

1
2

float f = 0x1.2p3; // 1.2 by 2^3
std::cout << f; // 9

That’s try with another direction:

1
2
3
4
5
6
7
8
9

#include <iostream>
#include <stdint.h>
#include <inttypes.h>
int main() {
double qNan = std::numeric_limits<double>::quiet_NaN();
printf("0x%" PRIx64 "\n", *reinterpret_cast<uint64_t*>(&qNan));
// 0x7ff8000000000000, the canonical qNaN!
}

## Representation of Non-Numbers

There are more in the IEEE-754!

Real numbers doesn’t satisfy closure property
as integers does. Notably, the set of real numbers is NOT closed under the
division! It could produce non-number results such as **infinity** (e.g. `1/0`

)
and **NaN (Not-a-Number)** (e.g. taking
a square root of a negative number).

It would be algebraically ideal if the set of floating-point numbers can be closed under all floating-point arithmetics. That would made many people’s life easier. So the IEEE made it so! Non-numeber values are squeezed in.

We will also include the two zeros (`+0`

/`-0`

) into the comparison here,
since they are also special by being the only two demanding an `0x00`

exponent:

1
2
3
4
5
6
7
8
9
10

binary | hex |
--------------------------------------------------------
0 00000000 00000000000000000000000 = 0000 0000 = +0
1 00000000 00000000000000000000000 = 8000 0000 = −0
0 11111111 00000000000000000000000 = 7f80 0000 = +infinity
1 11111111 00000000000000000000000 = ff80 0000 = −infinity
_ 11111111 10000000000000000000000 = _fc0 0000 = qNaN (canonical)
_ 11111111 00000000000000000000001 = _f80 0001 = sNaN (one of them)

1
2
3
4
5
6
7
8

(8 bits) (23 bits)
sign exponent fraction
0 00 0 ...0 0 = +0
1 00 0 ...0 0 = -0
0 FF 0 ...0 0 = +infinity
1 FF 0 ...0 0 = -infinity
_ FF 1 ...0 0 = qNaN (canonical)
_ FF 0 ...0 1 = sNaN (one of them)

Encodings of qNaN and sNaN are not specified in IEEE 754 and implemented differently on different processors. Luckily, both x86 and ARM family use the “most significant bit of fraction” to indicate whether it’s quite.

### More on NaN

If we look carefully into the IEEE 754-2008 spec, in the *page35, 6.2.1*, it
actually defined anything with exponent `FF`

and not a infinity (i.e. with
all the fraction bits being `0`

), a NaN!

All binary NaN bit strings have all the bits of the biased exponent field E set to 1 (see 3.4). A quiet NaN bit string should be encoded with the first bit (d1) of the trailing significand field T being 1. A signaling NaN bit string should be encoded with the first bit of the trailing significand field being 0.

That implies, we actually had `2 ** 24 - 2`

of NaNs in a 32-bits float!
The `24`

came from the `1`

sign bit plus `23`

fractions and the `2`

excluded
were the `+/- inf`

.

The continuous 22 bits inside the fraction looks quite a waste, and there
would be even 51 bits of them in the `double`

! We will see how to made them useful
in later episodes (spoiler: they are known as *NaN payload*).

It’s also worth noting that it’s weird that the IEEE choose to use the MSB instead of the sign bit for NaN quiteness/signalness:

It seems strange to me that the bit which signifies whether or not the NaN is signaling is the top bit of the mantissa rather than the sign bit; perhaps something about how floating point pipelines are implemented makes it less natural to use the sign bit to decide whether or not to raise a signal. – https://anniecherkaev.com/the-secret-life-of-nan

I guess it might be something related to the CPU pipeline? I don’t know yet.

## IEEE-754 64-bits Double-Precision Floats

Now, the 64-bit versions floating-point number, known as `double`

, is just a
matter of scale:

1
2
3
4
5

(11 bits) (52 bits)
sign exponent fraction
0
63 62 .... 52 51 ....................... 0

## IEEE-754-2008 16-bits Short Floats

The 2008 edition of IEEE-754 also standardize the `short float`

, which is
neither in C or C++ standard. Though compiler extension might include it.

It looks like:

1
2

1 sign bit | 5 exponent bits | 10 fraction bits
S E E E E E M M M M M M M M M M

## References

- https://en.wikipedia.org/wiki/Floating-point_arithmetic
- https://www3.ntu.edu.sg/home/ehchua/programming/java/datarepresentation.html