A crash course on floating point numbers in a computer

Unlike integers, the floating point number system in a computer doesn’t use two’s complement to represent negative numbers. However, floating point numbers have certain details about them that make them quite complicated.

The floating point number formats can be quite confusing for people, because they are much farther removed from how we think about numbers than the integer formats, which I explained in an earlier article.

It’s not too difficult for the average programmer to do whiteboard calculations with two’s complement, or even as mental arithmetic. But to do whiteboard calculations with floating point numbers is quite difficult, and usually requires the help of a calculator.

Even so, at least having a basic understanding of how floating point numbers work, what they’re good for and what they’re not so good for can help you avoid some unpleasant surprises when using floating point numbers in any of the major programming languages.

The opposite of floating point is fixed point. The integer formats are all fixed point formats: the “binary point” is placed so that only whole numbers can be represented, so you can imagine an integer in a computer as always having an implied “.000…” at the end.

The least significant bit in an integer can only represent 0 × 1 or 1 × 1, the next least significant can only represent 0 × 2 or 1 × 2, the next only 0 × 4 or 1 × 4 and so on and so forth for 8, 16, 32, until reaching the leftmost bit in an unsigned integer (I covered how negative numbers are represented in the previous article).

One possible fixed point system to represent numbers like 7.25 would be to allocate 64 bits and agree that a bit in the middle represents 0 × 1 or 1 × 1. The bits to the left of that correspond to 2, 4, 8, 16, 32, etc. And the bits to the right correspond to 1/2, 1/4, 1/8, 1/16, 1/32, etc.

This fixed point system could represent positive numbers as small as 1/4294967296 and, if unsigned, as large as 4294967296 − 1/4294967296. Sounds like quite a wide range.

The problem is that scientific calculations require numbers much smaller and much larger than the minimum and maximum of such a fixed point system.

Scientists estimate that the universe contains 4 × 10⁸⁰ particles. The Planck length is roughly 1.6 meters divided by 10³⁵. Both of these numbers are outside of the range of the fixed point system described above.

What if we could say that for some numbers, the “binary point” is placed very close to the left and for some numbers very close to the right? Or in some cases outside of the bit pattern?

And so, floating point was born. Part of the bit pattern represents a fixed point number, like 1.7500152587890625, and part of the bit pattern represents an exponent, like 14. Then we have 1.7500152587890625 × 2¹⁴ = 28672.25.

The system of floating point numbers almost all computers use today was standardized by the Institute of Electrical and Electronics Engineers (IEEE) in 1985, “reaffirmed” in 1990 and revised in 2008.

The document detailing the standard is best known as IEEE 754. If you need to be more specific, you can refer to “IEEE 754–1985” or “IEEE 754–2008.”

A graphics expert might say that the integer formats are “lossless” (like PNG) and floating point formats are “lossy” (like JPEG — though there are lossless versions of JPEG; I’m skeptical).

#floating-point-number #ieee754 #coding #math #mathematics

medium.com

A crash course on floating point numbers in a computer