Thursday, September 19, 2024

Understanding Floating Point Formats

Under ordinary circumstances, you don’t have to know or care how numbers are represented within your programs. However, when you are transferring data files that contain numbers, you will have to convert if the storage formats are not identical. If the numbers are just integers, that’s fairly easy because the only differences will be the length and the byte order: how many bytes the number takes up, and whether it is stored lsb or msb (least signifacant byte or most significant byte first). Once you know that, conversion is trivial.

Floating point numbers are a whole other game. For example, in December of 1983, I had to convert some Tandy Basic programs and data files to Xenix MBASIC. The Basic programs themselves were fairly challenging, but the data files were even more so. Tandy stored floating point nubers in what they called “XS128 notation” (Excess 128 is what they really meant) and MBASIC used packed BCD. At the time, I had never given a single thought to how floating point numbers are stored. As you surely realize, this was long before you could ask Google to find you something like MAD 3401 IEEE Floating-Point Notes, and the availability of computer oriented books was not anything like it is today. I was on my own, with only “od -cx”, my wits, and pure stubbornness to go on. There was an explanation in the manuals, but it was typical geek-babble and it made my head hurt. It took me several hours of painful work to understand what I needed to do, and a few hours more to write programs to do it, but the project got done. I haven’t had to do anything like that since then, and you may never have had to at all, but that doesn’t that neither of us ever will. So rather than you getting a headache from trying to puzzle it out (because there’s still a lot of techno-babble out there) , I’ll get you started.

The first thing you need to know is that your machine may give different results than mine. It probably won’t unless you are using something odd, but if it does, don’t panic: the theory is still the same; you just have a slightly different implementation. Here’s a Perl program that is going to show us what’s going on (you do not need to understand this script):

We’re looking at single precision floating point numbers here. Double precision uses the same scheme, just more bits. Here’s what the output looks like :

The first column is what the stored format looks like in hex. After that come the actual bits; I’ve separated them in this odd way for a very good reason (which will become clear later). The value “5.75” is stored as “01000000101110000000000000000000” or “40B80000” (hex).

You might easily guess that the first bit is the sign bit. I think that’s what I first grokked back in 1983 too. The next 8 bits are used for the exponent, and the last 23 are the value. As you will no doubt notice, the value bits from 0 to 8192 are all empty, so I must be crazy and there’s no point in reading this trash any farther.

Well, actually there is. There’s a hidden bit there that isn’t stored but is always assumed. If you are really compulsive and counted the bits, you see that only 23 bits are there. The hidden bit makes it 24.bits (or 4 bytes) and is always 1. So, if we add the hidden bit, the bits would look like:

But remember, it’s what I showed above that is really there.

One more thing: there’s an implied decimal point after that hidden number. To get the value of bits after the decimal point, start dividing by two: so the first bit after the (implied) decimal point is .5, the next is .25 and so on. We don’t have to worry about any of that for the powers of two, because obviously those are whole numbers and the bits will be all 0. But down at the 5.75 we see that at work:

First, looking at the exponent for 5.75, we see that it is 129. Subtracting 127 gives us 2. So 1.0111 times 2^2 becomes 101.11 (simply shift 2 places to the right to multiply by 4). So now we have 101 binary, which is 5, plus .5 plus .25 (.11) or 5.75 in total. Too quick?

Taking it in detail:

Exponent: 10000001, which is 129 (use the Javascript Bit Twiddler if you like). Subtract 127 leaves us with 2.

Mantissa: 01110000000000000000000

Add in the implied bit and we have 101110000000000000000000, with implied decimal point that’s 1.01110000000000000000000

Multiple that by 2^2 to get 101.110000000000000000000

That is 4 + 1 + .5 + .25 or 5.75

Look at 2048. The exponent is 128 + 8 + 2 or 138, subtract 127 we get 11. Use the Bit Twiddle if you don’t see that. The mantissa is all 0’s, which with the implied bit makes this all 1.00000000000000000000000 times 2^11. What’s 2^11? It’s 2048, of course.

Now the -.1. This actually can’t store store precisely, but the method is still the same. The exponent is 64 + 32 + 16 + 8 + 2 + 1 or 123. Subtract 127 and we get -4, which means the decimal point moves 4 places to the left, making our value .000110011001100110011001101. Now you understand why it’s stored after adding 127 – it’s so we can end up with negative exponents. If we calculate out the binary, that’s .625 + .3125 + .0390625 and on to ever smaller numbers which get us very, very close to .1 (but off slightly). The sign bit was set, so it’s a -.1

The Tandy (and Dec Vax, by the way) “excess 128” exponent storage simply changes the ranges of positive versus negative numbers – other than that, it works just like this.

Finally, there are two reserved values: all 0’s for 0, and all 1’s for NaN (Not A Number) in other words, too large (or too small) for the format to hold. You’d also get that from dividing by zero.

That’s it. Take a look at the link at the beginning if you want to go a little deeper, but this is probably all you need to get started.

A.P. Lawrence provides SCO Unix and Linux consulting services http://www.pcunix.com

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles