JavaRush /Java Blog /Random EN /What's inside a floating point number and how does it wor...

Level 2

Харьков

28 February 2021
42 views
0 comments

What's inside a floating point number and how does it work?

Content:

Image: http://pikabu.ru/

Introduction

In the very first days of learning Java, I came across such a curious type of primitives as floating point numbers. I was immediately interested in their features and, even more so, in the way they were written in binary code (which is interconnected). Unlike any range of integers, even in a very small range (for example, from 1 to 2) there is an infinite number of them. And having a finite memory size, it is impossible to express this set. So how are they expressed in binary and how do they work? Alas, the explanations on the wiki and a rather cool article on Habré here did not give me a complete understanding, although they laid the foundation. The realization came only after reading this analysis article the morning after reading it.

Excursion into history

( taken from this article on Habré ) In the 60-70s, when computers were large and programs were small, there was still no single standard for calculations, as well as a standard for expressing the floating-point number itself. Each computer did it differently, and each had its own errors. But in the mid-70s, Intel decided to make new processors with supported “improved” arithmetic and at the same time standardize it. Professors William Kahan and John Palmer (no, not the author of books about beer) were brought in to develop it. There was some drama, but a new standard was developed. Now this standard is called IEEE754

Floating point number format

Even in school textbooks, everyone was faced with an unusual way of writing very large or very small numbers of the form 1.2 × 10 ³ or 1.2 E3 , which is equal to 1.2 × 1000 = 1200 . This is called the exponential notation method. In this case, we are dealing with the expression of a number using the formula: N=M×n ^p , where

N = 1200 - the resulting number
M = 1,2 - mantissa - fractional part, without taking into account orders
n = 10 is the base of order. In this case and when we are not talking about computers, the base is the number 10
p = 3 - degree of base

Quite often, the base of the order is assumed to be 10 and only the mantissa and the value of the base are written, separating them with the letter E. In our example, I gave equivalent entries 1.2 × 10 ³ and 1.2 E3 If everything is clear, and we have finished the nostalgic excursion into the school curriculum, then now I recommend forgetting this, because when forming a floating point number we are dealing with powers of two, not tens, i.e. n = 2 , the whole harmonious formula 1.2E3 breaks down and it really broke my brain.

Sign and degree

So what do we have? As a result, we also have a binary number, which consists of a mantissa - the part that we will raise to a power and the power itself. In addition, just as is common with integer types, floating-point numbers have a bit that determines the sign - whether the number will be positive or negative. As an example, I propose to consider the type float, which consists of 32 bits. With double precision numbers doublethe logic is the same, only there are twice as many bits. Of the 32 bits, the first most significant is allocated to the sign, the next 8 bits are allocated to the exponent - the power to which we will raise the mantissa, and the remaining 23 bits - to the mantissa. To demonstrate, let's look at an example: What's inside a floating point number and how it works - 1

What's inside a floating point number and how it works - 1

The first bit is very simple. If the value of the first bit is 0 , then the number we get will be positive . If the bit is 1 , then the number will be negative . The next block of 8 bits is an exponent block. The exponent is written as a regular eight-bit number, and to get the required degree we need to subtract 127 from the resulting number . In our case, the eight bits of the exponent are 10000001 . This corresponds to the number 129 . If you have a question about how to calculate this, then the picture shows a quick answer. An expanded version can be obtained in any Boolean algebra course. What's inside a floating point number and how it works - 2

What's inside a floating point number and how it works - 2

1×2 ⁷ + 0×2 ⁶ + 0×2 5 + 0×2 ⁴ + 0×2 ³⁺ 0×2 ² + 0×2 ¹ + 1×2 ⁰ = 1×128 + 1×1 = 128+ 1=129 It is not difficult to calculate that the maximum number that we can get from these 8 bits is 11111111 ₂ = 255 ₁₀ (subscript 2 and 10 mean binary and decimal number systems) However, if we use only positive exponent values ( from 0 to 255 ), then the resulting numbers will have many numbers before the decimal point, but not after? To obtain negative values of the degree, you need to subtract 127 from the generated exponent . Thus, the range of degrees will be from -127 to 128 . Using our example, the required degree will be 129-127 = 2 . Let's remember this number for now.

Mantissa

Now about the mantissa. It consists of 23 bits, but at the beginning there is always another unit implied, for which the bits are not allocated. This is done for reasons of expediency and economy. The same number can be expressed in different powers by adding zeros to the mantissa before or after the decimal point. The easiest way to understand this is with a decimal exponent: 120,000 = 1.2×10 ⁵ = 0.12×10 ⁶ = 0.012×10 ⁷ = 0.0012×10 ⁸ etc. However, by entering a fixed number in the head of the mantissa, we will receive new numbers each time. Let's take it for granted that before our 23 bits there will be one more with one. Usually this bit is separated from the rest by a dot, which, however, does not mean anything. It's just more convenient 1. 111000000000000000000000 What's inside a floating point number and how does it work - 3

What's inside a floating point number and how does it work - 3

Now the resulting mantissa needs to be raised to a power from left to right, decreasing the power by one with each step. We start from the value of the power that we obtained as a result of the calculation, i.e. 2 (I deliberately chose a simple example so as not to write each value of the power of two and did not calculate them in the table above when the corresponding bit is zero) What's inside a floating point number and how does it work - 4

What's inside a floating point number and how does it work - 4

1×2 ² + 1×2 ¹ + 1×2 ⁰ + 1×2 ^-1 = 1×4 + 1×2 + 1×1 + 1×0.5 = 4+2+1+0.5 = 7.5 and got the result 7.5 , correctness can be checked, for example, at this link

Results

A standard floating-point number floatconsists of 32 bits, the first bit is the sign (+ or -), the next eight are the exponent, the next 23 are the mantissa By sign - if bit 0 is a positive number. If bit 1 is negative. By exponential - we convert bitwise into a decimal number (the first bit from the left is 128 , the second is 64 , the third is 32 , the fourth is 16 , the fifth is 8 , the sixth is 4 , the seventh is 2 , the eighth is 1 ), subtract 127 from the resulting number , we get the degree with which we will start. According to the mantissa - to the existing 23 bits in front we add another bit with the value 1 and from it we begin to raise to the power we received, decrementing this power with each subsequent bit. That's all folks, kids! What's inside a floating point number and how it works - 5

What's inside a floating point number and how it works - 5

PS: As homework, using this article, leave in the comments your versions of why precision errors occur with a large number of arithmetic operations with floating point numbers

Comments

TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION