Subtraction of single precision IEEE 754 numbers

Subtraction of single precision IEEE 754 numbers - math

The problem is (-1.100 x 2^5) + (1.1001 x 2^7).
After shifting to get them both to the same magnitude you would get
1.10010 x 2^7
-0.01100 x 2^7
My problem is with carrying. I'm not sure if I am doing it right.
The answer I got was 0.01110 x 2^7, is this correct? Also, when subtracting how do I know if I would end up with a negative value? If the answer I have above is correct, would the correct representation in single precision IEEE be
0 10000110 011100000000000000000000

Check your work by adding the difference (the result) to the subtrahend (the number after the minus sign). If you get the minuend (the number before the minus sign), you did it correctly.
11 // the carries from the addition
0.01100 // the difference you computed
+ 0.01110 // the subtrahend
---------
0.11010 // should be the minuend, if you computed the difference correctly
That is not the minuend (1.10010), so you subtracted incorrectly.

Related

Binary 2's Complement

I'm facing a problem. when we want to subtract a number from another using 2's complement we can do that. I don't know how to subtract fractional number using 2's complement.
5 is in binary form 101 and 2 is 10. if we want to subtract 2 from 5 we need to find out 2's complement of 2
2's complement of 2-> 11111110
so if we now add with binary of 5 we can get the subtraction result. If I want to get the result of 5.5-2.125. what would be the procedure.

Fixed point numbers can be used and it is still common to find them in embedded code or hardware.
Their use is identical to integers, but you need to specify where your "point" is. For instance, assume that you want 3 bits after after the point and that your data is 8 bits, bits 7..3 are the integer part (left of "point") and bits 2..0 the fractional part. The interpretation of integer part is as usual the binary decomposition of this integer: bits 3 correspond to 20, bits 4 to 21, etc.
For the fractional part, the decomposition is in negative powers or two. bits 2 correspond to 2-1, bits 1 to 2-2 and bit 0 to 2-3.
So for you problem, 5.5=4+1+1/2=22+20+2-1 and its code is 00101(.)100. Similarly 2.125=2+1/8 and its code is 00010(.)001 (note (.) is just an help to understand the coding).
Indeed they are just integers, but you must take into account that all your numbers are multiplied by 2-3. This will have no impact for addition, but results of multiplication and division must be adjusted. Taking into account the position of point and managing over and underflows is the difficulty of arithmetic with fixed point, but it allows to do fractional computations even if your hardware does not provide floating point support (for instance with low end microcontrollers or FPGA systems).
Two complement is similar to integers and its computation is identical. If code of 2.125 is 00010(.)001, than -2.125==11101(.)111. Operations are as usual.
+5 00101(.)100
-2.125 11101(.)111
00011(.)011
and 00011(.)011=2+1+1/4+1/8=3,375
For the record, two complement first use was for fixed point fractional numbers and two complement name comes from that. If a fractional number if represented by, say 0(.)1100000 (0.75), its negative counter part will be 1(.)0100000 (-0.75 or 1.25 if interpreted as unsigned) and we always have x+(unsigned)-x=2. For this coding, the negative value of a fractional number x is the number y that must be added to x to get a 2, hence the name that y is 2's complement of x.

Why does 0.1 + 0.4 = 0.5?

We know that floating point is broken, because decimal numbers can't always be perfectly represented in binary. They're rounded to a number that can be represented in binary; sometimes that number is higher, and sometimes it's lower. In this case using the ubiquitous IEEE 754 double format both 0.1 and 0.4 round higher:
0.1 = 0.1000000000000000055511151231257827021181583404541015625
0.4 = 0.40000000000000002220446049250313080847263336181640625
Since both of these numbers are high, you'd expect their sum to be high as well. Perfect addition should give you 0.5000000000000000277555756156289135105907917022705078125, but instead you get a nice exact 0.5. Why?
The question Is floating point math broken? was already identified above, but this question is different. It is asking for a further level of detail on a non-intuitive result when taking the answers of that question into consideration.

This calculation behaves this way because the addition pushes the result into another (binary) order of magnitude. This adds a significant bit to the left (most-significant side) and therefore has to drop a bit on the right (least-significant side).
The first number, 0.1, is stored in binary as a number between 2^-4 == 1/16 and 2^-3 == 1/8. The second number, 0.4, is stored in binary as a number between 2^-2 == 1/4 and 2^-1 == 1/2. The sum, 0.5, is the number 2^-1 == 1/2 or a little larger. This is a mis-match in magnitudes and can cause loss of digits.
Here is an example, easier to understand. Let's say we are working on a decimal computer that can store four decimal digits in floating point. Let's also say we want to add the numbers 10/3 and 20/3. These may end up stored as
3.334
and
6.667
both of which are a little high. When we get those numbers, we expect the sum to be also a little high, namely
10.001
but notice that we have moved into a new order of magnitude with our result. The full result has five decimal digits, which will not fit. So the computer rounds the result to just four decimal digits and gets the sum
10.00
which surprisingly is the correct exact answer to 10/3 + 20/3.
I get the same kind of thing often in my U.S. high-school Chemistry and Physics classes. When a calculation moves to a new order of magnitude, strange things happen with precision and significant digits.

Although most decimal numbers need to be rounded to fit into binary, some don't. 0.5 can be exactly represented in binary, since it's 2-1.
Floating point isn't just binary, it also has limited precision. Here is the exact sum and the two closest IEEE 754 double representable numbers on either side:
0.5000000000000000277555756156289135105907917022705078125
0.5000000000000000000000000000000000000000000000000000000
0.5000000000000001110223024625156540423631668090820312500
It's clear that the exact 0.5 is closest to the true sum. IEEE 754 has rules regarding simple math operations that dictate how result rounding will take place, and you can generally rely on the closest result to be taken.

fixed point multiplication for normal multiplication

I need to multiply X with a floating point number in floating point as i don't have floating point operations in my processor. I understand the method but don't know why that method exists?
Suppose we want to multiply 2*4.5 in decimal I do the below:
2 * 4.5 (100.1)
So i multiply 2*1001 = 2*9 = 18 and then right shift by 1.
so 18>>1 = 9
Is it that we represent 2 in fixed point and represent 4.5 in fixed point and as we multiply Q1.1 and Q1.1 format so we get Q2.2 format and we do right shifting causing Q1.1 format result.Is this right?

In decimal, your fixed-point example is actually:
2 * 4.5
2 * 45 (after multiplying by 10) = 90
90 / 10 = 9 (after dividing the 10 back out)
In binary, the same thing is being done, but just with powers of 2 instead of powers of 10 (as the factors / divisors). Fixed point operations occur in purely integral space after appropriate multiplications. And multiplying or dividing by a power of 2 is just a left shift or right shift respectively on the binary number (very fast for the CPU). In fixed-point the number of bits to the left (integer) and right (fractional) of the decimal point are fixed (predetermined), which means that some numbers cannot be represented on the scale without loss of precision.
Floating-point further extends the concept by allowing the number of bits assigned to the left and right of the decimal point to be flexible. In floating point, every number is represented as an integral "significand" (or mantissa) to a specified power (for example, a power of 2). This representation allows the same number of significant digits to be maintained over a greater dynamic range (for very small or very large magnitude numbers). For floating point, most of the bits will be assigned to the significant digits of the mantissa, and fewer of the bits assigned to the digits of the power. Floating-point calculations are more expensive (time-wise) than fixed-point, which is why fixed-point remains popular in microcontrollers and embedded systems.
If I didn't answer your question, please elaborate and I will edit this answer to include the information you desire.

decimal to floating point system.

i've been asked to work on the following question with the following specification/ rules...
Numbers are held in 16 bits split from left to right as follows:
1 bit sign flag that should be set for negative numbers and otherwise clear.
7 bit exponent held in Excess 63
8 bit significand, normalised to 1.x with only the fractional part stored – as in IEEE 754
Giving your answers in hexadecimal, how would the number -18 be represented in this system?
the answer is got is: 11000011 00100000 (or C320 in hexadecimal)
using the following method:
-18 decimal is a negative number so we have the sign bit set to 1.
18 in binary would be 0010010. This we could note down as 10010. We know work on what’s on the right side of the decimal point but in this case we don’t have any decimal point or fractions so we note down 0000 0000 since there are no fractions. We now write down the binary of 18 and the remainder zeroes (which are not necessarily required) and separate them with a decimal point as shown below:
10010.00000000
We now normalise this into the form 1.x by moving the decimal point and placing it between the first and second number (counting the amount of times we move the decimal point until it reaches that area). The result now is 1.001000000000 x 2^4 and we also know that the decimal point has been moved 4 times which for now we will consider to be our exponent value. The floating point system we are using has 7 bit exponent and uses excess 63. The exponent is 4 in excess 63 which would equal to 63 + 4 = 67 and this in 7 bit binary is shown as 1000011.
The sign bit is: 1 (-ve)
Exponent is: 1000011
Significand is 00100…
The binary representation is: 11000011 00100000 (or C320 in hexadecimal)
please let me know if it's correct or if i've done it wrong and what changes could be applied. thank you guy :)

Since you seem to have been assigned a lot of questions of this type, it may be useful to write an automated answer checker to validate your work. I've put together a quick converter in Python:
def convert_from_system(x):
#retrieve the first eight bits, and add a ninth bit to the left. This bit is the 1 in "1.x".
significand = (x & 0b11111111) | 0b100000000
#retrieve the next seven bits
exponent = (x >> 8) & 0b1111111
#retrieve the final bit, and determine the sign
sign = -1 if x >> 15 else 1
#add the excess exponent
exponent = exponent - 63
#multiply the significand by 2^8 to turn it from 1.xxxxxxxx into 1xxxxxxxx, then divide by 2^exponent to get back the decimal value.
result = sign * (significand / float(2**(8-exponent)))
return result
for value in [0x4268, 0xC320]:
print "The decimal value of {} is {}".format(hex(value), convert_from_system(value))
Result:
The decimal value of 0x4268 is 11.25
The decimal value of 0xc320 is -18.0
This confirms that -18 does convert into 0xC320.

Can a IEEE 754 real number "cover" all integers within its range?

The original question was edited (shortened) to focus on a problem of precision, not range.
Single, or double precision, every representation of real number is limited to (-range,+range). Within this range lie some integer numbers (1, 2, 3, 4..., and so on; the same goes with negative numbers).
Is there a guarantee that a IEEE 754 real number (float, double, etc) can "cover" all integers within its range? By "cover" I mean the real number will represent the integer number exactly, not as (for example) "5.000001".
Just as reminder: http://www3.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html nice explanation of various number representation formats.
Update:
Because the question is for "can" I am also looking for the fact this cannot be done -- for it quoting a number is enough. For example "no it cannot be done, for example number 1748574 is not represented exactly by float number" (this number is taken out of thin air of course).
For curious reader
If you would like to play with IEEE 754 representation -- on-line calculator: http://www.ajdesigner.com/fl_ieee_754_word/ieee_32_bit_word.php

No, not all, but there exists a range within which you can represent all integers accurately.
Structure of 32bit floating point numbers
The 32bit floating point type uses
1 bit for the sign
8 bits for the exponent
23 bits for the fraction (leading 1 implied)
Representing numbers
Basically, you have a number in the form
(-)1.xxxx_xxxx_xxxx_xxxx_xxxx_xxx (binary)
which you then shift left/right with the (unbiased) exponent.
To have it represent an integer requiring n bits, you need to shift it by n-1 bits to the left. (All xes beyond the floating point are simply zero)
Representing integers with 24 bits
It is easy to see, that we can represent all integers requiring 24 bits (and less)
1xxx_xxxx_xxxx_xxxx_xxxx_xxxx.0 (unbiased exponent = 23)
since we can set the xes at will to either 1 or 0.
The highest number we can represent in this fashion is:
1111_1111_1111_1111_1111_1111.0
or 2^24 - 1 = 16777215
The next higher integer is 1_0000_0000_0000_0000_0000_0000. Thus, we need 25 bits.
Representing integers with 25 bits
If you try to represent a 25 bit integer (unbiased exponent = 24), the numbers have the following form:
1_xxxx_xxxx_xxxx_xxxx_xxxx_xxx0.0
The twenty-three digits that are available to you have all been shifted past the floating point. The leading digit is always a 1. In total, we have 24 digits. But since we need 25, a zero is appended.
A maximum is found
We can represent ``1_0000_0000_0000_0000_0000_0000with the form1_xxxx_xxxx_xxxx_xxxx_xxxx_xxx0.0, by simply assigning 1to allxes. The next higher integer from that is: 1_0000_0000_0000_0000_0000_0001. It's easy to see that this number cannot be represented accurately, because the form does not allow us to set the last digit to 1: It is always 0`.
It follows, that the 1 followed by 24 zeroes is an upper bound for the integers we can accurately represent.
The lower bound simply has its sign bit flipped.
Range within which all integers can be represented (including boundaries)
224 as an upper bound
-224 as a lower bound
Structure of 64bit floating point numbers
1 bit for the sign
11 exponent bits
52 fraction bits
Range within which all integers can be represented (including boundaries)
253 as an upper bound
-253 as a lower bound
This easily follows by applying the same argumentation to the structure of 64bit floating point numbers.
Note: That is not to say these are all integers we can represent, but it gives you a range within which you can represent all integers. Beyond that range, we can only represent a power of two multiplied with an integer from said range.
Combinatorial argument
Simply convincing ourselves that it is impossible for 32bit floating point numbers to represent all integers a 32bit integer can represent, we need not even look at the structure of floating point numbers.
With 32 bits, there are 232 different things we can represent. No more, no less.
A 32bit integer uses all of these "things" to represent numbers (pairwise different).
A 32bit floating point number can represent at least one number with a fractional part.
Thus, it is impossible for the 32bit floating point number to be able to represent this fractional number in addition to all 232 integers.

macias, to add to the already excellent answer by phant0m (upvoted; I suggest you accept it), I'll use your own words.
"No it cannot be done, for example number 16777217 is not represented exactly by float number."
Also, "for example number 9223372036854775809 is not represented exactly by double number".
This is assuming your computer is using the IEEE floating point format, which is a pretty strong bet.

No.
For example, on my system, the type float can represent values up to approximately 3.40282e+38. As an integer, that would be approximately 340282000000000000000000000000000000000, or about 2128.
The size of float is 32 bits, so it can exactly represent at most 232 distinct numbers.
An integer object generally uses all of its bits to represent values (with 1 bit dedicated as a sign bit for signed types). A floating-point object uses some of its bits to represent an exponent (8 bits for IEEE 32-bit float); this increases its range at the cost of losing precision.
A concrete example (1267650600228229401496703205376.0 is 2100, and is exactly representable as a float):
#include <stdio.h>
#include <float.h>
#include <math.h>
int main(void) {
float x = 1267650600228229401496703205376.0;
float y = nextafterf(x, FLT_MAX);
printf("x = %.1f\n", x);
printf("y = %.1f\n", y);
return 0;
}
The output on my system is:
x = 1267650600228229401496703205376.0
y = 1267650751343956853325350043648.0
Another way to look at it:
A 32-bit object can represent at most 232 distinct values.
A 32-bit signed integer can represent all integer values in the range -2147483648 .. 2147483647 (-231 .. +231-1).
A 32-bit float can represent many values that a 32-bit signed integer can't, either because they're fractional (0.5) or because they're too big (2.0100). Since there are values that can be represented by a 32-bit float but not by a 32-bit int, there must be other values that can be represented by a 32-bit int but not by a 32-bit float. Those values are integers that have more significant digits than a float can handle, because the int has 31 value bits but the float has only about 24.

Apparently you are asking whether a Real data type can represent all of the integer values in its range (absolute values up to FLT_MAX or DBL_MAX, in C, or similar constants in other languages).
The largest numbers representable by floating point numbers stored in K bits typically are much larger than the 2^K number of integers that K bits can represent, so typically the answer is no. 32-bit C floats exceed 10^37, 32-bit C integers are less than 10^10. To find out the next representable number after some number, use nextafter() or nextafterf(). For example, the code
printf ("%20.4f %20.4f\n", nextafterf(1e5,1e9), nextafterf(1e6,1e9));
printf ("%20.4f %20.4f\n", nextafterf(1e7,1e9), nextafterf(1e8,1e9));
prints out
100000.0078 1000000.0625
10000001.0000 100000008.0000
You might be interested in whether an integer J that is between two nearby fractional floating values R and S can be represented exactly, supposing S-R < 1 and R < J < S. Yes, such a J can be represented exactly. Every float value is the ratio of some integer and some power of 2. (Or is the product of some integer and some power of 2.) Let the power of 2 be P, and suppose R = U/P, S = V/P. Now U/P < J < V/P so U < J*P < V. More of J*P's low-order bits are zero than are those of U, V (because V-U < P, due to S-R < 1), so J can be represented exactly.
I haven't filled in all the details to show that J*P-U < P and V-J*P < P, but under the assumption S-R < 1 that's straightforward. Here is an example of R,J,S,P,U,V value computations: Let R=99999.9921875 = 12799999/128, (ie P=128); let S=100000.0078125 = 12800001/128; we have U=0xc34fff and V=0xc35001 and there is a number between them that has more low-order zeroes than either; to wit, J = 0xc35000/128 = 12800000/128 = 100000.0. For the numbers in this example, note that U and V require 24 bits for their exact representations (6 ea. 4-bit hex digits). Note that 24 bits is the number of bits of precision in IEEE 754 single-precision floating point numbers. (See table in wikipedia article.)
That each floating point number is a product or ratio of some integer and some power of 2 (as mentioned two paragraphs above) also is discussed in that floating point article, in a paragraph that begins:
By their nature, all numbers expressed in floating-point format are rational numbers with a terminating expansion in the relevant base (for example, ... a terminating binary expansion in base-2). Irrational numbers, such as π or √2, or non-terminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex