IEEE 754 Floating Point Add/Rounding

IEEE 754 Floating Point Add/Rounding - math

I don't understand how I can add in IEEE 754 Floating Point (mainly the "re-alignment" of the exponent)
Also for rounding, how does the Guard, Round & Sticky come into play? How to do rounding in general (base 2 floats that is)
eg. Suppose the qn: Add IEEE 754 Float represented in hex 0x383FFBAD
and 0x3FD0ECCD, then give answers in Round to 0, \$\pm \infty\$,
nearest
So I have
0x383FFBAD 0 | 0111 0000 | 0111 1111 1111 0111 0101 1010
0x3FD0ECCD 0 | 0111 1111 | 1010 0001 1101 1001 1001 1010
Then how should I continue? Feel free to use other examples if you wish

If I understood your "re-alignment" of the exponent correctly...
Here's an explanation of how the format relates to the actual value.
1.b(22)b(21)...b(0) * 2e-127 can be interpreted as a binary integer shifted left by e-127 bit positions. Of course, the shift amount can be negative, which is how we get fractions (values between 0 and 1).
In order to add 2 floating-point numbers of the same sign you need to first have their exponent part equal, or, in other words, denormalize one of the addends (the one with the smaller exponent).
The reason is very simple. When you add, for example, 1 thousand and 1 you want to add tens with tens, hundreds with hundreds, etc. So, you have 1.000*103 + 1.000*100 = 1.000*103 + 0.001*103(<--denormalized) = 1.001*103. This can, of course, result in truncation/rounding, if the format cannot represent the result exactly (e.g. if it could only have 2 significant digits, you'd end up with the same 1.0*103 for the sum).
So, just like in the above example with 1000 and 1, you may need to shift to the right one of the addends before adding their mantissas. You need to remember that there's an implict 1. bit in the format, which isn't stored in the float, which you have to account for when shifting and adding. After adding the mantissas, you most likely will run into a mantissa overflow and will have to denormalize again to get rid of the overflow.
That's the basics. There're special cases to consider as well.

Related

Struggle with binary substraction

Last week I learned about arithmetic with binary numbers especially substraction with two's complement, it was pretty easy so far but something bothers me a little bit. Why is 0 - 1 = 1 with a borrow of 1?
Sure its -1 but should we get some result like 1001 (4 bits for size)? Can someone please explain?

0-1 is not 1 in binary
0-1 does not equal 1 in two's complement with signed numbers. But its representation does have alot of ones in it: you should get 1111.
Negative numbers in two's complement:
In two complement's format (which computers use) with signed magnitude, one bit is reserved to represent the sign of the number. By convention this is the leftmost bit, where 0 indicates a positive value and 1 indicates a negative value.
To get the representation of a negative number we follow two steps:
Flip all the bits
Then increment by one
Additive Inverse of 7:
7=0111
1000 // Flip the bits
-7=1001 // Add +1
So why do we get 1111 when we subtract 0-1?
How do we substract in two's complement? We subtract A-B by taking the additive inverse (the opposite) of B and adding the numbers.
So with A=0 and B=1:
Take the additive inverse by inverting and then increment by 1.
Additive Inverse of B:
B= 0001
-B= 1110 // Invert
-B= 1111 // Increment +1
Now sum A + (-B) :
A=0000
(+) -B=1111
-----------
A+(-B)=1111
Note that if we are only allowed one bit. We can't represent negative numbers since there's no room for the value.

How to represent a negative hexadecimal number?

If I have a negative decimal number, say -5 for example, and converted it to hexadecimal format. Would I be able to simply put a negative sign in front of the hexadecimal number? Or is there another way of doing it like with 2's compliment in binary?

There is no "one way" to represent a negative number. That said there are a number of standard ways to represent a negative number. To keep the math simple, I'm going to assume that all number use 4 bits.
Use a sign bit (in binary) 0111 is 7, but 1111 is -7. (also can be done in reverse 0111 is -7 1111 is 7.
Use the 1's complement. 0111 is 7, but 1000 is -7 (all bits flipped. This has the odd property of 0000 being a natural 0, but 1111 being a (negative zero) -0.
Use the 2's complement. Negation is 1's complement plus one, so 0111 is 7, but 1000 + 0001 or 1001 is -7. This leverages integer overflow to maintain no negative zero. 0000 negated is 1111 + 0001 which overflows to 0000. It also has a few nice properties such that adding some number plus its negative resolves to zero, provided that both numbers can be written (there is one more negative number than a positive one). 7 + (-7) is 0111 + 1001 which overflows to 0 0000.
You may hear the saying that the "bits mean whatever you want them to mean". This means that you can come up with any number of ways to represent anything, you just build a "map" of the bits to the values you desire. For example, here is an odd, whimsical way of representing prime numbers.
(bits) => value
0001 => 2
0010 => 3
0011 => 5
0100 => 7
0101 => 11
0110 => 13
0111 => 17
(and so on)
Such a system would be hard to do math with, but it is an example that you don't have to be constrained to a specific way of doing anything. As long as you build routines to produce the expected output from the expected input, you can make the mapping of bits to values mean what ever you want it to mean.
This idea of meaning being a thing you impose upon the bits is important. When you start to deal with text, the "encoding" is the meaning imposed upon the bits storing text, with the same bits sometimes encoding different letters in different encoding schemes.

Whats the highest and the lowest integer for representing signed numbers in two's complement in 5 bits?

I understand how binary works and I can calculate binary to decimal, but I'm lost around signed numbers.
I have found a calculator that does the conversion. But I'm not sure how to find the maximum and the minumum number or convert if a binary number is not given, and question in StackO seems to be about converting specific numbers or doesn't include signed numbers to a specific bit.
The specific question is:
We have only 5 bits for representing signed numbers in two's complement:
What is the highest signed integer?
Write its decimal value (including the sign only if negative).
What is the lowest signed integer?
Write its decimal value (including the sign only if negative).
Seems like I'll have to go heavier on binary concepts, I just have 2 months in programming and I thought i knew about binary conversion.

From a logical point of view:
Bounds in signed representation
You have 5 bits, so there are 32 different combinations. It means that you can make 32 different numbers with 5 bits. On unsigned integers, it makes sense to store integers from 0 to 31 (inclusive) on 5 bits.
However, this is about unsigned integers. Meaning: we have to find a way to represent negative numbers too. Meaning: we have to store the number's value, but also its sign (+ or -). The representation used is 2's complement, and it is the one that's learned everywhere (maybe other exist but I don't know them). In this representation, the sign is given by the first bit. That is, in 2's complement representation a positive number starts with a 0 and a negative number starts with an 1.
And here the problem rises: Is 0 a positive number or a negative number ? It can't be both, because it would mean that 0 can be represented in two manners for a given number a bits (for 5: 00000 and 10000), that is we lose the space to put one more number. I have no idea how they decided, but fact is 0 is a positive number. For any number of bits, signed or unsigned, a 0 is represented with only 0.
Great. This gives us the answer to the first question: what is the upper bound for a decimal number represented in 2's complement ? We know that the first bit is for the sign, so all of the numbers we can represent must be composed of 4 bits. We can have 16 different values of 4-bits strings, and 0 is one of them, so the upper bound is 15.
Now, for the negative numbers, this becomes easy. We have already filled 16 values out of the 32 we can make on 5 bits. 16 left. We also know that 0 has already been represented, so we don't need to include it. Then we start at the number right before 0: -1. As we have 16 numbers to represent, starting from -1, the lowest signed integer we can represent on 5 bits is -16.
More generally, with n bits we can represent 2^n numbers. With signed integers, half of them are positive, and half of them are negative. That is, we have 2^(n-1) positive numbers and 2^(n-1) negative numbers. As we know 0 is considered as positive, the greatest signed integer we can represent on n bits is 2^(n-1) - 1 and the lowest is -2^(n-1)
2's complement representation
Now that we know which numbers can be represented on 5 bits, the question is to know how we represent them.
We already saw the sign is represented on the first bit, and that 0 is considered as positive. For positive numbers, it works the same way as it does for unsigned integers: 00000 is 0, 00001 is 1, 00010 is 2, etc until 01111 which is 15. This is where we stop for positive signed integers because we have occupied all the 16 values we had.
For negative signed integers, this is different. If we keep the same representation (10001 is -1, 10010 is -2, ...) then we end up with 11111 being -15 and 10000 not being attributed. We could decide to say it's -16 but we would have to check for this particular case each time we work with negative integers. Plus, this messes up all of the binary operations. We could also decide that 10000 is -1, 10001 is -2, 10010 is -3 etc. But it also messes up all of the binary operations.
2's complement works the following way. Let's say you have the signed integer 10011, you want to know what decimal is is.
Flip all the bits: 10011 --> 01100
Add 1: 01100 --> 01101
Read it as an unsigned integer: 01101 = 0*2^4 + 1*2^3 + 1*2^2 + 0*2^1 + 1*2^0 = 13.
10011 represents -13. This representation is very handy because it works both ways. How to represent -7 as a binary signed integer ? Start with the binary representation of 7 which is 00111.
Flip all the bits: 00111 --> 11000
Add 1: 11000 --> 11001
And that's it ! On 5 bits, -7 is represented by 11001.
I won't cover it, but another great advantage with 2's complement is that the addition works the same way. That is, When adding two binary numbers you do not have to care if they are signed or unsigned, this is the same algorithm behind.
With this, you should be able to answer the questions, but more importantly to understand the answers.
This topic is great for understanding 2's complement: Why is two's complement used to represent negative numbers?

hexadecimal U2 substraction

I want to substract 92H-64H in two's complementary and state whether carry flag bit C and overflow flag bit V are 1 or 0.
So far no problem to convert and check in decimal that it is
146-100=46=2EH
But I get lost in performing substraction to check bits bit by bit. I can imagine it's done in binary, but how? Appreciate any help!!

You have to operate in binary. Hexadecimal is no more than an easy (=less digits) way to show numbers that are binary internally.
That said, your two numbers:
92h - 64h . I assume you work with 8 bits. Translating them to binary: 1001 0010 - 0110 0100
To substract using c2, take the second number, 0110 0100
Invert its bits: 1001 1011
Add one: 1001 1011 + 1 = 1001 1100
Add this new number to the previous first number:
1001 0010
1001 1100
---------
10010 1110
The carry is the 9-th bit of this addition. In this case is 1.
The overflow bit is calculated as follows: take the 8-th bit of each number in the addition: they are 1, 1, and 0. These are the sign bits of each number.
There is overflow if both signs of operands 1 and 2 are the same, bit the sign of the result is different. In any other case, there's no overflow.
In this addition, signs of both operands are the same (1) but sign of the result is not (it's 0), so there is overflow here.
By the way, the result of this addition (taking its lower 8 bits, discarding the carry bit) is the result of the original substraction.
The result of the addition is 0010 1110, which is 2E in hexadecimal.
So 92h - 64h = 2Eh, Carry is 1, Overflow is 1.

Subtracting a large unsigned binary number from a smaller one

I'm taking a computer organization and assembly language course. The written part of our lab this week has a question on it that has me stumped. The question reads...
Subtract the following unsigned binary numbers (show the borrow and overflow bits). Do not convert to two's complement.
0101 0111 1101
-1110 1011 0110
--------------
I realize that the answer is -1001 0011 1001 but I'm having a hard time trying to figure out how to borrow to actually perform this subtraction by taking the larger number and subtracting it from the smaller number and show my work. My whole life when subtracting a large number from a small number I have reversed the problem and instead subtracted the smaller number from the larger number and added a negative sign in front of the result. I asked the professor and he says that he wants the problem solved the way that it is written. I am not allowed to solve this by subtracting the smaller number from the larger number and negating like I normally would. I haven't been able to find any examples online of subtracting a larger unsigned binary number from a smaller one.
I would really appreciate it if someone could describe to me how to perform subtraction in this scenario.
Update:
#Alex is correct. The professor was looking for
0110 1100 0111 (1735)
Thanks everyone.

You do it the same way irrespective of which number is bigger and which is smaller.
bb b bb <- borrows
0101 0111 1101 (1405)
-1110 1011 0110 (3766)
--------------
0110 1100 0111 (1735?)
Now, if you want a proper difference, you need to take into account the overflow since the above result doesn't include the sign bit:
b bb b bb <- borrows
0 0101 0111 1101 (1405)
-0 1110 1011 0110 (3766)
----------------
1 0110 1100 0111 (-2361 signed 2's complement)
Really, the CPU doesn't care what gets subtracted from what. It uses the same algorithm for integer addition/subtraction, moreover, this algorithm is the same for signed and unsigned integers. You only have to correctly interpret the result and the carry and overflow flags. That's all.

simply subtract the two binary numbers as they are, then take the 2's complement of the result. voila!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex