Converting a number to IEEE 754 - math

Can someone help me with this question:
“Convert the decimal number 10/32 to the 32-bit IEEE 754 floating point and
express your answer in hexadecimal. (Reminder: the 32 bits are used as
follows: Bit 1: sign of mantissa, bits 2-9: 8-bits of exponent in excess 127, bits 10-32: 23 bits for magnitude of mantissa.)”
I understand how to convert a decimal number to IEE 754. But I am confused on how to answer this—it only gives me a quotient? I am not allowed to use a calculator, so I am unsure how to work this out. Should I convert them both to binary first and divide them?

10/32 = 5/16 = 5•2−4 = 1.25•2−2 = 1.012•2−2.
The sign is +, the exponent is −2, and the significand is 1.012.
A positive sign is encoded as 0.
Exponent −2 is encoded as −2 + 127 = 125 = 011111012.
Significand 1.012 is 1.010000000000000000000002, and it is encoded using the last 23 bits, 010000000000000000000002.
Putting these together, the IEEE-754 encoding is 0 01111101 01000000000000000000000. To convert to hexadecimal, first organize into groups of four bits: 0011 1110 1010 0000 0000 0000 0000 0000. Then the hexadecimal can be easily read: 3EA0000016.

I see it like this:
10/32 = // input
10/2^5 = // convert division by power of 2 to bitshift
1010b >> 5 =
.01010b // fractional result
--^-------------------------------------------------------------
|
first nonzero bit is the exponent position and start of mantissa
----------------------------------------------------------------
man = (1)010b // first one is implicit
exp = -2 + 127 = 125 // position from decimal point + bias
sign = 0 // non negative
----------------------------------------------------------------
0 01111101 01000000000000000000000 b
^ ^ ^
| | mantissa + zero padding
| exp
sign
----------------------------------------------------------------
0011 1110 1010 0000 0000 0000 0000 0000 b
3 E A 0 0 0 0 0 h
----------------------------------------------------------------
3EA00000h
Yes the answer of Eric Postpischil is the same approach (+1 btw) but I didn't like the formating as it was not clear from a first look what to do without proper reading the text.

Giving the conversion of 10/322 without a calculator as an exercise is pure sadism.
There is a a general method doable without tools, but it may be tedious.
N is the number to code. We assume n<1
exp=0
mantissa=0
repeat
n *= 2
exp ++
if n>1
n = n-1
mantissa = mantissa <<1 | 1
else
mantissa = mantissa <<1
until mantissa is a 1 followed by 23 bits
Then you just have to code mantissa and (23-exp) in IEEE format.
Note that frequently this kind of computations lead to loops. Whenever you find the same n, you know that the sequence will be repeated.
As an example, assume we have to code 3/14
3/14 -> 6/14 e=1 m=0
6/14 -> 12/14 e=2 m=00
12/14 -> 24/14-14/14=10/14 e=3 m=001
10->14 -> 20/14-14/14=6/14 e=4 m=0011
6/14 -> 12/14 e=5 m=00110
Great we found a loop !
6/14->12/14->10/14->6/14.
So the mantissa will be 110 iterated as required 110110110...
If we fill the mantissa with 24 bits, we need 26 iterations and exponent is 23-26=-3 (another way to get it is to remark that n became >1 for the first time at iteration 3 and exponent is -3 as 1≤3/14*2^3<2).
And we can do the IEEE754 coding with exponent=127-3=124 and mantissa =1.1011011011011....

Related

Sum of 2 two's complement binary number

I am familiar with two's complement when performing addition in 4 bits, then I am confused when I face the question below
** Find the sum of the 2 two's complement binary number 010111 and 110101 in 8 bit output**
Below is my attempt, but I am in a dilemma, should I
(1) discard the carry, then add two 0, so the answer is 00001100, which is 12 in decimal
Thank you !
(2) just add 1 in the beginning, so it the answer is 11001100, which is 204 in decimal
For the Twos Complement in 8 bit you have to invert ALL 8 bits of the number
As far as I know the ones complement and twos complement operate on the absolute value so:
The Binary Number 010111 is represented in 8 bits with 00010111
C_1(00010111) = 00010111 xor 11111111 = 11101000
C_2(00010111) = C_1 + 1 = 00010111 + 1 = 00011000
The Binary Number 110101 is represented in 8 bits -> 00110101
C_1(00110101) = 00110101 xor 11111111 = 11001010
C_2(00010111) = C_1 + 1 = 11001010 + 1 = 11001011
Now add the two twos-complements:
C_2(00010111) + C_2(00010111) = 00011000 + 11001011 = 11100011
Pls correct me if messed something up with sign bits (I just took the binary numbers as is in 8 bit)...

What does these hex values mean?

I have a table of values of hex values (I think they are hex bytes?) and I'm trying to figure out what they mean in decimal.
On the website the author states that the highlighted values 43 7A 00 00 mean 250 in decimal, but when I input these into a hex to decimal converter I get 1132068864.
For the life of me I don't understand what's going on. I know that the naming above the highlighted values 1 2 3 4 5 6 7 8 9 A B C D E F are the hex system, but I don't understand how you read the values inside the table.
Help would be appreciated!
What's happening here is that the bytes 43 7A 00 00 are not being treated as an integer. They are being treated as an IEEE-format 32-bit floating-point number. This is why the Type column in the Inspector window in the image says Float. When those bytes are interpreted in that way they do indeed represent the value 250.0.
You can read about the details of the format at https://en.wikipedia.org/wiki/Single-precision_floating-point_format
In this particular case the bytes would be decoded as:
a "sign" bit of 0, meaning that the value is a positive number
an "exponent" field containing bits 1000 0110 (or hex 86, decimal 134), meaning that the exponent has the value 7 (calculated by subtracting 127 from the raw value of the field, 134)
a "significand" field containing bits 1111 1010 0000 0000 0000 0000 (or hex FA0000, decimal 16384000)
The significand and exponent are combined according to the formula:
value = ( significand * (2 ^ exponent) ) / (2 ^ 23)
where a ^ b means "a raised to the power b" .
In this case that gives:
value = ( 16384000 * 2^7 ) / 2^23
= ( 16384000 * 128 ) / 8388608
= 250.0

BCD to Decimal And Decimal to BCD

I am writing a library for RTC module in Arduino where the data is stored in BCD. I know how the Decimal number is converted into BCD but having some problem while writing it programmatically. After searching the internet I got two formulas which are as follows and working perfectly but cannot understand how it is calculating.
1. Formula1
DEC to BCD
(value / 10 * 16) + (value % 10)
Example
DEC -> 40 which converted to 01000000 in BCD which is again equal to 64.
So if I put into the formula I get the same result.
(40/10 * 16) + (40%10)
= (4*16) + 0
= 64
BCD to DEC
(value / 16 * 10) + (value % 16)
2. Formula2
DEC to BCD
val + 6 * (val / 10)
BCD to DEC
val - 6 * (val >> 4)
If anyone can explain it in details it will be helpful.
Thanks to all in advance.
The correct conversion functions are:
byte bcdToDec(byte val)
{
return( (val/16*10) + (val%16) );
}
byte decToBcd(byte val)
{
return( (val/10*16) + (val%10) );
}
Why does this work? Let's take a single digit 5. In binary, it's
0101 = 5
Now let's take that same digit, shift it four places to the left by adding four zeroes to the right:
0101 0000 = 50 BCD
That's how BCD works. Since it takes four binary digits to represent the decimal digits 0 through 9, each single decimal digit is represented by four bits. The key is that shifting four places in binary multiplies or divides by 16, so that's the reason for the 16 values in the formulas.
So let's take 96:
0110 = 6
1001 = 9
1001 0110 = 96 BCD

Floating Point Multiplication

I got this problem I have to solve where I have to multiply to floating point numbers (16 bit), but I have no way of double checking it. Any help is immensely appreciated.
Floating Point A: 0 11000 0111000000
Floating Point B: 0 11010 1011000000
I calculate the exponents:
A: 24-15=9
B: 26-15=11
Calculate mantissas (a & b):
(2^9*b) * (2^11*b) = 2^9+11 * (a*b) + 2^20 * (a*b)
Overflow, so I increase exponent of A to sane as B(11).
Then I shift mantissa of A in accordance with the calculation:
1.0111 > 0.10111 > 0.010111.
Then I multiply to get mantissa.
0.010111
* 1.101100
0000000
0000000
0010111
0010111
0000000
0010111
0010111_____
0.100110110100
I shift again.
0.100110110100 < 1.00110110100
Decrease exponent by 1, so it's 10.
Sign is 0, so it's positive number.
Answer: 0 01010 00110110100.
Is it correct?
Thanks in advance!
Looks like binary16
Is it correct?: No.
Both "mantissas" or fraction need to include an implied bit as its not the minimum exponent. So .1011000000 becomes 1.1011000000.
Instead 1.01 1100 0000 * 1.10 1100 0000 = 10.0110 1101 0000 0000 0000
The change in exponent is something to consider after the multiplication, in addition to rounding.
Shift right the product 1.0011 0110 1000 0000 0000 0 and +1 to the unbiased exponents 9 + 11 + 1 --> 21.
Round the product to 10 fraction bits
1.0011 0110 1000 0000 0000 0
1.0011 0110 10
Rebuild result
sign 0^0 = 0
biased exponent 21 + 15 = 36 = 100100, which overflows. #Yves Daoust
fraction = 0011 0110 10.
Overflow is typically set to INF 0 11111 0000000000 = infinity
Have not doubled check my work yet.
I see two problems:
you don't account for the implicit 1 as the leftmost bit of the mantissas: 1.0111000000 * 1.1011000000,
the exponent overflows and the result just cannot be represented [had the exponent underflown, you could have denormalized].

How to subtract IEEE 754 numbers?

How do I subtract IEEE 754 numbers?
For example: 0,546875 - 32.875...
-> 0,546875 is 0 01111110 10001100000000000000000 in IEEE-754
-> -32.875 is 1 10000111 01000101111000000000000 in IEEE-754
So how do I do the subtraction? I know I have to to make both exponents equal but what do I do after that? 2'Complement of -32.875 mantissa and add with 0.546875 mantissa?
Really not any different than you do it with pencil and paper. Okay a little different
123400 - 5432 = 1.234*10^5 - 5.432*10^3
the bigger number dominates, shift the smaller number's mantissa off into the bit bucket until the exponents match
1.234*10^5 - 0.05432*10^5
then perform the subtraction with the mantissas
1.234 - 0.05432 = 1.17968
1.17968 * 10^5
Then normalize (which in this case it is)
That was with base 10 numbers.
In IEEE float, single precision
123400 = 0x1E208 = 0b11110001000001000
11110001000001000.000...
normalize that we have to shift the decimal place 16 places to the left so
1.1110001000001000 * 2^16
The exponent is biased so we add 127 to 16 and get 143 = 0x8F. It is a positive number so the sign bit is a 0 we start to build the IEEE floating point number the leading
1 before the decimal is implied and not used in single precision, we get rid of it and keep the fraction
sign bit, exponent, mantissa
0 10001111 1110001000001000...
0100011111110001000001000...
0100 0111 1111 0001 0000 0100 0...
0x47F10400
And if you write a program to see what a computer things 123400 is you get the same thing:
0x47F10400 123400.000000
So we know the exponent and mantissa for the first operand'
Now the second operand
5432 = 0x1538 = 0b0001010100111000
Normalize, shift decimal 12 bits left
1010100111000.000
1.010100111000000 * 2^12
The exponent is biased add 127 and get 139 = 0x8B = 0b10001011
Put it all together
0 10001011 010100111000000
010001011010100111000000
0100 0101 1010 1001 1100 0000...
0x45A9C00
And a computer program/compiler gives the same
0x45A9C000 5432.000000
Now to answer your question. Using the component parts of the floating point numbers, I have restored the implied 1 here because we need it
0 10001111 111100010000010000000000 - 0 10001011 101010011100000000000000
We have to line up our decimal places just like in grade school before we can subtract so in this context you have to shift the smaller exponent number right, tossing mantissa bits off the end until the exponents match
0 10001111 111100010000010000000000 - 0 10001011 101010011100000000000000
0 10001111 111100010000010000000000 - 0 10001100 010101001110000000000000
0 10001111 111100010000010000000000 - 0 10001101 001010100111000000000000
0 10001111 111100010000010000000000 - 0 10001110 000101010011100000000000
0 10001111 111100010000010000000000 - 0 10001111 000010101001110000000000
Now we can subtract the mantissas. If the sign bits match then we are going to actually subtract if they dont match then we add. They match this will be a subtraction.
computers perform a subtraction by using addition logic, inverting the second operator on the way into the adder and asserting the carry in bit, like this:
1
111100010000010000000000
+ 111101010110001111111111
==========================
And now just like with paper and pencil lets perform the add
1111000100000111111111111
111100010000010000000000
+ 111101010110001111111111
==========================
111001100110100000000000
or do it with hex on your calculator
111100010000010000000000 = 1111 0001 0000 0100 0000 0000 = 0xF10400
111101010110001111111111 = 1111 0101 0110 0011 1111 1111 = 0xF563FF
0xF10400 + 0xF563FF + 1 = 0x1E66800
1111001100110100000000000 =1 1110 0110 0110 1000 0000 0000 = 0x1E66800
A little bit about how the hardware works, since this was really a subtract using the adder we also invert the carry out bit (or on some computers they leave it as is). So that carry out of a 1 is a good thing we basically discard it. Had it been a carry out of a zero we would have needed more work. We dont have a carry out so our answer is really 0xE66800.
Very quickly lets see that another way, instead of inverting and adding one lets just use a calculator
111100010000010000000000 - 000010101001110000000000 =
0xF10400 - 0x0A9C00 =
0xE66800
By trying to visualize it I perhaps made it worse. The result of the mantissa subtracting is 111001100110100000000000 (0xE66800), there was no movement in the most significant bit we end up with a 24 bit number in this case with the msbit of a 1. No normalization. To normalize you need to shift the mantissa left or right until the 24 bits lines up with the most significant 1 in that left most position, adjusting the exponent for each bit shift.
Now stripping the 1. bit off the answer we put the parts together
0 10001111 11001100110100000000000
01000111111001100110100000000000
0100 0111 1110 0110 0110 1000 0000 0000
0x47E66800
If you have been following along by writing a program to do this, I did as well. This program violates the C standard by using a union in an improper way. I got away with it with my compiler on my computer, dont expect it to work all the time.
#include <stdio.h>
union
{
float f;
unsigned int u;
} myun;
int main ( void )
{
float a,b,c;
a=123400;
b= 5432;
c=a-b;
myun.f=a; printf("0x%08X %f\n",myun.u,myun.f);
myun.f=b; printf("0x%08X %f\n",myun.u,myun.f);
myun.f=c; printf("0x%08X %f\n",myun.u,myun.f);
return(0);
}
And our result matches the output of the above program, we got a 0x47E66800 doing it by hand
0x47F10400 123400.000000
0x45A9C000 5432.000000
0x47E66800 117968.000000
If you are writing a program to synthesize the floating point math your program can perform the subtract, you dont have to do the invert and add plus one thing, over complicates it as we saw above. If you get a negative result though you need to play with the sign bit, invert your result, then normalize.
So:
1) extract the parts, sign, exponent, mantissa.
2) Align your decimal places by sacrificing mantissa bits from the number with the smallest exponent, shift that mantissa to the right until the exponents match
3) being a subtract operation if the sign bits are the same then you perform a subtract, if the sign bits are different you perform an add of the mantissas.
4) if the result is a zero then your answer is a zero, encode the IEEE value for zero as the result, otherwise:
5) normalize the number, shift the answer to the right or left (The answer can be 25 bits from a 24 bit add/subtract, add/subtract can have a dramatic shift to normalize, either one right or many bits to the left) until you have a 24 bit number with the most significant one left justified. 24 bits is for single precision float. The more correct way to define normalizing is to shift left or right until the number resembles 1.something. if you had 0.001 you would shift left 3, if you had 11.10 you would shift right 1. a shift left increases your exponent, a shift right decreases it. No different than when we converted from integer to float above.
6) for single precision remove the leading 1. from the mantissa, if the exponent has overflowed then you get into building a signaling nan. If the sign bits were different and you performed an add, then you have to deal with figuring out the result sign bit. If as above everything fine you just place the sign bit, exponent and mantissa in the result
Multiply and divide is different, you asked about subract, so that is all I covered.
I'm presuming 0,546875 means 0.546875.
Firstly, to correct/clarify:
0 01111110 10001100000000000000000 = 0011 1111 0100 0110 0000 0000 0000 0000 =
0x3F460000 in IEEE-754 is 0.77343750, not 0.546875.
0.546875 in IEEE-754 is 0x3F0C0000 = 0011 1111 0000 1100 0000 0000 0000 0000 =
0 01111110 00011000000000000000000 = 1 x 1.00011 x 2^(01111110 - 127) =
1.00011 x 2^(126 - 127) = 1.00011 x 2^-1 = (1 + 1/16 + 1/32) x 1/2.
1 10000111 01000101111000000000000 = 1100 0011 1010 0010 1111 0000 0000 0000 =
0xc3a2f000 in IEEE-754 is -325.87500, not -32.875.
-32.875 in IEEE-754 is 0xC2038000 = 1100 0010 0000 0011 1000 0000 0000 0000 =
1 10000100 00000111000000000000000 = -1 x 1.00000111 x 2^(10000100 - 127) =
-1.00000111 x 2^(132 - 127) = -1.00000111 x 2^5 = (1 + 1/64 + 1/128 + 1/256) x -32.
32.875 in IEEE-754 is 0x42038000 = 0100 0010 0000 0011 1000 0000 0000 0000 =
0 10000100 00000111000000000000000 = 1 x 1.00000111 x 2^(10000100 - 127) =
1.00000111 x 2^(132 - 127) = 1.00000111 x 2^5 = (1 + 1/64 + 1/128 + 1/256) x 32.
The subtraction is carried out as follows:
1.00011000 x 1/2
- 1.00000111 x 32
------------------
==>
0.00000100011 x 32
- 1.00000111000 x 32
---------------
==>
-1 x (
1.00000111000 x 32
- 0.00000100011 x 32
---------------
)
==>
-1 x (
1.00000110112 x 32 // borrow
- 0.00000100011 x 32
---------------
)
==>
-1 x (
1.00000110112 x 32
- 0.00000100011 x 32
---------------
1.00000010101 x 32
)
==>
-1.00000010101 x 32 =
-1.00000010101000000000000 x 32 =
-1.00000010101000000000000 x 2^5 =
-1.00000010101000000000000 x 2^(132 - 127) =
-1.00000010101000000000000 x 2^(10000100 - 127)
==>
1 10000100 00000010101000000000000 =
1100 0010 0000 0001 0101 0000 0000 0000 =
0xc2015000
Note that in this example we did not need to handle underflow, which is more complicated.

Resources