Converting Binary Float To Decimal On Paper. Mantissa begins with 0 - math

Good afternoon all. I wasn't exactly sure where to post this question so I apologize if this is the wrong thread. I am currently taking a Discreet Mathematics course and initially I thought I understood binary float to decimal conversion rather well from a previous course. However today while doing some practice work using arbitrary sizes, I came across a problem that I must understand.
For the sake of easy math, I am going to use a 1 sign bit, 3 bit exponent (with a 4 bit bias instead of 127) and a 4 bit mantissa.
I have this number. 0 010 0100 Seems easy enough and it probably is to all you experts.
I know the first bit 0 is the sign bit, this number is positive.
I also know that the next 3 bits are the exponent bits. 010 represents 2. For this problem I am using a 4 bit bias instead of 127 so I do 2 - 4 = -2. I will shift the invisible decimal over to the left 2 spots on the Mantissa.
Here is my question. This mantissa starts with 0 instead of a 1. So is the "invisible" decimal point before or after that 1?
Basically what I am asking is, before shifting the decimal, is the mantissa 0.100 or 1.00 ? Oddly enough with all the floating point questions asked on exams from my previous classes, I don't believe I came across this problem. Perhaps the Professor's were being kind to us by giving us easy scenarios.
I always thought that the Mantissa is "normalized so I should see this Mantissa as 1.000 before shifting to the left twice to get .01000 which becomes .25 in decimal. But now I am not so sure.
Thanks for your time all!

For normal float formats, there is an implied leading one which is not encoded in the mantissa bits. So your mantissa would actually be 1.0100 in binary.
for more info see IEEE_754-1985

Related

Basic Floating Point Questions?

I read following example on Book:
I think the last Representation should be { 0 | 1 0 0 0 1 | 1 1 0 1 1 1 0 } because 11.10111010 in normalized form is 1.110111010. is there any wrong?
These are all the same number: 0.111*22, 1.11*21, 11.1*20. It looks like the example format doesn't have a hidden bit and places the binary point at the left.
You have take the same convention forward as backwards. The example does not use the implicit leading 1/hidden bit and is totally consistent in that.
To demonstrate the contrary option, if your normalization includes an implicit leading 1/hidden bit, then the addend/augend and result should be
111.001000
+ 1.10011010
--------------
1000.10111010
leading to the binary encoded result
0|10011|00010111
Looking at the answers and comments, I think there may be a basic misunderstanding of the term "normalization". Normalization does not imply a hidden bit.
It only means that the most significant non-zero digit will have a specific position relative to the radix point. For example, in decimal, 1 might be represented as 100*10-2, 10*10-1, 1*100, 0.1*101 etc. A normalized system might require the use of e.g. 0.1*101, putting the "1" digit immediately to the right of the decimal point.
In a binary normalized system, one of the bits is known to be one. Not storing that bit is a common choice, but is not required by being a normalized system.
In the case of the example, it is clear from how the inputs were expressed in the summation that there is no hidden bit, the normalized form has the most significant bit immediately to the right of the binary point, and unbiased exponent 0 is represented as 10000.
Binary 11.10111010 is equal to binary 0.1110111010 with unbiased exponent decimal 2, binary 10. That makes the biased exponent 10010, and the significand the leftmost bits of 1110111010.

What determines which system is used to translate a base 10 number to decimal and vice-versa?

There are a lot of ways to store a given number in a computer. This site lists 5
unsigned
sign magnitude
one's complement
two's complement
biased (not commonly known)
I can think of another. Encode everything in Ascii and write the number with the negative sign (45) and period (46) if needed.
I'm not sure if I'm mixing apples and oranges but today I heard how computers store numbers using single and double precision floating point format. In this everything is written as a power of 2 multiplied by a fraction. This means numbers that aren't powers of 2 like 9 are written as a power of 2 multiplied by a fraction e.g. 9 ➞ 16*9/16. Is that correct?
Who decides which system is used? Is it up to the hardware of the computer or the program? How do computer algebra systems handle transindental numbers like π on a finite machine? It seems like things would be a lot easier if everything's coded in Ascii and the negative sign and the decimal is placed accordingly e.g. -15.2 would be 45 49 53 46 (to base 10)
➞
111000 110001 110101 101110
Well there are many questions here.
The main reason why the system you imagined is bad, is because the lack of entropy. An ASCII character is 8 bits, so instead of 2^32 possible integers, you could represent only 4 characters on 32 bits, so 10000 integer values (+ 1000 negative ones if you want). Even if you reduce to 12 codes (0-9, -, .) you still need 4 bits to store them. So, 10^8+10^7 integer values, still much less than 2^32 (remember, 2^10 ~ 10^3). Using binary is optimal, because our bits only have 2 values. Any base that is a power of 2 also makes sense, hence octal and hex -- but ultimately they're just binary with bits packed per 3 or 4 for readability. If you forget about the sign (just use one bit) and the decimal separator, you get BCD : Binary Coded Decimals, which are usually coded on 4 bits per digit though a version on 8 bits called uncompressed BCD also seems to exist. I'm sure with a bit of research you can find fixed or floating point numbers using BCD.
Putting the sign in front is exactly sign magnitude (without the entropy problem, since it has a constant size of 1 bit).
You're roughly right on the fraction in floating point numbers. These numbers are written with a mantissa m and an exponent e, and their value is m 2^e. If you represent an integer that way, say 8, it would be 1x2^3, then the fraction is 1 = 8/2^3. With 9 that fraction is not exactly representable, so instead of 1 we write the closest number we can with the available bits. That is what we do as well with irrational (and thus transcendental) numbers like Pi : we approximate.
You're not solving anything with this system, even for floating point values. The denominator is going to be a power of 10 instead of a power of 2, which seems more natural to you, because it is the usual way we write rounded numbers, but is not in any way more valid or more accurate. ** Take 1/6 for example, you cannot represent it with a finite number of digits in the form a/10^b. *
The most popular representations for negative numbers is 2's complement, because of its nice properties when adding negative and positive numbers.
Standards committees (argue a lot internally and eventually) decide what complex number formats like floating points look like, and how to consistently treat corner cases. E.g. should dividing by 0 yield NaN ? Infinity ? An exception ? You should check out the IEEE : www.ieee.org . Some committees are not even agreeing yet, for example on how to represent intervals for interval arithmetic. Eventually it's the people who make the processors who get the final word on how bits are interpreted into a number. But sticking to standards allows for portability and compatibility between different processors (or coprocessors, what if your GPU used a different number format ? You'd have more to do than just copy data around).
Many alternatives to floating point values exist, like fixed point or arbitrary precision numbers, logarithmic number systems, rational arithmetic...
* Since 2 divides 10, you might argue that all the numbers representable by a/2^b can be a5^b/10^b, so less numbers need to be approximated. That only covers a minuscule family (an ideal, really) of the rational numbers, which are an infinite set of numbers. So it still doesn't solve the need for approximations for many rational, as well as all irrational numbers (as Pi).
** In fact, because of the fact that we use the powers of 2 we pack more significant digits after the decimal separator than we would with powers of 10 (for a same number of bits). That is, 2^-(53+e), the smallest bit of the mantissa of a double with exponent e, is much smaller than what you can reach with 53 bits of ASCII or 4-bit base 10 digits : at best 10^-4 * 2^-e

How do you perform floating point arithmetic on two floating point numbers?

Suppose I wanted to add, subtract, and/or multiply the following two floating point numbers that follow the format:
1 bit sign
3 bit exponent (bias 3)
6 bit mantissa
Can someone briefly explain how I would do that? I've tried searching online for helpful resources, but I haven't been able to find anything too intuitive. However, I know the procedure is generally supposed to be very simple. As an example, here are two numbers that I'd like to perform the three operations on:
0 110 010001
1 010 010000
To start, take the significand encoding and prefix it with a “1.”, and write the result with the sign determined by the sign bit. So, for your example numbers, we have:
+1.010001
-1.010000
However, these have different scales, because they have different exponents. The exponent of the second one is four less than the first one (0102 compared to 1102). So shift it right by four bits:
+1.010001
- .0001010000
Now both significands have the same scale (exponent 1102), so we can perform normal arithmetic, in binary:
+1.010001
- .0001010000
_____________
+1.0011000000
Next, round the significand to the available bits (seven). In this case, the trailing bits are zero, so the rounding does not change anything:
+1.001100
At this point, we could have a significand that needed more shifting, if it were greater than 2 (102) or less than 1. However, this significand is just where we want it, between 1 and 2. So we can keep the exponent as is (1102).
Convert the sign back to a bit, take the leading “1.” off the significand, and put the bits together:
0 110 001100
Exceptions would arise if the number overflowed or underflowed the normal exponent range, but those did not happen here.

Where did the leading 1 means negative number in signed int arise from?

Even though I read a number of articles that say that mostly 2's complement is used to represent the negative numbers in a signed integer and that that is the best method,
However for some reason I have this (below) stuck in my head and can't get rid of it without knowing the history of it
"Use the leading bit as 1 to denote negative numbers when using signed int."
I have read many posts online & in StakOverflow that 2's complement is the best way to represent negative numbers. But my question is not about the best way, it is about the history or from where did the "leading bit" concept arise and then disappear?
P.S: Also it is just not me, a bunch of other folks were also getting confused with this.
Edit - 1
The so called leading 1 method I mentioned is described with an example in this post:
Why is two's complement used to represent negative numbers?
Now I understand, the MSB of 1 signifies negative numbers. This is by nature of 2's complement and not any special scheme.
Eg. If not for the 1st bit, we can't say if 1011 represents -5 or +11.
Thanks to:
jamesdlin, Oli Charlesworth, Mr Lister for asking imploring questions to make me realize the correct answer.
Rant:
I think there are a bunch of groups/folks who have been taught or been made to think (incorrectly) that 1011 evaluates to -3. 1 denoting - and 011 denoting 3.
The folks who ask "what my question was.. " were probably taught the correct 2's complement way from the first instance they learnt it and weren't exposed to these wrong answers.
There are several advantages to the two's-complement representation for signed integers.
Let's assume 16 bits for now.
Non-negative numbers in the range 0 to 32,767 have the same representation in both signed and unsigned types. (Two's-complement shares this feature with ones'-complement and sign-and-magnitude.)
Two's-complement is easy to implement in hardware. For many operations, you can use the same instructions for signed and unsigned arithmetic (if you don't mind ignoring overflow). For example, -1 is represented as 1111 1111 1111 1111, and +1 as 0000 0000 0000 0001. If you add them, ignoring the fact that the high-order bit is a sign bit, the mathematical result is 1 0000 0000 0000 0000; dropping all but the low-order 16 bits, gives you 0000 0000 0000 0000, which is the correct signed result. Interpreting the same operation as unsigned, you're adding 65535 + 1, and getting 0, which is the correct unsigned result (with wraparound modulo 65536).
You can think of the leading bit, not as a "sign bit", but as just another value bit. In an unsigned binary representation, each bit represents 0 or 1 multiplied by the place value, and the total value is the sum of those products. The lowest bit's place value is 1, the next lower bit is 2, then 4, etc. In a 16-bit unsigned representation, the high-order bit's place value is 32768. In a 16-bit signed two's-complement representation, the high-order bit's place value is -32768. Try a few examples, and you'll see that everything adds up nicely.
See Wikipedia for more information.
It's not just about the leading bit. It's about all the bits.
Starting with addition
First let's look at how addition is done in 4-bit binary for 2 + 7:
10 +
111
____
1001
It's the same as long addition in decimal: bit by bit, right to left.
In the rightmost place we add 0 and 1, it makes 1, no carry.
In the second place from the right, we add 1 and 1, that makes 2 in decimal or 10 in binary - we write the 0, carry the 1.
In the third place from the right, we add the 1 we carried to the 1 already there, it makes binary 10. We write the 0, carry the 1.
The 1 that just got carried gets written in the fourth place from the right.
Long subtraction
Now we know that binary 10 + 111 = 1001, we should be able to work backwards and prove that 1001 - 10 = 111. Again, this is exactly the same as in decimal long subtraction.
1001 -
10
____
111
Here's what we did, working right to left again:
In the rightmost place, 1 - 0 = 1, we write that down.
In the second place, we have 0 - 1, so we need to borrow an extra bit. We now do binary 10 - 1, which leaves 1. We write this down.
In the third place, remember we borrowed an extra bit - so again we have 0 - 1. We use the same trick to borrow an extra bit, giving us 10 - 1 = 1, which we put in the third place of the result.
In the fourth place, we again have a borrowed bit to deal with. Subtract the borrowed bit from the 1 already there: 1 - 1 = 0. We could write this down in front of the result, but since it's the end of the subtraction there's no need.
There's a number less than zero?!
Do you remember how you learnt about negative numbers? Part of the idea is that you can subtract any number from any other number and still get a number. So 7 - 5 is 2; 6 - 5 is 1; 5 - 5 is 0; What is 4 - 5? Well, one way to reason about such numbers is simply to apply the same method as above to do the subtraction.
As an example, let's try 2 - 7 in binary:
10 -
111
_______
...1011
I started in the same way as before:
In the rightmost place, subtract 1 from 0, which requires a borrowed bit. 10 - 1 = 1, so the last bit of the result is 1.
In the second-rightmost place, we have 1 - 1 with an extra borrow bit, so we have to subtract another 1. This means we need to borrow our own bit, giving 11 - 1 - 1 = 1. We write 1 in the second-rightmost spot.
In the third place, there are no more bits in the top number! But we know we can pretend there's a 0 in front, just like we would do if the bottom number ran out of bits. So we have 0 - 1 - 1 because of the borrow bit from second place. We have to borrow a bit again! Anyway we have 10 - 1 - 1 = 0, which we write down in the third place from the right.
Now something very interesting has happened - both the operands of the subtraction have no more digits, but we still have a borrow bit to take care of! Oh well, let's just carry on as we have been doing. We have 0 - 0, since neither the top or bottom operand have any bits here, but because of the borrow bit it's actually 0 - 1.
(We have to borrow again! If we keep borrowing like this we'll have to declare bankruptcy soon.)
Anyway, we borrow the bit, and we get 10 - 1 = 1, which we write in the fourth place from the right.
Now anyone with half a mind is about to see that we are going to keep borrowing bits until the cows come home, because there ain't no more bits to go around! We ran out of them two places ago if you forgot. But if you tried to keep going it'd look like this:
...00000010
...00000111
___________
...11111011
In the fifth place we get 0 - 0 - 1, and we borrow a bit to get 10 - 0 - 1 = 1.
In the sixth place we get 0 - 0 - 1, and we borrow a bit to get 10 - 0 - 1 = 1.
In the seventh place we get 0 - 0 - 1, and we borrow a bit to get 10 - 0 - 1 = 1.
...And so it goes on for as many places as you like. By the way, we just derived the two's complement binary form of -5.
You could try this for any pair of numbers you like, and generate the two's complement form of any negative number. If you try to do 0 - 1, you'll see why -1 is represented as ...11111111. You'll also realise why all two's complement negative numbers have a 1 as their most significant bit (the "leading bit" in the original question).
In practice, your computer doesn't have infinitely many bits to store negative numbers in, so it usually stops after some more reasonable number, like 32. What do we do with the extra borrow bit in position 33? Eh, we just quietly ignore it and hope no one notices. When some does notice that our new number system doesn't work, we call it integer overflow.
Final notes
This isn't the only way to make our number system work, of course. After all, if I owe you $5, I wouldn't say that your current balance with me was $...999999995.
But there are some cool things about the system we just derived, like the fact that subtraction gives you the right result in this system, even if you ignore the fact that one of the numbers is negative. Normally, we have to think about subtractions with conditional steps: to calculate 2 - 7, we first have to figure out that 2 is less than 7, so instead we calculate 7 - 2 = 5, and then stick a minus sign in front to get 2 - 7 = -5. But with two's complement we just go ahead do the subtraction and don't care about which number is bigger, and the right result comes out by itself. And others have mentioned that addition works nicely, and so does multiplication.
You don't use the leading bit, per say. For instance, in an 8-bit signed char,
11111111
represents -1. You can test the leading bit to determine if it is a negative number.
There are a number of reasons to use 2's complement, but the first and greatest is convenience. Take the above number and add 2. What do we end up with?
00000001
You can add and subtract 2's complement numbers basically for free. This was a big deal historically, because the logic is very simple; you don't need dedicated hardware to handle signed numbers. You use less transistors, you need less complicated design, etc. It goes back to before 8-bit microprocessors, which didn't even have multiply instructions built-in (even many 16-bit ones didn't have them, such as the 65c816 used in apple IIe and Super NES).
With that said, multiplication is relatively trivial with 2's complement also, so that's no big deal.
Complements (including things like 9s complement in decimal, mechanical calculators / adding-machines / cash registers) have been around forever. In nines' complement with four decimal digits, for instance, values in the range 0000..4999 are positive while values in 5000..9999 are negative. See http://en.wikipedia.org/wiki/Method_of_complements for details.
This directly gives rise to 1s complement in binary, and in both 1s and 2s complement, the topmost bit acts as a "sign bit". This does not explain exactly how computers moved from ones' complement to two's complement (I use Knuth's apostrophe convention when spelling these out as words with apostrophes, by the way). I think it was a combination of luck, irritation at "negative zero", and the way ones' complement requires end-around carry (vs two's complement, not requiring it).
In a logical sense, it does not matter which bit you use to represent signs, but for practical purposes, using the top bit, and two's complement, simplifies the hardware. Back when transistors were expensive, this was pretty important. (Or even tubes, although I think most if not all vacuum-tube computers used ones' complement. In any case they predated the C language by rather a lot.)
In summary, the history goes back way before electronic computers and the C language, and there was no reason to change from a good way of implementing this mechanically, when converting from mechanical calculators to vacuum-tube ENIACs to transistorized computers and then on to "chips", MSI, LSI, VLSI, and onward.
Well, it had to work such that 2 plus -2 gives zero. Early CPUs had hardware addition and subtraction and someone noticed that by complementing all the bits (one's complement, the original system), to change the "sign" of the value, it allowed the existing addition hardware to work properly—except that sometimes the result was negative zero. (What is the difference between -0 and 0? On such machines, it was indeterminate.)
Someone soon realized that by using twos-complement (convert a number between negative and positive by inverting the bits and adding one), the negative zero problem was avoided.
So really, it is not just the sign bit which is affected by negatives, but all of the bits except the LSB. However, by examining the MSB, one can immediately determine whether the signed value there is negative.

Floating Point Algorithms in C

I am thinking recently on how floating point math works on computers and is hard for me understand all the tecnicals details behind the formulas. I would need to understand the basics of addition, subtraction, multiplication, division and remainder. With these I will be able to make trig functions and formulas.
I can guess something about it, but its a bit unclear. I know that a fixed point can be made by separating a 4 byte integer by a signal flag, a radix and a mantissa. With this we have a 1 bit flag, a 5 bits radix and a 10 bit mantissa. A word of 32 bits is perfect for a floating point value :)
To make an addition between two floats, I can simply try to add the two mantissas and add the carry to the 5 bits radix? This is a way to do floating point math (or fixed point math, to be true) or I am completely wrong?
All the explanations I saw use formulas, multiplications, etc. and they look so complex for a thing I guess, would be a bit more simple. I would need an explanation more directed to beginning programmers and less to mathematicians.
See Anatomy of a floating point number
The radix depends of the representation, if you use radix r=2 you can never change it, the number doesn't even have any data that tell you which radix have. I think you're wrong and you mean exponent.
To add two numbers in floating point you must make the exponent one equal to another by rotating the mantissa. One bit right means exponent+1, and one bit left means exponent -1, when you have the numbers with the same exponent then you can add them.
Value(x) = mantissa * radix ^ exponent
adding these two numbers
101011 * 2 ^ 13
001011 * 2 ^ 12
would be the same as adding:
101011 * 2 ^ 13
000101 * 2 ^ 13
After making exponent equal one to another you can operate.
You also have to know if the representation has implicit bit, I mean, the most significant bit must be a 1, so usually, as in the iee standard its known to be there, but it isn't representated, although its used to operate.
I know this can be a bit confusing and I'm not the best teacher so any doubt you have, just ask.
Run, don't walk, to get Knuth's Seminumerical Algorithms which contains wonderful intuition and algorithms behind doing multiprecision and floating point arithmetic.

Resources