decimal to floating point system. - math

i've been asked to work on the following question with the following specification/ rules...
Numbers are held in 16 bits split from left to right as follows:
1 bit sign flag that should be set for negative numbers and otherwise clear.
7 bit exponent held in Excess 63
8 bit significand, normalised to 1.x with only the fractional part stored – as in IEEE 754
Giving your answers in hexadecimal, how would the number -18 be represented in this system?
the answer is got is: 11000011 00100000 (or C320 in hexadecimal)
using the following method:
-18 decimal is a negative number so we have the sign bit set to 1.
18 in binary would be 0010010. This we could note down as 10010. We know work on what’s on the right side of the decimal point but in this case we don’t have any decimal point or fractions so we note down 0000 0000 since there are no fractions. We now write down the binary of 18 and the remainder zeroes (which are not necessarily required) and separate them with a decimal point as shown below:
10010.00000000
We now normalise this into the form 1.x by moving the decimal point and placing it between the first and second number (counting the amount of times we move the decimal point until it reaches that area). The result now is 1.001000000000 x 2^4 and we also know that the decimal point has been moved 4 times which for now we will consider to be our exponent value. The floating point system we are using has 7 bit exponent and uses excess 63. The exponent is 4 in excess 63 which would equal to 63 + 4 = 67 and this in 7 bit binary is shown as 1000011.
The sign bit is: 1 (-ve)
Exponent is: 1000011
Significand is 00100…
The binary representation is: 11000011 00100000 (or C320 in hexadecimal)
please let me know if it's correct or if i've done it wrong and what changes could be applied. thank you guy :)

Since you seem to have been assigned a lot of questions of this type, it may be useful to write an automated answer checker to validate your work. I've put together a quick converter in Python:
def convert_from_system(x):
#retrieve the first eight bits, and add a ninth bit to the left. This bit is the 1 in "1.x".
significand = (x & 0b11111111) | 0b100000000
#retrieve the next seven bits
exponent = (x >> 8) & 0b1111111
#retrieve the final bit, and determine the sign
sign = -1 if x >> 15 else 1
#add the excess exponent
exponent = exponent - 63
#multiply the significand by 2^8 to turn it from 1.xxxxxxxx into 1xxxxxxxx, then divide by 2^exponent to get back the decimal value.
result = sign * (significand / float(2**(8-exponent)))
return result
for value in [0x4268, 0xC320]:
print "The decimal value of {} is {}".format(hex(value), convert_from_system(value))
Result:
The decimal value of 0x4268 is 11.25
The decimal value of 0xc320 is -18.0
This confirms that -18 does convert into 0xC320.

Related

Converting a decimal number to floating point notation and IEEE 754 format

I'm doing an assignment for one of my classes and I'm stuck on those two questions:
Express the decimal -412.8 using binary floating point notation using 11 fraction bits for
the significand and 3 digits for the exponent without bias
I think I managed to solve it, but my exponent has 4 bits not 3. I don't really understand how you can convert -412.8 to floating point notation using only 3 bit exponents. Here is how I tried to solve it:
First of all, the floating point notation has three parts. The sign part, 0 for positive numbers and 1 for negative numbers, the exponent part and finally the mantissa. The mantissa in this case includes the leading 1. Since the number is negative, the sign bit is going to be 1. For the mantissa, I first converted 412.8 to binary, which gave me 110011100.11 and then I shifted the decimal point to the left 8 times, which gives me 1.1001110011. The mantissa is therefore 1100 1110 011 (11 bits as the teacher asked). Finally, the exponent is going to be 2^8, since I shifted the decimal 8 times to the right. 8 is 1000 in binary. So am I correct to assume that my floating point notation should be 1 1000 11001110011?
Represent the decimal number 16.1875×2-134 in single-precision IEEE 754 format.
I'm completely stuck on this one. I don't know how to convert that number. When I enter it in wolfram, the decimal number is way beyond the limit of the single precision format. I do know that the sign bit is going to be 0 since the number is positive. I don't know what the mantissa is though, nor how to find it. I also don't know how to find the exponent. Can someone guide me through this problem? Thanks.
For 1, you appear to be correct -- there's no way to reperesent the exponent unbiased in 3 bits. Of course, the problem says "3 digits" and doesn't define a base for the digits...
2 is relatively straight-forward -- convert the value to binary gives you 10000.0011 then normalize, giving 1.00000011×2-130. Now -130 is too small for a single-precision exponent (minimum is -126), so we have to denormalize (continue shifting the point to get an exponent of -126), which gives us 0.000100000011×2-126. That's then our mantissa (with the 0 dropped) and an exponent field of 0: 0|00000000|00010000001100000000000 (vertical bars separating the sign/exponent/mantissa fields) or 0x00081800

Whats the highest and the lowest integer for representing signed numbers in two's complement in 5 bits?

I understand how binary works and I can calculate binary to decimal, but I'm lost around signed numbers.
I have found a calculator that does the conversion. But I'm not sure how to find the maximum and the minumum number or convert if a binary number is not given, and question in StackO seems to be about converting specific numbers or doesn't include signed numbers to a specific bit.
The specific question is:
We have only 5 bits for representing signed numbers in two's complement:
What is the highest signed integer?
Write its decimal value (including the sign only if negative).
What is the lowest signed integer?
Write its decimal value (including the sign only if negative).
Seems like I'll have to go heavier on binary concepts, I just have 2 months in programming and I thought i knew about binary conversion.
From a logical point of view:
Bounds in signed representation
You have 5 bits, so there are 32 different combinations. It means that you can make 32 different numbers with 5 bits. On unsigned integers, it makes sense to store integers from 0 to 31 (inclusive) on 5 bits.
However, this is about unsigned integers. Meaning: we have to find a way to represent negative numbers too. Meaning: we have to store the number's value, but also its sign (+ or -). The representation used is 2's complement, and it is the one that's learned everywhere (maybe other exist but I don't know them). In this representation, the sign is given by the first bit. That is, in 2's complement representation a positive number starts with a 0 and a negative number starts with an 1.
And here the problem rises: Is 0 a positive number or a negative number ? It can't be both, because it would mean that 0 can be represented in two manners for a given number a bits (for 5: 00000 and 10000), that is we lose the space to put one more number. I have no idea how they decided, but fact is 0 is a positive number. For any number of bits, signed or unsigned, a 0 is represented with only 0.
Great. This gives us the answer to the first question: what is the upper bound for a decimal number represented in 2's complement ? We know that the first bit is for the sign, so all of the numbers we can represent must be composed of 4 bits. We can have 16 different values of 4-bits strings, and 0 is one of them, so the upper bound is 15.
Now, for the negative numbers, this becomes easy. We have already filled 16 values out of the 32 we can make on 5 bits. 16 left. We also know that 0 has already been represented, so we don't need to include it. Then we start at the number right before 0: -1. As we have 16 numbers to represent, starting from -1, the lowest signed integer we can represent on 5 bits is -16.
More generally, with n bits we can represent 2^n numbers. With signed integers, half of them are positive, and half of them are negative. That is, we have 2^(n-1) positive numbers and 2^(n-1) negative numbers. As we know 0 is considered as positive, the greatest signed integer we can represent on n bits is 2^(n-1) - 1 and the lowest is -2^(n-1)
2's complement representation
Now that we know which numbers can be represented on 5 bits, the question is to know how we represent them.
We already saw the sign is represented on the first bit, and that 0 is considered as positive. For positive numbers, it works the same way as it does for unsigned integers: 00000 is 0, 00001 is 1, 00010 is 2, etc until 01111 which is 15. This is where we stop for positive signed integers because we have occupied all the 16 values we had.
For negative signed integers, this is different. If we keep the same representation (10001 is -1, 10010 is -2, ...) then we end up with 11111 being -15 and 10000 not being attributed. We could decide to say it's -16 but we would have to check for this particular case each time we work with negative integers. Plus, this messes up all of the binary operations. We could also decide that 10000 is -1, 10001 is -2, 10010 is -3 etc. But it also messes up all of the binary operations.
2's complement works the following way. Let's say you have the signed integer 10011, you want to know what decimal is is.
Flip all the bits: 10011 --> 01100
Add 1: 01100 --> 01101
Read it as an unsigned integer: 01101 = 0*2^4 + 1*2^3 + 1*2^2 + 0*2^1 + 1*2^0 = 13.
10011 represents -13. This representation is very handy because it works both ways. How to represent -7 as a binary signed integer ? Start with the binary representation of 7 which is 00111.
Flip all the bits: 00111 --> 11000
Add 1: 11000 --> 11001
And that's it ! On 5 bits, -7 is represented by 11001.
I won't cover it, but another great advantage with 2's complement is that the addition works the same way. That is, When adding two binary numbers you do not have to care if they are signed or unsigned, this is the same algorithm behind.
With this, you should be able to answer the questions, but more importantly to understand the answers.
This topic is great for understanding 2's complement: Why is two's complement used to represent negative numbers?

How to do Division of two fixed point 64 bits variables in Synthesizable Verilog?

I'm implementing an Math equation in verilog, in a combinational scheme (assigns = ...) to the moment Synthesis tool (Quartus II) has been able to do add, sub and mul easly 32 bit unsigned absolute numbers by using the operators "+,- and *" respectively.
However, one of the final steps of the equation is to divide two 64 bits unsigned fixed point variables, the reason why is such of large 64 bit capacity is because I'm destinating 16 bits for integers and 48 bits for fractions (although, computer does everything in binary and doesn't care about fractions, I would be able to check the number to separate fraction from integer in the end).
Problem is that the operator "/" is useless since it auto-invokes a so-called "LPM_divide" library which output only gives me the integer, disregarding fractions, plus in a wrong position (the less significant bit).
For example:
b1000111010000001_000000000000000000000000000000000000000000000000 / b1000111010000001_000000000000000000000000000000000000000000000000
should be 1, it gives me
b0000000000000000_000000000000000000000000000000000000000000000001
So, how can I make this division for synthesizable verilog? What methods or algorithms should I follow, I'd like it to be faster, maybe a full combinational?
I'd like it to keep the 16 integers - 24 fractions user point of view. Thanks in advance.
First assume you multiply two fixed-point numbers.
Let's call them X and Y, first containing Xf fractional bits, and second Yf fractional bits accordingly.
If you multiply those numbers as integers, the LSB Xf+Yf bits of the integer result could be treated as fractional bits of resulting fixed-point number (and you still multiply them as integers).
Similarly, if you divide number of Sf fractional bits by number of Df fractional bits, the resulting integer could be treated as fixed-point number having Sf-Df fractional bits -- therefore your example with resulting integer 1.
Thus, if you need to get 48 fractional bits from your division of 16.48 number by another 16.48 number, append divident with another 48 zeroed fractional bits, then divide the resulting 64+48=112-bit number by another 64-bit number, treating both as integers (and using LPM_divide). The result's LSB 48 bits will then be what you need -- the resulting fixed-point number's 48 fractional bits.

decimal to binary conversion of a number which is having fractional part too

How to represent 500.2 in binary number system. I want to know the conversion method. I know how to convert numbers without points but if point comes in any number I don't know how to convert it.
Quoting the conversion description from Modern Digital Electronics 4E
Decimal to Binary conversion :
Any decimal number can be converted into its equivalent binary number.
For integers, the conversion is obtained by continuous division by 2
and keeping track of the remainders, while for fractional parts, the
conversion is affected by continuous multiplication by 2 and keeping
track of the integers generated.
The conversion process in your case is illustrated below :-
500/2 = 250 Remainder = 0
250/2 = 125 Reaminder = 0
125/2 = 62 Remainder = 1
62/2 = 31 Remainder = 0
31/2 = 15 Remainder = 1
15/2 = 7 Remainder = 1
7/2 = 3 Remainder = 1
3/2 = 1 Remainder = 1
1/2 = 0 Remainder = 1
So, the order of evaluation is that topmost remainder will go to LSB, the bottom-most remainder would go to MSB.
Therefore, (500)2 = 111110100.
Now, talking about the fractional part, we would go as follows :-
// separate the integer generated(0 or 1) on the left hand side of the fraction/dot,
// and ensure only fractional part between 0 and 1 are allowed in the next step
0.2 * 2 = 0.4 , so, keep 0 in the bag
0.4 * 2 = 0.8 , so, keep 0 in the bag
0.8 * 2 = 1.6 , so, keep 1 in the bag, and next put 0.6 to the next step
0.6 * 2 = 1.2, so, keep 1 in the bag, and next put 0.2 to the next step
0.2 * 2 = 0.4, so, keep 0 in the bag...
// and so on as we see that it would continue(repeating) the same pattern.
As we find that the series would go on infinitely, we can consider only the precision upto certain decimal places.
So, if I assume that the required precision is 4 digits after the dot, then the answer would be the sequence in which the digits are being placed in the bag, i.e.,
(0.2)2 = 0.00110011...
= 0.0011....
= 0.0011.
Now, combinedly, (500.2)2 = 111110100.0011 .
Here is a good webpage about it. I don't know if it answers your question :
http://www.h-schmidt.net/FloatConverter/IEEE754.html
Edit
from that link
Usage: You can either convert a number by choosing its binary representation in the button-bar, the other fields will be updated immediately. Or you can enter a binary number, a hexnumber or the decimal representation into the corresponding textfield and press return to update the other fields. To make it easier to spot eventual rounding errors, the selected float number is displayed after conversion to double precision.
Special Values: You can enter the words "Infinity", "-Infinity" or "NaN" to get the corresponding special values for IEEE-754. Please note there are two kinds of zero: +0 and -0.
Conversion: The value of a IEEE-754 number is computed as:
sign * 2exponent * mantissa
The sign is stored in bit 32. The exponent can be computed from bits 24-31 by subtracting 127. The mantissa (also known as significand or fraction) is stored in bits 1-23. An invisible leading bit (i.e. it is not actually stored) with value 1.0 is placed in front, then bit 23 has a value of 1/2, bit 22 has value 1/4 etc. As a result, the mantissa has a value between 1.0 and 2. If the exponent reaches -127 (binary 00000000), the leading 1 is no longer used to enable gradual underflow.
Underflow: If the exponent has minimum value (all zero), special rules for denormalized values are followed. The exponent value is set to 2-126 while the "invisible" leading bit for the mantissa is no longer used. The range of the mantissa is now [0:1).
Note: The converter used to show denormalized exponents as 2-127 and a denormalized mantissa range [0:2). This is effectively identical to the values above, with a factor of two shifted between exponent and mantissa. However this confused people and was therefore changed (2015-09-26).
Rounding errors: Not every decimal number can be expressed exactly as a floating point number. This can be seen when entering "0.1" and examining its binary representation which is either slightly smaller or larger, depending on the last bit.
Other representations: The hex representation is just the integer value of the bitstring printed as hex. Don't confuse this with true hexadecimal floating point values in the style of 0xab.12ef.

Can a IEEE 754 real number "cover" all integers within its range?

The original question was edited (shortened) to focus on a problem of precision, not range.
Single, or double precision, every representation of real number is limited to (-range,+range). Within this range lie some integer numbers (1, 2, 3, 4..., and so on; the same goes with negative numbers).
Is there a guarantee that a IEEE 754 real number (float, double, etc) can "cover" all integers within its range? By "cover" I mean the real number will represent the integer number exactly, not as (for example) "5.000001".
Just as reminder: http://www3.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html nice explanation of various number representation formats.
Update:
Because the question is for "can" I am also looking for the fact this cannot be done -- for it quoting a number is enough. For example "no it cannot be done, for example number 1748574 is not represented exactly by float number" (this number is taken out of thin air of course).
For curious reader
If you would like to play with IEEE 754 representation -- on-line calculator: http://www.ajdesigner.com/fl_ieee_754_word/ieee_32_bit_word.php
No, not all, but there exists a range within which you can represent all integers accurately.
Structure of 32bit floating point numbers
The 32bit floating point type uses
1 bit for the sign
8 bits for the exponent
23 bits for the fraction (leading 1 implied)
Representing numbers
Basically, you have a number in the form
(-)1.xxxx_xxxx_xxxx_xxxx_xxxx_xxx (binary)
which you then shift left/right with the (unbiased) exponent.
To have it represent an integer requiring n bits, you need to shift it by n-1 bits to the left. (All xes beyond the floating point are simply zero)
Representing integers with 24 bits
It is easy to see, that we can represent all integers requiring 24 bits (and less)
1xxx_xxxx_xxxx_xxxx_xxxx_xxxx.0 (unbiased exponent = 23)
since we can set the xes at will to either 1 or 0.
The highest number we can represent in this fashion is:
1111_1111_1111_1111_1111_1111.0
or 2^24 - 1 = 16777215
The next higher integer is 1_0000_0000_0000_0000_0000_0000. Thus, we need 25 bits.
Representing integers with 25 bits
If you try to represent a 25 bit integer (unbiased exponent = 24), the numbers have the following form:
1_xxxx_xxxx_xxxx_xxxx_xxxx_xxx0.0
The twenty-three digits that are available to you have all been shifted past the floating point. The leading digit is always a 1. In total, we have 24 digits. But since we need 25, a zero is appended.
A maximum is found
We can represent ``1_0000_0000_0000_0000_0000_0000with the form1_xxxx_xxxx_xxxx_xxxx_xxxx_xxx0.0, by simply assigning 1to allxes. The next higher integer from that is: 1_0000_0000_0000_0000_0000_0001. It's easy to see that this number cannot be represented accurately, because the form does not allow us to set the last digit to 1: It is always 0`.
It follows, that the 1 followed by 24 zeroes is an upper bound for the integers we can accurately represent.
The lower bound simply has its sign bit flipped.
Range within which all integers can be represented (including boundaries)
224 as an upper bound
-224 as a lower bound
Structure of 64bit floating point numbers
1 bit for the sign
11 exponent bits
52 fraction bits
Range within which all integers can be represented (including boundaries)
253 as an upper bound
-253 as a lower bound
This easily follows by applying the same argumentation to the structure of 64bit floating point numbers.
Note: That is not to say these are all integers we can represent, but it gives you a range within which you can represent all integers. Beyond that range, we can only represent a power of two multiplied with an integer from said range.
Combinatorial argument
Simply convincing ourselves that it is impossible for 32bit floating point numbers to represent all integers a 32bit integer can represent, we need not even look at the structure of floating point numbers.
With 32 bits, there are 232 different things we can represent. No more, no less.
A 32bit integer uses all of these "things" to represent numbers (pairwise different).
A 32bit floating point number can represent at least one number with a fractional part.
Thus, it is impossible for the 32bit floating point number to be able to represent this fractional number in addition to all 232 integers.
macias, to add to the already excellent answer by phant0m (upvoted; I suggest you accept it), I'll use your own words.
"No it cannot be done, for example number 16777217 is not represented exactly by float number."
Also, "for example number 9223372036854775809 is not represented exactly by double number".
This is assuming your computer is using the IEEE floating point format, which is a pretty strong bet.
No.
For example, on my system, the type float can represent values up to approximately 3.40282e+38. As an integer, that would be approximately 340282000000000000000000000000000000000, or about 2128.
The size of float is 32 bits, so it can exactly represent at most 232 distinct numbers.
An integer object generally uses all of its bits to represent values (with 1 bit dedicated as a sign bit for signed types). A floating-point object uses some of its bits to represent an exponent (8 bits for IEEE 32-bit float); this increases its range at the cost of losing precision.
A concrete example (1267650600228229401496703205376.0 is 2100, and is exactly representable as a float):
#include <stdio.h>
#include <float.h>
#include <math.h>
int main(void) {
float x = 1267650600228229401496703205376.0;
float y = nextafterf(x, FLT_MAX);
printf("x = %.1f\n", x);
printf("y = %.1f\n", y);
return 0;
}
The output on my system is:
x = 1267650600228229401496703205376.0
y = 1267650751343956853325350043648.0
Another way to look at it:
A 32-bit object can represent at most 232 distinct values.
A 32-bit signed integer can represent all integer values in the range -2147483648 .. 2147483647 (-231 .. +231-1).
A 32-bit float can represent many values that a 32-bit signed integer can't, either because they're fractional (0.5) or because they're too big (2.0100). Since there are values that can be represented by a 32-bit float but not by a 32-bit int, there must be other values that can be represented by a 32-bit int but not by a 32-bit float. Those values are integers that have more significant digits than a float can handle, because the int has 31 value bits but the float has only about 24.
Apparently you are asking whether a Real data type can represent all of the integer values in its range (absolute values up to FLT_MAX or DBL_MAX, in C, or similar constants in other languages).
The largest numbers representable by floating point numbers stored in K bits typically are much larger than the 2^K number of integers that K bits can represent, so typically the answer is no. 32-bit C floats exceed 10^37, 32-bit C integers are less than 10^10. To find out the next representable number after some number, use nextafter() or nextafterf(). For example, the code
printf ("%20.4f %20.4f\n", nextafterf(1e5,1e9), nextafterf(1e6,1e9));
printf ("%20.4f %20.4f\n", nextafterf(1e7,1e9), nextafterf(1e8,1e9));
prints out
100000.0078 1000000.0625
10000001.0000 100000008.0000
You might be interested in whether an integer J that is between two nearby fractional floating values R and S can be represented exactly, supposing S-R < 1 and R < J < S. Yes, such a J can be represented exactly. Every float value is the ratio of some integer and some power of 2. (Or is the product of some integer and some power of 2.) Let the power of 2 be P, and suppose R = U/P, S = V/P. Now U/P < J < V/P so U < J*P < V. More of J*P's low-order bits are zero than are those of U, V (because V-U < P, due to S-R < 1), so J can be represented exactly.
I haven't filled in all the details to show that J*P-U < P and V-J*P < P, but under the assumption S-R < 1 that's straightforward. Here is an example of R,J,S,P,U,V value computations: Let R=99999.9921875 = 12799999/128, (ie P=128); let S=100000.0078125 = 12800001/128; we have U=0xc34fff and V=0xc35001 and there is a number between them that has more low-order zeroes than either; to wit, J = 0xc35000/128 = 12800000/128 = 100000.0. For the numbers in this example, note that U and V require 24 bits for their exact representations (6 ea. 4-bit hex digits). Note that 24 bits is the number of bits of precision in IEEE 754 single-precision floating point numbers. (See table in wikipedia article.)
That each floating point number is a product or ratio of some integer and some power of 2 (as mentioned two paragraphs above) also is discussed in that floating point article, in a paragraph that begins:
By their nature, all numbers expressed in floating-point format are rational numbers with a terminating expansion in the relevant base (for example, ... a terminating binary expansion in base-2). Irrational numbers, such as π or √2, or non-terminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly.

Resources