Rounding to the nearest integer in floating point - math

How can I round a floating point number to the nearest integer? I am looking for the algorithm in terms of binary since I have to implement the code in assembly.

UPDATED with method for proper rounding to even.
Basic Algorithm:
Store the 23-exponent+1'th bit (after the decimal point). Next, zero out the (23-exponent) least significant bits. Then use the stored bit and the new LSB to round. If the stored bit bit is 1, add one to the LSB of the non-truncated part and normalize if necessary. If the stored bit is 0, do nothing.
**
For results matching IEEE-754 standard:
**
Before Zeroing out the (23-exponent) least significant bits, OR together the (22-exponent) least significant bits. Call the result of that OR the rounding bit.
The stored (23-exponent + 1) bit (after the decimal point) will be called the guard bit.
Then zero out the (23-exponent) least significant bits).
If the guard bit is zero, do nothing.
If the guard bit is 1, and the sticky bit is 0, add one to the LSB if the LSB is 1.
If the guard bit is 1 and the sticky bit is 1, add one to the LSB.
Here are some examples using the basic algorithm:
x = 62.3
sign exponent mantissa
x = 0 5 (1).11110010011001100110011
Step 1: Store the exponent+1'th bit (after the decimal point)
exponent+1 = 6th bit
savedbit = 0
Step 2: Zero out 23-exponent least significant bits
23-exponent = 18, so we zero out the 18 LSBs
sign exponent mantissa
x = 0 5 (1).11110000000000000000000
Step 3: Use the next bit to round
Since the stored bit is 0, we do nothing, and the floating point number has been rounded to 62.
Another example:
x = 21.9
sign exponent mantissa
x = 0 4 (1).01011110011001100110011
Step 1: Store the exponent+1'th bit (after the decimal point)
exponent+1 = 5th bit
savedbit = 1
Step 2: Zero out 23-exponent least significant bits
23-exponent = 19, so we zero out the 19 LSBs
sign exponent mantissa
x = 0 4 (1).01010000000000000000000
Step 3: Use the next bit to round
Since the stored bit is 1, we add one to the LSB of the truncated part and get 22, which is the correct number:
We start with:
sign exponent mantissa
x = 0 4 (1).01010000000000000000000
Add one at this location:
+ 1
And we get 22:
sign exponent mantissa
x = 0 4 (1).01100000000000000000000

There is an SSE instruction for round to nearest: http://www.musicdsp.org/showone.php?id=246
inline int float2int(float x) {
int i;
__asm {
fld x
fistp i
}
return i;
}

Decrease the exponent by 1, add 1, increase the exponent by 1, truncate. Or just add 0.5 and truncate. Whichever floats your boat.

Related

decimal to binary conversion of a number which is having fractional part too

How to represent 500.2 in binary number system. I want to know the conversion method. I know how to convert numbers without points but if point comes in any number I don't know how to convert it.
Quoting the conversion description from Modern Digital Electronics 4E
Decimal to Binary conversion :
Any decimal number can be converted into its equivalent binary number.
For integers, the conversion is obtained by continuous division by 2
and keeping track of the remainders, while for fractional parts, the
conversion is affected by continuous multiplication by 2 and keeping
track of the integers generated.
The conversion process in your case is illustrated below :-
500/2 = 250 Remainder = 0
250/2 = 125 Reaminder = 0
125/2 = 62 Remainder = 1
62/2 = 31 Remainder = 0
31/2 = 15 Remainder = 1
15/2 = 7 Remainder = 1
7/2 = 3 Remainder = 1
3/2 = 1 Remainder = 1
1/2 = 0 Remainder = 1
So, the order of evaluation is that topmost remainder will go to LSB, the bottom-most remainder would go to MSB.
Therefore, (500)2 = 111110100.
Now, talking about the fractional part, we would go as follows :-
// separate the integer generated(0 or 1) on the left hand side of the fraction/dot,
// and ensure only fractional part between 0 and 1 are allowed in the next step
0.2 * 2 = 0.4 , so, keep 0 in the bag
0.4 * 2 = 0.8 , so, keep 0 in the bag
0.8 * 2 = 1.6 , so, keep 1 in the bag, and next put 0.6 to the next step
0.6 * 2 = 1.2, so, keep 1 in the bag, and next put 0.2 to the next step
0.2 * 2 = 0.4, so, keep 0 in the bag...
// and so on as we see that it would continue(repeating) the same pattern.
As we find that the series would go on infinitely, we can consider only the precision upto certain decimal places.
So, if I assume that the required precision is 4 digits after the dot, then the answer would be the sequence in which the digits are being placed in the bag, i.e.,
(0.2)2 = 0.00110011...
= 0.0011....
= 0.0011.
Now, combinedly, (500.2)2 = 111110100.0011 .
Here is a good webpage about it. I don't know if it answers your question :
http://www.h-schmidt.net/FloatConverter/IEEE754.html
Edit
from that link
Usage: You can either convert a number by choosing its binary representation in the button-bar, the other fields will be updated immediately. Or you can enter a binary number, a hexnumber or the decimal representation into the corresponding textfield and press return to update the other fields. To make it easier to spot eventual rounding errors, the selected float number is displayed after conversion to double precision.
Special Values: You can enter the words "Infinity", "-Infinity" or "NaN" to get the corresponding special values for IEEE-754. Please note there are two kinds of zero: +0 and -0.
Conversion: The value of a IEEE-754 number is computed as:
sign * 2exponent * mantissa
The sign is stored in bit 32. The exponent can be computed from bits 24-31 by subtracting 127. The mantissa (also known as significand or fraction) is stored in bits 1-23. An invisible leading bit (i.e. it is not actually stored) with value 1.0 is placed in front, then bit 23 has a value of 1/2, bit 22 has value 1/4 etc. As a result, the mantissa has a value between 1.0 and 2. If the exponent reaches -127 (binary 00000000), the leading 1 is no longer used to enable gradual underflow.
Underflow: If the exponent has minimum value (all zero), special rules for denormalized values are followed. The exponent value is set to 2-126 while the "invisible" leading bit for the mantissa is no longer used. The range of the mantissa is now [0:1).
Note: The converter used to show denormalized exponents as 2-127 and a denormalized mantissa range [0:2). This is effectively identical to the values above, with a factor of two shifted between exponent and mantissa. However this confused people and was therefore changed (2015-09-26).
Rounding errors: Not every decimal number can be expressed exactly as a floating point number. This can be seen when entering "0.1" and examining its binary representation which is either slightly smaller or larger, depending on the last bit.
Other representations: The hex representation is just the integer value of the bitstring printed as hex. Don't confuse this with true hexadecimal floating point values in the style of 0xab.12ef.

decimal to floating point system.

i've been asked to work on the following question with the following specification/ rules...
Numbers are held in 16 bits split from left to right as follows:
1 bit sign flag that should be set for negative numbers and otherwise clear.
7 bit exponent held in Excess 63
8 bit significand, normalised to 1.x with only the fractional part stored – as in IEEE 754
Giving your answers in hexadecimal, how would the number -18 be represented in this system?
the answer is got is: 11000011 00100000 (or C320 in hexadecimal)
using the following method:
-18 decimal is a negative number so we have the sign bit set to 1.
18 in binary would be 0010010. This we could note down as 10010. We know work on what’s on the right side of the decimal point but in this case we don’t have any decimal point or fractions so we note down 0000 0000 since there are no fractions. We now write down the binary of 18 and the remainder zeroes (which are not necessarily required) and separate them with a decimal point as shown below:
10010.00000000
We now normalise this into the form 1.x by moving the decimal point and placing it between the first and second number (counting the amount of times we move the decimal point until it reaches that area). The result now is 1.001000000000 x 2^4 and we also know that the decimal point has been moved 4 times which for now we will consider to be our exponent value. The floating point system we are using has 7 bit exponent and uses excess 63. The exponent is 4 in excess 63 which would equal to 63 + 4 = 67 and this in 7 bit binary is shown as 1000011.
The sign bit is: 1 (-ve)
Exponent is: 1000011
Significand is 00100…
The binary representation is: 11000011 00100000 (or C320 in hexadecimal)
please let me know if it's correct or if i've done it wrong and what changes could be applied. thank you guy :)
Since you seem to have been assigned a lot of questions of this type, it may be useful to write an automated answer checker to validate your work. I've put together a quick converter in Python:
def convert_from_system(x):
#retrieve the first eight bits, and add a ninth bit to the left. This bit is the 1 in "1.x".
significand = (x & 0b11111111) | 0b100000000
#retrieve the next seven bits
exponent = (x >> 8) & 0b1111111
#retrieve the final bit, and determine the sign
sign = -1 if x >> 15 else 1
#add the excess exponent
exponent = exponent - 63
#multiply the significand by 2^8 to turn it from 1.xxxxxxxx into 1xxxxxxxx, then divide by 2^exponent to get back the decimal value.
result = sign * (significand / float(2**(8-exponent)))
return result
for value in [0x4268, 0xC320]:
print "The decimal value of {} is {}".format(hex(value), convert_from_system(value))
Result:
The decimal value of 0x4268 is 11.25
The decimal value of 0xc320 is -18.0
This confirms that -18 does convert into 0xC320.

Convert floating point number from binary to a decimal number

I have to convert floating point number from binary to usable decimal number.
Of course my floating point number has been separated into bytes, so 4 bytes total.
1 2 3 4
[xxxxxxxx][xxxxxxxx][xxxxxxxx][xxxxxxxx]
These 4 bytes are already converted to decimal, so I have e.g.
1 2 3 4
[0][10][104][79]
Now Mantissa is held in three parts, two rightmost bytes (3 & 4) and in byte 2 but without the MSB bit (that one is easy to mask out, so let's assume we have a nice decimal number there as well). So three decimal numbers.
Is there an straightforward mathematical conversion to a floating point mantissa for these three decimal numbers?
This is along the lines: if I needed to get an integer, the formula would be
10 * 65536 + 104 * 256 + 79.
Call these bytes a, b, and c. I assume a has already been masked, so it contains only the bits of the significand and none of the exponent, and that the number is IEEE-754 32-bit binary floating-point, with bytes taken with the appropriate endianness.
If the raw exponent field is 1 to 254 (thus, not 0 or 255), then the significand is:
1 + a*0x1p-7 + b*0x1p-15 + c*0x1p-23
or, equivalently:
(65536*a + 256*b + c) * 0x1p-23 + 1.
If the raw exponent field is 0, then remove the 1 from the sum (the number is subnormal or zero). If the raw exponent field is 255, then the floating-point value is infinity (if a, b, and c are all 0) or a NaN (otherwise).
I cannot be of much help, since it has been a while since I did conversions, but I hope you find this tutorial useful.

Convert decimal to hex/binary

I have small math question.
Is there any way to convert decimal number (for example 3.14) to hex or binary? If it's possible, can anybody place here some links to tutorials or exaplanations? (I don't want it for some language, I need it generally in math.) Please help.
EDIT:
Input passed in code:
0.1
Output in ASM code:
415740h
Another input:
0.058
Another output by compiler:
00415748h
But how has been this done? How can be it converted?
I do not recognize your output samples as encodings of floating-point numbers or other common representations of .1 and .058. I suspect these numbers are addresses where the assembler or compiler has stored the floating-point encoding.
In other words, you wrote some text that including a floating-point literal, and the assembler or compiler converted that literal to a floating-point encoding, stored it at some address, and then put the address into an instruction that loads the floating-point encoding from memory.
This hypothesis is consistent with the fact that the two numbers differ by eight. Since double-precision floating-point is commonly eight bytes, the second address (0x415748) was eight bytes beyond the first address (0x415740).
The process for encoding a number in floating-point is roughly this:
Let x be the number to be encoded.
Set s (a sign bit) to 0 if x is positive and to 1 if x is negative. Set x to the absolute value of x.
Set e (an exponent) to 0. Repeat whichever of the following is appropriate:
If x is 2 or greater, add 1 to e and divide x by 2. Repeat until x is less than 2.
If x is less than 1, add -1 to e and multiply x by 2. Repeat until x is at least 1.
When you are done with the above, x is at least 1 and is less than 2. Also, the original number equals (-1)s·2e·x. That is, we have represented the number with a sign bit (s), and exponent of two (e), and a significand (x) that is in [1, 2) (includes 1, excludes 2).
Set f = (x-1)·252. Round f to the nearest integer (if it is a tie between two integers, round to the even integer). If f is now 252, set f to 0 and add 1 to e. (This step finds the 52 bits of x that are immediately after the “decimal point“ when x is represented as a binary numeral, with rounding after the 52nd digit, and it adjusts the exponent if rounding at that position rounds x up to 2, which is out of interval where we want it.)
Add 1023 to e. This has no numerical significance with regard to x; it is simply part of the floating-point encoding. When decoding, 1023 gets subtracted.
Now, convert s, e, and f to binary numerals, using exactly one digit for s, 11 digits for e, and 52 digits for f. If necessary, including leading zeroes so that e is represented with exactly 11 binary digits and f is represented with exactly 52 binary digits. Concatenate those digits, and you have 64 bits. That is the common IEEE 754 encoding for a double-precision floating-point number.
There are some special cases: If the original number is zero, use zero for s, e, and f. (s can also be 1, to represent a special “negative zero“. If, before adding 1023, e is less than -1022, then some adjustments have to be made to get a “denormal“ result or zero, which I do not describe further at the moment. If, before adding 1023, e is more than 1023, then the magnitude of the number is too large to be represented in floating point. It can be encoded as infinity instead, by setting e (after adding 1023) to 2047 and f to zero.
Decimal to Floating-point:
http://sandbox.mc.edu/~bennet/cs110/flt/dtof.html

Adding negative and positive binary?

X = 01001001 and Y = 10101010
If I want to add them together how do I do that? They are "Two's Complement"...
I have tried a lots of things but I am not quite sure I am getting the right answer since there seems to be different type of rules.
Just want to make sure it is correct:
1. Add them as they are do not convert the negative
2. Convert the negative number you get and that's the sum.
f.eks
01001001+10101010 = 11110011 => 00001100 => 1101 => -13
Or?
1. Convert the negative
2. Add them together and convert the negative
f.eks
01001001+10101010 => 01001001 + 01010110 => 10011111 => 01100001 => -97
So basically what I want to do is to take: X-Y, and X+Y
Can someone tell me how to do that?
Some resource sites:
student-binary
celtickane
swarthmore
The beauty of two's complement is that at the binary level it's a matter of interpretation rather than algorithm - the hardware for adding two signed numbers is the same as for unsigned numbers (ignoring flag bits).
Your first example - "just add them" - is exactly the right answer. Your example numbers
01001001 = 73
10101010 = -86
So, the correct answer is indeed -13.
Subtracting is just the same, in that no special processing is required for two's complement numbers: you "just subtract them".
Note that where things get interesting is the handling of overflow/underflow bits. You can't represent the result of 73 - (-86) as an 8-bit two's complement number...
Adding in two's complement doesn't require any special processing when the signs of the two arguments are opposite. You just add them as you normally would in binary, and the sign of the result is the sign you keep.
And just to make sure you understand two's complement, to convert from a positive to a negative number (or vice versa): invert each bit, then add 1 to the result.
For example, your positive number X = 01001001 becomes 10110101+1=10110110 as a negative number; your negative number Y = 10101010 becomes 01010101+1=01010110 as a positive number.
To subtract Y from X, negate Y and add. I.E. 01001001 + 01010110.
Your confusion might be because of the widths of the numbers involved. To get a better feel for this you could try creating a signed integer out of your unsigned integer.
If the MSB of your unsigned integer is already 0, then you can read it as signed and get the same result.
If the MSB is 1 then you can append a 0 to the left to get a signed number. You should sign-extend (that is, add 0s if the MSB is 0, add 1s if the MSB is 1) all the signed numbers to get a number of the same width so you can do the arithmetic "normally".
For instance, using your numbers:
X = 01001001: Unsigned, MSB is 0, do nothing.
Y = 10101010: Signed, did nothing with X, still do nothing.
But if we change the MSB of X to 1:
X = 11001001: Unsigned, MSB is 1, Add a 0 --> 011001001
Y = 10101010: Signed, extended X, so sign-extend Y --> 110101010
Now you have two signed numbers that you can add or subtract the way you already know.
01001001 + 10101010 = 11110011 => 00001100 => 1101 => -13
The first addend is 73. The second addend is -86. 86 = 101010. Padding to 8 bits including the 1 for the negative sign, -86 = 10101010
Both addends are in Sign-bit representation.
Solving them their sum is 1 1 1 1 0 0 1 1 which is an encoded binary (equivalent to having undergone One's Complement by inversion then Two's Complement by adding 1).
So do the reverse to have the decimal number. This time do first subtract 1 as inverse of Two's Complement = 1 1 1 1 0 0 1 1 - 1
= 1 1 1 1 0 0 1 0 then invert as in One's Complement = 0 0 0 0 1 1 0 1 which is equal to 13. Having done such reversal or having acknowledged One's complement and Two's Complement, the answer is negative. So affix the negative sign = -13

Resources