Convert decimal to hex/binary - math

I have small math question.
Is there any way to convert decimal number (for example 3.14) to hex or binary? If it's possible, can anybody place here some links to tutorials or exaplanations? (I don't want it for some language, I need it generally in math.) Please help.
EDIT:
Input passed in code:
0.1
Output in ASM code:
415740h
Another input:
0.058
Another output by compiler:
00415748h
But how has been this done? How can be it converted?

I do not recognize your output samples as encodings of floating-point numbers or other common representations of .1 and .058. I suspect these numbers are addresses where the assembler or compiler has stored the floating-point encoding.
In other words, you wrote some text that including a floating-point literal, and the assembler or compiler converted that literal to a floating-point encoding, stored it at some address, and then put the address into an instruction that loads the floating-point encoding from memory.
This hypothesis is consistent with the fact that the two numbers differ by eight. Since double-precision floating-point is commonly eight bytes, the second address (0x415748) was eight bytes beyond the first address (0x415740).
The process for encoding a number in floating-point is roughly this:
Let x be the number to be encoded.
Set s (a sign bit) to 0 if x is positive and to 1 if x is negative. Set x to the absolute value of x.
Set e (an exponent) to 0. Repeat whichever of the following is appropriate:
If x is 2 or greater, add 1 to e and divide x by 2. Repeat until x is less than 2.
If x is less than 1, add -1 to e and multiply x by 2. Repeat until x is at least 1.
When you are done with the above, x is at least 1 and is less than 2. Also, the original number equals (-1)s·2e·x. That is, we have represented the number with a sign bit (s), and exponent of two (e), and a significand (x) that is in [1, 2) (includes 1, excludes 2).
Set f = (x-1)·252. Round f to the nearest integer (if it is a tie between two integers, round to the even integer). If f is now 252, set f to 0 and add 1 to e. (This step finds the 52 bits of x that are immediately after the “decimal point“ when x is represented as a binary numeral, with rounding after the 52nd digit, and it adjusts the exponent if rounding at that position rounds x up to 2, which is out of interval where we want it.)
Add 1023 to e. This has no numerical significance with regard to x; it is simply part of the floating-point encoding. When decoding, 1023 gets subtracted.
Now, convert s, e, and f to binary numerals, using exactly one digit for s, 11 digits for e, and 52 digits for f. If necessary, including leading zeroes so that e is represented with exactly 11 binary digits and f is represented with exactly 52 binary digits. Concatenate those digits, and you have 64 bits. That is the common IEEE 754 encoding for a double-precision floating-point number.
There are some special cases: If the original number is zero, use zero for s, e, and f. (s can also be 1, to represent a special “negative zero“. If, before adding 1023, e is less than -1022, then some adjustments have to be made to get a “denormal“ result or zero, which I do not describe further at the moment. If, before adding 1023, e is more than 1023, then the magnitude of the number is too large to be represented in floating point. It can be encoded as infinity instead, by setting e (after adding 1023) to 2047 and f to zero.

Decimal to Floating-point:
http://sandbox.mc.edu/~bennet/cs110/flt/dtof.html

Related

R largest/smallest representable numbers

I'm trying to get the largest/smallest representable number in R.
After typing ".Machine"
I got:
$double.xmin
[1] 2.225074e-308
$double.xmax
[1] 1.797693e+308
However even if I type 2.225074e-309 in R command prompt I get 2.225074e-309 instead of the expected 0
How can I find the largest/smallest number for which adding or subtracting 1 would lead to either Inf(Adding 1 to largest number) or 0(subtracting 1 from smallest number) ?
.Machine$double.xmin gives the value of the smallest positive number whose representation meets the requirements of IEEE 754 technical standard for floating point computation. As is mentioned in the Wikipedia article on double-precision floating point numbers, that standard requires that:
If a decimal string with at most 15 significant digits is converted to IEEE 754 double precision representation and then converted back to a string with the same number of significant digits, then the final string should match the original. If an IEEE 754 double precision is converted to a decimal string with at least 17 significant digits and then converted back to double, then the final number must match the original.
The same article goes on to note that, by compromising precision, even smaller positive numbers (which do not meet the standards' precision requirements) can be represented:
The 11 bit width of the exponent allows the representation of numbers between 10-308 and 10308, with full 15–17 decimal digits precision. By compromising precision, the subnormal representation allows even smaller values up to about 5 × 10-324.
R's doubles behave in exactly this way, as is noted in the Details section of ?.Machine:
Note that on most platforms smaller positive values than
‘.Machine$double.xmin’ can occur. On a typical R platform the
smallest positive double is about ‘5e-324’.
To confirm that that is the smallest positive value that can be represented using R's doubles and to see the cost in loss of precision, try out a few operations like this:
5e-324
# [1] 4.940656e-324
2e-324
# [1] 0
1.4 * 5e-324
# [1] 4.940656e-324
1.6 * 5e-324
# [1] 9.881313e-324
Here are some representations using SAS, IEEE 754 Big Endian?
data _null_;
y=constant('big');
put y hex16.;
put y E21.3;
run;quit;
Biggest
7FEFFFFFFFFFFFFF
1.79769313486230E+308
data _null_;
y=constant('small');
put y hex16.;
put y E21.3;
run;quit;
Smallest
0010000000000000
2.22507385850720E-308
I am not sure the smallest because SAS may set aside some values for missings.

Whats the highest and the lowest integer for representing signed numbers in two's complement in 5 bits?

I understand how binary works and I can calculate binary to decimal, but I'm lost around signed numbers.
I have found a calculator that does the conversion. But I'm not sure how to find the maximum and the minumum number or convert if a binary number is not given, and question in StackO seems to be about converting specific numbers or doesn't include signed numbers to a specific bit.
The specific question is:
We have only 5 bits for representing signed numbers in two's complement:
What is the highest signed integer?
Write its decimal value (including the sign only if negative).
What is the lowest signed integer?
Write its decimal value (including the sign only if negative).
Seems like I'll have to go heavier on binary concepts, I just have 2 months in programming and I thought i knew about binary conversion.
From a logical point of view:
Bounds in signed representation
You have 5 bits, so there are 32 different combinations. It means that you can make 32 different numbers with 5 bits. On unsigned integers, it makes sense to store integers from 0 to 31 (inclusive) on 5 bits.
However, this is about unsigned integers. Meaning: we have to find a way to represent negative numbers too. Meaning: we have to store the number's value, but also its sign (+ or -). The representation used is 2's complement, and it is the one that's learned everywhere (maybe other exist but I don't know them). In this representation, the sign is given by the first bit. That is, in 2's complement representation a positive number starts with a 0 and a negative number starts with an 1.
And here the problem rises: Is 0 a positive number or a negative number ? It can't be both, because it would mean that 0 can be represented in two manners for a given number a bits (for 5: 00000 and 10000), that is we lose the space to put one more number. I have no idea how they decided, but fact is 0 is a positive number. For any number of bits, signed or unsigned, a 0 is represented with only 0.
Great. This gives us the answer to the first question: what is the upper bound for a decimal number represented in 2's complement ? We know that the first bit is for the sign, so all of the numbers we can represent must be composed of 4 bits. We can have 16 different values of 4-bits strings, and 0 is one of them, so the upper bound is 15.
Now, for the negative numbers, this becomes easy. We have already filled 16 values out of the 32 we can make on 5 bits. 16 left. We also know that 0 has already been represented, so we don't need to include it. Then we start at the number right before 0: -1. As we have 16 numbers to represent, starting from -1, the lowest signed integer we can represent on 5 bits is -16.
More generally, with n bits we can represent 2^n numbers. With signed integers, half of them are positive, and half of them are negative. That is, we have 2^(n-1) positive numbers and 2^(n-1) negative numbers. As we know 0 is considered as positive, the greatest signed integer we can represent on n bits is 2^(n-1) - 1 and the lowest is -2^(n-1)
2's complement representation
Now that we know which numbers can be represented on 5 bits, the question is to know how we represent them.
We already saw the sign is represented on the first bit, and that 0 is considered as positive. For positive numbers, it works the same way as it does for unsigned integers: 00000 is 0, 00001 is 1, 00010 is 2, etc until 01111 which is 15. This is where we stop for positive signed integers because we have occupied all the 16 values we had.
For negative signed integers, this is different. If we keep the same representation (10001 is -1, 10010 is -2, ...) then we end up with 11111 being -15 and 10000 not being attributed. We could decide to say it's -16 but we would have to check for this particular case each time we work with negative integers. Plus, this messes up all of the binary operations. We could also decide that 10000 is -1, 10001 is -2, 10010 is -3 etc. But it also messes up all of the binary operations.
2's complement works the following way. Let's say you have the signed integer 10011, you want to know what decimal is is.
Flip all the bits: 10011 --> 01100
Add 1: 01100 --> 01101
Read it as an unsigned integer: 01101 = 0*2^4 + 1*2^3 + 1*2^2 + 0*2^1 + 1*2^0 = 13.
10011 represents -13. This representation is very handy because it works both ways. How to represent -7 as a binary signed integer ? Start with the binary representation of 7 which is 00111.
Flip all the bits: 00111 --> 11000
Add 1: 11000 --> 11001
And that's it ! On 5 bits, -7 is represented by 11001.
I won't cover it, but another great advantage with 2's complement is that the addition works the same way. That is, When adding two binary numbers you do not have to care if they are signed or unsigned, this is the same algorithm behind.
With this, you should be able to answer the questions, but more importantly to understand the answers.
This topic is great for understanding 2's complement: Why is two's complement used to represent negative numbers?

Checking if floating point number is completely convertible to binary [duplicate]

This question already has answers here:
What types of numbers are representable in binary floating-point?
(4 answers)
Closed 8 years ago.
I know how to convert Integers and floating point numbers to binary. But some floating point numbers didn't have exact binary format. Like 0.5 can be exactly written as 0.1 and 0.25 can be written as 0.01 . But how can be sure if a number is completely convertible to binary?
To be representable a binary floating-point format, a number must be a multiple of a power of two, including negative powers. For example, .375 is representable, and it is a multiple of 1/8. (1/8 is .125, and .375 is three times that.) There are additional requirements because the parts of the number must fit into the floating-point format:
A finite number can be represented in the common IEEE 754 double-precision format if and only if it equals M•2e for some integers M and e such that -253 < M < 253 and -1074 ≤ e ≤ 971.
For single precision, -224 < M < 224 and -149 ≤ e ≤ 104.
These values come from the parameters that specify the floating-point formats, such as how many bits are available for the fraction and exponent fields and how much the exponent is biased.
The following is a method to test whether a number meets the above criteria.
First, if the number has a fractional part, try multiplying the number by two until there is no fractional part. If you multiply more than 149 (for single precision) or 1074 times (for double precision), the number is not representable. If the number has no fractional part but is even, divide it by two until it is odd. Stop after 104 (for single precision) or 971 (for double precision) divisions. When you are done multiplying or dividing, look at the absolute value of the remaining number. If it is greater than or equal to 16,777,216 (for single precision) or 9,007,199,254,740,992 (for double precision), the number is not representable. Otherwise, it is.
(Tip: When doing the multiplication step with a number in decimal, if the fraction part ever ends in a digit other than 5, the number is not representable. E.g., .4 and .24 are not representable. .5, .25, and .625 are, although .525 is not.)
If you have large integer arithmetic available, rewrite xxx.yyy as a fraction xxxyyy/10^n where n is length (number of digits) of fraction part yyy.
You convert decimal representation of numerator and denominator of above fraction to binary, then reduce the fraction:
compute gcd(xxxyyy,10^n),
reducedNumerator=xxxyyy/gcd(xxxyyy,10^n),
reducedDenominator=10^n/gcd(xxxyyy,10^n)
if reducedDenominator is divisible by 5 (remainder is zero) then the number is not representable in binary, else it is, and the float can be represented in the formsign*integerSignificand*2^biasedExponent
But in most languages, the number of bits of significand and the range of exponent are limited.
Let's take a look at numerator absolute value first, (I presume not 0) and extract
nh = rank of highest bit set to 1,
nl = rank of lowest bit set to 1
nh+1-nl gives you the number of bits (binary digits) required to represent the significand, it must be under the limit
For example, in IEEE 754 double precision: nh+1-nl<=53 or nh-nl<53
Now let's look at denominator, if not divisible by 5, it is a power of two 10....0, so we can also rewrite the divisibility test: take
dh = rank of highest bit set to 1
dl = rank of lowest bit set to 1
If dh is not equal to dl, then the number is not representable in binary.
If dh==dl, you must have nh-dh remaining in certain range, but also nl-dh.
For example in IEEE 754 double precision: -1074 <= nl-dh and nh-dh <= 1023
What would be interesting and more simple if you are using IEEE 754 formats, is to know whether the conversion function atod or atof as provided by libc or the equivalent provided by your language will correctly raise the IEEE 754 inexact flag, and how you can access this flag in your language...

Can a IEEE 754 real number "cover" all integers within its range?

The original question was edited (shortened) to focus on a problem of precision, not range.
Single, or double precision, every representation of real number is limited to (-range,+range). Within this range lie some integer numbers (1, 2, 3, 4..., and so on; the same goes with negative numbers).
Is there a guarantee that a IEEE 754 real number (float, double, etc) can "cover" all integers within its range? By "cover" I mean the real number will represent the integer number exactly, not as (for example) "5.000001".
Just as reminder: http://www3.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html nice explanation of various number representation formats.
Update:
Because the question is for "can" I am also looking for the fact this cannot be done -- for it quoting a number is enough. For example "no it cannot be done, for example number 1748574 is not represented exactly by float number" (this number is taken out of thin air of course).
For curious reader
If you would like to play with IEEE 754 representation -- on-line calculator: http://www.ajdesigner.com/fl_ieee_754_word/ieee_32_bit_word.php
No, not all, but there exists a range within which you can represent all integers accurately.
Structure of 32bit floating point numbers
The 32bit floating point type uses
1 bit for the sign
8 bits for the exponent
23 bits for the fraction (leading 1 implied)
Representing numbers
Basically, you have a number in the form
(-)1.xxxx_xxxx_xxxx_xxxx_xxxx_xxx (binary)
which you then shift left/right with the (unbiased) exponent.
To have it represent an integer requiring n bits, you need to shift it by n-1 bits to the left. (All xes beyond the floating point are simply zero)
Representing integers with 24 bits
It is easy to see, that we can represent all integers requiring 24 bits (and less)
1xxx_xxxx_xxxx_xxxx_xxxx_xxxx.0 (unbiased exponent = 23)
since we can set the xes at will to either 1 or 0.
The highest number we can represent in this fashion is:
1111_1111_1111_1111_1111_1111.0
or 2^24 - 1 = 16777215
The next higher integer is 1_0000_0000_0000_0000_0000_0000. Thus, we need 25 bits.
Representing integers with 25 bits
If you try to represent a 25 bit integer (unbiased exponent = 24), the numbers have the following form:
1_xxxx_xxxx_xxxx_xxxx_xxxx_xxx0.0
The twenty-three digits that are available to you have all been shifted past the floating point. The leading digit is always a 1. In total, we have 24 digits. But since we need 25, a zero is appended.
A maximum is found
We can represent ``1_0000_0000_0000_0000_0000_0000with the form1_xxxx_xxxx_xxxx_xxxx_xxxx_xxx0.0, by simply assigning 1to allxes. The next higher integer from that is: 1_0000_0000_0000_0000_0000_0001. It's easy to see that this number cannot be represented accurately, because the form does not allow us to set the last digit to 1: It is always 0`.
It follows, that the 1 followed by 24 zeroes is an upper bound for the integers we can accurately represent.
The lower bound simply has its sign bit flipped.
Range within which all integers can be represented (including boundaries)
224 as an upper bound
-224 as a lower bound
Structure of 64bit floating point numbers
1 bit for the sign
11 exponent bits
52 fraction bits
Range within which all integers can be represented (including boundaries)
253 as an upper bound
-253 as a lower bound
This easily follows by applying the same argumentation to the structure of 64bit floating point numbers.
Note: That is not to say these are all integers we can represent, but it gives you a range within which you can represent all integers. Beyond that range, we can only represent a power of two multiplied with an integer from said range.
Combinatorial argument
Simply convincing ourselves that it is impossible for 32bit floating point numbers to represent all integers a 32bit integer can represent, we need not even look at the structure of floating point numbers.
With 32 bits, there are 232 different things we can represent. No more, no less.
A 32bit integer uses all of these "things" to represent numbers (pairwise different).
A 32bit floating point number can represent at least one number with a fractional part.
Thus, it is impossible for the 32bit floating point number to be able to represent this fractional number in addition to all 232 integers.
macias, to add to the already excellent answer by phant0m (upvoted; I suggest you accept it), I'll use your own words.
"No it cannot be done, for example number 16777217 is not represented exactly by float number."
Also, "for example number 9223372036854775809 is not represented exactly by double number".
This is assuming your computer is using the IEEE floating point format, which is a pretty strong bet.
No.
For example, on my system, the type float can represent values up to approximately 3.40282e+38. As an integer, that would be approximately 340282000000000000000000000000000000000, or about 2128.
The size of float is 32 bits, so it can exactly represent at most 232 distinct numbers.
An integer object generally uses all of its bits to represent values (with 1 bit dedicated as a sign bit for signed types). A floating-point object uses some of its bits to represent an exponent (8 bits for IEEE 32-bit float); this increases its range at the cost of losing precision.
A concrete example (1267650600228229401496703205376.0 is 2100, and is exactly representable as a float):
#include <stdio.h>
#include <float.h>
#include <math.h>
int main(void) {
float x = 1267650600228229401496703205376.0;
float y = nextafterf(x, FLT_MAX);
printf("x = %.1f\n", x);
printf("y = %.1f\n", y);
return 0;
}
The output on my system is:
x = 1267650600228229401496703205376.0
y = 1267650751343956853325350043648.0
Another way to look at it:
A 32-bit object can represent at most 232 distinct values.
A 32-bit signed integer can represent all integer values in the range -2147483648 .. 2147483647 (-231 .. +231-1).
A 32-bit float can represent many values that a 32-bit signed integer can't, either because they're fractional (0.5) or because they're too big (2.0100). Since there are values that can be represented by a 32-bit float but not by a 32-bit int, there must be other values that can be represented by a 32-bit int but not by a 32-bit float. Those values are integers that have more significant digits than a float can handle, because the int has 31 value bits but the float has only about 24.
Apparently you are asking whether a Real data type can represent all of the integer values in its range (absolute values up to FLT_MAX or DBL_MAX, in C, or similar constants in other languages).
The largest numbers representable by floating point numbers stored in K bits typically are much larger than the 2^K number of integers that K bits can represent, so typically the answer is no. 32-bit C floats exceed 10^37, 32-bit C integers are less than 10^10. To find out the next representable number after some number, use nextafter() or nextafterf(). For example, the code
printf ("%20.4f %20.4f\n", nextafterf(1e5,1e9), nextafterf(1e6,1e9));
printf ("%20.4f %20.4f\n", nextafterf(1e7,1e9), nextafterf(1e8,1e9));
prints out
100000.0078 1000000.0625
10000001.0000 100000008.0000
You might be interested in whether an integer J that is between two nearby fractional floating values R and S can be represented exactly, supposing S-R < 1 and R < J < S. Yes, such a J can be represented exactly. Every float value is the ratio of some integer and some power of 2. (Or is the product of some integer and some power of 2.) Let the power of 2 be P, and suppose R = U/P, S = V/P. Now U/P < J < V/P so U < J*P < V. More of J*P's low-order bits are zero than are those of U, V (because V-U < P, due to S-R < 1), so J can be represented exactly.
I haven't filled in all the details to show that J*P-U < P and V-J*P < P, but under the assumption S-R < 1 that's straightforward. Here is an example of R,J,S,P,U,V value computations: Let R=99999.9921875 = 12799999/128, (ie P=128); let S=100000.0078125 = 12800001/128; we have U=0xc34fff and V=0xc35001 and there is a number between them that has more low-order zeroes than either; to wit, J = 0xc35000/128 = 12800000/128 = 100000.0. For the numbers in this example, note that U and V require 24 bits for their exact representations (6 ea. 4-bit hex digits). Note that 24 bits is the number of bits of precision in IEEE 754 single-precision floating point numbers. (See table in wikipedia article.)
That each floating point number is a product or ratio of some integer and some power of 2 (as mentioned two paragraphs above) also is discussed in that floating point article, in a paragraph that begins:
By their nature, all numbers expressed in floating-point format are rational numbers with a terminating expansion in the relevant base (for example, ... a terminating binary expansion in base-2). Irrational numbers, such as π or √2, or non-terminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly.

What types of numbers are representable in binary floating-point?

I've read a lot about floats, but it's all unnecessarily involved. I think I've got it pretty much understood, but there's just one thing I'd like to know for sure:
I know that, fractions of the form 1/pow(2,n), with n an integer, can be represented exactly in floating point numbers. This means that if I add 1/32 to itself 32 million times, I would get exactly 1,000,000.
What about something like 1/(32+16)? It's one over the sum of two powers of two, does this work? Or is it 1/32+1/16 that works? This is where I'm confused, so if anyone could clarify that for me I would appreciate it.
The rule can be summed up as this:
A number can be represented exactly in binary if the prime factorization of the denominator contains only 2. (i.e. the denominator is a power-of-two)
So 1/(32 + 16) is not representable in binary because it has a factor of 3 in the denominator. But 1/32 + 1/16 = 3/32 is.
That said, there are more restrictions to be representable in a floating-point type. For example, you only have 53 bits of mantissa in an IEEE double so 1/2 + 1/2^500 is not representable.
So you can do sum of powers-of-two as long as the range of the exponents doesn't span more than 53 powers.
To generalize this to other bases:
A number can be exactly represented in base 10 if the prime factorization of the denominator consists of only 2's and 5's.
A rational number X can be exactly represented in base N if the prime factorization of the denominator of X contains only primes found in the factorization of N.
A finite number can be represented in the common IEEE 754 double-precision format if and only if it equals M•2e for some integers M and e such that -253 < M < 253 and -1074 ≤ e ≤ 971.
For single precision, -224 < M < 224 and -149 ≤ e ≤ 104.
For double-precision, these are consequences of the facts that the double-precision format uses 52 bits to store a significand (which normally has 53 bits due to an implicit 1) and uses 11 bits to store an exponent. 11 bits encodes numbers from 0 to 2047, but 0 and 2047 are excluded for special purposes, and the encoded number is biased by 1023, so it represents unbiased exponents from -1022 to 1023. However, these unbiased exponents are for significands in the interval [1, 2), and those significands have fractions. To express the significand as an integer, I adjusted the exponent range by 52. Single-precision is similar, with 23 bits to store a 24-bit significand, 8 bits for the exponent, and a bias of 127.
Expressing the representable numbers using an integer times a power of two rather than the more common fractional significand simplifies some number theory and other reasoning about floating-point properties. I used it in this answer because it allows the set of representable values to be expressed concisely.
Floating-point numbers are literally represented using the form:
1.m * 2^e
Where 1.m is a binary fraction and e is a positive or negative integer.
As such, you can represent 1/32 + 1/16 exactly, as:
1.1000000 * 2^-4
(1.10 being the binary fraction equivalent to 1.5.) 1/48, however, is not representable in this format.
One point not yet mentioned is that semantically, a floating-point number may be best regarded as representing a range of values. The range of values has a very precisely-defined center point, and the IEEE spec generally requires that the result of a floating-point computation be the number whose range contains the point one would get operating upon the center-points of the original numbers, but in the sequence:
double N1 = 0.1;
float N2 = (float)N1;
double N3 = N2;
N2 is the unambiguous correct single-precision representation of the value that had been represented in N1, despite the language's silly requirement to use an explicit cast. N3 will represent one of the values that N2 could represent (the language spec happens to choose the double value whose range is centered upon the middle of the range of the float). Note that while N2 represents the value of its type whose range contains the correct value, N3 does not.
Incidentally, conversion of a number from a string to a float in .net and .net languages seems to go through an intermediate conversion to double, which may sometimes alter the value. For example, even though the value 13571357 is representable as a single-precision float, the value 13571357.499999999069f gets rounded to 13571358 (even though it's obviously closer to 13571357).

Resources