R largest/smallest representable numbers - r

I'm trying to get the largest/smallest representable number in R.
After typing ".Machine"
I got:
$double.xmin
[1] 2.225074e-308
$double.xmax
[1] 1.797693e+308
However even if I type 2.225074e-309 in R command prompt I get 2.225074e-309 instead of the expected 0
How can I find the largest/smallest number for which adding or subtracting 1 would lead to either Inf(Adding 1 to largest number) or 0(subtracting 1 from smallest number) ?

.Machine$double.xmin gives the value of the smallest positive number whose representation meets the requirements of IEEE 754 technical standard for floating point computation. As is mentioned in the Wikipedia article on double-precision floating point numbers, that standard requires that:
If a decimal string with at most 15 significant digits is converted to IEEE 754 double precision representation and then converted back to a string with the same number of significant digits, then the final string should match the original. If an IEEE 754 double precision is converted to a decimal string with at least 17 significant digits and then converted back to double, then the final number must match the original.
The same article goes on to note that, by compromising precision, even smaller positive numbers (which do not meet the standards' precision requirements) can be represented:
The 11 bit width of the exponent allows the representation of numbers between 10-308 and 10308, with full 15–17 decimal digits precision. By compromising precision, the subnormal representation allows even smaller values up to about 5 × 10-324.
R's doubles behave in exactly this way, as is noted in the Details section of ?.Machine:
Note that on most platforms smaller positive values than
‘.Machine$double.xmin’ can occur. On a typical R platform the
smallest positive double is about ‘5e-324’.
To confirm that that is the smallest positive value that can be represented using R's doubles and to see the cost in loss of precision, try out a few operations like this:
5e-324
# [1] 4.940656e-324
2e-324
# [1] 0
1.4 * 5e-324
# [1] 4.940656e-324
1.6 * 5e-324
# [1] 9.881313e-324

Here are some representations using SAS, IEEE 754 Big Endian?
data _null_;
y=constant('big');
put y hex16.;
put y E21.3;
run;quit;
Biggest
7FEFFFFFFFFFFFFF
1.79769313486230E+308
data _null_;
y=constant('small');
put y hex16.;
put y E21.3;
run;quit;
Smallest
0010000000000000
2.22507385850720E-308
I am not sure the smallest because SAS may set aside some values for missings.

Related

Mathematical precision at 19th decimal place and beyond

I have the same set of data and am running the same code, but sometimes I get different results at the 19th decimal place and beyond. Although this is not a great concern to me for numbers less than 0.0001, it makes me wonder whether 19th decimal place is Raku's limit of precision?
Word 104 differ:
0.04948872986571077 19 chars
0.04948872986571079 19 chars
Word 105 differ:
0.004052062278212545 20 chars
0.0040520622782125445 21 chars
TL;DR See the doc's outstanding Numerics page.
(I had forgotten about that page before I wrote the following answer. Consider this answer at best a brief summary of a few aspects of that page.)
There are two aspects to this. Internal precision and printing precision.
100% internal precision until RAM is exhausted
Raku supports arbitrary precision number types. Quoting Wikipedia's relevant page:
digits of precision are limited only by the available memory of the host system
You can direct Raku to use one of its arbitrary precision types.[1] If you do so it will retain 100% precision until it runs out of RAM.
Arbitrary precision type
Corresponding type checking[2]
Example of value of that type
Int
my Int $foo ...
66174449004242214902112876935633591964790957800362273
FatRat
my FatRat $foo ...
66174449004242214902112876935633591964790957800362273 / 13234889800848443102075932929798260216894990083844716
Thus you can get arbitrary internal precision for integers and fractions (including arbitrary precision decimals).
Limited internal precision
If you do not direct Raku to use an arbitrary precision number type it will do its best but may ultimately switch to limited precision. For example, Raku will give up on 100% precision if a formula you use calculates a Rat and the number's denominator exceeds 64 bits.[1]
Raku's fall back limited precision number type is Num:
On most platforms, [a Num is] an IEEE 754 64-bit floating point numbers, aka "double precision".
Quoting the Wikipedia page for that standard:
Floating point is used ... when a wider range is needed ... even if at the cost of precision.
The 53-bit significand precision gives from 15 to 17 significant decimal digits precision (2−53 ≈ 1.11 × 10−16).
Printing precision
Separate from internal precision is stringification of numbers.
(It was at this stage that I remembered the doc page on Numerics linked at the start of this answer.)
Quoting Printing rationals:
Keep in mind that output routines like say or put ... may choose to display a Num as an Int or a Rat number. For a more definitive string to output, use the raku method or [for a rational number] .nude
Footnotes
[1] You control the type of a numeric expression via the types of individual numbers in the expression, and the types of the results of numeric operations, which in turn depend on the types of the numbers. Examples:
1 + 2 is 3, an Int, because both 1 and 2 are Ints, and a + b is an Int if both a and b are Ints;
1 / 2 is not an Int even though both 1 and 2 are individually Ints, but is instead 1/2 aka 0.5, a Rat.
1 + 4 / 2 will print out as 3, but the 3 is internally a Rat, not an Int, due to Numeric infectiousness.
[2] All that enforcement does is generate a run-time error if you try to assign or bind a value that is not of the numeric type you've specified as the variable's type constraint. Enforcement doesn't mean that Raku will convert numbers for you. You have to write your formulae to ensure the result you get is what you want.[1] You can use coercion -- but coercion cannot regain precision that's already been lost.

Converting a decimal number to floating point notation and IEEE 754 format

I'm doing an assignment for one of my classes and I'm stuck on those two questions:
Express the decimal -412.8 using binary floating point notation using 11 fraction bits for
the significand and 3 digits for the exponent without bias
I think I managed to solve it, but my exponent has 4 bits not 3. I don't really understand how you can convert -412.8 to floating point notation using only 3 bit exponents. Here is how I tried to solve it:
First of all, the floating point notation has three parts. The sign part, 0 for positive numbers and 1 for negative numbers, the exponent part and finally the mantissa. The mantissa in this case includes the leading 1. Since the number is negative, the sign bit is going to be 1. For the mantissa, I first converted 412.8 to binary, which gave me 110011100.11 and then I shifted the decimal point to the left 8 times, which gives me 1.1001110011. The mantissa is therefore 1100 1110 011 (11 bits as the teacher asked). Finally, the exponent is going to be 2^8, since I shifted the decimal 8 times to the right. 8 is 1000 in binary. So am I correct to assume that my floating point notation should be 1 1000 11001110011?
Represent the decimal number 16.1875×2-134 in single-precision IEEE 754 format.
I'm completely stuck on this one. I don't know how to convert that number. When I enter it in wolfram, the decimal number is way beyond the limit of the single precision format. I do know that the sign bit is going to be 0 since the number is positive. I don't know what the mantissa is though, nor how to find it. I also don't know how to find the exponent. Can someone guide me through this problem? Thanks.
For 1, you appear to be correct -- there's no way to reperesent the exponent unbiased in 3 bits. Of course, the problem says "3 digits" and doesn't define a base for the digits...
2 is relatively straight-forward -- convert the value to binary gives you 10000.0011 then normalize, giving 1.00000011×2-130. Now -130 is too small for a single-precision exponent (minimum is -126), so we have to denormalize (continue shifting the point to get an exponent of -126), which gives us 0.000100000011×2-126. That's then our mantissa (with the 0 dropped) and an exponent field of 0: 0|00000000|00010000001100000000000 (vertical bars separating the sign/exponent/mantissa fields) or 0x00081800

Checking if floating point number is completely convertible to binary [duplicate]

This question already has answers here:
What types of numbers are representable in binary floating-point?
(4 answers)
Closed 8 years ago.
I know how to convert Integers and floating point numbers to binary. But some floating point numbers didn't have exact binary format. Like 0.5 can be exactly written as 0.1 and 0.25 can be written as 0.01 . But how can be sure if a number is completely convertible to binary?
To be representable a binary floating-point format, a number must be a multiple of a power of two, including negative powers. For example, .375 is representable, and it is a multiple of 1/8. (1/8 is .125, and .375 is three times that.) There are additional requirements because the parts of the number must fit into the floating-point format:
A finite number can be represented in the common IEEE 754 double-precision format if and only if it equals M•2e for some integers M and e such that -253 < M < 253 and -1074 ≤ e ≤ 971.
For single precision, -224 < M < 224 and -149 ≤ e ≤ 104.
These values come from the parameters that specify the floating-point formats, such as how many bits are available for the fraction and exponent fields and how much the exponent is biased.
The following is a method to test whether a number meets the above criteria.
First, if the number has a fractional part, try multiplying the number by two until there is no fractional part. If you multiply more than 149 (for single precision) or 1074 times (for double precision), the number is not representable. If the number has no fractional part but is even, divide it by two until it is odd. Stop after 104 (for single precision) or 971 (for double precision) divisions. When you are done multiplying or dividing, look at the absolute value of the remaining number. If it is greater than or equal to 16,777,216 (for single precision) or 9,007,199,254,740,992 (for double precision), the number is not representable. Otherwise, it is.
(Tip: When doing the multiplication step with a number in decimal, if the fraction part ever ends in a digit other than 5, the number is not representable. E.g., .4 and .24 are not representable. .5, .25, and .625 are, although .525 is not.)
If you have large integer arithmetic available, rewrite xxx.yyy as a fraction xxxyyy/10^n where n is length (number of digits) of fraction part yyy.
You convert decimal representation of numerator and denominator of above fraction to binary, then reduce the fraction:
compute gcd(xxxyyy,10^n),
reducedNumerator=xxxyyy/gcd(xxxyyy,10^n),
reducedDenominator=10^n/gcd(xxxyyy,10^n)
if reducedDenominator is divisible by 5 (remainder is zero) then the number is not representable in binary, else it is, and the float can be represented in the formsign*integerSignificand*2^biasedExponent
But in most languages, the number of bits of significand and the range of exponent are limited.
Let's take a look at numerator absolute value first, (I presume not 0) and extract
nh = rank of highest bit set to 1,
nl = rank of lowest bit set to 1
nh+1-nl gives you the number of bits (binary digits) required to represent the significand, it must be under the limit
For example, in IEEE 754 double precision: nh+1-nl<=53 or nh-nl<53
Now let's look at denominator, if not divisible by 5, it is a power of two 10....0, so we can also rewrite the divisibility test: take
dh = rank of highest bit set to 1
dl = rank of lowest bit set to 1
If dh is not equal to dl, then the number is not representable in binary.
If dh==dl, you must have nh-dh remaining in certain range, but also nl-dh.
For example in IEEE 754 double precision: -1074 <= nl-dh and nh-dh <= 1023
What would be interesting and more simple if you are using IEEE 754 formats, is to know whether the conversion function atod or atof as provided by libc or the equivalent provided by your language will correctly raise the IEEE 754 inexact flag, and how you can access this flag in your language...

Calculations precision level in R

I am working in R with very small numbers which reflect probabilities in an Maximum Likelihood Estimation algorithm. Some of these numbers are as small as 1e-155 ( or smaller). However, when there is something as simple as summation taking place, the precision level gets truncated to the least precise one and thus ruins the precisions of my calculations and produces meaningless results.
Example:
> sum(c(7.831908e-70,6.002923e-26,6.372573e-36,5.025015e-38,5.603268e-38,1.118121e-14, 4.512098e-07,4.400717e-05,2.300423e-26,1.317602e-58))
[1] 4.445838e-05
As is seen from the example, the base for this calculation is 1e-5 , which in a very rude manner rounds up sensitive calculation.
Is there a way around this? Why is R choosing such a strange automatic behavior? Perhaps it is not really doing this, I just see the result in the truncated form? In this case, is the actual number with correct precision stored in the variable?
There is no precision loss in your sum. But if you're worried about it, you should use a multiple-precision library:
library("Rmpfr")
x <- c(7.831908e-70,6.002923e-26,6.372573e-36,5.025015e-38,5.603268e-38,1.118121e-14, 4.512098e-07,4.400717e-05,2.300423e-26,1.317602e-58)
sum(mpfr(x, 1024))
# 1 'mpfr' number of precision 1024 bits
# [1] 4.445837981118120898327314579322617633703674840117902103769961398533293289165193843930280422747754618577451267010103975610356319174778512980120125435961577770470993217990999166176083700886405875414277348471907198346293122011042229843450802884152750493740313686430454254150390625000000000000000000000000000000000e-5
Your results are only truncated in the display.
Try:
x <- sum(c(7.831908e-70,6.002923e-26,6.372573e-36,5.025015e-38,5.603268e-38,1.118121e-14, 4.512098e-07,4.400717e-05,2.300423e-26,1.317602e-58))
print(x, digits=22)
[1] 4.445837981118121081878e-05
You can read more about the behaviour of print at ?print.default
You can also set an option - this will affext all calls to print
options(digits=22)
have you ever heard about Floating point numbers?
there is no loss of precision (significant figures) in multiplication or division as far as the result stay between
1.7976931348623157·10^308 to 4.9·10^−324 (see the link for detail)
so if you do 1.0e-30 * 1.0e-10 result will be 1.0e-40
but if you do 1.0e-30 + 1.0e-10 result will be 1.0e-10
Why?
-> finite set of number rapresentable with computer works. (64 bits max 2^64 different representation of numbers with 64 bits)
instead of using a direct conversion like for integer numbers (they represent from ~ -2^62 to +2^62, every INTEGER number -> about from -10^16 to +10*16)
or there exist a clever way like floating point? from 1.7976931348623157·10^308 to - 4.9·10^−324 and it can represent /approximate rational numbers?
So in floating point, to achieve a wider range, precision in sums is sacrified, There is loss of precision during sums or subtractions as the significant figures that could be represented by (the 52 bits of) the fraction part (of a floating point number of 64 bits) are less than log10(2^52) ~ 16.
if you look for a basic everyday example, summary(lm), when the p-value of parameter is near zero, summary() output <2.2e-16 (what a coincidence).
why limited to 64 bits? CPU have the execution units specifically to 64bits floating point arithmetic (64 bit IEEE 754 standard), if you use higher precision like 128 bits floating point, the performances will be lowered by 10 times or more, as CPU need to split the data and operation in multiple 64 bits data and operations.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format

Can a IEEE 754 real number "cover" all integers within its range?

The original question was edited (shortened) to focus on a problem of precision, not range.
Single, or double precision, every representation of real number is limited to (-range,+range). Within this range lie some integer numbers (1, 2, 3, 4..., and so on; the same goes with negative numbers).
Is there a guarantee that a IEEE 754 real number (float, double, etc) can "cover" all integers within its range? By "cover" I mean the real number will represent the integer number exactly, not as (for example) "5.000001".
Just as reminder: http://www3.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html nice explanation of various number representation formats.
Update:
Because the question is for "can" I am also looking for the fact this cannot be done -- for it quoting a number is enough. For example "no it cannot be done, for example number 1748574 is not represented exactly by float number" (this number is taken out of thin air of course).
For curious reader
If you would like to play with IEEE 754 representation -- on-line calculator: http://www.ajdesigner.com/fl_ieee_754_word/ieee_32_bit_word.php
No, not all, but there exists a range within which you can represent all integers accurately.
Structure of 32bit floating point numbers
The 32bit floating point type uses
1 bit for the sign
8 bits for the exponent
23 bits for the fraction (leading 1 implied)
Representing numbers
Basically, you have a number in the form
(-)1.xxxx_xxxx_xxxx_xxxx_xxxx_xxx (binary)
which you then shift left/right with the (unbiased) exponent.
To have it represent an integer requiring n bits, you need to shift it by n-1 bits to the left. (All xes beyond the floating point are simply zero)
Representing integers with 24 bits
It is easy to see, that we can represent all integers requiring 24 bits (and less)
1xxx_xxxx_xxxx_xxxx_xxxx_xxxx.0 (unbiased exponent = 23)
since we can set the xes at will to either 1 or 0.
The highest number we can represent in this fashion is:
1111_1111_1111_1111_1111_1111.0
or 2^24 - 1 = 16777215
The next higher integer is 1_0000_0000_0000_0000_0000_0000. Thus, we need 25 bits.
Representing integers with 25 bits
If you try to represent a 25 bit integer (unbiased exponent = 24), the numbers have the following form:
1_xxxx_xxxx_xxxx_xxxx_xxxx_xxx0.0
The twenty-three digits that are available to you have all been shifted past the floating point. The leading digit is always a 1. In total, we have 24 digits. But since we need 25, a zero is appended.
A maximum is found
We can represent ``1_0000_0000_0000_0000_0000_0000with the form1_xxxx_xxxx_xxxx_xxxx_xxxx_xxx0.0, by simply assigning 1to allxes. The next higher integer from that is: 1_0000_0000_0000_0000_0000_0001. It's easy to see that this number cannot be represented accurately, because the form does not allow us to set the last digit to 1: It is always 0`.
It follows, that the 1 followed by 24 zeroes is an upper bound for the integers we can accurately represent.
The lower bound simply has its sign bit flipped.
Range within which all integers can be represented (including boundaries)
224 as an upper bound
-224 as a lower bound
Structure of 64bit floating point numbers
1 bit for the sign
11 exponent bits
52 fraction bits
Range within which all integers can be represented (including boundaries)
253 as an upper bound
-253 as a lower bound
This easily follows by applying the same argumentation to the structure of 64bit floating point numbers.
Note: That is not to say these are all integers we can represent, but it gives you a range within which you can represent all integers. Beyond that range, we can only represent a power of two multiplied with an integer from said range.
Combinatorial argument
Simply convincing ourselves that it is impossible for 32bit floating point numbers to represent all integers a 32bit integer can represent, we need not even look at the structure of floating point numbers.
With 32 bits, there are 232 different things we can represent. No more, no less.
A 32bit integer uses all of these "things" to represent numbers (pairwise different).
A 32bit floating point number can represent at least one number with a fractional part.
Thus, it is impossible for the 32bit floating point number to be able to represent this fractional number in addition to all 232 integers.
macias, to add to the already excellent answer by phant0m (upvoted; I suggest you accept it), I'll use your own words.
"No it cannot be done, for example number 16777217 is not represented exactly by float number."
Also, "for example number 9223372036854775809 is not represented exactly by double number".
This is assuming your computer is using the IEEE floating point format, which is a pretty strong bet.
No.
For example, on my system, the type float can represent values up to approximately 3.40282e+38. As an integer, that would be approximately 340282000000000000000000000000000000000, or about 2128.
The size of float is 32 bits, so it can exactly represent at most 232 distinct numbers.
An integer object generally uses all of its bits to represent values (with 1 bit dedicated as a sign bit for signed types). A floating-point object uses some of its bits to represent an exponent (8 bits for IEEE 32-bit float); this increases its range at the cost of losing precision.
A concrete example (1267650600228229401496703205376.0 is 2100, and is exactly representable as a float):
#include <stdio.h>
#include <float.h>
#include <math.h>
int main(void) {
float x = 1267650600228229401496703205376.0;
float y = nextafterf(x, FLT_MAX);
printf("x = %.1f\n", x);
printf("y = %.1f\n", y);
return 0;
}
The output on my system is:
x = 1267650600228229401496703205376.0
y = 1267650751343956853325350043648.0
Another way to look at it:
A 32-bit object can represent at most 232 distinct values.
A 32-bit signed integer can represent all integer values in the range -2147483648 .. 2147483647 (-231 .. +231-1).
A 32-bit float can represent many values that a 32-bit signed integer can't, either because they're fractional (0.5) or because they're too big (2.0100). Since there are values that can be represented by a 32-bit float but not by a 32-bit int, there must be other values that can be represented by a 32-bit int but not by a 32-bit float. Those values are integers that have more significant digits than a float can handle, because the int has 31 value bits but the float has only about 24.
Apparently you are asking whether a Real data type can represent all of the integer values in its range (absolute values up to FLT_MAX or DBL_MAX, in C, or similar constants in other languages).
The largest numbers representable by floating point numbers stored in K bits typically are much larger than the 2^K number of integers that K bits can represent, so typically the answer is no. 32-bit C floats exceed 10^37, 32-bit C integers are less than 10^10. To find out the next representable number after some number, use nextafter() or nextafterf(). For example, the code
printf ("%20.4f %20.4f\n", nextafterf(1e5,1e9), nextafterf(1e6,1e9));
printf ("%20.4f %20.4f\n", nextafterf(1e7,1e9), nextafterf(1e8,1e9));
prints out
100000.0078 1000000.0625
10000001.0000 100000008.0000
You might be interested in whether an integer J that is between two nearby fractional floating values R and S can be represented exactly, supposing S-R < 1 and R < J < S. Yes, such a J can be represented exactly. Every float value is the ratio of some integer and some power of 2. (Or is the product of some integer and some power of 2.) Let the power of 2 be P, and suppose R = U/P, S = V/P. Now U/P < J < V/P so U < J*P < V. More of J*P's low-order bits are zero than are those of U, V (because V-U < P, due to S-R < 1), so J can be represented exactly.
I haven't filled in all the details to show that J*P-U < P and V-J*P < P, but under the assumption S-R < 1 that's straightforward. Here is an example of R,J,S,P,U,V value computations: Let R=99999.9921875 = 12799999/128, (ie P=128); let S=100000.0078125 = 12800001/128; we have U=0xc34fff and V=0xc35001 and there is a number between them that has more low-order zeroes than either; to wit, J = 0xc35000/128 = 12800000/128 = 100000.0. For the numbers in this example, note that U and V require 24 bits for their exact representations (6 ea. 4-bit hex digits). Note that 24 bits is the number of bits of precision in IEEE 754 single-precision floating point numbers. (See table in wikipedia article.)
That each floating point number is a product or ratio of some integer and some power of 2 (as mentioned two paragraphs above) also is discussed in that floating point article, in a paragraph that begins:
By their nature, all numbers expressed in floating-point format are rational numbers with a terminating expansion in the relevant base (for example, ... a terminating binary expansion in base-2). Irrational numbers, such as π or √2, or non-terminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly.

Resources