Managing floating point accuracy - r

I'm struggling with issues re. floating point accuracy, and could not find a solution.
Here is a short example:
aa<-c(99.93029, 0.0697122)
aa
[1] 99.9302900 0.0697122
aa[1]
99.93029
print(aa[1],digits=20)
99.930289999999999
It would appear that, upon storing the vector, R converted the numbers to something with a slightly different internal representation (yes, I have read circle 1 of the "R inferno" and similar material).
How can I force R to store the input values exactly "as is", with no modification?
In my case, my problem is that the values are processed in such a way that the small errors very quickly grow:
aa[2]/(100-aa[1])*100
[1] 100.0032 ## Should be 100, of course !
print(aa[2]/(100-aa[1])*100,digits=20)
[1] 100.00315593171625
So I need to find a way to get my normalization right.
Thanks
PS- There are many questions on this site and elsewhere, discussing the issue of apparent loss of precision, i.e. numbers displayed incorrectly (but stored right). Here, for instance:
How to stop read.table from rounding numbers with different degrees of precision in R?
This is a distinct issue, as the number is stored incorrectly (but displayed right).
(R version 3.2.1 (2015-06-18), win 7 x64)

Floating point precision has always generated lots of confusion. The crucial idea to remember is: when you work with doubles, there is no way to store each real number "as is", or "exactly right" -- the best you can store is the closest available approximation. So when you type (in R or any other modern language) something like x = 99.93029, you'll get this number represented by 99.930289999999999.
Now when you expect a + b to be "exactly 100", you're being inaccurate in terms. The best you can get is "100 up to N digits after the decimal point" and hope that N is big enough. In your case it would be correct to say 99.9302900 + 0.0697122 is 100 with 5 decimal points of accuracy. Naturally, by multiplying that equality by 10^k you'll lose additional k digits of accuracy.
So, there are two solutions here:
a. To get more precision in the output, provide more precision in the input.
bb <- c(99.93029, 0.06971)
print(bb[2]/(100-bb[1])*100, digits = 20)
[1] 99.999999999999119
b. If double precision not enough (can happen in complex algorithms), use packages that provide extra numeric precision operations. For instance, package gmp.

i guess you have misunderstood here. It's the same case where R is storing the correct value but the value is displayed accordingly to the value of option chosen while displaying it.
For Eg:
# the output of below will be:
> print(99.930289999999999,digits=20)
[1] 99.930289999999999395
But
# the output of:
> print(1,digits=20)
[1] 1
Also
> print(1.1,digits=20)
[1] 1.1000000000000000888

In addition to previous answers, I think that a good lecture regarding the subject would be
R Inferno, by P.Burns
http://www.burns-stat.com/documents/books/the-r-inferno/

Related

Why does the addition of precise numbers give an imprecise result in R?

I would like to add precise numbers together, and get a precise result. I know that I need to set my desired digit length in the options. For example:
> options(digits=20)
> x <- 42.616999999999997 + 42.405999999999999 + 42.869
I expect to get an answer with the level of precision set. In the above case:
> x
[1] 127.89199999999999591
This result is duly produced on my personal computer, but I cannot replicate it on my work computer, which - with the exact same code and the exact same environment (no difference in packages) - produces the following result:
> x
[1] 127.892
I am unclear about what else could differ between my personal and work computer. It might be useful to know that my personal computer is running R version 3.6.2, while my work computer is running 3.4.2 (I cannot update it because of installation restrictions).
Thank you in advance for your help!
In help("options") we read:
digits: controls the number of significant (see signif) digits to
print when printing numeric values. It is a suggestion only. Valid
values are 1...22 with default 7. See the note in print.default about
values greater than 15.
Your value is greater than 15, so let's check the note in help("print.default"):
Large number of digits
Note that for large values of digits, currently
for digits >= 16, the calculation of the number of significant digits
will depend on the platform's internal (C library) implementation of
sprintf() functionality.
This explains why you observe different results on different computers. Indeed, on my system:
options(digits=20)
42.616999999999997 + 42.405999999999999 + 42.869
#[1] 127.892
However, you can enforce a specific number of digits after the decimal point by using the R function sprintf:
sprintf("%.17f", 42.616999999999997 + 42.405999999999999 + 42.869)
#[1] "127.89199999999999591"
Note that all of this is only about how numbers are printed. Internally, R uses always double precision (please read this post to understand the consequences).
If you need more precise results you'd need to use arbitrary precision numbers, e.g., by using the R package Rmpfr.

Factorial(x) for x>170 using Rmpfr/gmp library

The problem that I would like to solve is the infinite sum over the following function:
For the sum I use an FTOL determination criterion. This whole term doesn't create any problems until z becomes very large. I expect the maximum value of z around 220. As you can see the first term has its max around Factorial(221) and therefore has to go around Factorial(500) until the determination criterion has been reached. After spotting this problem I didn't want to change the whole code (as it is only one small part) and tried to use library('Rmpfr') and library('gmp'). The problem is that I do not get what I want to. While multiplication normally works, subtraction fails for higher values:
This works
> factorialZ(22)-factorial(22)
Big Integer ('bigz') :
[1] 0
but this fails:
> factorialZ(50)-factorial(50)
Big Integer ('bigz') :
[1] 359073645150499628823711419759505502520867983196160
another way I tried:
> gamma(as(10,"mpfr"))-factorial(9)
1 'mpfr' number of precision 128 bits
[1] 0
> gamma(as(40,"mpfr"))-factorial(39)
1 'mpfr' number of precision 128 bits
[1] 1770811808798664813196481658880
There has to be something that I don't really understand. Does someone have a even better solution for the problem or can someone help me out with the issue above?
I think you incorrectly understand the priorities in factorialZ(x)-factorial(x) . The second term, factorial(x) is calculated before it's converted to a bigz to be combined with the first term.
You must create any integer outside the 2^64 (or whatever, depending on your machine) range using a bigz - compatible function.
50! is between 2^214 and 2^215 so the closest representable numbers are 2^(214-52) apart. factorial in R is based on a Lanczos approximation whereas factorialZ is calculating it exactly. The answers are within the machine precision:
> all.equal(as.numeric(factorialZ(50)), factorial(50))
[1] TRUE
The part that you're not understanding is floating point and it's limitations. You're only getting ~15 digits of precision in floating point. factorialZ(50) has a LOT more precision than that, so you shouldn't expect them to be the same.

Why does the number 1e9999... (31 9s) cause problems in R?

When entering 1e9999999999999999999999999999999 into R, R hangs and will not respond - requiring it to be terminated.
It seems to happen across 3 different computers, OSes (Windows 7 and Ubuntu). It happens in RStudio, RGui and RScript.
Here's some code to generate the number more easily:
boom <- paste(c("1e", rep(9, 31)), collapse="")
eval(parse(text=boom))
Now clearly this isn't a practical problem. I have no need to use numbers of this magnitude. It's just a question of curiosity.
Curiously, if you try 1e9999999999999999999999999999998 or 1e10000000000000000000000000000000 (add or subtract one from the power), you get Inf and 0 respectively. This number is clearly some kind of boundary, but between what and why here?
I considered that it might be:
A floating point problem, but I think they max out at 1.7977e308, long before the number in question.
An issue with 32-bit integers, but 2^32 is 4294967296, much smaller than the number in question.
Really weird. This is my dominant theory.
EDIT: As of 2015-09-15 at the latest, this no longer causes R to hang. They must have patched it.
This looks like an extreme case in the parser. The XeY format is described in Section 10.3.1: Literal Constants of the R Language Definition and points to ?NumericConstants for "up-to-date information on the currently accepted formats".
The problem seems to be how the parser handles the exponent. The numeric constant is handled by NumericValue (line 4361 of main/gram.c), which calls mkFloat (line 4124 of main/gram.c), which calls R_atof (line 1584 of main/util.c), which calls R_strtod4 (line 1461 of main/util.c). (All as of revision 60052.)
Line 1464 of main/utils.c shows expn declared as int and it will overflow at line 1551 if the exponent is too large. The signed integer overflow causes undefined behavior.
For example, the code below produces values for exponents < 308 or so and Inf for exponents > 308.
const <- paste0("1e",2^(1:31)-2)
for(n in const) print(eval(parse(text=n)))
You can see the undefined behavior for exponents > 2^31 (R hangs for an exponent = 2^31):
const <- paste0("1e",2^(31:61)+1)
for(n in const) print(eval(parse(text=n)))
I doubt this will get any attention from R-core because R can only store numeric values between about 2e-308 to 2e+308 (see ?double) and this number is way beyond that.
This is interesting, but I think R has systemic problems with parsing numbers that have very large exponents:
> 1e10000000000000000000000000000000
[1] 0
> 1e1000000000000000000000000000000
[1] Inf
> 1e100000000000000000000
[1] Inf
> 1e10000000000000000000
[1] 0
> 1e1000
[1] Inf
> 1e100
[1] 1e+100
There we go, finally something reasonable. According to this output and Joshua Ulrich's comment below, R appears to support representing numbers up to about 2e308 and parsing numbers with exponents up to about +2*10^9, but it cannot represent them. After that, there is undefined behavior apparently due to overflow.
R might use sometimes bignums. Perhaps 1e9999999999999999999999999999999 is some threshold, or perhaps the parsing routines have a limited buffer for reading the exponent. Your observation would be consistent with a 32 char (null-terminated) buffer for the exponent.
I'll rather ask that question on forums or mailing list specific to R, which are rumored to be friendly.
Alternatively, since R is free software, you could investigate its source code.

What is integer overflow in R and how can it happen?

I have some calculation going on and get the following warning (i.e. not an error):
Warning messages:
1: In sum(myvar, na.rm = T) :
Integer overflow - use sum(as.numeric(.))
In this thread people state that integer overflows simply don't happen. Either R isn't overly modern or they are not right. However, what am I supposed to do here? If I use as.numeric as the warning suggests I might not account for the fact that information is lost way before. myvar is read form a .csv file, so shouldn't R figure out that some bigger field is needed? Does it already cut off something?
What's the max length of integer or numeric? Would you suggest any other field type / mode?
EDIT: I run:
R version 2.13.2 (2011-09-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) within R Studio
You can answer many of your questions by reading the help page ?integer. It says:
R uses 32-bit integers for integer vectors, so the range of
representable integers is restricted to about +/-2*10^9.
Expanding to larger integers is under consideration by R Core but it's not going to happen in the near future.
If you want a "bignum" capacity then install Martin Maechler's Rmpfr package [PDF]. I recommend the 'Rmpfr' package because of its author's reputation. Martin Maechler is also heavily involved with the Matrix package development, and in R Core as well. There are alternatives, including arithmetic packages such as 'gmp', 'Brobdingnag' and 'Ryacas' package (the latter also offers a symbolic math interface).
Next, to respond to the critical comments in the answer you linked to, and how to assess the relevance to your work, consider this: If there were the same statistical functionality available in one of those "modern" languages as there is in R, you would probably see a user migration in that direction. But I would say that migration, and certainly growth, is in the R direction at the moment. R was built by statisticians for statistics.
There was at one time a Lisp variant with a statistics package, Xlisp-Stat, but its main developer and proponent is now a member of R-Core. On the other hand one of the earliest R developers, Ross Ihaka, suggests working toward development in a Lisp-like language [PDF]. There is a compiled language called Clojure (pronounced as English speakers would say "closure") with an experimental interface, Rincanter.
Update:
The new versions of R (3.0.+) has 53 bit integers of a sort (using the numeric mantissa). When an "integer" vector element is assigned a value in excess of '.Machine$integer.max', the entire vector is coerced to "numeric", a.k.a. "double". Maximum value for integers remains as it was, however, there may be coercion of integer vectors to doubles to preserve accuracy in cases that would formerly generate overflow. Unfortunately, the length of lists, matrix and array dimensions, and vectors is still set at integer.max.
When reading in large values from files, it is probably safer to use character-class as the target and then manipulate. If there is coercion to NA values, there will be a warning.
In short, integer is an exact type with limited range, and numeric is a floating-point type that can represent a much wider range of value but is inexact. See the help pages (?integer and ?numeric) for further details.
As to the overflow, here is an explanation by Brian D. Ripley:
It means that you are taking the mean [in your case, the sum -- #aix] of some very large integers, and
the calculation is overflowing. It is just a warning.
This will not happen in the next release of R.
You can specify that a number is an integer by giving it the suffix L, for example, 1L is the integer one, as opposed to 1 which is a floating point one, with class "numeric".
The largest integer that you can create on your machine is given by .Machine$integer.max.
> .Machine$integer.max
[1] 2147483647
> class(.Machine$integer.max)
[1] "integer"
Adding a positive integer to this causes an overflow, returning NA.
> .Machine$integer.max + 1L
[1] NA
Warning message:
In .Machine$integer.max + 1L : NAs produced by integer overflow
> class(.Machine$integer.max + 1L)
[1] "integer"
You can get round this limit by adding floating point values instead.
> .Machine$integer.max + 1
[1] 2147483648
> class(.Machine$integer.max + 1)
[1] "numeric"
Since in your case the warning is issued by sum, this indicates that the overflow happens when the numbers are added together. The suggested workaround sum(as.numeric(.)) should do the trick.
What's the max length of integer or numeric?
Vectors are currently indexed with an integer, so the max length is given by .Machine$integer.max. As DWin noted, all versions of R currently use 32-bit integers, so this will be 2^31 - 1, or a little over 2 billion.
Unless you are packing some serious hardware (or you are reading this in the future; hello from 2012) you won't have enough memory to allocate vectors that long.
I remember a discussion where R-core (Brian Ripley, I think) suggested that the next step could be to index vectors with the mantissa of doubles, or something clever like that, effectively giving 48-bits of index. Sadly, I can't find that discussion.
In addition to the Rmpfr package, if you are suffering integer overflow, you might want to try the int64 package.
If c = a - b overflows because a and b are integers, try the following:
c = as.double(a - b)

Adding floating point precision to qnorm/pnorm?

I would be interested to increase the floating point limit for when calculating qnorm/pnorm from their current level, for example:
x <- pnorm(10) # 1
qnorm(x) # Inf
qnorm(.9999999999999999444) # The highst limit I've found that still return a <<Inf number
Is that (under a reasonable amount of time) possible to do? If so, how?
If the argument is way in the upper tail, you should be able to get better precision by calculating 1-p. Like this:
> x = pnorm(10, lower.tail=F)
> qnorm(x, lower.tail=F)
10
I would expect (though I don't know for sure) that the pnorm() function is referring to a C or Fortran routine that is stuck on whatever floating point size the hardware supports. Probably better to rearrange your problem so the precision isn't needed.
Then, if you're dealing with really really big z-values, you can use log.p=T:
> qnorm(pnorm(100, low=F, log=T), low=F, log=T)
100
Sorry this isn't exactly what you're looking for. But I think it will be more scalable -- pnorm hits 1 so rapidly at high z-values (it is e^(-x^2), after all) that even if you add more bits they will run out fast.

Resources