What is integer overflow in R and how can it happen? - r

I have some calculation going on and get the following warning (i.e. not an error):
Warning messages:
1: In sum(myvar, na.rm = T) :
Integer overflow - use sum(as.numeric(.))
In this thread people state that integer overflows simply don't happen. Either R isn't overly modern or they are not right. However, what am I supposed to do here? If I use as.numeric as the warning suggests I might not account for the fact that information is lost way before. myvar is read form a .csv file, so shouldn't R figure out that some bigger field is needed? Does it already cut off something?
What's the max length of integer or numeric? Would you suggest any other field type / mode?
EDIT: I run:
R version 2.13.2 (2011-09-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) within R Studio

You can answer many of your questions by reading the help page ?integer. It says:
R uses 32-bit integers for integer vectors, so the range of
representable integers is restricted to about +/-2*10^9.
Expanding to larger integers is under consideration by R Core but it's not going to happen in the near future.
If you want a "bignum" capacity then install Martin Maechler's Rmpfr package [PDF]. I recommend the 'Rmpfr' package because of its author's reputation. Martin Maechler is also heavily involved with the Matrix package development, and in R Core as well. There are alternatives, including arithmetic packages such as 'gmp', 'Brobdingnag' and 'Ryacas' package (the latter also offers a symbolic math interface).
Next, to respond to the critical comments in the answer you linked to, and how to assess the relevance to your work, consider this: If there were the same statistical functionality available in one of those "modern" languages as there is in R, you would probably see a user migration in that direction. But I would say that migration, and certainly growth, is in the R direction at the moment. R was built by statisticians for statistics.
There was at one time a Lisp variant with a statistics package, Xlisp-Stat, but its main developer and proponent is now a member of R-Core. On the other hand one of the earliest R developers, Ross Ihaka, suggests working toward development in a Lisp-like language [PDF]. There is a compiled language called Clojure (pronounced as English speakers would say "closure") with an experimental interface, Rincanter.
Update:
The new versions of R (3.0.+) has 53 bit integers of a sort (using the numeric mantissa). When an "integer" vector element is assigned a value in excess of '.Machine$integer.max', the entire vector is coerced to "numeric", a.k.a. "double". Maximum value for integers remains as it was, however, there may be coercion of integer vectors to doubles to preserve accuracy in cases that would formerly generate overflow. Unfortunately, the length of lists, matrix and array dimensions, and vectors is still set at integer.max.
When reading in large values from files, it is probably safer to use character-class as the target and then manipulate. If there is coercion to NA values, there will be a warning.

In short, integer is an exact type with limited range, and numeric is a floating-point type that can represent a much wider range of value but is inexact. See the help pages (?integer and ?numeric) for further details.
As to the overflow, here is an explanation by Brian D. Ripley:
It means that you are taking the mean [in your case, the sum -- #aix] of some very large integers, and
the calculation is overflowing. It is just a warning.
This will not happen in the next release of R.
You can specify that a number is an integer by giving it the suffix L, for example, 1L is the integer one, as opposed to 1 which is a floating point one, with class "numeric".
The largest integer that you can create on your machine is given by .Machine$integer.max.
> .Machine$integer.max
[1] 2147483647
> class(.Machine$integer.max)
[1] "integer"
Adding a positive integer to this causes an overflow, returning NA.
> .Machine$integer.max + 1L
[1] NA
Warning message:
In .Machine$integer.max + 1L : NAs produced by integer overflow
> class(.Machine$integer.max + 1L)
[1] "integer"
You can get round this limit by adding floating point values instead.
> .Machine$integer.max + 1
[1] 2147483648
> class(.Machine$integer.max + 1)
[1] "numeric"
Since in your case the warning is issued by sum, this indicates that the overflow happens when the numbers are added together. The suggested workaround sum(as.numeric(.)) should do the trick.

What's the max length of integer or numeric?
Vectors are currently indexed with an integer, so the max length is given by .Machine$integer.max. As DWin noted, all versions of R currently use 32-bit integers, so this will be 2^31 - 1, or a little over 2 billion.
Unless you are packing some serious hardware (or you are reading this in the future; hello from 2012) you won't have enough memory to allocate vectors that long.
I remember a discussion where R-core (Brian Ripley, I think) suggested that the next step could be to index vectors with the mantissa of doubles, or something clever like that, effectively giving 48-bits of index. Sadly, I can't find that discussion.
In addition to the Rmpfr package, if you are suffering integer overflow, you might want to try the int64 package.

If c = a - b overflows because a and b are integers, try the following:
c = as.double(a - b)

Related

Does the documentation or language definition for R remark on the intended usage of R's integers?

A common comment that I see made about R's integer type is that it's only really intended for communication with C code. Do any statement like this appear in any official part of R's documentation? I often catch myself making vectors like integer(10) under the impression that they'll be more efficient for my purposes, only to remember this folklore and reconsider if I should ever be using integers for code that never tries to communicate with C code.
I don't think so. This folklore probably comes from the fact that R is pretty loose about typing and coercion, so it's easy to end up with a floating-point variable by accident.
Integer types certainly save memory:
> object.size(seq(1e8))
400000048 bytes
> object.size(seq(1e8)+0.1)
800000048 bytes
I haven't tried benchmarking to see if R uses faster routines for integer vs floating-point arithmetic, but you could.
I haven't looked carefully through all of R's documentation, but the only slightly relevant comment that turns up in a full-text search for "integer" in the R language definition is:
In most cases,the difference between an integer and a numeric value will be unimportant as R will do the right thing when using the numbers. There are, however, times when we would like to explicitly create an integer value for a constant. We can do this by calling the function as.integer or using various other techniques ...
I did a grep integer *.texi in the doc/manual directory of the R source tree and didn't (in a quick skim) notice anything else that looked relevant.
Following Ben Bolker's advice, I checked the seven R manuals. In addition to Ben's answer, I found the following:
For most purposes the user will not be concerned if the “numbers” in a numeric vector are integers, reals or even complex. Internally calculations are done as double precision real numbers, or double precision complex numbers if the input data are complex.
An Introduction to R Section 2.2
Writing R Extensions gives lots of guidance for making R communicate with C and Fortran, but it doesn't say anything about the integer typing's intent.
The last place to check is the Full Reference Manual. You would have to be mad to do so - the word "integer" occurs over 1000 times. However, a quick look at the index reveals the documentation for the integer class. This gives us the answer is such plain English that I should not be forgiven for having missed it:
Integer vectors exist so that data can be passed to C or Fortran code which expects them, and so that (small) integer data can be represented exactly and compactly.

Why does the addition of precise numbers give an imprecise result in R?

I would like to add precise numbers together, and get a precise result. I know that I need to set my desired digit length in the options. For example:
> options(digits=20)
> x <- 42.616999999999997 + 42.405999999999999 + 42.869
I expect to get an answer with the level of precision set. In the above case:
> x
[1] 127.89199999999999591
This result is duly produced on my personal computer, but I cannot replicate it on my work computer, which - with the exact same code and the exact same environment (no difference in packages) - produces the following result:
> x
[1] 127.892
I am unclear about what else could differ between my personal and work computer. It might be useful to know that my personal computer is running R version 3.6.2, while my work computer is running 3.4.2 (I cannot update it because of installation restrictions).
Thank you in advance for your help!
In help("options") we read:
digits: controls the number of significant (see signif) digits to
print when printing numeric values. It is a suggestion only. Valid
values are 1...22 with default 7. See the note in print.default about
values greater than 15.
Your value is greater than 15, so let's check the note in help("print.default"):
Large number of digits
Note that for large values of digits, currently
for digits >= 16, the calculation of the number of significant digits
will depend on the platform's internal (C library) implementation of
sprintf() functionality.
This explains why you observe different results on different computers. Indeed, on my system:
options(digits=20)
42.616999999999997 + 42.405999999999999 + 42.869
#[1] 127.892
However, you can enforce a specific number of digits after the decimal point by using the R function sprintf:
sprintf("%.17f", 42.616999999999997 + 42.405999999999999 + 42.869)
#[1] "127.89199999999999591"
Note that all of this is only about how numbers are printed. Internally, R uses always double precision (please read this post to understand the consequences).
If you need more precise results you'd need to use arbitrary precision numbers, e.g., by using the R package Rmpfr.

Managing floating point accuracy

I'm struggling with issues re. floating point accuracy, and could not find a solution.
Here is a short example:
aa<-c(99.93029, 0.0697122)
aa
[1] 99.9302900 0.0697122
aa[1]
99.93029
print(aa[1],digits=20)
99.930289999999999
It would appear that, upon storing the vector, R converted the numbers to something with a slightly different internal representation (yes, I have read circle 1 of the "R inferno" and similar material).
How can I force R to store the input values exactly "as is", with no modification?
In my case, my problem is that the values are processed in such a way that the small errors very quickly grow:
aa[2]/(100-aa[1])*100
[1] 100.0032 ## Should be 100, of course !
print(aa[2]/(100-aa[1])*100,digits=20)
[1] 100.00315593171625
So I need to find a way to get my normalization right.
Thanks
PS- There are many questions on this site and elsewhere, discussing the issue of apparent loss of precision, i.e. numbers displayed incorrectly (but stored right). Here, for instance:
How to stop read.table from rounding numbers with different degrees of precision in R?
This is a distinct issue, as the number is stored incorrectly (but displayed right).
(R version 3.2.1 (2015-06-18), win 7 x64)
Floating point precision has always generated lots of confusion. The crucial idea to remember is: when you work with doubles, there is no way to store each real number "as is", or "exactly right" -- the best you can store is the closest available approximation. So when you type (in R or any other modern language) something like x = 99.93029, you'll get this number represented by 99.930289999999999.
Now when you expect a + b to be "exactly 100", you're being inaccurate in terms. The best you can get is "100 up to N digits after the decimal point" and hope that N is big enough. In your case it would be correct to say 99.9302900 + 0.0697122 is 100 with 5 decimal points of accuracy. Naturally, by multiplying that equality by 10^k you'll lose additional k digits of accuracy.
So, there are two solutions here:
a. To get more precision in the output, provide more precision in the input.
bb <- c(99.93029, 0.06971)
print(bb[2]/(100-bb[1])*100, digits = 20)
[1] 99.999999999999119
b. If double precision not enough (can happen in complex algorithms), use packages that provide extra numeric precision operations. For instance, package gmp.
i guess you have misunderstood here. It's the same case where R is storing the correct value but the value is displayed accordingly to the value of option chosen while displaying it.
For Eg:
# the output of below will be:
> print(99.930289999999999,digits=20)
[1] 99.930289999999999395
But
# the output of:
> print(1,digits=20)
[1] 1
Also
> print(1.1,digits=20)
[1] 1.1000000000000000888
In addition to previous answers, I think that a good lecture regarding the subject would be
R Inferno, by P.Burns
http://www.burns-stat.com/documents/books/the-r-inferno/

R's Format function returning an odd result

Why does
format(4444444444444444444,scientific=FALSE)
return "4444444444444444672"?
I thought it might be an integer precision thing, but this number is relatively small. Thanks!
I'm running R version 3.0.0 on Ubuntu Linux.
If you give R an integer greater than:
> .Machine$integer.max
[1] 2147483647
It converts it to a double. The R-FAQ has 7.31. Questions related to floating point accuracy occur commonly on SO: Controlling number of decimal digits in print output in R
(The size of vector indices was increased with the last version of R (3.0.0) and it seems possible that the max size for an integer may be widened in the future. I don't quite understand how we can keep the limit on sizes of integers and also access vector indices that are larger.)
That's not a 32-bit integer.
R> as.integer(4444444444444444444)
[1] NA
Warning message:
NAs introduced by coercion
It's a double-precision floating point number, which only have 15-16 places of precision. The above fails because the number is greater than .Machine$integer.max. The ...672 is rounding error. If you need to use large numbers, consider packages like gmp.

Why does the number 1e9999... (31 9s) cause problems in R?

When entering 1e9999999999999999999999999999999 into R, R hangs and will not respond - requiring it to be terminated.
It seems to happen across 3 different computers, OSes (Windows 7 and Ubuntu). It happens in RStudio, RGui and RScript.
Here's some code to generate the number more easily:
boom <- paste(c("1e", rep(9, 31)), collapse="")
eval(parse(text=boom))
Now clearly this isn't a practical problem. I have no need to use numbers of this magnitude. It's just a question of curiosity.
Curiously, if you try 1e9999999999999999999999999999998 or 1e10000000000000000000000000000000 (add or subtract one from the power), you get Inf and 0 respectively. This number is clearly some kind of boundary, but between what and why here?
I considered that it might be:
A floating point problem, but I think they max out at 1.7977e308, long before the number in question.
An issue with 32-bit integers, but 2^32 is 4294967296, much smaller than the number in question.
Really weird. This is my dominant theory.
EDIT: As of 2015-09-15 at the latest, this no longer causes R to hang. They must have patched it.
This looks like an extreme case in the parser. The XeY format is described in Section 10.3.1: Literal Constants of the R Language Definition and points to ?NumericConstants for "up-to-date information on the currently accepted formats".
The problem seems to be how the parser handles the exponent. The numeric constant is handled by NumericValue (line 4361 of main/gram.c), which calls mkFloat (line 4124 of main/gram.c), which calls R_atof (line 1584 of main/util.c), which calls R_strtod4 (line 1461 of main/util.c). (All as of revision 60052.)
Line 1464 of main/utils.c shows expn declared as int and it will overflow at line 1551 if the exponent is too large. The signed integer overflow causes undefined behavior.
For example, the code below produces values for exponents < 308 or so and Inf for exponents > 308.
const <- paste0("1e",2^(1:31)-2)
for(n in const) print(eval(parse(text=n)))
You can see the undefined behavior for exponents > 2^31 (R hangs for an exponent = 2^31):
const <- paste0("1e",2^(31:61)+1)
for(n in const) print(eval(parse(text=n)))
I doubt this will get any attention from R-core because R can only store numeric values between about 2e-308 to 2e+308 (see ?double) and this number is way beyond that.
This is interesting, but I think R has systemic problems with parsing numbers that have very large exponents:
> 1e10000000000000000000000000000000
[1] 0
> 1e1000000000000000000000000000000
[1] Inf
> 1e100000000000000000000
[1] Inf
> 1e10000000000000000000
[1] 0
> 1e1000
[1] Inf
> 1e100
[1] 1e+100
There we go, finally something reasonable. According to this output and Joshua Ulrich's comment below, R appears to support representing numbers up to about 2e308 and parsing numbers with exponents up to about +2*10^9, but it cannot represent them. After that, there is undefined behavior apparently due to overflow.
R might use sometimes bignums. Perhaps 1e9999999999999999999999999999999 is some threshold, or perhaps the parsing routines have a limited buffer for reading the exponent. Your observation would be consistent with a 32 char (null-terminated) buffer for the exponent.
I'll rather ask that question on forums or mailing list specific to R, which are rumored to be friendly.
Alternatively, since R is free software, you could investigate its source code.

Resources