Why does the number 1e9999... (31 9s) cause problems in R? - r

When entering 1e9999999999999999999999999999999 into R, R hangs and will not respond - requiring it to be terminated.
It seems to happen across 3 different computers, OSes (Windows 7 and Ubuntu). It happens in RStudio, RGui and RScript.
Here's some code to generate the number more easily:
boom <- paste(c("1e", rep(9, 31)), collapse="")
eval(parse(text=boom))
Now clearly this isn't a practical problem. I have no need to use numbers of this magnitude. It's just a question of curiosity.
Curiously, if you try 1e9999999999999999999999999999998 or 1e10000000000000000000000000000000 (add or subtract one from the power), you get Inf and 0 respectively. This number is clearly some kind of boundary, but between what and why here?
I considered that it might be:
A floating point problem, but I think they max out at 1.7977e308, long before the number in question.
An issue with 32-bit integers, but 2^32 is 4294967296, much smaller than the number in question.
Really weird. This is my dominant theory.
EDIT: As of 2015-09-15 at the latest, this no longer causes R to hang. They must have patched it.

This looks like an extreme case in the parser. The XeY format is described in Section 10.3.1: Literal Constants of the R Language Definition and points to ?NumericConstants for "up-to-date information on the currently accepted formats".
The problem seems to be how the parser handles the exponent. The numeric constant is handled by NumericValue (line 4361 of main/gram.c), which calls mkFloat (line 4124 of main/gram.c), which calls R_atof (line 1584 of main/util.c), which calls R_strtod4 (line 1461 of main/util.c). (All as of revision 60052.)
Line 1464 of main/utils.c shows expn declared as int and it will overflow at line 1551 if the exponent is too large. The signed integer overflow causes undefined behavior.
For example, the code below produces values for exponents < 308 or so and Inf for exponents > 308.
const <- paste0("1e",2^(1:31)-2)
for(n in const) print(eval(parse(text=n)))
You can see the undefined behavior for exponents > 2^31 (R hangs for an exponent = 2^31):
const <- paste0("1e",2^(31:61)+1)
for(n in const) print(eval(parse(text=n)))
I doubt this will get any attention from R-core because R can only store numeric values between about 2e-308 to 2e+308 (see ?double) and this number is way beyond that.

This is interesting, but I think R has systemic problems with parsing numbers that have very large exponents:
> 1e10000000000000000000000000000000
[1] 0
> 1e1000000000000000000000000000000
[1] Inf
> 1e100000000000000000000
[1] Inf
> 1e10000000000000000000
[1] 0
> 1e1000
[1] Inf
> 1e100
[1] 1e+100
There we go, finally something reasonable. According to this output and Joshua Ulrich's comment below, R appears to support representing numbers up to about 2e308 and parsing numbers with exponents up to about +2*10^9, but it cannot represent them. After that, there is undefined behavior apparently due to overflow.

R might use sometimes bignums. Perhaps 1e9999999999999999999999999999999 is some threshold, or perhaps the parsing routines have a limited buffer for reading the exponent. Your observation would be consistent with a 32 char (null-terminated) buffer for the exponent.
I'll rather ask that question on forums or mailing list specific to R, which are rumored to be friendly.
Alternatively, since R is free software, you could investigate its source code.

Related

Managing floating point accuracy

I'm struggling with issues re. floating point accuracy, and could not find a solution.
Here is a short example:
aa<-c(99.93029, 0.0697122)
aa
[1] 99.9302900 0.0697122
aa[1]
99.93029
print(aa[1],digits=20)
99.930289999999999
It would appear that, upon storing the vector, R converted the numbers to something with a slightly different internal representation (yes, I have read circle 1 of the "R inferno" and similar material).
How can I force R to store the input values exactly "as is", with no modification?
In my case, my problem is that the values are processed in such a way that the small errors very quickly grow:
aa[2]/(100-aa[1])*100
[1] 100.0032 ## Should be 100, of course !
print(aa[2]/(100-aa[1])*100,digits=20)
[1] 100.00315593171625
So I need to find a way to get my normalization right.
Thanks
PS- There are many questions on this site and elsewhere, discussing the issue of apparent loss of precision, i.e. numbers displayed incorrectly (but stored right). Here, for instance:
How to stop read.table from rounding numbers with different degrees of precision in R?
This is a distinct issue, as the number is stored incorrectly (but displayed right).
(R version 3.2.1 (2015-06-18), win 7 x64)
Floating point precision has always generated lots of confusion. The crucial idea to remember is: when you work with doubles, there is no way to store each real number "as is", or "exactly right" -- the best you can store is the closest available approximation. So when you type (in R or any other modern language) something like x = 99.93029, you'll get this number represented by 99.930289999999999.
Now when you expect a + b to be "exactly 100", you're being inaccurate in terms. The best you can get is "100 up to N digits after the decimal point" and hope that N is big enough. In your case it would be correct to say 99.9302900 + 0.0697122 is 100 with 5 decimal points of accuracy. Naturally, by multiplying that equality by 10^k you'll lose additional k digits of accuracy.
So, there are two solutions here:
a. To get more precision in the output, provide more precision in the input.
bb <- c(99.93029, 0.06971)
print(bb[2]/(100-bb[1])*100, digits = 20)
[1] 99.999999999999119
b. If double precision not enough (can happen in complex algorithms), use packages that provide extra numeric precision operations. For instance, package gmp.
i guess you have misunderstood here. It's the same case where R is storing the correct value but the value is displayed accordingly to the value of option chosen while displaying it.
For Eg:
# the output of below will be:
> print(99.930289999999999,digits=20)
[1] 99.930289999999999395
But
# the output of:
> print(1,digits=20)
[1] 1
Also
> print(1.1,digits=20)
[1] 1.1000000000000000888
In addition to previous answers, I think that a good lecture regarding the subject would be
R Inferno, by P.Burns
http://www.burns-stat.com/documents/books/the-r-inferno/

R expression results in NaN for no obvious reason [duplicate]

This question already has answers here:
How to calculate any negative number to the power of some fraction in R?
(2 answers)
Closed 6 years ago.
How can it be that the expression
> (exp(17.118708 + 4.491715 * -2)/-67.421587)^(-67.421587)
results in
[1] NaN
while
> -50.61828^(-67.421587)
which should basically have the same outcome, gives me
[1] -1.238487e-115
This is driving me crazy, I spent hours searching for the Error. "-2", in this case, is a Parameter of the function. I really can't think of a solution. Thanks for your help!
EDIT:
I see that when I add brackets
> (-50.61828)^(-67.421587)
it also results in
[1] NaN
...but that does not solve my Problem.
It is because of the implementation of pow under C99 standard.
Let alone OP's example: (-50.61828)^(-67.421587), the mathematically justified (-8)^(1/3) = -2 does not work in R:
(-8)^(1/3)
# [1] NaN
Quoted from ?"^":
Users are sometimes surprised by the value returned, for example
why ‘(-8)^(1/3)’ is ‘NaN’. For double inputs, R makes use of IEC
60559 arithmetic on all platforms, together with the C system
function ‘pow’ for the ‘^’ operator. The relevant standards
define the result in many corner cases. In particular, the result
in the example above is mandated by the C99 standard. On many
Unix-alike systems the command ‘man pow’ gives details of the
values in a large number of corner cases.
I am on Ubuntu LINUX, so can help get relevant part of man power printed here:
If x is a finite value less than 0, and y is a finite noninteger, a
domain error occurs, and a NaN is returned.
From what I can tell, -50.61828^(-67.421587) is evaluating as -(50.61828^(-67.421587)). (-50.61828)^(-67.421587) also results in NaN.

Factorial(x) for x>170 using Rmpfr/gmp library

The problem that I would like to solve is the infinite sum over the following function:
For the sum I use an FTOL determination criterion. This whole term doesn't create any problems until z becomes very large. I expect the maximum value of z around 220. As you can see the first term has its max around Factorial(221) and therefore has to go around Factorial(500) until the determination criterion has been reached. After spotting this problem I didn't want to change the whole code (as it is only one small part) and tried to use library('Rmpfr') and library('gmp'). The problem is that I do not get what I want to. While multiplication normally works, subtraction fails for higher values:
This works
> factorialZ(22)-factorial(22)
Big Integer ('bigz') :
[1] 0
but this fails:
> factorialZ(50)-factorial(50)
Big Integer ('bigz') :
[1] 359073645150499628823711419759505502520867983196160
another way I tried:
> gamma(as(10,"mpfr"))-factorial(9)
1 'mpfr' number of precision 128 bits
[1] 0
> gamma(as(40,"mpfr"))-factorial(39)
1 'mpfr' number of precision 128 bits
[1] 1770811808798664813196481658880
There has to be something that I don't really understand. Does someone have a even better solution for the problem or can someone help me out with the issue above?
I think you incorrectly understand the priorities in factorialZ(x)-factorial(x) . The second term, factorial(x) is calculated before it's converted to a bigz to be combined with the first term.
You must create any integer outside the 2^64 (or whatever, depending on your machine) range using a bigz - compatible function.
50! is between 2^214 and 2^215 so the closest representable numbers are 2^(214-52) apart. factorial in R is based on a Lanczos approximation whereas factorialZ is calculating it exactly. The answers are within the machine precision:
> all.equal(as.numeric(factorialZ(50)), factorial(50))
[1] TRUE
The part that you're not understanding is floating point and it's limitations. You're only getting ~15 digits of precision in floating point. factorialZ(50) has a LOT more precision than that, so you shouldn't expect them to be the same.

What is integer overflow in R and how can it happen?

I have some calculation going on and get the following warning (i.e. not an error):
Warning messages:
1: In sum(myvar, na.rm = T) :
Integer overflow - use sum(as.numeric(.))
In this thread people state that integer overflows simply don't happen. Either R isn't overly modern or they are not right. However, what am I supposed to do here? If I use as.numeric as the warning suggests I might not account for the fact that information is lost way before. myvar is read form a .csv file, so shouldn't R figure out that some bigger field is needed? Does it already cut off something?
What's the max length of integer or numeric? Would you suggest any other field type / mode?
EDIT: I run:
R version 2.13.2 (2011-09-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) within R Studio
You can answer many of your questions by reading the help page ?integer. It says:
R uses 32-bit integers for integer vectors, so the range of
representable integers is restricted to about +/-2*10^9.
Expanding to larger integers is under consideration by R Core but it's not going to happen in the near future.
If you want a "bignum" capacity then install Martin Maechler's Rmpfr package [PDF]. I recommend the 'Rmpfr' package because of its author's reputation. Martin Maechler is also heavily involved with the Matrix package development, and in R Core as well. There are alternatives, including arithmetic packages such as 'gmp', 'Brobdingnag' and 'Ryacas' package (the latter also offers a symbolic math interface).
Next, to respond to the critical comments in the answer you linked to, and how to assess the relevance to your work, consider this: If there were the same statistical functionality available in one of those "modern" languages as there is in R, you would probably see a user migration in that direction. But I would say that migration, and certainly growth, is in the R direction at the moment. R was built by statisticians for statistics.
There was at one time a Lisp variant with a statistics package, Xlisp-Stat, but its main developer and proponent is now a member of R-Core. On the other hand one of the earliest R developers, Ross Ihaka, suggests working toward development in a Lisp-like language [PDF]. There is a compiled language called Clojure (pronounced as English speakers would say "closure") with an experimental interface, Rincanter.
Update:
The new versions of R (3.0.+) has 53 bit integers of a sort (using the numeric mantissa). When an "integer" vector element is assigned a value in excess of '.Machine$integer.max', the entire vector is coerced to "numeric", a.k.a. "double". Maximum value for integers remains as it was, however, there may be coercion of integer vectors to doubles to preserve accuracy in cases that would formerly generate overflow. Unfortunately, the length of lists, matrix and array dimensions, and vectors is still set at integer.max.
When reading in large values from files, it is probably safer to use character-class as the target and then manipulate. If there is coercion to NA values, there will be a warning.
In short, integer is an exact type with limited range, and numeric is a floating-point type that can represent a much wider range of value but is inexact. See the help pages (?integer and ?numeric) for further details.
As to the overflow, here is an explanation by Brian D. Ripley:
It means that you are taking the mean [in your case, the sum -- #aix] of some very large integers, and
the calculation is overflowing. It is just a warning.
This will not happen in the next release of R.
You can specify that a number is an integer by giving it the suffix L, for example, 1L is the integer one, as opposed to 1 which is a floating point one, with class "numeric".
The largest integer that you can create on your machine is given by .Machine$integer.max.
> .Machine$integer.max
[1] 2147483647
> class(.Machine$integer.max)
[1] "integer"
Adding a positive integer to this causes an overflow, returning NA.
> .Machine$integer.max + 1L
[1] NA
Warning message:
In .Machine$integer.max + 1L : NAs produced by integer overflow
> class(.Machine$integer.max + 1L)
[1] "integer"
You can get round this limit by adding floating point values instead.
> .Machine$integer.max + 1
[1] 2147483648
> class(.Machine$integer.max + 1)
[1] "numeric"
Since in your case the warning is issued by sum, this indicates that the overflow happens when the numbers are added together. The suggested workaround sum(as.numeric(.)) should do the trick.
What's the max length of integer or numeric?
Vectors are currently indexed with an integer, so the max length is given by .Machine$integer.max. As DWin noted, all versions of R currently use 32-bit integers, so this will be 2^31 - 1, or a little over 2 billion.
Unless you are packing some serious hardware (or you are reading this in the future; hello from 2012) you won't have enough memory to allocate vectors that long.
I remember a discussion where R-core (Brian Ripley, I think) suggested that the next step could be to index vectors with the mantissa of doubles, or something clever like that, effectively giving 48-bits of index. Sadly, I can't find that discussion.
In addition to the Rmpfr package, if you are suffering integer overflow, you might want to try the int64 package.
If c = a - b overflows because a and b are integers, try the following:
c = as.double(a - b)

Negative Exponents throwing NaN in Fortran

Very basic Fortran question. The following function returns a NaN and I can't seem to figure out why:
F_diameter = 1. - (2.71828**(-1.0*((-1. / 30.)**1.4)))
I've fed 2.71... in rather than using exp() but they both fail the same way. I've noticed that I only get a NaN when the fractional part (-1 / 30) is negative. Positives evaluate ok.
Thanks a lot
The problem is that you are taking a root of a negative number, which would give you a complex answer. This is more obvious if you imagine e.g.
(-1) ** (3/2)
which is equivalent to
(1/sqrt(-1))**3
In other words, your fractional exponent can't trivially operate on a negative number.
There is another interesting point here I learned today and I want to add to ire_and_curses answer: The fortran compiler seems to compute powers with integers with successive multiplications.
For example
PROGRAM Test
PRINT *, (-23) ** 6
END PROGRAM
work fine and gives 148035889 as an answer.
But for REAL exponents, the compiler uses logarithms: y**x = 10**(x * log(y)) (maybe compilers today do differently, but my book says so). Now that negative logarithms give a complex result, this does not work:
PROGRAM Test
PRINT *, (-23) ** 6.1
END PROGRAM
and even gives an compiler error:
Error: Raising a negative REAL at (1) to a REAL power is prohibited
From an mathematical point of view, this problem seems also be quite interesting: https://math.stackexchange.com/questions/1211/non-integer-powers-of-negative-numbers

Resources