NAs produced by integer overflow + R on linux - r

I'm running an R script on UNIX based system , the script contain multiplication of large numbers , so the results where NAs by integer overflow , but when i run the same script on windows , this problem does not appears.
but i should keep the script working the whole night on the Desktop(which is Unix).
is there any solution for this problem?
thanks
for(ol in seq(1,nrow(yi),by=25))
{
for(oh in seq(1,nrow(yi),by=25))
{
A=(N*(ol^2)) + ((N*(N+1)*(2*N+1))/6) -(2*ol*((N*N+1)/2)) + (2*N*ol*(N-oh+1)) + ((N-oh+1)*N^2) + (2*N*(oh-N-1)*(oh+N))
}
}
with :
N=16569 = nrow(yi)
but first round is not being calculated on unix.

Can you cast your integers to floating-point numbers in order to use floating-point math for the computations?
For example:
> x=as.integer(1000000)
> x*x
[1] NA
Warning message:
In x * x : NAs produced by integer overflow
> x=as.numeric(1000000)
> x*x
[1] 1e+12
As an aside, it is not entirely clear why the warning would appear in one environment but not the other. I first thought that 32-bit and 64-bit builds of R might be using 32-bit and 64-bit integers respectively, but that doesn't appear to be the case. Are both your environments configured identically in terms of how warnings are displayed?

As the other answers have pointed out, there is something a bit non-reproducible/strange about your results so far. Nevertheless, if you really must do exact calculations on large integers, you probably need an interface between R and some other system.
Some of your choices are:
the gmp package (see this page and scroll down to R
an interface to the bc calculator on googlecode
there is a high precision arithmetic page on the R wiki which compares interfaces to Yacas, bc, and MPFR/GMP
there is a limited interface to the PARI/GP package in the elliptical package, but this is probably (much) less immediately useful than the preceding three choices
Most Unix or Cygwin systems should have bc installed already. GMP and Yacas are easy to install on modern Linux systems ...
Here's an extended example, with a function that can choose among numeric, integer, or bigz computation.
f1 <- function(ol=1L,oh=1L,N=16569L,type=c("num","int","bigz")) {
type <- match.arg(type)
## convert all values to appropriate type
if (type=="int") {
ol <- as.integer(ol)
oh <- as.integer(oh)
N <- as.integer(N)
one <- 1L
two <- 2L
six <- 6L
cc <- as.integer
} else if (type=="bigz") {
one <- as.bigz(1)
two <- as.bigz(2)
six <- as.bigz(6)
N <- as.bigz(N)
ol <- as.bigz(ol)
oh <- as.bigz(oh)
cc <- as.bigz
} else {
one <- 1
two <- 2
six <- 6
N <- as.numeric(N)
oh <- as.numeric(oh)
ol <- as.numeric(ol)
cc <- as.numeric
}
## if using bigz mode, the ratio needs to be converted back to bigz;
## defining cc() as above seemed to be the most transparent way to do it
N*ol^two + cc(N*(N+one)*(two*N+one)/six) -
ol*(N*N+one) + two*N*ol*(N-oh+one) +
(N-oh+one)*N^two + two*N*(oh-N-one)*(oh+N)
}
I removed a lot of unnecessary parentheses, which actually made it harder to see what was going on. It is indeed true that for the (1,1) case the final result is not bigger than .Machine$integer.max but some of the intermediate steps are ... (for the (1,1) case this actually reduces to $$-1/6*(N+2)*(4*N^2-5*N+3)$$ ...)
f1() ## -3.032615e+12
f1() > .Machine$integer.max ## FALSE
N <- 16569L
N*(N+1)*(2*N+1) > .Machine$integer.max ## TRUE
N*(N+1L)*(2L*N+1L) ## integer overflow (NA)
f1(type="int") ## integer overflow
f1(type="bigz") ## "-3032615078557"
print(f1(),digits=20) ## -3032615078557: no actual loss of precision in this case
PS: you have a (N*N+1) term in your equation. Should that really be N*(N+1), or did you really mean N^2+1?

Given your comments, I guess that you seriously misunderstand the "correctness" of numbers in R. You say the outcome you get on Windows is something like -30598395869593930593. Now, on both 32bit and 64bit that precision is even not possible using a double, let alone using an integer :
> x <- -30598395869593930593
> format(x,scientific=F)
[1] "-30598395869593931776"
> all.equal(x,as.numeric(format(x,scientific=F)))
[1] TRUE
> as.integer(x)
[1] NA
You have 16 digits you can trust, all the rest is bollocks. Then again, an accuracy of 16 digits is already pretty strong. Most measurement tools don't even come close to that.

Related

Bitwise operations with bigz in gmp

I'm translating some cryptography scripts from Python to R. Python seems to handle very large integers much better than R can natively:
10593080468914978578954316149578855170502344604886137564370015851276669104055 >> 1
# 5296540234457489289477158074789427585251172302443068782185007925638334552027
But I'm aware of the gmp library for R, which handles them well (mostly):
as.bigz("10593080468914978578954316149578855170502344604886137564370015851276669104055")
For context, to translate these scripts I need to use bitwise operations. The problem is that these bigz objects are encoded as raw values, and so I can't use the base bitwise functions for them as they are incompatible.
Finding workaround for shifting bits to the left and right is straight foward, but I need something that will:
Perform the equivalent of bitwAnd and bitwOr
On bigz values
WITHOUT losing precision.
Any ideas?
Bonus: if you can provide an interpretation of bitwAnd and bitwOr in terms of base 10 then that could work. Preferably with some example code in R, if not I can work around it.
I'm sure there must be a slicker and faster way, but one option would be something like this...
library(gmp)
z <- as.bigz("10593080468914978578954316149578855170502344604886137564370015851276669104055")
w <- as.bigz("1234874454654321549879876546351546654456432132321654987584654321321")
#express as numeric vectors of 0s and 1s
z1 <- as.numeric(charToRaw(as.character(z, b=2)))-48
w1 <- as.numeric(charToRaw(as.character(w, b=2)))-48
#normalise the lengths
mx <- max(length(z1), length(w1))
z1 <- c(rep(0, mx-length(z1)), z1)
w1 <- c(rep(0, mx-length(w1)), w1)
#then do & or | and convert back to bigz
zandw <- as.bigz(paste0("0b", rawToChar(as.raw(1*(z1 & w1) + 48))))
zorw <- as.bigz(paste0("0b", rawToChar(as.raw(1*(z1 | w1) + 48))))
zandw
Big Integer ('bigz') :
[1] 905773543034890641004226585015137324621885921615658881499355162273
zorw
Big Integer ('bigz') :
[1] 10593080469244079490573747058454505131838753934720683775076011957361968263103

For Loops in Base R - How do the double brackets function (and lead to the boost in processing speed)?

I have been trying to construct a base R Forloop I could speed test alongside other R package functions.
Unfortunately, I seem to have a gap in my understanding of 1) how for loops must be constructed, 2) how small syntactical differences lead to significant changes in processing speeds, and 3) if those first two matters are somehow related. My education never touched upon these subjects. All the resources I can find don't explain the empirical results.
For example, let's take a seeded random dataframe that is populated with values between 1 and 5. All values greater than 4 are to be replaced with zeros. I use four permutations of for loops with differing bracketing structures in order to find all the values that are greater than 4 and replace them with zeros.
Three of these are fully functioning for loops, and there is an interesting relationship exhibited between how many double brackets are used, and the increases in speed that result.
This is not about silent integer coercion as far as I can tell.
The speed differences seen between bracket counts are fully independent of if they start out as integers or doubles, or get tested as integers or doubles, or get replaced as integers or doubles, as well as all combinations of the above. I have tested all these and find that all those variations contribute significantly less to the final speed than the bracketing does.
Why is the number of brackets used causing these variations in speed?
Also, a probably much simpler question: Why the does version with the double bracket in the 'testing/filtering' vector not function? - even though leaving it out entirely, or adding another double bracket in front causes it to work.
Please find my example code below:
set.seed(24)
# Numeric Dataframe
df_numeric_orig <- as.data.frame(matrix(sample(c(1:5), 1e6 *12, replace=TRUE),
dimnames = list(NULL, paste0("var", 1:12)), ncol=12))
## No Double Brackets
df_num <- df_numeric_orig
NumDF_NoSetsDbleBrackets <-
system.time({
for(row in 1:ncol(df_num)) {
df_num[row][(df_num[row] > 4)] <- 0
}
})
# 1 set Double Brackets - Front
df_num <- df_numeric_orig
NumDF_1SetDbleBrackets_front <-
system.time({
for(row in 1:ncol(df_num)) {
df_num[[row]][(df_num[row] > 4)] <- 0
}
})
# 1 set Double Brackets - Back
df_num <- df_numeric_orig
NumDF_1SetDbleBrackets_back <-
system.time({
for(row in 1:ncol(df_num)) {
df_num[row][(df_num[[row]] > 4)] <- 0
}
})
# 2 set Double Brackets
df_num <- df_numeric_orig
NumDF_2SetDbleBrackets <-
system.time({
for(row in 1:ncol(df_num)) {
df_num[[row]][(df_num[[row]] > 4)] <- 0
}
})
NumDF_NoSetsDbleBrackets
NumDF_1SetDbleBrackets_front
NumDF_1SetDbleBrackets_back
NumDF_2SetDbleBrackets
I know that I can wrap these loops in local functions, and thus not have to replace the dataframe each time; however, I want to remove as many obstacles as possible for reviewers to allow reviewers to be able to observe directly that these for loops do in fact function effectively. i.e.: syntactic convention does not sufficient explanation make.
The speed results on my machine typically come out about as below:
> NumDF_NoSetsDbleBrackets
user system elapsed
0.41 0.13 0.55
> NumDF_1SetDbleBrackets_front
user system elapsed
0.25 0.10 0.35
> NumDF_1SetDbleBrackets_back
Error: object 'NumDF_1SetDbleBrackets_back' not found
> NumDF_2SetDbleBrackets
user system elapsed
0.23 0.05 0.29

Integer overflow from many-leveled factor with class.ind()?

I'm trying to convert a "big" factor into a set of indicator (i.e. dummy, binary, flag) variables in R as such:
FLN <- data.frame(nnet::class.ind(FinelineNumber))
where FinelineNumber is a 5,000-level factor from Kaggle.com's current Walmart contest (the data is public if you'd like to reproduce this error).
I keep getting this concerning-looking warning:
In n * (unclass(cl) - 1L) : NAs produced by integer overflow
Memory available to the system is essentially unlimited. I'm not sure what the problem is.
The source code of nnet::class.ind is:
function (cl) {
n <- length(cl)
cl <- as.factor(cl)
x <- matrix(0, n, length(levels(cl)))
x[(1L:n) + n * (unclass(cl) - 1L)] <- 1
dimnames(x) <- list(names(cl), levels(cl))
x
}
.Machine$integer.max is 2147483647. If n*(nlevels - 1L) is greater than this value that should produce your error. Solving for n:
imax <- .Machine$integer.max
nlevels <- 5000
imax/(nlevels-1L)
## [1] 429582.6
You'll encounter this problem if you have 429583 or more rows (not particularly big for a data-mining context). As commented above, you'll do much better with Matrix::sparse.model.matrix (or Matrix::fac2sparse), if your modeling framework can handle sparse matrices. Alternatively, you'll have to rewrite class.ind to avoid this bottleneck (i.e. indexing by rows and columns rather than by absolute location) [#joran comments above that R indexes large vectors via double-precision values, so you might be able to get away with just hacking that line to
x[(1:n) + n * (unclass(cl) - 1)] <- 1
possibly throwing in an explicit as.numeric() here or there to force the coercion to double ...]
Even if you were able to complete this step, you'd end up with a 5000*650000 matrix - it looks like that will be 12Gb.
print(650*object.size(matrix(1L,5000,1000)),units="Gb")
I guess if you've got 100Gb free that could be OK ...

Creating a vector in a for loop

I am ashamed I need assistance on such a simple task. I want to create 20 normal distributed numbers, add them, and then do this again x times. Then plot a histogram of these sums. This is an exercise in Gilman and Hills text "Data Analysis Using Regression and Multilevel/Hierarchical Models".
I thought this would be simple, but I am into it about 10 hours now. Web searches and looking in "The Art of R Programming" by Norman Matloff and "R for Everyone" by Jared Lander have not helped. I suspect the answer is so simple that no one would suspect a problem. The syntax in R is something that I am having difficulty with.
> # chapter 2 exercise 3
> n.sim <- 10 # number of simultions
>
> sumNumbers <- rep(NA, n.sim) # generate vector of NA's
> for (i in 1:n.sim) # begin for loop
+{
+ numbers <- rnorm(20,0,1)
+ sumNumbers(i) <- sum(numbers) # defined as a vector bur R
+ # thinks it's a function
+ }
Error in sumNumbers(i) <- sum(numbers) :
could not find function "sumNumbers<-"
>
> hist(sumNumbers)
Hide Traceback
Rerun with Debug
Error in hist.default(sumNumbers) : 'x' must be numeric
3 stop("'x' must be numeric")
2 hist.default(sumNumbers)
1 hist(sumNumbers)
>
A few things:
When you put parentheses after a variable name, the R interpreter assumes that it's a function. In your case, you want to reference an index of a variable, so it should be sumNumbers[i] <- sum(numbers), which uses square brackets instead. This will solve your problem.
You can initiate sumNumbers as sumNumbers = numeric(n.sim). It's a bit easier to read in simple case like this.
By default, rnorm(n) is the same as rnorm(n,0,1). This can save you some time typing.
You can replicate an operation a specified number of times with the replicate function:
set.seed(144) # For consistent results
(simulations <- replicate(10, sum(rnorm(20))))
# [1] -9.3535884 1.4321598 -1.7812790 -1.1851263 -1.9325988 2.9652475 2.9559994
# [8] 0.7164233 -8.1364348 -7.3428464
After simulating the proper number of samples, you can plot with hist(simulations).

intToBin with large numbers

I'm using the intToBin() function from "R.utils" package and am having trouble using it to convert large decimal numbers to binary.
I get this error : NAs introduced by coercion.
Is there another function out there that can handle big numbers/ is there an algorithm/ code to implement such a function?
Thanks
If you read the help page for intToBin, it quite explicitly says it takes "integer" inputs. These are not mathematical "integers" but rather the computer-language-defined ints, which are limited to 16 bits (or something like that).
You'll need to find (or write :-() a function which converts floating-point numbers to binary floats, or if you're lucky, perhaps Rmpfr or gmp packages, which do arbitrary precision "big number" math, may have a float-to-binary tool.
By the time this gets posted, someone will have exposed my ignorance by posting an existing function, w/ my luck.
Edit -- like maybe the package pack
I needed a converter between doubles and hex numbers. So I wrote those, might be helpful to others
doubleToHex <- function(x) {
if(x < 16)
return(sprintf("%X", x))
remainders <- c()
while(x > 15) {
remainders <- append(remainders, x%%16)
x <- floor(x/16)
}
remainders <- paste(sprintf("%X", rev(remainders)), collapse="")
return(paste(x, remainders, sep=""))
}
hexToDouble <- function(x) {
x <- strsplit(x,"")[[1]]
output <- as.double(0)
for(i in rev(seq_along(x))) {
output <- output + (as.numeric(as.hexmode(x[i]) * (16**(length(x)-i))))
}
return(output)
}
doubleToHex(x = 8356723)
hexToDouble(x = "7F8373")
Hasn't been extensively tested yet, let me know if you detect a problem with it.

Resources