Integer overflow from many-leveled factor with class.ind()? - r

I'm trying to convert a "big" factor into a set of indicator (i.e. dummy, binary, flag) variables in R as such:
FLN <- data.frame(nnet::class.ind(FinelineNumber))
where FinelineNumber is a 5,000-level factor from Kaggle.com's current Walmart contest (the data is public if you'd like to reproduce this error).
I keep getting this concerning-looking warning:
In n * (unclass(cl) - 1L) : NAs produced by integer overflow
Memory available to the system is essentially unlimited. I'm not sure what the problem is.

The source code of nnet::class.ind is:
function (cl) {
n <- length(cl)
cl <- as.factor(cl)
x <- matrix(0, n, length(levels(cl)))
x[(1L:n) + n * (unclass(cl) - 1L)] <- 1
dimnames(x) <- list(names(cl), levels(cl))
x
}
.Machine$integer.max is 2147483647. If n*(nlevels - 1L) is greater than this value that should produce your error. Solving for n:
imax <- .Machine$integer.max
nlevels <- 5000
imax/(nlevels-1L)
## [1] 429582.6
You'll encounter this problem if you have 429583 or more rows (not particularly big for a data-mining context). As commented above, you'll do much better with Matrix::sparse.model.matrix (or Matrix::fac2sparse), if your modeling framework can handle sparse matrices. Alternatively, you'll have to rewrite class.ind to avoid this bottleneck (i.e. indexing by rows and columns rather than by absolute location) [#joran comments above that R indexes large vectors via double-precision values, so you might be able to get away with just hacking that line to
x[(1:n) + n * (unclass(cl) - 1)] <- 1
possibly throwing in an explicit as.numeric() here or there to force the coercion to double ...]
Even if you were able to complete this step, you'd end up with a 5000*650000 matrix - it looks like that will be 12Gb.
print(650*object.size(matrix(1L,5000,1000)),units="Gb")
I guess if you've got 100Gb free that could be OK ...

Related

R combn fails when asked for large number of combinations (even though memory should suffice)

I wanted to get all the combinations for 73,000 choose 2 and I tried to use combn in order to calculate it.
combn(73000,2)
I received the following error:
Error in matrix(r, nrow = len.r, ncol = count) :
invalid 'ncol' value (too large or NA)
I figured that the number of combinations is 2,664,463,500 so multiplied by 8 should yield around 22GB which I had free on my machine.
So even though it's a lot of combinations, it shouldn't fail.
Any alternative way to calculate the number of combinations or explanations of why combn fails?
I dug into the code and apparently when constructing the output matrix the dimensions are converted to integer:
count <- as.integer(round(choose(n, m)))
out <- matrix(r, nrow = len.r, ncol = count) # matrix for now
Eliminating the as.integer from count increases its range and in my case it doesn't overflow anymore.
It wasn't enough though, I continue to receive the same error again.
I couldn't find a way to initialize the matrix as type integer, so I created two vectors instead (m=2)like that:
col1 <- vector(mode="integer", length=n)
col2 <- vector(mode="integer", length=n)
With few more adjustments it runs now.
I hope this might help to others as well.

How can I avoid replacement has length zero error

I am trying to generate the term frequency matrix of a document and subsequently look up the frequency of a certain word in a given query in that matrix. In the end I want to sum the frequencies found of the words in the query.
However, I am coping with the error message: Error in feature[i] <- x : replacement has length zero
I do not have a lot of coding experience in general, and this is my first time working with R, thus I am having difficulties solving this error. I presume it has something to do with a null-value. I already tried to avoid the nested for-loop with an apply function because I thought that might help (not sure though), but I could not quite get the hang of how to convert the for-loop into an apply function.
termfreqname <- function(queries,docs){
n <- length(queries)
feature <- vector(length=n)
for(i in 1:n){
query <- queries[i]
documentcorpus <- c(docs[i])
tdm <- TermDocumentMatrix(tm_corpus) #creates the term frequency matrix per document
m <- sapply(strsplit(query, " "), length) #length of the query in words
totalfreq <- list(0) #initialize list
freq_counter <- rowSums(as.matrix(tdm)) #counts the occurrence of a given word in the tdm matrix
for(j in 1:m){
freq <- freq_counter[word(query,j)] #finds frequency of each word in the given query, in the term frequency matrix
totalfreq[[j]] <- freq #adds this frequency to position j in the list
}
x <- reduce(totalfreq,'+') #sums all the numbers in the list
feature[i] <- x #adds this number to feature list
feature
}
}
It depends on your needs, but bottom line you need to add some if statement. How you use it depends on whether you want the default value of the vector to persist. In your code, while feature starts as a logical vector, it is likely coerced to integer or numeric once you overwrite its first value with a number. In that case, the default value in all positions of the vector will be 0 (or 0L, if integer). That's going to influence your decision on how to use the if statement.
if (length(x)) feature[i] <- x
This will only attempt to overwrite the ith value of feature if the x objects has length (that's equivalent to if (length(x) > 0)). In this case, since the default value in the vector will be zero, this means when you are done that you will not be able to distinguish between an element known to be 0 and an element that failed to find anything.
The alternative (and my preference/recommendation):
feature[i] <- if (length(x)) x else NA
In this case, when you are done, you can clearly distinguish between known-zero (0) and uncertain/unknown values (NA). When doing math operations on that vector, you might want/need na.rm=TRUE ... but it all depends on your use.
BTW, as MartinGal noted, your use of reduce(totalfreq, '+') is a little flawed: 'x' may not be (is not?) recognized as a known function. The first fix to this is to use backticks around the function, so
totalfreq <- 5:7
reduce(totalfreq, '+')
# NULL
reduce(totalfreq, `+`)
# [1] 18
sum(totalfreq)
# [1] 18
There the last is the much-more-preferred method. Why? With a vector of length 4, for instance, it takes the first two and adds them, then takes that result and adds it to the third, then takes that result and adds to the fourth. Three operations. When you have 100 elements, it will make 99 individual additions. sum does it once, and this does have an effect on performance (asymptotically).
However, if totalfreq is instead a list, then this changes slightly:
totalfreq <- as.list(5:7)
reduce(totalfreq, `+`)
# [1] 18
sum(totalfreq)
# Error in sum(totalfreq) : invalid 'type' (list) of argument
# x
sum(unlist(totalfreq))
# [1] 18
The reduce code still works, and the sum by itself fails, but we can unlist the list first, effectively creating a vector, and then call sum on that. Much much faster asymptotically. And perhaps clearer, more declarative.
(I'm assuming purrr::reduce, btw ...)

Combine list of matrices into a big.matrix

I have a list of large (35000 x 3) matrices in R and I want to combine them into a single matrix but it would be about 1 billion rows long and would exceed the maximum object size in R.
The bigmemory package allows for larger matrices but doesn't appear to support rbind to put multiple matrices together.
Is there some other package or technique that supports the creation of a very large matrix from smaller matrices?
Also before you ask this is not a RAM issue, simply an R limitation even on 64-bit R.
You could implement it with a loop:
library(bigmemory)
## Reproducible example
mat <- matrix(1, 50e3, 3)
l <- list(mat)
for (i in 2:100) {
l[[i]] <- mat
}
## Solution
m <- ncol(l[[1]]) ## assuming that all have the same number of columns
n <- sum(sapply(l, nrow))
bm <- big.matrix(n, m)
offset <- 0
for (i in seq_along(l)) {
mat_i <- l[[i]]
n_i <- nrow(mat_i)
ind_i <- seq_len(n_i) + offset
bm[ind_i, ] <- mat_i
offset <- offset + n_i
}
## Verif
stopifnot(offset == n, all(bm[, 1] == 1))
Not quite an answer, but a little more than a comment: are you sure that you can't do it by brute force? R now has long vectors (since version 3.0.0; the question you link to refers to R version 2.14.1): from this page,
Arrays (including matrices) can be based on long vectors provided each of their dimensions is at most 2^31 - 1: thus there are no 1-dimensional long arrays.
while the underlying atomic vector can go up to 2^52 -1 elements ("in theory .. address space limits of current CPUs and OSes will be much smaller"). That means you should in principle be able to create a matrix that is as much as ((2^31)-1)/1e9 = 2.1 billion rows long; since the maximum "long" object size is about 10^15 (i.e. literally millions of billions), a matrix of 1 billion rows and 3 columns should (theoretically) not be a problem.

R: Big integer matrices

I have some big integer matrices (1000 x 1000000) that I have to multiply and do rowmax on.
They contain 0 and 1 (approx 99% 1 and 1% 0 and no other values).
My problem is memory consumption: Currently R eats 8 bytes per integer.
I have looked at SparseMatrix, but it seems I cannot set the default value to 1 instead of 0.
How can I represent these matrices in a memory efficient way, but so I can still multiply them as matrices and use rowmax?
Preferably it should work with R-2.15 and not require additional libraries.
Second idea: If you have a couple of these matrices, call them X_1 and X_2, let Y_1 = 1*1' - X_1 and Y_2 = 1*1' - X_2; the Y's can be sparse because they are 99% zero. So their product is
X_1 * X_2 = ( 1*1' - Y_1) * (1*1' - Y_2) = 1*1'*1*1' - Y_1*1*1' - 1*1'*Y_2 + Y_1 * Y_2
which you can simplify even further.
There are several sparse matrix packages slam, SparseM, Matrix, ...) but I doubt any will do bitwise presentation, or even single char, as you'd need here. You may have to code that up yourself.
Alternatively, packages like ff allow more compact storage but AFAIK will not do matrix ops for you. Maybe you could that on top of them?
Off the top of my head, I can't think of a packaged solution...
It seems like you could represent this type of data extremely efficiently with run length encoding by row. From there, you could implement a matrix-vector multiply method for rle objects, (which might be hard) and row-max (which should be trivial).
Since there are only 1% 0's it would not be difficult to compress. One trivial example:
pseudo.matrix <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
zeroes.cells <- which(x==0)
p <- list(nrow=nrow, ncol=ncol, zeroes.cells=zeroes.cells)
}
This alone would reduce their memory size significantly. And it would be easy to recover the original matrix:
recover.matrix <- function(x) {
m <- matrix(1, x$nrow, x$ncol)
for (i in x$zeroes.cells) m[i] <- 0
m
}
I guess it is possible to figure out a way to efficiently multiply these pseudo matrices, since the result for each cell would be something like the number of columns of the first matrix minus an adjustment regarding the number of zeros in the operation, but I am not sure how easy it would be to do this.

NAs produced by integer overflow + R on linux

I'm running an R script on UNIX based system , the script contain multiplication of large numbers , so the results where NAs by integer overflow , but when i run the same script on windows , this problem does not appears.
but i should keep the script working the whole night on the Desktop(which is Unix).
is there any solution for this problem?
thanks
for(ol in seq(1,nrow(yi),by=25))
{
for(oh in seq(1,nrow(yi),by=25))
{
A=(N*(ol^2)) + ((N*(N+1)*(2*N+1))/6) -(2*ol*((N*N+1)/2)) + (2*N*ol*(N-oh+1)) + ((N-oh+1)*N^2) + (2*N*(oh-N-1)*(oh+N))
}
}
with :
N=16569 = nrow(yi)
but first round is not being calculated on unix.
Can you cast your integers to floating-point numbers in order to use floating-point math for the computations?
For example:
> x=as.integer(1000000)
> x*x
[1] NA
Warning message:
In x * x : NAs produced by integer overflow
> x=as.numeric(1000000)
> x*x
[1] 1e+12
As an aside, it is not entirely clear why the warning would appear in one environment but not the other. I first thought that 32-bit and 64-bit builds of R might be using 32-bit and 64-bit integers respectively, but that doesn't appear to be the case. Are both your environments configured identically in terms of how warnings are displayed?
As the other answers have pointed out, there is something a bit non-reproducible/strange about your results so far. Nevertheless, if you really must do exact calculations on large integers, you probably need an interface between R and some other system.
Some of your choices are:
the gmp package (see this page and scroll down to R
an interface to the bc calculator on googlecode
there is a high precision arithmetic page on the R wiki which compares interfaces to Yacas, bc, and MPFR/GMP
there is a limited interface to the PARI/GP package in the elliptical package, but this is probably (much) less immediately useful than the preceding three choices
Most Unix or Cygwin systems should have bc installed already. GMP and Yacas are easy to install on modern Linux systems ...
Here's an extended example, with a function that can choose among numeric, integer, or bigz computation.
f1 <- function(ol=1L,oh=1L,N=16569L,type=c("num","int","bigz")) {
type <- match.arg(type)
## convert all values to appropriate type
if (type=="int") {
ol <- as.integer(ol)
oh <- as.integer(oh)
N <- as.integer(N)
one <- 1L
two <- 2L
six <- 6L
cc <- as.integer
} else if (type=="bigz") {
one <- as.bigz(1)
two <- as.bigz(2)
six <- as.bigz(6)
N <- as.bigz(N)
ol <- as.bigz(ol)
oh <- as.bigz(oh)
cc <- as.bigz
} else {
one <- 1
two <- 2
six <- 6
N <- as.numeric(N)
oh <- as.numeric(oh)
ol <- as.numeric(ol)
cc <- as.numeric
}
## if using bigz mode, the ratio needs to be converted back to bigz;
## defining cc() as above seemed to be the most transparent way to do it
N*ol^two + cc(N*(N+one)*(two*N+one)/six) -
ol*(N*N+one) + two*N*ol*(N-oh+one) +
(N-oh+one)*N^two + two*N*(oh-N-one)*(oh+N)
}
I removed a lot of unnecessary parentheses, which actually made it harder to see what was going on. It is indeed true that for the (1,1) case the final result is not bigger than .Machine$integer.max but some of the intermediate steps are ... (for the (1,1) case this actually reduces to $$-1/6*(N+2)*(4*N^2-5*N+3)$$ ...)
f1() ## -3.032615e+12
f1() > .Machine$integer.max ## FALSE
N <- 16569L
N*(N+1)*(2*N+1) > .Machine$integer.max ## TRUE
N*(N+1L)*(2L*N+1L) ## integer overflow (NA)
f1(type="int") ## integer overflow
f1(type="bigz") ## "-3032615078557"
print(f1(),digits=20) ## -3032615078557: no actual loss of precision in this case
PS: you have a (N*N+1) term in your equation. Should that really be N*(N+1), or did you really mean N^2+1?
Given your comments, I guess that you seriously misunderstand the "correctness" of numbers in R. You say the outcome you get on Windows is something like -30598395869593930593. Now, on both 32bit and 64bit that precision is even not possible using a double, let alone using an integer :
> x <- -30598395869593930593
> format(x,scientific=F)
[1] "-30598395869593931776"
> all.equal(x,as.numeric(format(x,scientific=F)))
[1] TRUE
> as.integer(x)
[1] NA
You have 16 digits you can trust, all the rest is bollocks. Then again, an accuracy of 16 digits is already pretty strong. Most measurement tools don't even come close to that.

Resources