R: Big integer matrices - r

I have some big integer matrices (1000 x 1000000) that I have to multiply and do rowmax on.
They contain 0 and 1 (approx 99% 1 and 1% 0 and no other values).
My problem is memory consumption: Currently R eats 8 bytes per integer.
I have looked at SparseMatrix, but it seems I cannot set the default value to 1 instead of 0.
How can I represent these matrices in a memory efficient way, but so I can still multiply them as matrices and use rowmax?
Preferably it should work with R-2.15 and not require additional libraries.

Second idea: If you have a couple of these matrices, call them X_1 and X_2, let Y_1 = 1*1' - X_1 and Y_2 = 1*1' - X_2; the Y's can be sparse because they are 99% zero. So their product is
X_1 * X_2 = ( 1*1' - Y_1) * (1*1' - Y_2) = 1*1'*1*1' - Y_1*1*1' - 1*1'*Y_2 + Y_1 * Y_2
which you can simplify even further.

There are several sparse matrix packages slam, SparseM, Matrix, ...) but I doubt any will do bitwise presentation, or even single char, as you'd need here. You may have to code that up yourself.
Alternatively, packages like ff allow more compact storage but AFAIK will not do matrix ops for you. Maybe you could that on top of them?

Off the top of my head, I can't think of a packaged solution...
It seems like you could represent this type of data extremely efficiently with run length encoding by row. From there, you could implement a matrix-vector multiply method for rle objects, (which might be hard) and row-max (which should be trivial).

Since there are only 1% 0's it would not be difficult to compress. One trivial example:
pseudo.matrix <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
zeroes.cells <- which(x==0)
p <- list(nrow=nrow, ncol=ncol, zeroes.cells=zeroes.cells)
}
This alone would reduce their memory size significantly. And it would be easy to recover the original matrix:
recover.matrix <- function(x) {
m <- matrix(1, x$nrow, x$ncol)
for (i in x$zeroes.cells) m[i] <- 0
m
}
I guess it is possible to figure out a way to efficiently multiply these pseudo matrices, since the result for each cell would be something like the number of columns of the first matrix minus an adjustment regarding the number of zeros in the operation, but I am not sure how easy it would be to do this.

Related

Combine list of matrices into a big.matrix

I have a list of large (35000 x 3) matrices in R and I want to combine them into a single matrix but it would be about 1 billion rows long and would exceed the maximum object size in R.
The bigmemory package allows for larger matrices but doesn't appear to support rbind to put multiple matrices together.
Is there some other package or technique that supports the creation of a very large matrix from smaller matrices?
Also before you ask this is not a RAM issue, simply an R limitation even on 64-bit R.
You could implement it with a loop:
library(bigmemory)
## Reproducible example
mat <- matrix(1, 50e3, 3)
l <- list(mat)
for (i in 2:100) {
l[[i]] <- mat
}
## Solution
m <- ncol(l[[1]]) ## assuming that all have the same number of columns
n <- sum(sapply(l, nrow))
bm <- big.matrix(n, m)
offset <- 0
for (i in seq_along(l)) {
mat_i <- l[[i]]
n_i <- nrow(mat_i)
ind_i <- seq_len(n_i) + offset
bm[ind_i, ] <- mat_i
offset <- offset + n_i
}
## Verif
stopifnot(offset == n, all(bm[, 1] == 1))
Not quite an answer, but a little more than a comment: are you sure that you can't do it by brute force? R now has long vectors (since version 3.0.0; the question you link to refers to R version 2.14.1): from this page,
Arrays (including matrices) can be based on long vectors provided each of their dimensions is at most 2^31 - 1: thus there are no 1-dimensional long arrays.
while the underlying atomic vector can go up to 2^52 -1 elements ("in theory .. address space limits of current CPUs and OSes will be much smaller"). That means you should in principle be able to create a matrix that is as much as ((2^31)-1)/1e9 = 2.1 billion rows long; since the maximum "long" object size is about 10^15 (i.e. literally millions of billions), a matrix of 1 billion rows and 3 columns should (theoretically) not be a problem.

Implementing fast numerical calculations in R

I was trying to do an extensive computation in R. Eighteen hours have passed but my RStudio seems to continue to work. I'm not sure if I could have written the script in a different way to make it faster. I was trying to implement a Crank–Nicolson type method over a 50000 by 350 matrix as shown below:
#defining the discretization of cells
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- as.data.frame(matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,length(m)]<-0
#implement the calculation
for(j in 2:(length(m[1,])-1)){
for(i in 2:length(m[[1]])){
m[i,][2:length(m)-1][[j]]<-m[i-1,][[j]]+
D*dt*(m[i-1,][[j+1]]-2*m[i-1,][[j]]+m[i-1,][[j-1]])/(dz^2)-
v*dt*(m[i-1,][[j+1]]-m[i-1,][[j-1]])/(2*dz)
}}
Is there a way to know how long would it take for R to implement it? Is there a better way of constructing the numerical calculation? At this point, I feel like excel could have been faster!!
Just making a few simple optimisations really helps here. The original version code of your code would take ~ 5 days on my laptop. Using a matrix and calculating just once values that are reused in the loop, we bring this down to around 7 minutes
And think about messy constructions like
m[i,][2:length(m)-1][[j]]
This is equivalent to
m[[i, j]]
which would be faster (as well as much easier to understand). Making this change further reduces the runtime by another factor of over 2, to around 3 minutes
Putting this together we have
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- (matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
# cache a few values that get reused many times
NC = NCOL(m)
NR = NROW(m)
C1 = D*dt / dz^2
C2 = v*dt / (2*dz)
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,NC]<-0
#implement the calculation
for(j in 2:(NC-1)){
for(i in 2:NR){
ma = m[i-1,]
ma.1 = ma[[j+1]]
ma.2 = ma[[j-1]]
m[[i,j]] <- ma[[j]] + C1*(ma.1 - 2*ma[[j]] + ma.2) - C2*(ma.1 - ma.2)
}
}
If you need to go even faster than this, you can try out some more optimisations. For example see here for how different ways of indexing the same element can have very different execution times. In general it is better to refer to column first, then row.
If all the optimisations you can do in R are not enough for your speed requirements, then you might implement the loop in RCpp instead.

Element-wise operation with two vectors of a data frame in R

My first question here: how to apply an efficient routine that iterates values of two vectors (pairwise) of a given data frame?
To be more specific, consider the following example using the following data frame:
df0 <- data.frame(matrix(c(1,2,2,3,1,3,0.4,0.2,0.2,0.1,0.4,0.1),nrow=6,ncol=2))
colnames(df0) <- c("value","frequency")
The first column is a real value and the second column is a frequency (or weights). NOTICE: the weights have to be strictly positive, they might be repeated, they not necessarily add up to one (because of repetition).
I am performing the following LOOP to calculate my function P. This P is supposed to be a number between 0 and 1.
# Define two parameters
K = 1/2
alpha = 0
# LOOP
mattemp <- matrix(,nrow=length(df0$value), ncol=length(df0$value))
for(i in 1:length(df0$value)) {
for(j in 1:length(df0$value)) {
mattemp[i,j] <- df0$frequency[i]^(1+alpha) * df0$frequency[j] * abs(df0$value[i]-df0$value[j])
P <- K * sum(mattemp)
}
}
Basically, my function P is calculating:
P = K * (0.4^alpha * 0.2 * |1-2| + 0.4^alpha * 0.1 * |1-3| + ...
This code works perfectly well as long as the matrix is small.
However, I am trying to implement this routine for a big matrix (5400 x 5400) and this LOOP does not seem to find an end.
I already tried to loop it using a foreach command (using %dopar%), but it does not work as well.
Is there a smart and concise routine that R can handle??? It does not need to follow the above structure, as long as it is efficient.
Thank you very much
Try:
df$nval <- (df0$value - mean(df0$value)) / sd(df0$value)
ij <- combn(nrow(df0), 2)
foo <- sum(df0$frequency[ij[1, ]] ^ (1 + alpha) * df0$frequency[ij[2, ]] * abs(df0$nval[ij[1, ]] - df0$nval[ij[2, ]]))
P <- K*2*sum(foo)
Reasoning: Basically you are testing every possible permutation between frequencies and normalized values. We use combn to create half of those. We then just vectorize the whole thing. Since combn only gives unique combinations, we need to multiply by 2. [Keep in mind that we don't need the values on the diagonal, as abs(df0$value[i] - df0$value[i]) is equal to 0, and we are only missing cases where i=j and j=i, so that's why we multiply by 2.] We then multiply by K and get P.
It's not clear how you want to normalize, so I just substracted the mean and divided that by the standard deviation. If you meant something else, you yourself can change it accordingly.
Edit1: Big thanks to #alexis_laz for finding a mistake and suggesting improvements that almost double the speed!
Edit2: Adjusted script to fit changed requirements.

Integer overflow from many-leveled factor with class.ind()?

I'm trying to convert a "big" factor into a set of indicator (i.e. dummy, binary, flag) variables in R as such:
FLN <- data.frame(nnet::class.ind(FinelineNumber))
where FinelineNumber is a 5,000-level factor from Kaggle.com's current Walmart contest (the data is public if you'd like to reproduce this error).
I keep getting this concerning-looking warning:
In n * (unclass(cl) - 1L) : NAs produced by integer overflow
Memory available to the system is essentially unlimited. I'm not sure what the problem is.
The source code of nnet::class.ind is:
function (cl) {
n <- length(cl)
cl <- as.factor(cl)
x <- matrix(0, n, length(levels(cl)))
x[(1L:n) + n * (unclass(cl) - 1L)] <- 1
dimnames(x) <- list(names(cl), levels(cl))
x
}
.Machine$integer.max is 2147483647. If n*(nlevels - 1L) is greater than this value that should produce your error. Solving for n:
imax <- .Machine$integer.max
nlevels <- 5000
imax/(nlevels-1L)
## [1] 429582.6
You'll encounter this problem if you have 429583 or more rows (not particularly big for a data-mining context). As commented above, you'll do much better with Matrix::sparse.model.matrix (or Matrix::fac2sparse), if your modeling framework can handle sparse matrices. Alternatively, you'll have to rewrite class.ind to avoid this bottleneck (i.e. indexing by rows and columns rather than by absolute location) [#joran comments above that R indexes large vectors via double-precision values, so you might be able to get away with just hacking that line to
x[(1:n) + n * (unclass(cl) - 1)] <- 1
possibly throwing in an explicit as.numeric() here or there to force the coercion to double ...]
Even if you were able to complete this step, you'd end up with a 5000*650000 matrix - it looks like that will be 12Gb.
print(650*object.size(matrix(1L,5000,1000)),units="Gb")
I guess if you've got 100Gb free that could be OK ...

Populating Large Matrix and Computations

I am trying to populate a 25000 x 25000 matrix in a for loop, but R locks up on me. The data has many zero entries, so would a sparse matrix be suitable?
Here is some sample data and code.
x<-c(1,3,0,4,1,0,4,1,1,4)
y<-x
z<-matrix(NA,nrow=10,ncol=10)
for(i in 1:10){
if(x[i]==0){
z[i,]=0
} else{
for(j in 1:10){
if(x[i]==y[j]){
z[i,j]=1
} else{z[i,j]=0
}
}
}
}
One other question. Is it possible to do computations on matrices this large. When I perform some calculations on some sample matrices of this size I get an output of NA with a warning of integer overflow or R completely locks up.
You could vectorize this and that should help you. Also, if your data is indeed sparse and you can conduct your analysis on a sparse matrix it definitely is something to consider.
library(Matrix)
# set up all pairs
pairs <- expand.grid(x,x)
# get matrix indices
idx <- which(pairs[,1] == pairs[,2] & pairs[,1] != 0)
# create empty matrix with zero's instead
z<-matrix(0,nrow=10,ncol=10)
z[idx] = 1
# create empty sparse matrix
z2 <-Matrix(0,nrow=10,ncol=10, sparse=TRUE)
z2[idx] = 1
all(z == z2)
[1] TRUE
The comment by #alexis_lax would make this even simpler and faster. I had completely forgotten about the outer function.
# normal matrix
z = outer(x, x, "==") * (x!=0)
# sparse matrix
z2 = Matrix(outer(x, x, "==") * (x!=0), sparse=TRUE)
To answer your second question if computations can be done on such a big matrix the answer is yes. You just need to approach it more cautiously and use the appropriate tools. Sparse matrices are nice and many typical matrix functions are available and some other package are compatible. Here is a link to a page with some examples.
Another thought, if you are working with really large matrices you may want to look in to other packages like bigmemory which are designed to deal with R's large overhead.

Resources