Getting "node stack overflow" when cbind multiple sparse matrices - r

I have 100,000 sparse matrices("dgCMatrix") store in a list object. The row number of every matrix is the same(8,000,000) and the size of the list is approximately 25 Gb. Now when I do:
do.call(cbind, theListofMatrices)
to combine all matrices into one big sparse matrix, I got "node stack overflow". Actually, I can't even do this with only 500 elements out of that list, which should output a sparse matrix with a size of only 100 Mb.
My speculation for this is that the cbind() function transformed the sparse matrix to a normal dense matrix and thus cause the stack overflow?
Actually, I have tried something like this:
tmp = do.call(cbind, theListofMatrices[1:400])
this works fine, and tmp is still a sparse matrix with a size of 95 Mb, and then I tried:
> tmp = do.call(cbind, theListofMatrices[1:410])
Error in stopifnot(0 <= deparse.level, deparse.level <= 2) :
node stack overflow
and then the error occurred. However, I am having no trouble doing something like:
cbind(tmp, tmp, tmp, tmp)
thus, I believe it has something to do with do.call()
Reduce() seems to solve my problem, though I still don't know the reason why do.call() crushes.

The problem is not in do.call() but due to the way cbind from the Matrix package is implemented. It uses recursion to bind the individual arguments together. For instance, Matrix::cbind(mat1, mat2, mat3) is translated to something along the lines of Matrix::cbind(mat1, Matrix::cbind(mat2, mat3)).
Since do.call(cbind, theListofMatrices) is basically cbind(theListofMatrices[[1]], theListofMatrices[[2]], ...) you have too many arguments to the cbind function and you will end up with a recursion that's nested too deeply and it will fail.
Thus, Ben's comment to use Reduce() is a good way to work around that issue since it avoids the recursion and replaces it with an iteration:
tmp <- Reduce(cbind, theListofMatrices[-1], theListofMatrices[[1]])

In R: a 2-column matrix can have up to 2^30-1 rows = 1073,741,823 rows. So, I would check the row number and check the RAM size to make sure it can accommodate the big matrix size.

Related

Block-diagonal matrix from array in R

How can we construct a block-diagonal matrix from a three-dimensional array in R? There are several possibilities when starting from a list of matrices (e.g., Reduce(magic::adiag, list_of_matrices)) or individual matrices (e.g., magic::adiag(matrix1, matrix2)). However, I could not find anything when we start with an array:
matrices <- array(NA, c(3,3,2))
matrices[,,1] <- diag(1,3)
matrices[,,2] <- matrix(rnorm(9), 3, 3)
Are there any efficient solutions for constructing the corresponding 9x9 block matrix or is it a better idea to just convert to a list and use magic::adiag? The latter seems relatively inefficient, especially when the number of matrices is large.
I guess converting to a list and using magic::adiag is the fastest way. Try the following lines of code which is rather short and I use frequently:
library(magic)
arr <- array(1:8, c(2,2,3))
do.call("adiag", lapply(seq(dim(arr)[3]), function(x) arr[ , , x]))
This essentially reduces to a one-liner but uses lists.

How to assign submatrices in elements of a list

For example:
Let M be some matrix mXn matrix where n is large enough to make manual entry impossible.
tmp_list[1] <- M[,1:10]
tmp_list[2] <- M[,11:20]
.
.
.
tmp_list[last] <- M[end - 9,end]
The problem I'm working on is sort of monte carlo, repeating an experiment involving a random mXn matrix 100K times. I'm still pretty new to R, I've done it using a for loop, but it obviously took a very long time. So I'm hoping to assign each "experiment" to an element of a list and use lapply.
let's take the easy case, and you can expand it from there
say n=100, develop your start indeces
n<-100
byParam<-10
starts<-seq(1, n-(byParam-1), by=byParam)
then lapply
tmp_list<-lapply(starts, function(startIndex) M[, startIndex:(startIndex+(byParam-1)])
just one way to do it, becomes a bit more complicated if n is not a nice multiple of 10 (or whatever you set the "byParam" equal to). If that is the case then you can develop your start and end indeces, and then use mapply instead
#given start and end indeces
tmp_list<-mapply(function(startInd, endInd){
M[, startInd:endInd},
startInd=starts, endInd=ends)
Now lapply and mapply are still iterative, so I wouldn't expect massive improvement on time efficiency
EDIT
After discussion in the comments, here is a solution for the entire set up, not just the above question
tmp_list<-lapply(1:1000, function(i){
vect<-sample(c(0,1), 10*1000, replace=TRUE)
dim(vect)<-c(10, 1000)
vect
})
Let's break this down, it makes everything very simple.
We first create a random sample of 1's and 0's, of the length 10*1000 (the number of elements in each sub-matrix). We can then neatly convert that vector to a matrix by assigning it's dim attribute to be c(10, 1000), which changes its form to have 10 rows and 1000 columns. Then we return that into a list at the index i. We lapply over 1:1000, or iterate 1000 times.

mapply - passing row and column of element as argument

I'm new to R programming and I know I could write a loop to do this, but everything I read says that for simplicity its best to avoid loops and use apply instead.
I have a matrix and i would like to run this function on each element in the matrix.
cellresidue <- function(i,j){
result <- (cluster[i,j] - cluster.I[i,] - cluster.J[j,] - cluster.IJ)/(cluster.N*cluster.M)
return (result)
}
i= element row
j= element column
cluster.J is a matrix of column means
cluster.I is a matrix of row means
cluster.IJ is the mean of the entire matrix named cluster
What I can't figure out is how do I get the row and column of the element (I think should use row() and column col() functions) that mapply is working with and how do pass those arguments to mapply or apply?
There is no need for loops or *apply functions. You can just use plain matrix operations:
nI <- nrows(cluster)
nJ <- ncols(cluster)
cluster.I <- matrix(rowMeans(cluster), nI, nJ, byrow = FALSE)
cluster.J <- matrix(rowMeans(cluster), nI, nJ, byrow = TRUE)
cluster.IJ <- matrix( mean(cluster), nI, nJ)
residue.mat <- (cluster - cluster.I - cluster.J - cluster.IJ) /
(cluster.N * cluster.M)
(You did not explain what cluster.N and cluster.M are but I assume they are scalars)
It is not clear from your question what you are trying to do. It is best on this site to provide some mock data (preferably generated by the code, not pasted), and then show what form the end result should look like. It seems that the apply family is not what you seek.
Quick disambiguation between apply, sapply and mapply:
#providing data for examples
X=matrix(rnorm(9),3,3)
apply: apply a function to either columns (2) or rows (1) of a matrix or array
#here, sum by columns, same as colSums(X)
apply(X, 2, sum)
sapply: apply a function against (usually) a list of objects
#create a list with three vectors
mylist=list(1:4, 5:10, c(1,1,1))
#get the mean of each vector
sapply(mylist, mean)
#remove 2 to each element of X, same as c(X-2)
sapply(X, FUN=function(x) x-2)
mapply: a multivariate version of sapply, taking an arbitrary number of arguments. Never had much use of it… Some rock-bottom examples:
#same as c(1,2,3,4) + c(15,16,17,18)
mapply(sum, 1:4, 15:18)
#same as c(X+X), the vectorized matrix sum
mapply(sum, X, X)
Side note: It's perfectly ok to use loops in R; use whichever suits the best your thoughts. The issue is that if you have a "really big" number of iterations, this is where you could meet bottlenecks, depending on your patience. There are two solutions to this: rewrite your function in C/FORTRAN (and boost speed), or use built-in functions if applicable (which are, by the way, often writen in C or FORTRAN).

R assign several list elements the same object

I currently have a loop - well actually a loop in loop, in a simulation model which gets slow with larger numbers of individuals. I've vectorised most of it and made it a heck of a lot faster. But there's a part where I assign multiple elements of a list as the same thing, simplifying a big loop to just the task I want to achieve:
new.matrices[[length(new.matrices)+1]]<-old.matrix
With each iteration of the loop the line above is called, and the same matrix object is assigned to the next new element of a list.
I'm trying to vectorize this - if possible, or make it faster than a loop or apply statement.
So far I've tried stuff along the lines of:
indices <- seq(from = length(new.matrices) + 1, to = length(new.matrices) + reps)
new.matrices[indices] <- old.matrix
However this results in the message:
Warning message:
In new.effectors[effectorlength] <- matrix :
number of items to replace is not a multiple of replacement length
It also tries to assign one value of the old.matrix to one element of new.matrices like so:
[[1]]
[1] 8687
[[2]]
[1] 1
[[3]]
[1] 5486
[[4]]
[1] 0
When the desired result is one list element = one whole matrix, a copy of old.matrix
Is there a way I can vectorize sticking a matrix in list elements without looping? With loops how it is currently implemented we are talking many thousands of repetitions which slows things down considerably, hence my desire to vectorize this if possible.
Probably you already solved your problem, anyway, the issue in your code
new.matrices[indices] <- old.matrix
was caused by trying to replace some objects (the NULL elements in your new.matrices list) with something different, a matrix. So R coerces old.matrix into a vector and tries to stick each single value to a different list element, (that's why you got this result, and when, say, reps is 4 or 8 and old.matrix is NOT a 2 x 2 matrix, you also get the warning). Doing
new.matrices[indices] <- list(old.matrix)
will work, and R will replicate the single element list list(old.matrix) "reps" times automatically.

R colon operator on list of matrices

I've created a list of matrices in R. In all matrices in the list, I'd like to "pull out" the collection of matrix elements of a particular index. I was thinking that the colon operator might allow me to implement this in one line. For example, here's an attempt to access the [1,1] elements of all matrices in a list:
myList = list() #list of matrices
myList[[1]] = matrix(1:9, nrow=3, ncol=3, byrow=TRUE) #arbitrary data
myList[[2]] = matrix(2:10, nrow=3, ncol=3, byrow=TRUE)
#I expected the following line to output myList[[1]][1,1], myList[[2]][1,1]
slice = myList[[1:2]][1,1] #prints error: "incorrect number of dimensions"
The final line of the above code throws the error "incorrect number of dimensions."
For reference, here's a working (but less elegant) implementation of what I'm trying to do:
#assume myList has already been created (see the code snippet above)
slice = c()
for(x in 1:2) {
slice = c(slice, myList[[x]][1,1])
}
#this works. slice = [1 2]
Does anyone know how to do the above operation in one line?
Note that my "list of matrices" could be replaced with something else. If someone can suggest an alternative "collection of matrices" data structure that allows me to perform the above operation, then this will be solved.
Perhaps this question is silly...I really would like to have a clean one-line implementation though.
Two things. First, the difference between [ and [[. The relevant sentence from ?'[':
The most important distinction between [, [[ and $ is that the [ can
select more than one element whereas the other two select a single
element.
So you probably want to do myList[1:2]. Second, you can't combine subsetting operations in the way you describe. Once you do myList[1:2] you will get a list of two matrices. A list typically has only one dimension, so doing myList[1:2][1,1] is nonsensical in your case. (See comments for exceptions.)
You might try lapply instead: lapply(myList,'[',1,1).
If your matrices will all have same dimension, you could store them in a 3-dimensional array. That would certainly make indexing and extracting elements easier ...
## One way to get your data into an array
a <- array(c(myList[[1]], myList[[2]]), dim=c(3,3,2))
## Extract the slice containing the upper left element of each matrix
a[1,1,]
# [1] 1 2
This works:
> sapply(myList,"[",1,1)
[1] 1 2
edit: oh, sorry, I see almost the same idea toward the end of an earlier answer. But sapply probably comes closer to what you want, anyway

Resources