Combine list of matrices into a big.matrix - r

I have a list of large (35000 x 3) matrices in R and I want to combine them into a single matrix but it would be about 1 billion rows long and would exceed the maximum object size in R.
The bigmemory package allows for larger matrices but doesn't appear to support rbind to put multiple matrices together.
Is there some other package or technique that supports the creation of a very large matrix from smaller matrices?
Also before you ask this is not a RAM issue, simply an R limitation even on 64-bit R.

You could implement it with a loop:
library(bigmemory)
## Reproducible example
mat <- matrix(1, 50e3, 3)
l <- list(mat)
for (i in 2:100) {
l[[i]] <- mat
}
## Solution
m <- ncol(l[[1]]) ## assuming that all have the same number of columns
n <- sum(sapply(l, nrow))
bm <- big.matrix(n, m)
offset <- 0
for (i in seq_along(l)) {
mat_i <- l[[i]]
n_i <- nrow(mat_i)
ind_i <- seq_len(n_i) + offset
bm[ind_i, ] <- mat_i
offset <- offset + n_i
}
## Verif
stopifnot(offset == n, all(bm[, 1] == 1))

Not quite an answer, but a little more than a comment: are you sure that you can't do it by brute force? R now has long vectors (since version 3.0.0; the question you link to refers to R version 2.14.1): from this page,
Arrays (including matrices) can be based on long vectors provided each of their dimensions is at most 2^31 - 1: thus there are no 1-dimensional long arrays.
while the underlying atomic vector can go up to 2^52 -1 elements ("in theory .. address space limits of current CPUs and OSes will be much smaller"). That means you should in principle be able to create a matrix that is as much as ((2^31)-1)/1e9 = 2.1 billion rows long; since the maximum "long" object size is about 10^15 (i.e. literally millions of billions), a matrix of 1 billion rows and 3 columns should (theoretically) not be a problem.

Related

subtract strings with specific width in for recycle

I am trying to run a for function to extract multiple strings in order from a fasta.
Here is an example(of course the real one is more than 10 thousand)
eg <- ATCGACGTCGATGCTGATCGATCGATCGATCGTCAGATCGATCAG
And here is my code
forsubseq <- function(dna){
sta <- for (i in 1:floor(width(dna)/100)) {
seqGC <- Biostrings::subseq(dna, start = 100*i - 99, width = 100) %>%
Biostrings::letterFrequency(letters = "GC", as.prob = TRUE)
}
return(sta)
}
forsubseq(eg)
However, nothing happened after running. It really confused me...What I want to obtain is to analyze GC content for each 100 bp...
Could anyone kindly offer advice? Thanks.
The library Biostrings is not available for the most recent version of R, but one simplified approach would be to split eg at every n th character then use lapply to analyze. In this example I counted the number of "GC" pairs using str_count since I dont have the Biostrings library but you can change to the Biostrings::letterFrequency function:
eg <- "ACGACGTCGATGCTGATCGATCGATCGATCGTCAGATCGATCAG"
n <- 10 # you would change to 100
blocks <- seq(1, nchar(eg), n) # prep to separate every n base pairs
splits <- substring(eg, blocks, blocks + n - 1) # separate every n base pairs
lapply(splits,
function(x) stringr::str_count(x, "GC")) # replace with Biostrings::letterFrequency
The output is a list counting the number of "GC" pairs for each block of n characters (here, 10). If you want a vector of integers representing these data, just simply wrap the lapply function in unlist(lapply(...))

What is causing my "vector memory exhausted" error?

For context: I am running a large simulation that takes many hours and 30+ iterations. Everytime, seemingly randomly between 16 and 23 iterations, I get the following error:
I believe I've narrowed down the issue to the following code block (which is not surprising given that this is one of the slowest steps of the algorithm):
adj_mat <- dists_mat
a <- (adj_mat <= cutoff)
b <- (adj_mat > cutoff)
adj_mat[a] <- TRUE
adj_mat[b] <- FALSE
adj_mat is a very large 2,000 x 40,000 distance matrix. In the code block, I am trying to binarize all the cells of this matrix using a cutoff value. Step 1 is to assign the coordinates of all the 1 and 0 cells to the variables a and b, respectively. Step 2 is then to assign 1s and 0s to those coordinates in the matrix.
I just don't understand why this code works for 10+ iterations and then exhausts the vector memory. I'm not saving very much data within the function after each iteration.
Is there perhaps a more memory efficient/less computationally demanding way to binarize that matrix?
I am not sure if this will help but you can reduce this to only two lines. This will avoid creation of all temporary variables and remove additional steps.
adj_mat <- dists_mat
adj_mat <- adj_mat <= cutoff
We can use split and create a list of output in a single line
lst1 <- split(adj_mat, adj_mat <= cutoff)

Improving the computational time of R for loops

Trying to run a spline function by rows (690075) in a dataframe (Camera1) with 4096 columns (each column represents a position on the x axis) where the input variable to the function is a column in another dataset of the same length (test$vr) using a for loop; but I am having serious computational time issues.
I have tried converting the dataframe to a matrix and storing the output in a list amongst others, but to no avail. I have to do this for 2 other dataframes (camera2,camera3) of the same size.
Code
# Note camera1 and test$vr are of the same length
# Initialize
final.data1 <- data.frame()
#new wavelength range
y1 <- round(seq(from = 4714 , to = 4900, length.out = 4096),3)
system.time({
for (i in 1:690075) {
w1 = (as.numeric(colnames(camera1[-1]))) * (1.0 + test$vr[i]/299792.458)
my.data1<-as.data.frame(t(splinefun(x = w1, y = camera1[i,][-1])(y1)))
colnames(my.data1)=y1
final.data1 <- bind_rows(final.data1, my.data1)
} })
Running on a Ubuntu box with 344GB ram and 30 core Intel(R) Xeon(R) CPU E5-2695 # 2.30GHz
Any suggestions would be greatly appreciated.
Thank you.
Without seeing the data it's not easy to optimize your code, but I would start with something along the lines of the following.
final.data1 <- matrix(nrow = 690075, ncol = 4096)
#new wavelength range
y1 <- round(seq(from = 4714 , to = 4900, length.out = 4096), 3)
system.time({
w1 <- (as.numeric(colnames(camera1[-1]))) * (1.0 + test$vr/299792.458)
for (i in 1:690075) {
my.data1 <- t(splinefun(x = w1[i], y = camera1[i, ][-1])(y1))
final.data1[i, ] <- my.data1
}
})
final.data1 <- as.data.frame(final.data1)
colnames(final.data1) <- y1
Explanation:
I start by defining an object of class matrix to hold the results. I believe I got the dimensions of your final data.frame right. This reduces the running time because
Matrices are much faster than data frames, they are just folded vectors and indexing is fast. Data frames, on the contrary, are lists that can hold all types of data, numeric, character, logical, other lists, etc., and therefore it's slow to access their members.
By reserving the result's full memory in one operation saves R's memory management routines a lot of work. To extend final.data1 every iteration through the loop, is very time consuming.
w1 is computed outside the loop, taking advantage of R's vectorized nature. Besides, you were repeating the computation of as.numeric(colnames(camera1[-1])) 690k times!
Test this code and if it doesn't produce the same final result, just say so and I will see if I can do something to debug it.
First remove all instructions that can be done once, and put them outside the for loop. For example: colnames and as.numeric.
Second, try to vectorize. It seems that the w1 calculation can be vectorized, and so estimated once outside the for loop, by just removing the [i].
Third, initialize the final.data1 to the final dimension. For each row added to this data.frame, R will create a new data.frame with one more row, then remove the previous data.frame. It will take long time. Thus, final.data1 <- matrix(NA, ncol=length(y1), nrow=NROW).
And finally, if you want to use more than one core, try to replace the for loop by the parallelized foreach loop. It is possible if all rows are independant:
require(foreach)
require(doSNOW)
cl <- makeCluster(25, type="FORK") # FORK not usable in Windows
registerDoSNOW(cl) # register the cluster
clusterExport(cl, c("objects", "needed", "by", "each", "iteration"), envir=environment()) # for example y1, w1 and camera1
final.data1<- foreach(i=icount(NROW), .combine=rbind, inorder=FALSE) %dopar%
{
# your R code
}
stopCluster(cl)

Populating Large Matrix and Computations

I am trying to populate a 25000 x 25000 matrix in a for loop, but R locks up on me. The data has many zero entries, so would a sparse matrix be suitable?
Here is some sample data and code.
x<-c(1,3,0,4,1,0,4,1,1,4)
y<-x
z<-matrix(NA,nrow=10,ncol=10)
for(i in 1:10){
if(x[i]==0){
z[i,]=0
} else{
for(j in 1:10){
if(x[i]==y[j]){
z[i,j]=1
} else{z[i,j]=0
}
}
}
}
One other question. Is it possible to do computations on matrices this large. When I perform some calculations on some sample matrices of this size I get an output of NA with a warning of integer overflow or R completely locks up.
You could vectorize this and that should help you. Also, if your data is indeed sparse and you can conduct your analysis on a sparse matrix it definitely is something to consider.
library(Matrix)
# set up all pairs
pairs <- expand.grid(x,x)
# get matrix indices
idx <- which(pairs[,1] == pairs[,2] & pairs[,1] != 0)
# create empty matrix with zero's instead
z<-matrix(0,nrow=10,ncol=10)
z[idx] = 1
# create empty sparse matrix
z2 <-Matrix(0,nrow=10,ncol=10, sparse=TRUE)
z2[idx] = 1
all(z == z2)
[1] TRUE
The comment by #alexis_lax would make this even simpler and faster. I had completely forgotten about the outer function.
# normal matrix
z = outer(x, x, "==") * (x!=0)
# sparse matrix
z2 = Matrix(outer(x, x, "==") * (x!=0), sparse=TRUE)
To answer your second question if computations can be done on such a big matrix the answer is yes. You just need to approach it more cautiously and use the appropriate tools. Sparse matrices are nice and many typical matrix functions are available and some other package are compatible. Here is a link to a page with some examples.
Another thought, if you are working with really large matrices you may want to look in to other packages like bigmemory which are designed to deal with R's large overhead.

R: Big integer matrices

I have some big integer matrices (1000 x 1000000) that I have to multiply and do rowmax on.
They contain 0 and 1 (approx 99% 1 and 1% 0 and no other values).
My problem is memory consumption: Currently R eats 8 bytes per integer.
I have looked at SparseMatrix, but it seems I cannot set the default value to 1 instead of 0.
How can I represent these matrices in a memory efficient way, but so I can still multiply them as matrices and use rowmax?
Preferably it should work with R-2.15 and not require additional libraries.
Second idea: If you have a couple of these matrices, call them X_1 and X_2, let Y_1 = 1*1' - X_1 and Y_2 = 1*1' - X_2; the Y's can be sparse because they are 99% zero. So their product is
X_1 * X_2 = ( 1*1' - Y_1) * (1*1' - Y_2) = 1*1'*1*1' - Y_1*1*1' - 1*1'*Y_2 + Y_1 * Y_2
which you can simplify even further.
There are several sparse matrix packages slam, SparseM, Matrix, ...) but I doubt any will do bitwise presentation, or even single char, as you'd need here. You may have to code that up yourself.
Alternatively, packages like ff allow more compact storage but AFAIK will not do matrix ops for you. Maybe you could that on top of them?
Off the top of my head, I can't think of a packaged solution...
It seems like you could represent this type of data extremely efficiently with run length encoding by row. From there, you could implement a matrix-vector multiply method for rle objects, (which might be hard) and row-max (which should be trivial).
Since there are only 1% 0's it would not be difficult to compress. One trivial example:
pseudo.matrix <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
zeroes.cells <- which(x==0)
p <- list(nrow=nrow, ncol=ncol, zeroes.cells=zeroes.cells)
}
This alone would reduce their memory size significantly. And it would be easy to recover the original matrix:
recover.matrix <- function(x) {
m <- matrix(1, x$nrow, x$ncol)
for (i in x$zeroes.cells) m[i] <- 0
m
}
I guess it is possible to figure out a way to efficiently multiply these pseudo matrices, since the result for each cell would be something like the number of columns of the first matrix minus an adjustment regarding the number of zeros in the operation, but I am not sure how easy it would be to do this.

Resources