As a trivial case, let's say I'm interested in calculating:
r = v*M^t
Where r and v are vectors and M is an extremely large sparse matrix.
I can solve it one of two ways:
r = v*(M*M*M*M...)
r = ((((v*M)M)M)M)...
Where the first approach results in intermediate dense matrices that is impractical to store in RAM (I would need at least 10's of terabytes for my target minimum use case). The second solution , instead, always results in intermediate vectors, and it does in fact work in practice.
The problem is that at larger values of t, memory allocations are causing a substantial performance bottleneck.
library(pryr)
n = 20000
v = 1:n
M = matrix(1:(n*n), n) # Not a sparse matrix like my use case, but easier to start with
for (i in 1:10) {
v = v %*% M
print(address(v))
}
As the address() function shows, v is being reallocated every iteration. It is not being modified in place. Not only are the memory allocations slowing things down, but according to profvis, the garbage collector is constantly being called as well and taking up a large portion of the time.
So my question is, is there a way to perform this calculation (and potentially others similar to it) in R without the excess memory allocations and gc() calls happening under the hood?
Related
I’ve been wracking my brain around this problem all week and could really use an outside perspective. Basically I’ve built a recursive tree function where the output of each node in one layer is used as the input for a node in the subsequent layer. I’ve generated a toy example here where each call generates a large matrix, splits it into submatrices, and then passes those submatrices to subsequent calls. The key difference from similar questions on Stack is that each call of tree_search doesn't actually return anything, it just appends results onto a CSV file.
Now I'd like to parallelize this function. However, when I run it with mclapply and mc.cores=2, the runtime increases! The same happens when I run it on a multicore cluster with mc.cores=12. What’s going on here? Are the parent nodes waiting for the child nodes to return some output? Does this have something to do with fork/socket parallelization?
For background, this is part of an algorithm that models gene activation in white blood cells in response to viral infection. I’m a biologist and self-taught programmer so I’m a little out of my depth here - any help or leads would be really appreciated!
# Load libraries.
library(data.table)
library(parallel)
# Recursive tree search function.
tree_search <- function(submx = NA, loop = 0) {
# Terminate on fifth loop.
message(paste("Started loop", loop))
if(loop == 5) {return(TRUE)}
# Create large matrix and do some operation.
bigmx <- matrix(rnorm(10), 50000, 250)
bigmx <- sin(bigmx^2)
# Aggregate matrix and save output.
agg <- colMeans(bigmx)
append <- file.exists("output.csv")
fwrite(t(agg), file = "output.csv", append = append, row.names = F)
# Split matrix in submatrices with 100 columns each.
ind <- ceiling(seq_along(1:ncol(bigmx)) / 100)
lapply(unique(ind), function(i) {
submx <- bigmx[, ind == i]
# Pass each submatrix to subsequent call.
loop <- loop + 1
tree_search(submx, loop) # sub matrix is used to generate big matrix in subsequent call (not shown)
})
}
# Initiate tree search.
tree_search()
After a lot more brain wracking and experimentation, I ended up answering my own question. I’m not going to refer to the original example since I've changed up my approach quite a bit. Instead I’ll share some general observations that might help people in similar situations.
1.) For loops are more memory efficient than lapply and recursive functions
When you use lapply, each call creates a copy of your current environment. That’s why you can do this:
x <- 5
lapply(1:10, function(i) {
x <- x + 1
x == 6 # TRUE
})
x == 5 # ALSO TRUE
At the end x is still 5, which means that each call of lapply was manipulating a separate copy of x. That’s not good if, say, x was actually a large dataframe with 10,000 variables. for loops, on the other hand, allow you to override the variables on each loop.
x <- 5
for(i in 1:10) {x <- x + 1}
x == 5 # FALSE
2.) Parallelize once
Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script. Therefore, you should use mclapply with discretion. In my case, that meant NOT putting mclapply inside a recursive function where it was getting called tens to hundreds of times. Instead, I split the starting point into 16 parts and ran 16 different tree searches on separate nodes.
3.) You can use mclapply to throttle memory usage
If you split a job into 10 parts and process them with mclapply and mc.preschedule=F, each core will only process 10% of your job at a time. If mc.cores was set to two, for example, the other 8 "nodes" would wait until one part finished before starting a new one. This is useful if you are running into memory issues and want to prevent each loop from taking on more than it can handle.
Final Note
This is one of the more interesting problems I’ve worked on so far. However, recursive tree functions are complicated. Draw out the algorithm and force yourself to spend a few days away from your code so that you can come back with a fresh perspective.
I need to make quick calculations (+,*,>) with large 3D arrays (tensors) in R (like 1500 x 150 x 30000). Since these arrays are very sparse (only 0.03% entries are non-zeros) I first use as_sptensor function from the 'tensorr' library to convert my tensors to sparse ones like:
x <- array(data = c(1,0,0,0,0,0,0,1,1,1,1,1) , dim = c(3,2,2))
s <- as_dtensor(x)
s1 <- as_sptensor(s)
And then I do some arithmetic operations, e.g. multiplication:
s1*s1
I also have memory limitations of 8GB totally so that also help s me to store the result.
The problem is that when I deal with large tensors like:
A <-some_index_matrix[1:3,1:1000000]
A2 <- sptensor(A, rep(1,ncol(A)), dims=c(max(A[1,]),max(A[2,]),max(A[3,])))
A2*A2
I fail to get this product result within reasonable time. How can I optimize my code for such calculations to be carried out within seconds?
On my machine,
m1 = ( runif(5*10^7), ncol=10000, nrow=5000 )
uses up about 380 MB. I need to work with many of such matrices at the same time in memory (e.g. add or multiply them or apply functions on them). All in all my code uses up 4 GB of RAM due to multiple matrices stored in memory. I am contemplating options to store the data more efficiently (i.e. in a way that uses up less RAM).
I have seen the R package bigmemory being recommended. However:
library(bigmemory)
m2 = big.matrix( init = 0, ncol=10000, nrow=5000 )
m2[1:5000,1:10000] <- runif( 5*10^7 )
makes R use about the same amount in memory as I verified using Windows Task Manager. So I anticipate no big gain, or am I wrong and should I use big.matrix in a different way?
The solution is to work with matrices stored in files, i.e. setting backingfile to not NULL in the call of big.matrix() function.
Working with filebacked big.matrix from package bigmemory is a good solution.
However, assigning the whole matrix with runif( 5*10^7 ) makes you create this large temporary vector in memory. Yet, if you use gc(reset = TRUE), you will see that this memory usage disappear.
If you want to initialize your matrix by block (say blocks of 500 columns), you could use package bigstatsr. It uses similar objects as filebacked big.matrix (called FBM) and store them in your temporary directory by default. You could do:
library(bigstatsr)
m1 <- FBM(1e4, 5e3)
big_apply(m1, a.FUN = function(X, ind) {
X[, ind] <- runif(nrow(X) * length(ind))
NULL
}, a.combine = 'c', block.size = 500)
Depending on the makeup of your dataset, a sparse.matrix could be your best way forward. This is a common and extremely useful way to boost space and time efficiency. In fact, a lot of R packages require that you use sparse matrices.
I need to solve Ax = b, where A is a symmetric positiv semi-definite matrix. This can be efficiently implemented using the cholesky decomposition. Because the matrix A will be at least have the dimensions 25000 x 25000, I cannot waste memory. Therefore I want to use the in-place version of Julia's cholfact:
cholfact!(A, :U, pivot = true)
Compared to
F = cholfact(A, :U, pivot = true)
this would save Gigabytes of memory.
However after the computation, A is of the type Matrix Float64, while F has the type CholeskyPivoted{Float64}. I far as I understand the in-place version loses essential information, such as the pivot vector F.piv. How can I compute the cholesky decomposition correctly whithout wasting memory?
You want to combine these two:
F = cholfact!(A, :U, pivot = true)
This returns a CholeskyPivoted, which is indeed what you want. But by using cholfact!, you're saying that you don't care whether A gets destroyed in the process. Consequently, it will use the memory allocated for A for storing the factorization (thus destroying A).
Afterwards, you should only use F, not A, because A has been destroyed. Internally, F will contain a reference to A, since it's storing the factorization in A. This may be clearer if you examine how a CholeskyPivoted is represented; A will be used for that UL field.
I am trying to assign values to a "10000000*6" logical matrix. The process would be 1)create a matrix; 2) then assign the value to each element of the matrix.To simplify my question, I just show how one value get assigned to one element of the matrix.
Here are the codes:
m <- matrix(data = NA, ncol= 6, nrow= 10000000)
m[1,1] <- 1
Error: cannot allocate vector of size 228.9 Mb
There is no error when creating the "10000000*6" logical matrix, but there is when assigning the values.
I also tried to do the same tasks but with a smaller matrix (100*6). Things work well.
Here are the codes:
m <- matrix(data = NA, ncol= 6, nrow= 100)
m[1,1] <- 1
Could anyone help me to deal with the bigger matrix?
It may come as a bit of a surprise but R is a bit of a procrastinator. It is possible that a command to "create" an object may not actually do so until there is a real demand for action, such as populating a matrix with a "real" value. The term to describe this is "pass-by-promise". Furthermore, assignment to an existing object may construct duplicates or even triplicates of the object which will occupy space until they are garbage collected.
So here's what you do. Exit out of R. Power down. Restart your system with a minimum of other applications, since they all take memory. Restart R and run your commands. I predict success if you have the typical 4GB memory available before system load. 228.9 Mb is not really very big, but in your case it was the straw that broke the camel's back. R needs to be able to find contiguous memory for each object and garbage collection will typically not defragment memory.