What is causing my "vector memory exhausted" error? - r

For context: I am running a large simulation that takes many hours and 30+ iterations. Everytime, seemingly randomly between 16 and 23 iterations, I get the following error:
I believe I've narrowed down the issue to the following code block (which is not surprising given that this is one of the slowest steps of the algorithm):
adj_mat <- dists_mat
a <- (adj_mat <= cutoff)
b <- (adj_mat > cutoff)
adj_mat[a] <- TRUE
adj_mat[b] <- FALSE
adj_mat is a very large 2,000 x 40,000 distance matrix. In the code block, I am trying to binarize all the cells of this matrix using a cutoff value. Step 1 is to assign the coordinates of all the 1 and 0 cells to the variables a and b, respectively. Step 2 is then to assign 1s and 0s to those coordinates in the matrix.
I just don't understand why this code works for 10+ iterations and then exhausts the vector memory. I'm not saving very much data within the function after each iteration.
Is there perhaps a more memory efficient/less computationally demanding way to binarize that matrix?

I am not sure if this will help but you can reduce this to only two lines. This will avoid creation of all temporary variables and remove additional steps.
adj_mat <- dists_mat
adj_mat <- adj_mat <= cutoff

We can use split and create a list of output in a single line
lst1 <- split(adj_mat, adj_mat <= cutoff)

Related

the for loop shows NA for every iteration except for the last one

I wanted to run a t.test on every row in the matrix I have constructed. Then I tried to pull out the values for the confidence intervals and save them in separate vectors with one output per each iteration. However, after I run the code I get L1=NA NA NA....8.155677. I would be greatful if you could point out the mistakes.
(I understand there are numerous ways to write this code cleaner but, I tried to write it step-by-step.)
set.seed(1234)
n= 24 # sample size or a number of RV's
N=100 # number of exrtractions or a number of sums for each rv
X=rnorm(N*n, 9, 1.5 ) # generate rv's
XMat=matrix(X,nrow=N)
#Problem Part:
L1=c()
L2=c()
for(i in N)
{
s=XMat[i,1:n]
K=t.test(s,conf.level=0.95)
M=K$conf.int
l1=M[1]
l2=M[2]
L1[i]=l1
L2[i]=l2
}
Change the loop control to:
for (i in seq(N))
You are running the loop for a single value of i in your code.

Implementing fast numerical calculations in R

I was trying to do an extensive computation in R. Eighteen hours have passed but my RStudio seems to continue to work. I'm not sure if I could have written the script in a different way to make it faster. I was trying to implement a Crank–Nicolson type method over a 50000 by 350 matrix as shown below:
#defining the discretization of cells
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- as.data.frame(matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,length(m)]<-0
#implement the calculation
for(j in 2:(length(m[1,])-1)){
for(i in 2:length(m[[1]])){
m[i,][2:length(m)-1][[j]]<-m[i-1,][[j]]+
D*dt*(m[i-1,][[j+1]]-2*m[i-1,][[j]]+m[i-1,][[j-1]])/(dz^2)-
v*dt*(m[i-1,][[j+1]]-m[i-1,][[j-1]])/(2*dz)
}}
Is there a way to know how long would it take for R to implement it? Is there a better way of constructing the numerical calculation? At this point, I feel like excel could have been faster!!
Just making a few simple optimisations really helps here. The original version code of your code would take ~ 5 days on my laptop. Using a matrix and calculating just once values that are reused in the loop, we bring this down to around 7 minutes
And think about messy constructions like
m[i,][2:length(m)-1][[j]]
This is equivalent to
m[[i, j]]
which would be faster (as well as much easier to understand). Making this change further reduces the runtime by another factor of over 2, to around 3 minutes
Putting this together we have
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- (matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
# cache a few values that get reused many times
NC = NCOL(m)
NR = NROW(m)
C1 = D*dt / dz^2
C2 = v*dt / (2*dz)
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,NC]<-0
#implement the calculation
for(j in 2:(NC-1)){
for(i in 2:NR){
ma = m[i-1,]
ma.1 = ma[[j+1]]
ma.2 = ma[[j-1]]
m[[i,j]] <- ma[[j]] + C1*(ma.1 - 2*ma[[j]] + ma.2) - C2*(ma.1 - ma.2)
}
}
If you need to go even faster than this, you can try out some more optimisations. For example see here for how different ways of indexing the same element can have very different execution times. In general it is better to refer to column first, then row.
If all the optimisations you can do in R are not enough for your speed requirements, then you might implement the loop in RCpp instead.

How to efficiently do cross-validation with big.matrix in R?

I have a function, as follows, that takes a design matrix X with class type big.matrix as input and predicts the responses.
NOTE: the size of matrix X is over 10 GB. So I cannot load it into memory. I used read.big.matrix() to generate backing files X.bin and X.desc.
myfun <- function(X) {
## do something with X. class(X) == 'big.matrix'
}
My question is that, how I can do cross validation efficiently with this huge big.matrix?
My attempt: (It works, but is time consuming.)
Step 1: for each fold, get indices for training idx.train and test idx.test;
Step 2: divide X into X.train and X.test. Since X.train and X.test are also very large, I have to store them as big.matrix, and create associated backing files (.bin, .desc) for the training and test sets for each fold.
Step 3: feed the X.train to build the model, and predict responses for X.test.
The time-consuming part is Step 2, where I have to create backing files for training and test (almost like copy/paste the original big matrix) many times. For example, suppose I do 10-fold cross validation. Step 2 would take over 30 minutes for creating backing files for all 10 folds!
To solving this issue in Step 2, I think maybe I can divide the original matrix into 10 sub matrices (of class type big.matrix) just once. Then for each fold, I use one portion for testing, and combine the remaining 9 portions as one big matrix for training. But the new issue is, there is no way to combine small big.matrix into a larger one efficiently without copy/paste.
Of course I can do distributed computing for this cross validation procedure. But I am just wondering whether there is a better way to speed up the procedure if just using a single core.
Any ideas? Thanks in advance.
UPDATE:
It turns out that #cdeterman's answer doesn't work when X is very large. The reason is that the mpermute() function permutes the rows by essentially doing copy/paste. mpermute() calls ReorderRNumericMatrix() in C++, which then calls reorder_matrix() function. This function reorders the matrix by looping over all columns and rows and doing copy/paste. See the source code here.
Are there any better ideas for solving my problem?? Thanks.
END UPDATE
You will want to use the sub.big.matrix function. This avoids any further copies and points the same original data. However, it can currently only subset contiguous rows. So you will want to permute your rows first.
# Step 1 - generate random indices
idx <- sample(nrow(X), nrow(X))
mpermute(X, idx)
# Step 2 - create your folds
max <- nrow(bm)/10 # assuming 10 folds
idx_list <- split(seq(nrow(bm)), ceiling(seq(nrow(bm))/max))
# Step 3 - list of sub.big.matrix objects
sm_list <- lapply(idx_list, function(x) sub.big.matrix(bm, firstRow = x[1], lastRow = x[length(x)]))
You now have the original big.matrix split into 10 different matrices that you can use as you like.

Optimize variance calculation, for loop too slow

Here is the next step of the question answered at this link [Apply function too slow in r
I have to calculate for a lot of species a specific formula per row. The formula correspond to a variance calculation and so need the result obtained in the above link.
My current script consists in using a for-loop which is naturally very slow. I simplified the problem in the following script, using a simple df called az.
az=data.frame(c(1,2,10),c(2,4,20),c(3,6,30))
colnames(az)=c("a","b","c")
# Necessary number calculated in step 1 (see link above)
m <- as.matrix(az)
m[is.na(m)] <- 0 #remove NA from sums
step1 = as.vector(m %*% m[nrow(m),])
# Initial for loop
prov=0 # prov for provisional number
for (i in 1:nrow(az)){
for (j in 1:ncol(az)){
prov=prov+az[i,j]*az[nrow(az),j]
prov=prov+az[i,j]*(az[nrow(az),j]-step1[i])^2
}
print(prov)
prov=0
}
As I have to repeat the operation for a huge number of species, I was wondering if anyone has a more efficient solution, maybe using vectorized expressions.
Kind regards.
This code will return the same values that your code prints out, but more efficiently.
> n<-nrow(m)
> mm<-t(m)
> prov<-mm*mm[,n]
> prov<-prov+mm*(mm[,n]-step1[col(mm)])^2
> colSums(prov)
[1] 82140 791480 113717400

Populating Large Matrix and Computations

I am trying to populate a 25000 x 25000 matrix in a for loop, but R locks up on me. The data has many zero entries, so would a sparse matrix be suitable?
Here is some sample data and code.
x<-c(1,3,0,4,1,0,4,1,1,4)
y<-x
z<-matrix(NA,nrow=10,ncol=10)
for(i in 1:10){
if(x[i]==0){
z[i,]=0
} else{
for(j in 1:10){
if(x[i]==y[j]){
z[i,j]=1
} else{z[i,j]=0
}
}
}
}
One other question. Is it possible to do computations on matrices this large. When I perform some calculations on some sample matrices of this size I get an output of NA with a warning of integer overflow or R completely locks up.
You could vectorize this and that should help you. Also, if your data is indeed sparse and you can conduct your analysis on a sparse matrix it definitely is something to consider.
library(Matrix)
# set up all pairs
pairs <- expand.grid(x,x)
# get matrix indices
idx <- which(pairs[,1] == pairs[,2] & pairs[,1] != 0)
# create empty matrix with zero's instead
z<-matrix(0,nrow=10,ncol=10)
z[idx] = 1
# create empty sparse matrix
z2 <-Matrix(0,nrow=10,ncol=10, sparse=TRUE)
z2[idx] = 1
all(z == z2)
[1] TRUE
The comment by #alexis_lax would make this even simpler and faster. I had completely forgotten about the outer function.
# normal matrix
z = outer(x, x, "==") * (x!=0)
# sparse matrix
z2 = Matrix(outer(x, x, "==") * (x!=0), sparse=TRUE)
To answer your second question if computations can be done on such a big matrix the answer is yes. You just need to approach it more cautiously and use the appropriate tools. Sparse matrices are nice and many typical matrix functions are available and some other package are compatible. Here is a link to a page with some examples.
Another thought, if you are working with really large matrices you may want to look in to other packages like bigmemory which are designed to deal with R's large overhead.

Resources