How to efficiently do cross-validation with big.matrix in R? - r

I have a function, as follows, that takes a design matrix X with class type big.matrix as input and predicts the responses.
NOTE: the size of matrix X is over 10 GB. So I cannot load it into memory. I used read.big.matrix() to generate backing files X.bin and X.desc.
myfun <- function(X) {
## do something with X. class(X) == 'big.matrix'
}
My question is that, how I can do cross validation efficiently with this huge big.matrix?
My attempt: (It works, but is time consuming.)
Step 1: for each fold, get indices for training idx.train and test idx.test;
Step 2: divide X into X.train and X.test. Since X.train and X.test are also very large, I have to store them as big.matrix, and create associated backing files (.bin, .desc) for the training and test sets for each fold.
Step 3: feed the X.train to build the model, and predict responses for X.test.
The time-consuming part is Step 2, where I have to create backing files for training and test (almost like copy/paste the original big matrix) many times. For example, suppose I do 10-fold cross validation. Step 2 would take over 30 minutes for creating backing files for all 10 folds!
To solving this issue in Step 2, I think maybe I can divide the original matrix into 10 sub matrices (of class type big.matrix) just once. Then for each fold, I use one portion for testing, and combine the remaining 9 portions as one big matrix for training. But the new issue is, there is no way to combine small big.matrix into a larger one efficiently without copy/paste.
Of course I can do distributed computing for this cross validation procedure. But I am just wondering whether there is a better way to speed up the procedure if just using a single core.
Any ideas? Thanks in advance.
UPDATE:
It turns out that #cdeterman's answer doesn't work when X is very large. The reason is that the mpermute() function permutes the rows by essentially doing copy/paste. mpermute() calls ReorderRNumericMatrix() in C++, which then calls reorder_matrix() function. This function reorders the matrix by looping over all columns and rows and doing copy/paste. See the source code here.
Are there any better ideas for solving my problem?? Thanks.
END UPDATE

You will want to use the sub.big.matrix function. This avoids any further copies and points the same original data. However, it can currently only subset contiguous rows. So you will want to permute your rows first.
# Step 1 - generate random indices
idx <- sample(nrow(X), nrow(X))
mpermute(X, idx)
# Step 2 - create your folds
max <- nrow(bm)/10 # assuming 10 folds
idx_list <- split(seq(nrow(bm)), ceiling(seq(nrow(bm))/max))
# Step 3 - list of sub.big.matrix objects
sm_list <- lapply(idx_list, function(x) sub.big.matrix(bm, firstRow = x[1], lastRow = x[length(x)]))
You now have the original big.matrix split into 10 different matrices that you can use as you like.

Related

Parallel recursive function in R?

I’ve been wracking my brain around this problem all week and could really use an outside perspective. Basically I’ve built a recursive tree function where the output of each node in one layer is used as the input for a node in the subsequent layer. I’ve generated a toy example here where each call generates a large matrix, splits it into submatrices, and then passes those submatrices to subsequent calls. The key difference from similar questions on Stack is that each call of tree_search doesn't actually return anything, it just appends results onto a CSV file.
Now I'd like to parallelize this function. However, when I run it with mclapply and mc.cores=2, the runtime increases! The same happens when I run it on a multicore cluster with mc.cores=12. What’s going on here? Are the parent nodes waiting for the child nodes to return some output? Does this have something to do with fork/socket parallelization?
For background, this is part of an algorithm that models gene activation in white blood cells in response to viral infection. I’m a biologist and self-taught programmer so I’m a little out of my depth here - any help or leads would be really appreciated!
# Load libraries.
library(data.table)
library(parallel)
# Recursive tree search function.
tree_search <- function(submx = NA, loop = 0) {
# Terminate on fifth loop.
message(paste("Started loop", loop))
if(loop == 5) {return(TRUE)}
# Create large matrix and do some operation.
bigmx <- matrix(rnorm(10), 50000, 250)
bigmx <- sin(bigmx^2)
# Aggregate matrix and save output.
agg <- colMeans(bigmx)
append <- file.exists("output.csv")
fwrite(t(agg), file = "output.csv", append = append, row.names = F)
# Split matrix in submatrices with 100 columns each.
ind <- ceiling(seq_along(1:ncol(bigmx)) / 100)
lapply(unique(ind), function(i) {
submx <- bigmx[, ind == i]
# Pass each submatrix to subsequent call.
loop <- loop + 1
tree_search(submx, loop) # sub matrix is used to generate big matrix in subsequent call (not shown)
})
}
# Initiate tree search.
tree_search()
After a lot more brain wracking and experimentation, I ended up answering my own question. I’m not going to refer to the original example since I've changed up my approach quite a bit. Instead I’ll share some general observations that might help people in similar situations.
1.) For loops are more memory efficient than lapply and recursive functions
When you use lapply, each call creates a copy of your current environment. That’s why you can do this:
x <- 5
lapply(1:10, function(i) {
x <- x + 1
x == 6 # TRUE
})
x == 5 # ALSO TRUE
At the end x is still 5, which means that each call of lapply was manipulating a separate copy of x. That’s not good if, say, x was actually a large dataframe with 10,000 variables. for loops, on the other hand, allow you to override the variables on each loop.
x <- 5
for(i in 1:10) {x <- x + 1}
x == 5 # FALSE
2.) Parallelize once
Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script. Therefore, you should use mclapply with discretion. In my case, that meant NOT putting mclapply inside a recursive function where it was getting called tens to hundreds of times. Instead, I split the starting point into 16 parts and ran 16 different tree searches on separate nodes.
3.) You can use mclapply to throttle memory usage
If you split a job into 10 parts and process them with mclapply and mc.preschedule=F, each core will only process 10% of your job at a time. If mc.cores was set to two, for example, the other 8 "nodes" would wait until one part finished before starting a new one. This is useful if you are running into memory issues and want to prevent each loop from taking on more than it can handle.
Final Note
This is one of the more interesting problems I’ve worked on so far. However, recursive tree functions are complicated. Draw out the algorithm and force yourself to spend a few days away from your code so that you can come back with a fresh perspective.

Implementing fast numerical calculations in R

I was trying to do an extensive computation in R. Eighteen hours have passed but my RStudio seems to continue to work. I'm not sure if I could have written the script in a different way to make it faster. I was trying to implement a Crank–Nicolson type method over a 50000 by 350 matrix as shown below:
#defining the discretization of cells
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- as.data.frame(matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,length(m)]<-0
#implement the calculation
for(j in 2:(length(m[1,])-1)){
for(i in 2:length(m[[1]])){
m[i,][2:length(m)-1][[j]]<-m[i-1,][[j]]+
D*dt*(m[i-1,][[j+1]]-2*m[i-1,][[j]]+m[i-1,][[j-1]])/(dz^2)-
v*dt*(m[i-1,][[j+1]]-m[i-1,][[j-1]])/(2*dz)
}}
Is there a way to know how long would it take for R to implement it? Is there a better way of constructing the numerical calculation? At this point, I feel like excel could have been faster!!
Just making a few simple optimisations really helps here. The original version code of your code would take ~ 5 days on my laptop. Using a matrix and calculating just once values that are reused in the loop, we bring this down to around 7 minutes
And think about messy constructions like
m[i,][2:length(m)-1][[j]]
This is equivalent to
m[[i, j]]
which would be faster (as well as much easier to understand). Making this change further reduces the runtime by another factor of over 2, to around 3 minutes
Putting this together we have
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- (matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
# cache a few values that get reused many times
NC = NCOL(m)
NR = NROW(m)
C1 = D*dt / dz^2
C2 = v*dt / (2*dz)
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,NC]<-0
#implement the calculation
for(j in 2:(NC-1)){
for(i in 2:NR){
ma = m[i-1,]
ma.1 = ma[[j+1]]
ma.2 = ma[[j-1]]
m[[i,j]] <- ma[[j]] + C1*(ma.1 - 2*ma[[j]] + ma.2) - C2*(ma.1 - ma.2)
}
}
If you need to go even faster than this, you can try out some more optimisations. For example see here for how different ways of indexing the same element can have very different execution times. In general it is better to refer to column first, then row.
If all the optimisations you can do in R are not enough for your speed requirements, then you might implement the loop in RCpp instead.

How to assign submatrices in elements of a list

For example:
Let M be some matrix mXn matrix where n is large enough to make manual entry impossible.
tmp_list[1] <- M[,1:10]
tmp_list[2] <- M[,11:20]
.
.
.
tmp_list[last] <- M[end - 9,end]
The problem I'm working on is sort of monte carlo, repeating an experiment involving a random mXn matrix 100K times. I'm still pretty new to R, I've done it using a for loop, but it obviously took a very long time. So I'm hoping to assign each "experiment" to an element of a list and use lapply.
let's take the easy case, and you can expand it from there
say n=100, develop your start indeces
n<-100
byParam<-10
starts<-seq(1, n-(byParam-1), by=byParam)
then lapply
tmp_list<-lapply(starts, function(startIndex) M[, startIndex:(startIndex+(byParam-1)])
just one way to do it, becomes a bit more complicated if n is not a nice multiple of 10 (or whatever you set the "byParam" equal to). If that is the case then you can develop your start and end indeces, and then use mapply instead
#given start and end indeces
tmp_list<-mapply(function(startInd, endInd){
M[, startInd:endInd},
startInd=starts, endInd=ends)
Now lapply and mapply are still iterative, so I wouldn't expect massive improvement on time efficiency
EDIT
After discussion in the comments, here is a solution for the entire set up, not just the above question
tmp_list<-lapply(1:1000, function(i){
vect<-sample(c(0,1), 10*1000, replace=TRUE)
dim(vect)<-c(10, 1000)
vect
})
Let's break this down, it makes everything very simple.
We first create a random sample of 1's and 0's, of the length 10*1000 (the number of elements in each sub-matrix). We can then neatly convert that vector to a matrix by assigning it's dim attribute to be c(10, 1000), which changes its form to have 10 rows and 1000 columns. Then we return that into a list at the index i. We lapply over 1:1000, or iterate 1000 times.

lapply with growing data.table function in R

I come from a Java/Python comp sci theory background so I am still getting used to the various R packages and how they can save run time in functions.
Basically, I am working on a few projects and all of them involve taking individual factors in a long-list data set (15,000 to 200,000 factors) and performing calculations on individual factors in an equally-large data set, and concurrently storing the results of those calculations in an exponentially-longer data frame.
So far I have been using nested while loops and concatenating into a growing list, but that is taking days. Ive recently learned about 'lapply' and the 'data.frame' options in R, and I would love to see an example of how to apply (no pun intended) them to the following basic correlation function:
Corr<-function(miRdf, mRNAdf)
{
j=1
k=1
m=1
n=1
c=0
corrList=NULL
while(n<=71521)
{
while(m<=1477)
{
corr=cor(as.numeric(miRdf[k,2:13]), as.numeric(mRNAdf[j,2:13]), use ="complete.obs")
corrList<-c(corrList, corr)
j=j+1
c=c+1
print(c) #just a counter to see how far the function has run
m=m+1
}
k=k+1
n=n+1
j=1
m=1 #to reset the inner while loop
}
corrList<-matrix(unlist(corrList), ncol=1477, byrow=FALSE)
colnames(corrList)<-miRdf[,1]
rownames(corrList)<-mRNAdf[,1]
write.csv(corrList, "testCorrWhole.csv")
}
As you can see, the nested while loop results in 105,636,517 (71521x1477) miRNA vs mRNA expression-value correlation scores that need to be performed and stored in a data frame that is 1477 cols x 71521 rows in order to generate a scoring matrix.
My question is, can anyone shed light on how to turn the above monstrosity into an efficient function that utilizes 'lapply' instead of the while loops, and uses the 'data.table' set() function to do away with the inefficiency of concatenating a list during every pass through the loop?
Thank you in advance!
Your names end with 'df', which makes it seem like your data are a data.frame. But #Troy's answer uses a matrix. A matrix is appropriate when the data are homogeneous, and generally matrix operations are much faster than data.frame operations. So you can see already that if you'd provided a small example of your data set (e.g., dput(mRNAdf[1:10,]) that people might be in a better position to help you; this is what they're asking for.
In large numerical calculations it makes sense to 'hoist' any repeated calculations outside the loop, so they are performed only once. Repeated calculations in your case include sub-setting to columns 2:13, and coercion to numeric. With this idea, and guessing that you actually have a data.frame where each column is already a numeric vector, I'd start with
mRNAmatrix <- as.matrix(mRNAdf[,2:13])
miRmatrix <- as.matrix(miRdf[,2:13])
From the help page ?cor we see that the arguments can be a matrix, and if so the correlation is calculated between columns. You're interested in the result when the arguments are transposed relative to your current representation. So
result <- cor(t(mRNAmatrix), t(miRmatrix), use="complete.obs")
This is fast enough for your purposes
> m1 = matrix(rnorm(71521 * 12), 71521)
> m2 = matrix(rnorm(1477 * 12), 1477)
> system.time(ans <- cor(t(m1), t(m2)))
user system elapsed
9.124 0.200 9.340
> dim(ans)
[1] 71521 1477
result is the same as your corrList -- it's not a list, but a matrix; probably the row and column names have been carried forward. You'd write this to a file as you do above, write.csv(result, "testCorrWhole.csv")
UPDATED BELOW TO SHOW PARALLEL PROCESSING - ABOUT A 60% SAVING
Using apply() might not be quick enough for you. Here's how to do it, though. Will have a think about performance since this example (1M output correlations in 1000x1000 grid) takes over a minute on laptop.
miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf) # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)
corRow<-function(y){
apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
}
system.time(apply(mRNAdf,1,function(x)corRow(x[1])))
# user system elapsed
# 72.94 0.00 73.39
And with parallel::parApply on a 4 core Win64 laptop
require(parallel) ## Library to allow parallel processing
miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf) # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)
corRow<-function(y){
apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
}
# Make a cluster from all available cores
cl=makeCluster(detectCores())
# Use clusterExport() to distribute the function and data.frames needed in the apply() call
clusterExport(cl,c("corRow","miRdf","mRNAdf"))
# time the call
system.time(parApply(cl,mRNAdf,1,function(x)corRow(x[[1]])))
# Stop the cluster
stopCluster(cl)
# time the call without clustering
system.time(apply(mRNAdf,1,function(x)corRow(x[[1]])))
## WITH CLUSTER (4)
user system elapsed
0.04 0.03 29.94
## WITHOUT CLUSTER
user system elapsed
73.96 0.00 74.46

Performing statistics on multiple columns of data

I'm trying to conduct certain statistics such as t-tests on a table of data containing hundreds to thousands of columns. The data is formatted in a way that the two groups of values I'm comparing are in the same column.
So, basically my first attempt was to cut and paste like the following;
NN <-read.delim("E:/output.txt")
View(NN)
attach(NN)
#output p-values of 100 t-tests
sink(file="E:/ttest.txt", append=TRUE, split=FALSE)
t.test(Tree1[1:13],Tree1[14:34])$p.value
t.test(Tree2[1:13],Tree2[14:34])$p.value
t.test(Tree3[1:13],Tree3[14:34])$p.value
....
...
..
.
As my data grows, this is becoming more and more impractical. Is there a way to loop these t-tests through each column sequentially and save the ouput to file?
Thanks in advance.
lapply will get you there I think with an anonymous function:
> test <- data.frame(a=1:100,b=101:200)
> lapply(test,function(x) t.test(x[1:50],x[51:100])$p.value)
$a
[1] 2.876776e-31
$b
[1] 2.876776e-31
I should do my part for good practice and also note that running 100 t-tests in a single go is fraught with the potential for type-1 errors and other badness.
Extracting the p-value in isolation is also probably a really bad move.
Not sure if this is a wise approach or if it even works correctly but try mapply with the indexed parts as in:
test <- data.frame(a=1:100,b=101:200)
testa <- test[1:50, ]
testb <- test[51:100, ]
t.test2 <- function(x, y) t.test(x, y)[["p.value"]]
mapply(t.test2, testa, testb)
EDIT: I used thelatemail's data so it's comparable. His warning is right on.
Thanks for all the input. Just a few clarifications; while I AM running hundreds of t-tests at once, they are comparing independent sets of data each time. So for example, the values in column 1 (Tree1), rows 1:50 would only be compared once to rows 51:100 in the same column, and never used again. The same for column 2 (Tree2), and so on. Would type-1 error still be a problem? the way I see it I'm basically doing t-tests on separate data sets one at a time.
That being said, I've come up with a way to do this with a for-loop, and the results correspond to those when t-testing each column individually.
for (i in 1:100)
print (t.test(mydata[1:50, i],mydata[51:100, i])$p.value)
end;
The only problem being that my output always has a [1] in front of it.

Resources