I’ve been wracking my brain around this problem all week and could really use an outside perspective. Basically I’ve built a recursive tree function where the output of each node in one layer is used as the input for a node in the subsequent layer. I’ve generated a toy example here where each call generates a large matrix, splits it into submatrices, and then passes those submatrices to subsequent calls. The key difference from similar questions on Stack is that each call of tree_search doesn't actually return anything, it just appends results onto a CSV file.
Now I'd like to parallelize this function. However, when I run it with mclapply and mc.cores=2, the runtime increases! The same happens when I run it on a multicore cluster with mc.cores=12. What’s going on here? Are the parent nodes waiting for the child nodes to return some output? Does this have something to do with fork/socket parallelization?
For background, this is part of an algorithm that models gene activation in white blood cells in response to viral infection. I’m a biologist and self-taught programmer so I’m a little out of my depth here - any help or leads would be really appreciated!
# Load libraries.
library(data.table)
library(parallel)
# Recursive tree search function.
tree_search <- function(submx = NA, loop = 0) {
# Terminate on fifth loop.
message(paste("Started loop", loop))
if(loop == 5) {return(TRUE)}
# Create large matrix and do some operation.
bigmx <- matrix(rnorm(10), 50000, 250)
bigmx <- sin(bigmx^2)
# Aggregate matrix and save output.
agg <- colMeans(bigmx)
append <- file.exists("output.csv")
fwrite(t(agg), file = "output.csv", append = append, row.names = F)
# Split matrix in submatrices with 100 columns each.
ind <- ceiling(seq_along(1:ncol(bigmx)) / 100)
lapply(unique(ind), function(i) {
submx <- bigmx[, ind == i]
# Pass each submatrix to subsequent call.
loop <- loop + 1
tree_search(submx, loop) # sub matrix is used to generate big matrix in subsequent call (not shown)
})
}
# Initiate tree search.
tree_search()
After a lot more brain wracking and experimentation, I ended up answering my own question. I’m not going to refer to the original example since I've changed up my approach quite a bit. Instead I’ll share some general observations that might help people in similar situations.
1.) For loops are more memory efficient than lapply and recursive functions
When you use lapply, each call creates a copy of your current environment. That’s why you can do this:
x <- 5
lapply(1:10, function(i) {
x <- x + 1
x == 6 # TRUE
})
x == 5 # ALSO TRUE
At the end x is still 5, which means that each call of lapply was manipulating a separate copy of x. That’s not good if, say, x was actually a large dataframe with 10,000 variables. for loops, on the other hand, allow you to override the variables on each loop.
x <- 5
for(i in 1:10) {x <- x + 1}
x == 5 # FALSE
2.) Parallelize once
Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script. Therefore, you should use mclapply with discretion. In my case, that meant NOT putting mclapply inside a recursive function where it was getting called tens to hundreds of times. Instead, I split the starting point into 16 parts and ran 16 different tree searches on separate nodes.
3.) You can use mclapply to throttle memory usage
If you split a job into 10 parts and process them with mclapply and mc.preschedule=F, each core will only process 10% of your job at a time. If mc.cores was set to two, for example, the other 8 "nodes" would wait until one part finished before starting a new one. This is useful if you are running into memory issues and want to prevent each loop from taking on more than it can handle.
Final Note
This is one of the more interesting problems I’ve worked on so far. However, recursive tree functions are complicated. Draw out the algorithm and force yourself to spend a few days away from your code so that you can come back with a fresh perspective.
Related
I have a function, as follows, that takes a design matrix X with class type big.matrix as input and predicts the responses.
NOTE: the size of matrix X is over 10 GB. So I cannot load it into memory. I used read.big.matrix() to generate backing files X.bin and X.desc.
myfun <- function(X) {
## do something with X. class(X) == 'big.matrix'
}
My question is that, how I can do cross validation efficiently with this huge big.matrix?
My attempt: (It works, but is time consuming.)
Step 1: for each fold, get indices for training idx.train and test idx.test;
Step 2: divide X into X.train and X.test. Since X.train and X.test are also very large, I have to store them as big.matrix, and create associated backing files (.bin, .desc) for the training and test sets for each fold.
Step 3: feed the X.train to build the model, and predict responses for X.test.
The time-consuming part is Step 2, where I have to create backing files for training and test (almost like copy/paste the original big matrix) many times. For example, suppose I do 10-fold cross validation. Step 2 would take over 30 minutes for creating backing files for all 10 folds!
To solving this issue in Step 2, I think maybe I can divide the original matrix into 10 sub matrices (of class type big.matrix) just once. Then for each fold, I use one portion for testing, and combine the remaining 9 portions as one big matrix for training. But the new issue is, there is no way to combine small big.matrix into a larger one efficiently without copy/paste.
Of course I can do distributed computing for this cross validation procedure. But I am just wondering whether there is a better way to speed up the procedure if just using a single core.
Any ideas? Thanks in advance.
UPDATE:
It turns out that #cdeterman's answer doesn't work when X is very large. The reason is that the mpermute() function permutes the rows by essentially doing copy/paste. mpermute() calls ReorderRNumericMatrix() in C++, which then calls reorder_matrix() function. This function reorders the matrix by looping over all columns and rows and doing copy/paste. See the source code here.
Are there any better ideas for solving my problem?? Thanks.
END UPDATE
You will want to use the sub.big.matrix function. This avoids any further copies and points the same original data. However, it can currently only subset contiguous rows. So you will want to permute your rows first.
# Step 1 - generate random indices
idx <- sample(nrow(X), nrow(X))
mpermute(X, idx)
# Step 2 - create your folds
max <- nrow(bm)/10 # assuming 10 folds
idx_list <- split(seq(nrow(bm)), ceiling(seq(nrow(bm))/max))
# Step 3 - list of sub.big.matrix objects
sm_list <- lapply(idx_list, function(x) sub.big.matrix(bm, firstRow = x[1], lastRow = x[length(x)]))
You now have the original big.matrix split into 10 different matrices that you can use as you like.
I am working with R and my script is taking a very long time. I was thinking I can stop it and then start it again by changing my counters.
My code is this
NC <- MLOA
for (i in 1:313578){
len_mods <- length(MLOA[[i]])
for (j in 1:2090){
for(k in 1:len_mods){
temp_match <- matchv[j]
temp_rep <- replacev[j]
temp_mod <- MLOA[[i]][k]
is_found <- match(temp_mod,temp_match, nomatch = 0, incomparables = 0)
if(is_found[1] == 1) NC[[i]][k] <- temp_rep
rm(temp_match,temp_rep,temp_mod)
}
}
}
I am thinking that I can stop my script, then re-start it by checking what values of i,j and k are and changing the counts to start at their current values. So instead of counting "for (i in 1:313578)" if i is up to 100,000 I could do (i in 100000:313578).
I don't want to stop my script though before checking that my logic about restarting it is solid.
Thanks in anticipation
I'm a bit confused what you are doing. Generally on this forum it is a good idea to greatly simplify your code, and only present the core of the problem in a very simple example. That withstanding, this might help. Put your for loop in a function whose parameters are the first elements of the sequence of numbers you loop over. For example:
myloop <- function(x,...){
for (i in seq(x,313578,1)){
...
This way you can easily manipulate were your loop starts.
The more important question is, however, why are you using for loops in the first place? In R, for loops should be avoided at all costs. By vectorizing your code you can greatly increase its speed. I have realized speed increases of a factor of 500!
In general, the only reason you use a for loop in R is if current iterations of the for loop depend on previous iterations. If this is the case then you are likely bound to the slow for loop.
Depending on your computer skills, however, even for loops can be made faster in R. If you know C, or are willing to learn a bit, interfacing with C can dramatically increase the speed of your code.
An easier way to increase the speed of your code, which unfortunately will not yield the same speed up as interfacing with C, is using R's Byte Complier. Check out the cmpfun function.
One final thing on speeding up code: The following line of codetemp_match <- matchv[j] looks innocuous enough, however, this can really slow things down. This is because every time you assign matchv[j] to temp_match you make a copy of temp_match. That means that your computer needs to find some were to store this copy in RAM. R is smart, as you make more and more copies, it will clean up after you and throw away those copies you are no longer using with the garbage collect function. However finding places to store your copies as well as calling the garbage collect function take time. Read this if you want to learn more: http://adv-r.had.co.nz/memory.html.
You could also use while loops for your 3 loops to maintain a counter. In the following, you can stop the script at any time (and view the intermediate results) and restart by changing continue=TRUE or simply running the loop part of the script:
n <- 6
res <- array(NaN, dim=rep(n,3))
continue = FALSE
if(!continue){
i <- 1
j <- 1
k <- 1
}
while(k <= n){
while(j <= n){
while(i <= n){
res[i,j,k] <- as.numeric(paste0(i,j,k))
Sys.sleep(0.1)
i <- i+1
}
j <- j+1
i <- 1
}
k <- k+1
j <- 1
}
i;j;k
res
This is what I got to....
for(i in 1:313578)
{
mp<-match(MLOA[[i]],matchv,nomatch = 0, incomparables=0)
lgic<- which(as.logical(mp),arr.ind = FALSE, useNames = TRUE)
NC[[i]][lgic]<-replacev[mp]}
Thanks to those who responded, Jacob H, you are right, I am definitely a newby with R, your response was useful. Frank - your pointers helped.
My solution probably still isn't an optimal one. All I wanted to do was a find and replace. Matchv was the vector in which I was searching for a match for each MLOA[i], with replacev being the vector of replacement information.
I'm partitioning a data frame with split() in order to use parLapply() to call a function on each partition in parallel. The data frame has 1.3 million rows and 20 cols. I'm splitting/partitioning by two columns, both character type. Looks like there are ~47K unique IDs and ~12K unique codes, but not every pairing of ID and code are matched. The resulting number of partitions is ~250K. Here is the split() line:
system.time(pop_part <- split(pop, list(pop$ID, pop$code)))
The partitions will then be fed into parLapply() as follows:
cl <- makeCluster(detectCores())
system.time(par_pop <- parLapply(cl, pop_part, func))
stopCluster(cl)
I've let the split() code alone run almost an hour and it doesn't complete. I can split by the ID alone, which takes ~10 mins. Additionally, R studio and the worker threads are consuming ~6GB of RAM.
The reason I know the resulting number of partitions is I have equivalent code in Pentaho Data Integration (PDI) that runs in 30 seconds (for the entire program, not just the "split" code). I'm not hoping for that type of performance with R, but something that perhaps completes in 10 - 15 mins worst case.
The main question: Is there a better alternative to split? I've also tried ddply() with .parallel = TRUE, but it also ran over an hour and never completed.
Split indexes into pop
idx <- split(seq_len(nrow(pop)), list(pop$ID, pop$code))
Split is not slow, e.g.,
> system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
user system elapsed
1.056 0.000 1.058
so if yours is I guess there's some aspect of your data that slows things down, e.g., ID and code are both factors with many levels and so their complete interaction, rather than the level combinations appearing in your data set, are calculated
> length(split(1:10, list(factor(1:10), factor(10:1))))
[1] 100
> length(split(1:10, paste(letters[1:10], letters[1:10], sep="-")))
[1] 10
or perhaps you're running out of memory.
Use mclapply rather than parLapply if you're using processes on a non-Windows machine (which I guess is the case since you ask for detectCores()).
par_pop <- mclapply(idx, function(i, pop, fun) fun(pop[i,]), pop, func)
Conceptually it sounds like you're really aiming for pvec (distribute a vectorized calculation over processors) rather than mclapply (iterate over individual rows in your data frame).
Also, and really as the initial step, consider identifying the bottle necks in func; the data is large but not that big so perhaps parallel evaluation is not needed -- maybe you've written PDI code instead of R code? Pay attention to data types in the data frame, e.g., factor versus character. It's not unusual to get a 100x speed-up between poorly written and efficient R code, whereas parallel evaluation is at best proportional to the number of cores.
Split(x,f) is slow if x is a factor AND f contains a lot of different elements
So, this code if fast:
system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
But, this is very slow:
system.time(split(factor(seq_len(1300000)), sample(250000, 1300000, TRUE)))
And this is fast again because there are only 25 groups
system.time(split(factor(seq_len(1300000)), sample(25, 1300000, TRUE)))
For several efforts I'm involved in at the moment, I am running large datasets with numerous parameter combinations through a series of functions. The functions have a wrapper (so I can mclapply) for ease of operation on a cluster. However, I run into two major challenges.
a) My parameter combinations are large (think 20k to 100k). Sometimes particular combinations will fail (e.g. survival is too high and mortality is too low so the model never converges as a hypothetical scenario). It's difficult for me to suss out ahead of time exactly which combinations will fail (life would be easier if I could do that). But for now I have this type of setup:
failsafe <- failwith(NULL, my_wrapper_function)
# This is what I run
# Note that input_variables contains a list of variables in each list item
results <- mclapply(input_variables, failsafe, mc.cores = 72)
# On my local dual core mac, I can't do this so the equivalent would be:
results <- llply(input_variables, failsafe, .progress = 'text')
The skeleton for my wrapper function looks like this:
my_wrapper_function <- function(tlist) {
run <- tryCatch(my_model(tlist$a, tlist$b, tlist$sA, tlist$Fec, m = NULL) , error=function(e) NULL)
...
return(run)
}
Is this the most efficient approach? If for some reason a particular combination of variables crashes the model, I need it to return a NULL and carry on with the rest. However, I still have issues that this fails less than gracefully.
b) Sometimes a certain combination of inputs does not crash the model but takes too long to converge. I set a limit on the computation time on my cluster (say 6 hours) so I don't waste my resources on something that is stuck. How can I include a timeout such that if a function call takes more than x time on a single list item, it should move on? Calculating the time spent is trivial but a function mid simulation can't be interrupted to check the time, right?
Any ideas, solutions or tricks are appreciated!
You may well be able to manage graceful-exits-upon-timout using a combination of tryCatch() and evalWithTimeout() from the R.utils package.
See also this post, which presents similar code and unpacks it in a bit more detail.
require(R.utils)
myFun <- function(x) {Sys.sleep(x); x^2}
## evalWithTimeout() times out evaluation after 3.1 seconds, and then
## tryCatch() handles the resulting error (of class "TimeoutException") with
## grace and aplomb.
myWrapperFunction <- function(i) {
tryCatch(expr = evalWithTimeout(myFun(i), timeout = 3.1),
TimeoutException = function(ex) "TimedOut")
}
sapply(1:5, myWrapperFunction)
# [1] "1" "4" "9" "TimedOut" "TimedOut"
A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing.
How I typically do this in R is something like the following:
data.mat <- read.csv("...")
groupids <- unique(data.mat$ID) #Assume there are then 100 unique groups
results <- matrix(rep("NA",300),ncol=3,nrow=100)
for(i in 1:100) {
tempmat <- subset(data.mat,ID==groupids[i])
# Run various stats on tempmat (correlations, regressions, etc), checking to
# make sure this specific group doesn't have NAs in the variables I'm using
# and assign results to x, y, and z, for example.
results[i,1] <- x
results[i,2] <- y
results[i,3] <- z
}
This ends up working for me, but depending on the size of the data and the number of groups I'm working with, this can take up to three days.
Besides branching out into parallel processing, is there any "trick" for making something like this run faster? For instance, converting the loops into something else (something like an apply with a function containing the stats I want to run inside the loop), or eliminating the need to actually assign the subset of data to a variable?
Edit:
Maybe this is just common knowledge (or sampling error), but I tried subsetting with brackets in some of my code rather than using the subset command, and it seemed to provide a slight performance gain which surprised me. I have some code I used and output below using the same object names as above:
system.time(for(i in 1:1000){data.mat[data.mat$ID==groupids[i],]})
user system elapsed
361.41 92.62 458.32
system.time(for(i in 1:1000){subset(data.mat,ID==groupids[i])})
user system elapsed
378.44 102.03 485.94
Update:
In one of the answers, jorgusch suggested that I use the data.table package to speed up my subsetting. So, I applied it to a problem I ran earlier this week. In a dataset with a little over 1,500,000 rows, and 4 columns (ID,Var1,Var2,Var3), I wanted to calculate two correlations in each group (indexed by the "ID" variable). There are slightly more than 50,000 groups. Below is my initial code (which is very similar to the above):
data.mat <- read.csv("//home....")
groupids <- unique(data.mat$ID)
results <- matrix(rep("NA",(length(groupids) * 3)),ncol=3,nrow=length(groupids))
for(i in 1:length(groupids)) {
tempmat <- data.mat[data.mat$ID==groupids[i],]
results[i,1] <- groupids[i]
results[i,2] <- cor(tempmat$Var1,tempmat$Var2,use="pairwise.complete.obs")
results[i,3] <- cor(tempmat$Var1,tempmat$Var3,use="pairwise.complete.obs")
}
I'm re-running that right now for an exact measure of how long that took, but from what I remember, I started it running when I got into the office in the morning and it finished sometime in the mid-afternoon. Figure 5-7 hours.
Restructuring my code to use data.table....
data.mat <- read.csv("//home....")
data.mat <- data.table(data.mat)
testfunc <- function(x,y,z) {
temp1 <- cor(x,y,use="pairwise.complete.obs")
temp2 <- cor(x,z,use="pairwise.complete.obs")
res <- list(temp1,temp2)
res
}
system.time(test <- data.mat[,testfunc(Var1,Var2,Var3),by="ID"])
user system elapsed
16.41 0.05 17.44
Comparing the results using data.table to the ones I got from using a for loop to subset all IDs and record results manually, they seem to have given me the same answers(though I'll have to check that a bit more thoroughly). That looks to be a pretty big speed increase.
Update 2:
Running the code using subsets finally finished up again:
user system elapsed
17575.79 4247.41 23477.00
Update 3:
I wanted to see if anything worked out differently using the plyr package that was also recommended. This is my first time using it, so I may have done things somewhat inefficiently, but it still helped substantially compared to the for loop with subsetting.
Using the same variables and setup as before...
data.mat <- read.csv("//home....")
system.time(hmm <- ddply(data.mat,"ID",function(df)c(cor(df$Var1,df$Var2, use="pairwise.complete.obs"),cor(df$Var1,df$Var3,use="pairwise.complete.obs"))))
user system elapsed
250.25 7.35 272.09
This is pretty much exactly what the plyr package is designed to make easier. However it's unlikely that it will make things much faster - most of the time is probably spent doing the statistics.
Besides plyr, you can try to use foreach package to exclude explicit loop counter, but I don't know if it will give you any performance benefits.
Foreach, neverless, gives you a quite simple interface to parallel chunk processing if you have multicore workstation (with doMC/multicore packages) (check Getting Started with doMC and foreach for details), if you exclude parallel processing only because it is not very easy to understand for students. If it is not the only reason, plyr is very good solution IMHO.
Personally, I find plyr not very easy to understand. I prefer data.table which is also faster. For instance you want to do the standard deviation of colum my_column for each ID.
dt <- datab.table[df] # one time operation...changing format of df to table
result.sd <- dt[,sd(my_column),by="ID"] # result with each ID and SD in second column
Three statements of this kind and a cbind at the end - that is all you need.
You can also use dt do some action for only one ID without a subset command in an new syntax:
result.sd.oneiD<- dt[ID="oneID",sd(my_column)]
The first statment refers to rows (i), the second to columns (j).
If find it easier to read then player and it is more flexible, as you can also do sub domains within a "subset"...
The documentation describes that it uses SQL-like methods. For instance, the by is pretty much "group by" in SQL. Well, if you know SQL, you can probably do much more, but it is not necessary to make use of the package.
Finally, it is extremely fast, as each operation is not only parallel, but also data.table grabs the data needed for calculation. Subset, however, maintain the levels of the whole matrix and drag it trough the memory.
You have already suggested vectorizing and avoiding making unnecessary copies of intermediate results, so you are certainly on the right track. Let me caution you not to do what i did and just assume that vectorizing will always give you a performance boost (like it does in other languages, e.g., Python + NumPy, MATLAB).
An example:
# small function to time the results:
time_this = function(...) {
start.time = Sys.time(); eval(..., sys.frame(sys.parent(sys.parent())));
end.time = Sys.time(); print(end.time - start.time)
}
# data for testing: a 10000 x 1000 matrix of random doubles
a = matrix(rnorm(1e7, mean=5, sd=2), nrow=10000)
# two versions doing the same thing: calculating the mean for each row
# in the matrix
x = time_this( for (i in 1:nrow(a)){ mean( a[i,] ) } )
y = time_this( apply(X=a, MARGIN=1, FUN=mean) )
print(x) # returns => 0.5312099
print(y) # returns => 0.661242
The 'apply' version is actually slower than the 'for' version. (According to the Inferno author, if you are doing this you are not vectorizing, you are 'loop hiding'.)
But where you can get a performance boost is by using built-ins. Below, i've timed the same operation as the two above, just using the built-in function, 'rowMeans':
z = time_this(rowMeans(a))
print(z) # returns => 0.03679609
An order of magnitude improvement versus the 'for' loop (and the vectorized version).
The other members of the apply family are not just wrappers over a native 'for' loop.
a = abs(floor(10*rnorm(1e6)))
time_this(sapply(a, sqrt))
# returns => 6.64 secs
time_this(for (i in 1:length(a)){ sqrt(a[i])})
# returns => 1.33 secs
'sapply' is about 5x slower compared with a 'for' loop.
Finally, w/r/t vectorized versus 'for' loops, i don't think i ever use a loop if i can use a vectorized function--the latter is usually less keystrokes and and it's a more natural way (for me) to code, which is a different kind of performance boost, i suppose.