I am running a parallel process in a foreach loop that returns a 7x30 matrix at the end of each loop. When I run the loop using this command, it finishes in 11.5 minutes:
myData<-foreach(i=1:270000, .packages='quadprog')%dopar%{
Unfortunately, myData is a list and I want to plot the last two columns of every matrix within that list. So, I use this command to convert it to a data frame for ggplot2:
myData<-Reduce(rbind.data.frame, myData[1:length(myData)])
This command works well for a small myData but myData is 270,000 matrices long. It is either hanging up or taking a really long time to convert.
So, I try to run the loop using this command so that the output is a data frame in the first place:
myData<-foreach(i=1:270000, .combine=rbind.data.frame, .packages='quadprog')%dopar%{
This has been running for the last two hours (way longer than 11 minutes).
Is there a way to efficiently get the output from these loops and put it into a format where I can graph it?
Interestingly, when I look at the Windows Task Manager, the first call to the loop immediately sends all of my CPU core usage to 100%. The second one is closer to 10% even though I setup the same number of clusters under doSnow.
use rbindlist instead of rbind to combine your results and .multicombine to speed it up
comb = function(...)rbindlist(...)
myData = foreach(i = 1:27000 , .combine = comb ,.packages='quadprog',.multicombine = TRUE) %dopar% {
Related
I’ve been wracking my brain around this problem all week and could really use an outside perspective. Basically I’ve built a recursive tree function where the output of each node in one layer is used as the input for a node in the subsequent layer. I’ve generated a toy example here where each call generates a large matrix, splits it into submatrices, and then passes those submatrices to subsequent calls. The key difference from similar questions on Stack is that each call of tree_search doesn't actually return anything, it just appends results onto a CSV file.
Now I'd like to parallelize this function. However, when I run it with mclapply and mc.cores=2, the runtime increases! The same happens when I run it on a multicore cluster with mc.cores=12. What’s going on here? Are the parent nodes waiting for the child nodes to return some output? Does this have something to do with fork/socket parallelization?
For background, this is part of an algorithm that models gene activation in white blood cells in response to viral infection. I’m a biologist and self-taught programmer so I’m a little out of my depth here - any help or leads would be really appreciated!
# Load libraries.
library(data.table)
library(parallel)
# Recursive tree search function.
tree_search <- function(submx = NA, loop = 0) {
# Terminate on fifth loop.
message(paste("Started loop", loop))
if(loop == 5) {return(TRUE)}
# Create large matrix and do some operation.
bigmx <- matrix(rnorm(10), 50000, 250)
bigmx <- sin(bigmx^2)
# Aggregate matrix and save output.
agg <- colMeans(bigmx)
append <- file.exists("output.csv")
fwrite(t(agg), file = "output.csv", append = append, row.names = F)
# Split matrix in submatrices with 100 columns each.
ind <- ceiling(seq_along(1:ncol(bigmx)) / 100)
lapply(unique(ind), function(i) {
submx <- bigmx[, ind == i]
# Pass each submatrix to subsequent call.
loop <- loop + 1
tree_search(submx, loop) # sub matrix is used to generate big matrix in subsequent call (not shown)
})
}
# Initiate tree search.
tree_search()
After a lot more brain wracking and experimentation, I ended up answering my own question. I’m not going to refer to the original example since I've changed up my approach quite a bit. Instead I’ll share some general observations that might help people in similar situations.
1.) For loops are more memory efficient than lapply and recursive functions
When you use lapply, each call creates a copy of your current environment. That’s why you can do this:
x <- 5
lapply(1:10, function(i) {
x <- x + 1
x == 6 # TRUE
})
x == 5 # ALSO TRUE
At the end x is still 5, which means that each call of lapply was manipulating a separate copy of x. That’s not good if, say, x was actually a large dataframe with 10,000 variables. for loops, on the other hand, allow you to override the variables on each loop.
x <- 5
for(i in 1:10) {x <- x + 1}
x == 5 # FALSE
2.) Parallelize once
Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script. Therefore, you should use mclapply with discretion. In my case, that meant NOT putting mclapply inside a recursive function where it was getting called tens to hundreds of times. Instead, I split the starting point into 16 parts and ran 16 different tree searches on separate nodes.
3.) You can use mclapply to throttle memory usage
If you split a job into 10 parts and process them with mclapply and mc.preschedule=F, each core will only process 10% of your job at a time. If mc.cores was set to two, for example, the other 8 "nodes" would wait until one part finished before starting a new one. This is useful if you are running into memory issues and want to prevent each loop from taking on more than it can handle.
Final Note
This is one of the more interesting problems I’ve worked on so far. However, recursive tree functions are complicated. Draw out the algorithm and force yourself to spend a few days away from your code so that you can come back with a fresh perspective.
I am looking for an elegant way of timing execution of R chunk preferably running the chunk multiple times automatically in background. (Magic function %%timeit in Python notebook does exactly that)
I know there are several ways of timing an R function or bunch of R code and there are few SO questions on that as well. All the methods are described in this article too.
However, most of them do not replicate the r code and ones which have option to replicate (like system.time or mircobenchbark) are ideal for using on a function but not on a chunk of code. (or may be I do not understand it right)
tictoc works pretty well for me except it will give the run time for only single execution but does not have option to run like 1000 times and averaging the run time. (again what %%timeit does)
With tictoc it is possible to record the timing in a loop and average the results.
library(tictoc)
tic.clearlog()
for (x in 1:10) {
# passing x to tic() makes its value to become a label
# at time of the matching toc() call.
tic(x)
Sys.sleep(1)
# When log = TRUE, toc() pushes the measured timing to a list
# quiet = TRUE prevents from printing the timing
toc(log = TRUE, quiet = TRUE)
}
Fetch the results of toc() as formatted text for printing.
log.txt <- tic.log(format = TRUE)
Extract the list containing measurements in raw format.
log.lst <- tic.log(format = FALSE)
Since the data is already extracted, clear the tictoc log.
tic.clearlog()
Convert the list elements to timings.
Each element of the list has a start (tic) and end (toc) timestamp.
timings <- unlist(lapply(log.lst, function(x) x$toc - x$tic))
Compute the average loop time.
mean(timings)
# [1] 1.001
So I've got this function meant to group measurements from multiple probes that fall into defined regions.
HMkit.dmr<-function(Mat,Classes,method.fdr=c("BH","bonferroni"),probe.features) {
#Annotate first...
require(plyr)
require(dplyr)
#Filter matrix for testing and stuff...
message("Setting up merged table")
Mat2<-Mat[match(probe.features$probe,rownames(Mat)),]
#Split by classes
if(!is.factor(Classes)) {
Classes<-as.factor(Classes)
}
Class.1<-levels(Classes)[[1]]
Class.2<-levels(Classes)[[2]]
C1.Mat<-Mat2[,Classes==Class.1]
C2.Mat<-Mat2[,Classes==Class.2]
#Summarise and run wilcoxon's test for each dmr...
num.regions<-length(unique(as.character(probe.features$region.id)))
pvals.vec<-numeric(length=num.regions)
unique.regions<-unique(as.character(probe.features$region.id))
message(num.regions)
Meds.1<-numeric(length=num.regions);Meds.2<-numeric(length=num.regions)
for (i in 1:num.regions) {
region<-probe.features%>%filter(region.id %in% unique.regions[[i]])
Set1.Mat<-as.numeric(C1.Mat[rownames(C1.Mat) %in% region$probe,])
Set2.Mat<-as.numeric(C2.Mat[rownames(C2.Mat) %in% region$probe,])
pvals.vec[[i]]<-wilcox.test(Set1.Mat,Set2.Mat)$p.value
Meds.1[[i]]<-median(Set1.Mat)
Meds.2[[i]]<-median(Set2.Mat)
message(i)
}
#Output frame
dmrs.frame<-data.frame(region=unique.regions,pval=pvals.vec,G1=Meds.1,G2=Meds.2,dB=Meds.1-Meds.2)
dmrs.frame$q.val<-p.adjust(dmrs.frame$pval,method=method.fdr)
groups.ids<-levels(Classes)
return(list(dmrs=dmrs.frame,groups=groups.ids))
}
The code basically splits a matrix into two groups by samples and then pulls in the values of all probes that are defined as being in a region, calls a wilcox.test and a median summarisation step and uses it to populate vectors created beforehand.
I have tried to replace the for in the for loop with doparallel function in the foreach package but have not been able to get it to populate the vector with the correct outcomes. I want to know how to correctly use parallelisation with the function above - either by modifying the for loop, or by modifying the function call so regions are broken down into chunks that are processed in parallel.
Example objects follow below...
Mat<-matrix(runif(200,0,1), ncol=10,nrow=20)
rownames(Mat)<-paste0("p",1:20)
colnames(Mat)<-paste0("S",1:10)
Classes<-as.character(c(rep("G1",6),rep("G2",4)))
probe.features<-data.frame(probe=paste0("p",1:20),region.id=c(rep("R1",5),rep("R2",3),rep("R3",4),rep("R5",4),rep("R6",4))
and the function is run using
x<-HMkit.dmr(Mat,Classes,method.fdr=c("BH"),probe.features=probe.features)
In practise, there are 30,000 regions I am looking at, and want to parallelise the function across multiple cores on windows because serial execution can take up to 40 minutes. How do I do this?
Addendum - I tried to do this with
library(doParallel)
ncores<-2
Cl<-makeCluster(2)
registerDoParallel(Cl)
x<-foreach(i=1:length(unique(probe.features$region.id)),packages=c("plyr","dplyr"))%dopar%HMkit.dmr(Mat,Classes,probe.features=probe.features,method.fdr="BH")
However, doing that just returned two copies of the same results as the serial function, what I want it to do is break down regions in probe.features$region.id into chunks that go to different cores.
It appears to me that your "for" loop can be easily parallelized. It's just building up three vectors, one element per iteration, where each vector will become a column of "dmrs.frame". So each iteration is computing one row of the result.
To use "foreach", you can simply concatenate those three values into a vector. The .combine option is used to combine all of those the vectors into a matrix with "rbind":
m <- foreach(uregion=unique.regions, .combine='rbind',
.packages=c('plyr', 'dplyr')) %dopar% {
region<-probe.features%>%filter(region.id %in% uregion)
Set1.Mat<-as.numeric(C1.Mat[rownames(C1.Mat) %in% region$probe,])
Set2.Mat<-as.numeric(C2.Mat[rownames(C2.Mat) %in% region$probe,])
c(wilcox.test(Set1.Mat, Set2.Mat)$p.value,
median(Set1.Mat), median(Set2.Mat))
}
I got rid of the "i" variable since I think it's more readable to simply iterate over the elements of "unique.regions".
Now you can create "dmrs.frame" using the columns of matrix "m":
dmrs.frame <- data.frame(region=unique.regions,
pval=m[,1] G1=m[,2] G2=m[,3], dB=m[,2]-m[,3])
I'm partitioning a data frame with split() in order to use parLapply() to call a function on each partition in parallel. The data frame has 1.3 million rows and 20 cols. I'm splitting/partitioning by two columns, both character type. Looks like there are ~47K unique IDs and ~12K unique codes, but not every pairing of ID and code are matched. The resulting number of partitions is ~250K. Here is the split() line:
system.time(pop_part <- split(pop, list(pop$ID, pop$code)))
The partitions will then be fed into parLapply() as follows:
cl <- makeCluster(detectCores())
system.time(par_pop <- parLapply(cl, pop_part, func))
stopCluster(cl)
I've let the split() code alone run almost an hour and it doesn't complete. I can split by the ID alone, which takes ~10 mins. Additionally, R studio and the worker threads are consuming ~6GB of RAM.
The reason I know the resulting number of partitions is I have equivalent code in Pentaho Data Integration (PDI) that runs in 30 seconds (for the entire program, not just the "split" code). I'm not hoping for that type of performance with R, but something that perhaps completes in 10 - 15 mins worst case.
The main question: Is there a better alternative to split? I've also tried ddply() with .parallel = TRUE, but it also ran over an hour and never completed.
Split indexes into pop
idx <- split(seq_len(nrow(pop)), list(pop$ID, pop$code))
Split is not slow, e.g.,
> system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
user system elapsed
1.056 0.000 1.058
so if yours is I guess there's some aspect of your data that slows things down, e.g., ID and code are both factors with many levels and so their complete interaction, rather than the level combinations appearing in your data set, are calculated
> length(split(1:10, list(factor(1:10), factor(10:1))))
[1] 100
> length(split(1:10, paste(letters[1:10], letters[1:10], sep="-")))
[1] 10
or perhaps you're running out of memory.
Use mclapply rather than parLapply if you're using processes on a non-Windows machine (which I guess is the case since you ask for detectCores()).
par_pop <- mclapply(idx, function(i, pop, fun) fun(pop[i,]), pop, func)
Conceptually it sounds like you're really aiming for pvec (distribute a vectorized calculation over processors) rather than mclapply (iterate over individual rows in your data frame).
Also, and really as the initial step, consider identifying the bottle necks in func; the data is large but not that big so perhaps parallel evaluation is not needed -- maybe you've written PDI code instead of R code? Pay attention to data types in the data frame, e.g., factor versus character. It's not unusual to get a 100x speed-up between poorly written and efficient R code, whereas parallel evaluation is at best proportional to the number of cores.
Split(x,f) is slow if x is a factor AND f contains a lot of different elements
So, this code if fast:
system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
But, this is very slow:
system.time(split(factor(seq_len(1300000)), sample(250000, 1300000, TRUE)))
And this is fast again because there are only 25 groups
system.time(split(factor(seq_len(1300000)), sample(25, 1300000, TRUE)))
For several efforts I'm involved in at the moment, I am running large datasets with numerous parameter combinations through a series of functions. The functions have a wrapper (so I can mclapply) for ease of operation on a cluster. However, I run into two major challenges.
a) My parameter combinations are large (think 20k to 100k). Sometimes particular combinations will fail (e.g. survival is too high and mortality is too low so the model never converges as a hypothetical scenario). It's difficult for me to suss out ahead of time exactly which combinations will fail (life would be easier if I could do that). But for now I have this type of setup:
failsafe <- failwith(NULL, my_wrapper_function)
# This is what I run
# Note that input_variables contains a list of variables in each list item
results <- mclapply(input_variables, failsafe, mc.cores = 72)
# On my local dual core mac, I can't do this so the equivalent would be:
results <- llply(input_variables, failsafe, .progress = 'text')
The skeleton for my wrapper function looks like this:
my_wrapper_function <- function(tlist) {
run <- tryCatch(my_model(tlist$a, tlist$b, tlist$sA, tlist$Fec, m = NULL) , error=function(e) NULL)
...
return(run)
}
Is this the most efficient approach? If for some reason a particular combination of variables crashes the model, I need it to return a NULL and carry on with the rest. However, I still have issues that this fails less than gracefully.
b) Sometimes a certain combination of inputs does not crash the model but takes too long to converge. I set a limit on the computation time on my cluster (say 6 hours) so I don't waste my resources on something that is stuck. How can I include a timeout such that if a function call takes more than x time on a single list item, it should move on? Calculating the time spent is trivial but a function mid simulation can't be interrupted to check the time, right?
Any ideas, solutions or tricks are appreciated!
You may well be able to manage graceful-exits-upon-timout using a combination of tryCatch() and evalWithTimeout() from the R.utils package.
See also this post, which presents similar code and unpacks it in a bit more detail.
require(R.utils)
myFun <- function(x) {Sys.sleep(x); x^2}
## evalWithTimeout() times out evaluation after 3.1 seconds, and then
## tryCatch() handles the resulting error (of class "TimeoutException") with
## grace and aplomb.
myWrapperFunction <- function(i) {
tryCatch(expr = evalWithTimeout(myFun(i), timeout = 3.1),
TimeoutException = function(ex) "TimedOut")
}
sapply(1:5, myWrapperFunction)
# [1] "1" "4" "9" "TimedOut" "TimedOut"