I'm having some trouble getting some nested foreach loops to run in parallel. Here's the situation:
This program essentially performs hypothesis tests using different numbers of observations and different levels of statistical significance. I have four nested foreach loops. The item data.structures is a list of matrices on which the tests are performed. There are two different lists I'm using for data.structures. One list contains 243 matrices (the small list), and the other contains 19,683 (the large list).
number.observations = c(50,100,250,500,1000)
significance.levels = c(.001,.01,.05,.1,.15)
require(foreach)
require(doParallel)
cl = makeCluster(detectCores())
registerDoParallel(cl)
results = foreach(data=data.structures,.inorder=FALSE,.combine='rbind') %:%
foreach(iter=1:iterations,.inorder=FALSE,.combine='rbind') %:%
foreach(number.observations=observations,.inorder=FALSE,.combine='rbind') %:%
foreach(alpha=significance.levels,.inorder=FALSE,.combine='rbind') %dopar% {
#SOME FUNCTIONS HERE
}
When I use the small list of matrices for data.structures, I can see all of the cores being fully utilized (100 percent CPU usage) in Windows' Resource Monitor with six threads each for eight processes, and the job completes as expected in a much shorter amount of time. When I change, however, to the larger list of matrices, the processes are initiated and can be seen in the Processes section of the Resource Monitor. Each of the eight processes shows three threads each with no CPU action. The total CPU usage is approximately 12 percent.
I'm new to parallelization with R. Even if I simplify the problem and the functions, I'm still only able to get the program to run in parallel with the small list. From my own reading, I'm wondering if this is an issue with workload distribution. I've included the .inorder = FALSE option to try and work around this to no avail. I'm fairly certain that this program is a good candidate for parallelization because it performs the same task hundreds of thousands of times and the loops don't depend on previous values.
Any help is tremendously appreciated!
similar issues happened in my code too.
y= foreach(a= seq(1,500,1),.combine='rbind') %:%
foreach(b = seq(1,10,1), .combine='rbind') %:%
foreach(c = seq(1,20,1), .combine='rbind' ) %:%
foreach (d = seq(1,50,1), .combine='rbind' ) %do% {
data.frame(a,b,c,d)
}
A very simple nested foreach parallel loops, it can be executed, but just not in parallel style.
Related
I met one problem recently when I am doing my research: at first, I defined a function myfunction which contains two for loops in it and then I use lapply(datalist, myfunction), but the processing is too slow.
Then I learned two parallel packages 'foreach' and 'parallel' to do parallel computation. So I changed both processes to their parallel versions.
BUT I found that when I run my code, it seems that the foreach in my function doesn't work.
myfunction <- function{data} {
df <- foreach (i = 1:200, .combine = "rbind") %:%
foreach(j = 1:200, .combine = "rbind") %dopar% {
*****
process
*****
}
data <- df[1,1]
return(data)
}
system.time({
cl <- detectCores()
cl <- makeCluster(cl)
registerDoParallel(cl)
mat <- t(parSapply(cl, list, myfuntion))
stopCluster(cl)
})
I feel like it's due to the parSapply occupied the whole cores so foreach don't have additional cores to compute. Is there any good idea to fix it? Basically I want to achieve both two processes running in their parallel versions.
Another problem is: suppose we can only choose one process to do the parallel computation, which one should I choose? The for loop or apply family?
Much appreciated :)
I feel like it's due to the parSapply occupied the whole cores so foreach don't have additional cores to compute. Is there any good idea to fix it? Basically I want to achieve both two processes running in their parallel versions.
Nah, that's not a good idea. You're basically trying to over-parallelize here (but that does actually happen in your code as explained below).
Another problem is: suppose we can only choose one process to do the parallel computation, which one should I choose? The for loop or apply family?
There is no one right answer to that. I recommend that you profile your *** process *** code to figure out how much it gains from parallelization.
So, I found your parSapply(cl, ...) on top a foreach() %dopar% { ... } using the same cluster cl interesting. First time I've seen this asked/proposed in that way. You don't want to do this for sure but the question/attempt is not crazy. Your intuition that all workers would be occupied when foreach() %dopar% { ... } attempts to use them is partly correct. However, what is really happening is also that the foreach() %dopar% { ... } statement is evaluated in the workers not the main R session where the cluster cl was defined. On the workers, there are no foreach adaptors registered, so those calls will default to sequential processing (== foreach::registerDoSEQ()). To achieve nested parallelization, you'd had to set up and register a cluster within each worker, e.g. inside the myfunction() function.
As the author of the future framework, I'd like to propose you make use of that. It'll protect you against the above mistakes and it will not over-parallel either (you can do it if you really really want to do it). Here is how I would rewrite your code example:
library(foreach) ## foreach() and %dopar%
myfunction <- function{data} {
df <- foreach(i = 1:200, .combine = "rbind") %:%
foreach(j = 1:200, .combine = "rbind") %dopar% {
*****
process
*****
}
data <- df[1,1]
return(data)
}
## Tell foreach to parallelize via the future framework
doFuture::registerDoFuture()
## Have the future framework parallelize using a cluster of
## local workers (similar to makeCluster(detectCores()))
library(future)
plan(multisession)
library(future.apply) ## future_sapply()
system.time({
mat <- t(future_sapply(list, myfuntion))
})
Now, what is important to understand is that the outer future_sapply() parallelization will operate on the 'multisession' cluster. When you get to the inner foreach() %dopar% { ... } parallelization, all that foreach sees is a sequential worker so that inner layer will be processed in parallel. This is what I mean that the future framework will automatically protect you from over-parallelization.
If you'd like to have the inner layer parallelize on a 'multisession' cluster and the outer to be sequential, you can set that up as:
plan(list(sequential, multisession))
If you really want to do nested parallelization, say, two outer-level workers and 4 inner-level workers, you can use:
plan(list(tweak(multisession, workers = 2), tweak(multisession, workers = 4))
This will run 2*4 = 8 parallel R processes at the same time.
What is more useful is when you have multiple machines available, then you can use those for the outer level and then use a multisession cluster on each of them. Something like:
plan(list(tweak(cluster, workers = c("machine1", "machine2")), multisession))
You can read more about this in future vignettes.
Is there a way to modify how R foreach loop does load balancing with doParallel backend ? When parallelizing tasks that have very different execution time, it can happen all nodes but one have finished their tasks while the last one still have several tasks to do. Here is a toy example:
library(foreach)
library(doParallel)
registerDoParallel(4)
waittime = c(10,1,1,1,10,1,1,1,10,1,1,1,10,1,1,1)
w = iter(waittime)
foreach(i=w) %dopar% {
message(paste("waiting",i, "on",Sys.getpid()))
Sys.sleep(i)
}
Basically, the code register 4 cores. For each loop i, the task is to wait for waittime[i] seconds. However, because the load balancing in the foreach loop seems to be, by default, to split the total number of tasks into sets having a length of the number of registered cores, in the above example, the first core receives all the tasks with waittime = 10, while the 3 others receive tasks with waittime = 1 so that these 3 cores will have finished all their tasks before the first one have finished its first.
Is there a way to make foreach() distribute tasks one at a time ? i.e. in the above case, I'd like that the first 4 tasks are distributed among the 4 cores, and then that each next task is distributed to the next available core.
Thanks.
I haven't tested it myself, but the doParallel backend provides a preschedule option akin to the mc.preschedule argument in mclapply(). (See section 7 of the doParallel vignette.)
You might try:
mcoptions <- list(preschedule = FALSE)
foreach(i = w, .options.multicore = mcoptions)
Apologies for posting as an answer but I have insufficient rep to comment. Is it possible that you could rewrite your code to make use of parLapplyLB or parSapplyLB?
parLapplyLB, parSapplyLB are load-balancing versions, intended for use when applying FUN to different elements of X takes quite variable amounts of time, and either the function is deterministic or reproducible results are not required.
I am having trouble understanding how to make my code parallel. My desire is to find 3 vectors out of a matrix of 20 that produce the closest linear regression to my measured variable (which means that there are a total of 1140 different combinations). Currently, I was able to use 3 nested foreach loops that return the best vectors. However, my desire is to make the outer loop (or all of them?) work in parallel. Any help would be appreciated!
Here is my code:
NIR= matrix(rexp(100, rate=0.01),ncol=20, nrow = 4) #Creating the matrix with 20 vectors
colnames(NIR)=c(1:20)
S.measured=c(7,9,11,13) #Measured variable
bestvectors<-matrix(data=NA,ncol = 3+1, nrow= 1) #creating a vector to save in it the best results
###### Parallel stuff
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
registerDoParallel(cl)
#nested foreach loop to exhaustively find the best vectors
foreach(i=1:numcols) %:%
foreach(j=i:numcols) %:%
foreach(k=j:numcols) %do% {
if(i==j|i==k|j==k){ #To prevent same vector from being used twice
}
else{
lm<-lm(S.measured~NIR[,c(i,j,k)]-1) # package that calculates the linear regression
S.pred<-as.matrix(lm$fitted.values) # predicted vector to be compared with the actual measured one
error<-sqrt(sum(((S.pred-S.measured)/S.measured)^2)) # The 'Error' which is the product of the comparison which we want to minimize
#if the error is smaller than the last best one, it replaces it. If not nothing changes
if(error<as.numeric(bestvectors[1,3+1])|is.na(bestvectors[1,3+1])){
bestvectors[1,]<-c(colnames(NIR)[i],colnames(NIR)[j],colnames(NIR)[k],as.numeric(error))
bestvectors[,3+1]<-as.numeric(bestvectors[,3+1])
}
}
}
General advice for using foreach:
Use foreach(i=1:numcols) %dopar% { ... } if you would like your code to run on multiple cores. The %do% decorator imperfectly simulates parallelism but runs on a single core.
Processes spawned by %dopar% cannot communicate with each other while the loop is running. So, set up your code to output an R object, like a data.frame or vector, then do comparison afterwards. In your case, the logic in the if(error<as.numeric ... line should be executed sequentially (not in parallel) after your main foreach loop.
Behavior of nested %dopar% loops is inconsistent across operating systems and is unclear in the way it spawns processes across cores. For best performance and portability, use a single foreach loop in the outermost loop and then vanilla for loops within it.
I want to run a function 100 times. The function itself contains a for loop that requires running 4000 time. I placed my code online on EC2 to run it on multiple cores but am not sure if I am doing it correctly as it doesn't reveal if its actually utilizing all cores. Does the code below make sense?
#arbitrary function:
x = function() {
y=c()
for(i in 1:4000){
y=c(y,i)
}
return(y)
}
#helper Function
loop.helper<-function(n.times){
results = list()
for(i in 1:n.times){
results[[i]] = x()
}
return(results)
}
#Parallel
require(foreach)
require(parallel)
require(doParallel)
cores = detectCores() #32
cl<-makeCluster(cores) #register cores
registerDoParallel(cl, cores = cores)
This is my problem, I am not sure if its should be this:
out <- foreach(i=1:cores) %dopar% {
helper(n.times = 100)
}
or should it be this:
out <- foreach(i=1:100) %dopar% {
x()
}
Both of them work, but I am not sure if the first one will distribute the task to the 32 cores I have or does it automatically do it in the second foreach loop implementation.
thanks
out <- foreach(i=1:100) %dopar% {
x()
}
Is the correct way to do it. The foreach package will automatically distribute the 100 tasks among the registered cores (32 cores, in your case).
If you read the package documentation, you can read some of the examples and it should become extra clear to you.
EDIT:
To respond to #user1234440's comment:
Some considerations:
There is some time required to set up and manage the parallel tasks (e.g. setting up the multiple jobs to run concurrently, and then combining the results at the end). For some trivial tasks or small jobs, sometimes running a parallel process takes longer than the simple sequential loop simply because setting up the parallel processes takes up more time than it saves. However, for most tasks that require some non-trivial computations, you will likely experience speed improvements.
Also, from what I have read, you will see diminishing returns as you use more cores (e.g. using 8 cores may not necessarily be 2x faster than using 4 cores, but may only be 1.5x faster). In addition, from my personal experience, using ALL the available cores on my system resulted in some performance degradation. I think this was because I was dedicating all of my system resources to the parallel job and it was slowing down my other system processes.
That being said, I have almost always experienced speed improvements when using the parallel processing power offered by the foreach function. For your example of running 100 jobs with 32 cores, 4 cores will receive 4 jobs, and the other 28 cores will receive 3 jobs. Now it will be as if 32 computers are running mini for loops, iterating through the 4 or 3 jobs that were distributed to each of the cores. After each loop is completed, the results are combined and returned to you.
If running the 100 tasks is completed faster with a simple for loop than with a parallel foreach loop, then running these 100 tasks in a regular for loop 4000 times will be faster than running the 100 tasks in a parallelized foreach loop 4000 times.
Since you want to execute the function "x" 100 times, you can do that with:
out <- foreach(i=1:100) %dopar% {
x()
}
This correctly returns a list of 100 vectors. Your other solution is wrong because it will execute the function "x" cores * 100 times, returning a list of cores lists of 100 vectors.
You may be confused because it is common to write parallel loops that use one iteration for each core. For instance, you could also execute "x" 100 times like this:
out <- foreach(i=1:cores, .combine='c') %dopar% {
results <- vector('list', 25)
for (j in 1:25) {
results[[j]] <- x()
}
results
}
This also returns a list of 100 vectors, and it will be somewhat more efficient. This technique is called "task chunking", and it can give significantly better performance when the tasks are short. Your second solution is almost like this, except the helper function should execute fewer iterations, and the resulting lists should be combined, which I do by using c as the combine function.
It's important to realize that you can't control the number of cores that are used via the iteration variable in the foreach loop: that is controlled via the registerDoParallel function. But most parallel backends, including doParallel, will map cores tasks to cores workers. It's also important to realize that you don't truly control the number of cores that will be used by the cores worker processes. You control the number of processes that will be created to execute tasks when you call makeCluster, but ultimately it is up to the operating system to schedule those processes on the cores of the CPU, so the "cores" argument is something of a misnomer.
Also note that for your example, you should call registerDoParallel as:
registerDoParallel(cl)
Since you specified a value for the cl argument, the cores argument is ignored, however the documentation doesn't make that clear.
I have a very huge string vector and would like to do a parallel computing using foreach and dosnow package. I noticed that foreach would make copies of the vector for each process, thus exhaust system memory quickly. I tried to break the vector into smaller pieces in a list object, but still do not see any memory usage reduction. Does anyone have thoughts on this? Below are some demo code:
library(foreach)
library(doSNOW)
library(snow)
x<-rep('some string', 200000000)
# split x into smaller pieces in a list object
splits<-getsplits(x, mode='bysize', size=1000000)
tt<-vector('list', length(splits$start))
for (i in 1:length(tt)) tt[[i]]<-x[splits$start[i]: splits$end[i]]
ret<-foreach(i = 1:length(splits$start), .export=c('somefun'), .combine=c) %dopar% somefun(tt[[i]])
The style of iterating that you're using generally works well with the doMC backend because the workers can effectively share tt by the magic of fork. But with doSNOW, tt will be auto-exported to the workers, using lots of memory even though they only actually need a fraction of it. The suggestion made by #Beasterfield to iterate directly over tt resolves that issue, but it's possible to be even more memory efficient through the use of iterators and an appropriate parallel backend.
In cases like this, I use the isplitVector function from the itertools package. It splits a vector into a sequence of sub-vectors, allowing them to be processed in parallel without losing the benefits of vectorization. Unfortunately, with doSNOW, it will put these sub-vectors into a list in order to call the clusterApplyLB function in snow since clusterApplyLB doesn't support iterators. However, the doMPI and doRedis backends will not do that. They will send the sub-vectors to the workers right from the iterator, using almost half as much memory.
Here's a complete example using doMPI:
suppressMessages(library(doMPI))
library(itertools)
cl <- startMPIcluster()
registerDoMPI(cl)
n <- 20000000
chunkSize <- 1000000
x <- rep('some string', n)
somefun <- function(s) toupper(s)
ret <- foreach(s=isplitVector(x, chunkSize=chunkSize), .combine='c') %dopar% {
somefun(s)
}
print(length(ret))
closeCluster(cl)
mpi.quit()
When I run this on my MacBook Pro with 4 GB of memory
$ time mpirun -n 5 R --slave -f split.R
it takes about 16 seconds.
You have to be careful with the number of workers that you create on the same machine, although decreasing the value of chunkSize may allow you to start more.
You can decrease your memory usage even more if you're able to use an iterator that doesn't require all of the strings to be in memory at the same time. For example, if the strings are in a file named 'strings.txt', you can use s=ireadLines('strings.txt', n=chunkSize).