Specify number of cores for R doParallel - r

I'm trying to specify a parallel process in R to use 3 out of 4 possible cores on my computer to leave a bit of CPU power for other processes while this runs in the background. My code looks something like this:
library(doParallel)
cl <- makePSOCKcluster(3)
registerDoParallel(cl)
results <- foreach(i = 1:10) %dopar% {
...some processes to be parallelized...
}
stopCluster(cl)
When I run this and look in task manager, all cores are running at 100%. Is there a way to only use 3 cores, or is this not possible?
Thanks!

I'm sure this has been answered elsewhere, but ...
cl <- makePSOCKcluster(detectCores() * .875)
OR
cl <- makePSOCKcluster(detectCores() - 1)
Will work for this.
Check out the help page on detectCores() and one final warning is that I once put detectCores inside of a loop thinking its fast ... its not, so if you need it more than a few times, assign a variable.
Finally, I very much favor parallelization using furrr (future_map, etc.) instead of foreach() %dopar% these days.

Using mcaffinity before running parallel process, you can limit the number of cores.
parallel::mcaffinity(1:3)
This mcaffinity allow your R work to allocate only the first 3 cores.

Related

Can I use the parallel version of for loop and apply family together?

I met one problem recently when I am doing my research: at first, I defined a function myfunction which contains two for loops in it and then I use lapply(datalist, myfunction), but the processing is too slow.
Then I learned two parallel packages 'foreach' and 'parallel' to do parallel computation. So I changed both processes to their parallel versions.
BUT I found that when I run my code, it seems that the foreach in my function doesn't work.
myfunction <- function{data} {
df <- foreach (i = 1:200, .combine = "rbind") %:%
foreach(j = 1:200, .combine = "rbind") %dopar% {
*****
process
*****
}
data <- df[1,1]
return(data)
}
system.time({
cl <- detectCores()
cl <- makeCluster(cl)
registerDoParallel(cl)
mat <- t(parSapply(cl, list, myfuntion))
stopCluster(cl)
})
I feel like it's due to the parSapply occupied the whole cores so foreach don't have additional cores to compute. Is there any good idea to fix it? Basically I want to achieve both two processes running in their parallel versions.
Another problem is: suppose we can only choose one process to do the parallel computation, which one should I choose? The for loop or apply family?
Much appreciated :)
I feel like it's due to the parSapply occupied the whole cores so foreach don't have additional cores to compute. Is there any good idea to fix it? Basically I want to achieve both two processes running in their parallel versions.
Nah, that's not a good idea. You're basically trying to over-parallelize here (but that does actually happen in your code as explained below).
Another problem is: suppose we can only choose one process to do the parallel computation, which one should I choose? The for loop or apply family?
There is no one right answer to that. I recommend that you profile your *** process *** code to figure out how much it gains from parallelization.
So, I found your parSapply(cl, ...) on top a foreach() %dopar% { ... } using the same cluster cl interesting. First time I've seen this asked/proposed in that way. You don't want to do this for sure but the question/attempt is not crazy. Your intuition that all workers would be occupied when foreach() %dopar% { ... } attempts to use them is partly correct. However, what is really happening is also that the foreach() %dopar% { ... } statement is evaluated in the workers not the main R session where the cluster cl was defined. On the workers, there are no foreach adaptors registered, so those calls will default to sequential processing (== foreach::registerDoSEQ()). To achieve nested parallelization, you'd had to set up and register a cluster within each worker, e.g. inside the myfunction() function.
As the author of the future framework, I'd like to propose you make use of that. It'll protect you against the above mistakes and it will not over-parallel either (you can do it if you really really want to do it). Here is how I would rewrite your code example:
library(foreach) ## foreach() and %dopar%
myfunction <- function{data} {
df <- foreach(i = 1:200, .combine = "rbind") %:%
foreach(j = 1:200, .combine = "rbind") %dopar% {
*****
process
*****
}
data <- df[1,1]
return(data)
}
## Tell foreach to parallelize via the future framework
doFuture::registerDoFuture()
## Have the future framework parallelize using a cluster of
## local workers (similar to makeCluster(detectCores()))
library(future)
plan(multisession)
library(future.apply) ## future_sapply()
system.time({
mat <- t(future_sapply(list, myfuntion))
})
Now, what is important to understand is that the outer future_sapply() parallelization will operate on the 'multisession' cluster. When you get to the inner foreach() %dopar% { ... } parallelization, all that foreach sees is a sequential worker so that inner layer will be processed in parallel. This is what I mean that the future framework will automatically protect you from over-parallelization.
If you'd like to have the inner layer parallelize on a 'multisession' cluster and the outer to be sequential, you can set that up as:
plan(list(sequential, multisession))
If you really want to do nested parallelization, say, two outer-level workers and 4 inner-level workers, you can use:
plan(list(tweak(multisession, workers = 2), tweak(multisession, workers = 4))
This will run 2*4 = 8 parallel R processes at the same time.
What is more useful is when you have multiple machines available, then you can use those for the outer level and then use a multisession cluster on each of them. Something like:
plan(list(tweak(cluster, workers = c("machine1", "machine2")), multisession))
You can read more about this in future vignettes.

what exactly does the first argument in makeCluster function do?

I am new to r programming as you can tell from the nature of my question. I am trying to take advantage of the parallel computing ability of the train function.
library(parallel)
#detects number of cores available to use for parallel package
nCores <- detectCores(logical = FALSE)
cat(nCores, " cores detected.")
# detect threads with parallel()
nThreads<- detectCores(logical = TRUE)
cat(nThreads, " threads detected.")
# Create doSNOW compute cluster (try 64)
# One can increase up to 128 nodes
# Each node requires 44 Mbyte RAM under WINDOWS.
cluster <- makeCluster(128, type = "SOCK")
class(cluster);
I need someone to help me interpret this code. originally the first argument of makeCluster() had nthreads but after running
nCores <- detectCores(logical = FALSE)
I learned that I have 4 threads available. I changed the value based on the message provided in the guide. Will this enable me simultaneously run 128 iterations of the train function at once? If so what is the point of getting the number of threads and cores that my computer has in the first place?
What you want to do is to detect first the amount of cores you have.
nCores <- detectCores() - 1
Most of the time people add minus 1 to be sure you have one core left to do other stuff on.
cluster <- makeCluster(nCores)
This will set the amount of clusters you want your code to run on. There are several parallel methods (doParallel, parApply, parLapply, foreach,..).
Based on the parallel method you choose, there will run a method on one specific cluster you've created.
Small example I used in code of mine
no_cores <- detectCores() - 1
cluster <- makeCluster(no_cores)
result <- parLapply(cluster, docs$text, preProcessChunk)
stopCluster(cluster)
I also see that your making use of sock. Not sure if "type=SOCK" works.
I always use "type=PSOCK". FORK also exists but it depends on which OS you're using.
FORK: "to divide in branches and go separate ways"
Systems: Unix/Mac (not Windows)
Environment: Link all
PSOCK: Parallel Socket Cluster
Systems: All (including Windows)
Environment: Empty
I am not entirely convinced that the spec argument inside parallel::makeCluster is explicitly the max number of cores (actually, logical processors) to use. I've used the value of detectCores()-1 and detectCores()-2 in the spec argument on some computationally expensive processes and the CPU and # cores used==detectCores(), despite specifying to leave a little room (here, leaving 1 logical processor free for other processes).
The below example crude as I've not captured any quantitative outputs of the core usage. Please suggest edit.
You can visualize core usage by monitoring via e.g., task manager whilst running a simple example:
no_cores <- 5
cl<-makeCluster(no_cores)#, outfile = "debug.txt")
parallel::clusterEvalQ(cl,{
library(foreach)
foreach(i = 1:1e5) %do% {
print(sqrt(i))
}
})
stopCluster(cl)
#browseURL("debug.txt")
Then, rerun using e.g., ncores-1:
no_cores <- parallel::detectCores()-1
cl<-makeCluster(no_cores)#, outfile = "debug.txt")
parallel::clusterEvalQ(cl,{
library(foreach)
foreach(i = 1:1e5) %do% {
print(sqrt(i))
}
})
stopCluster(cl)
All 16 cores appear to engage despite no_cores being specified as 15:
Based on the above example and my very crude (visual only) analysis...it looks like it is possible that the spec argument tells the max number of cores to use throughout the process, but it doesn't appear the process is running on multiple cores simultaneously. Being a novice parallelizer, perhaps a more appropriate example is necessary to reject/support this?
The package documentation suggests spec is "A specification appropriate to the type of cluster."
I've dug into the relevant parallel documentation and and cannot determine what, exactly, spec is doing. But I am not convinced the argument necessarily controls the max number of cores (logical processors) to engage.
Here is where I think I could be wrong in my assumptions: If we specify spec as less than the number of the machine's cores (logical processors) then, assuming no other large processes are running, the machine should never achieve no_cores times 100% CPU usage (i.e., 1600% CPU usage max with 16 cores).
However, when I monitor the CPUs on a Windows OS using Resource Monitor), it does appear that there are, in fact, no_cores Images for Rscript.exe running.

Parallel computing taking same or more time

I am trying to set up parallel computing in R for a large simulation, but I noticed that there is no improvement in time.
I tried a simple example:
library(foreach)
library(doParallel)
stime<-system.time(for (i in 1:10000) rnorm(10000))[3]
print(stime)
10.823
cl<-makeCluster(2)
registerDoParallel(cores=2)
stime<-system.time(ls<-foreach(s = 1:10000) %dopar% rnorm(10000))[3]
stopCluster(cl)
print(stime)
29.526
The system time is more then twice as much as it was in the original case without parallel computing.
Obviously I am doing something wrong but I cannot figure out what it is.
Performing many tiny tasks in parallel can be very inefficient. The standard solution is to use chunking:
ls <- foreach(s=1:2) %dopar% {
for (i in 1:5000) rnorm(10000)
}
Instead of executing 10,000 tiny tasks in parallel, this loop executes two larger tasks, and runs almost twice as fast as the sequential version on my Linux machine.
Also note that your "foreach" example is actually sending a lot of data from the workers to the master. My "foreach" example throws that data away just like your sequential example, so I think it's a better comparison.
If you need to return a large amount of data then a fair comparison would be:
ls <- lapply(rep(10000, 10000), rnorm)
versus:
ls <- foreach(s=1:2, .combine='c') %dopar% {
lapply(rep(10000, 5000), rnorm)
}
On my Linux machine the times are 8.6 seconds versus 7.0 seconds. That's not impressive due to the large communication to computation ratio, but it would have been much worse if I hadn't used chunking.

R: Foreach Parallelized

I want to run a function 100 times. The function itself contains a for loop that requires running 4000 time. I placed my code online on EC2 to run it on multiple cores but am not sure if I am doing it correctly as it doesn't reveal if its actually utilizing all cores. Does the code below make sense?
#arbitrary function:
x = function() {
y=c()
for(i in 1:4000){
y=c(y,i)
}
return(y)
}
#helper Function
loop.helper<-function(n.times){
results = list()
for(i in 1:n.times){
results[[i]] = x()
}
return(results)
}
#Parallel
require(foreach)
require(parallel)
require(doParallel)
cores = detectCores() #32
cl<-makeCluster(cores) #register cores
registerDoParallel(cl, cores = cores)
This is my problem, I am not sure if its should be this:
out <- foreach(i=1:cores) %dopar% {
helper(n.times = 100)
}
or should it be this:
out <- foreach(i=1:100) %dopar% {
x()
}
Both of them work, but I am not sure if the first one will distribute the task to the 32 cores I have or does it automatically do it in the second foreach loop implementation.
thanks
out <- foreach(i=1:100) %dopar% {
x()
}
Is the correct way to do it. The foreach package will automatically distribute the 100 tasks among the registered cores (32 cores, in your case).
If you read the package documentation, you can read some of the examples and it should become extra clear to you.
EDIT:
To respond to #user1234440's comment:
Some considerations:
There is some time required to set up and manage the parallel tasks (e.g. setting up the multiple jobs to run concurrently, and then combining the results at the end). For some trivial tasks or small jobs, sometimes running a parallel process takes longer than the simple sequential loop simply because setting up the parallel processes takes up more time than it saves. However, for most tasks that require some non-trivial computations, you will likely experience speed improvements.
Also, from what I have read, you will see diminishing returns as you use more cores (e.g. using 8 cores may not necessarily be 2x faster than using 4 cores, but may only be 1.5x faster). In addition, from my personal experience, using ALL the available cores on my system resulted in some performance degradation. I think this was because I was dedicating all of my system resources to the parallel job and it was slowing down my other system processes.
That being said, I have almost always experienced speed improvements when using the parallel processing power offered by the foreach function. For your example of running 100 jobs with 32 cores, 4 cores will receive 4 jobs, and the other 28 cores will receive 3 jobs. Now it will be as if 32 computers are running mini for loops, iterating through the 4 or 3 jobs that were distributed to each of the cores. After each loop is completed, the results are combined and returned to you.
If running the 100 tasks is completed faster with a simple for loop than with a parallel foreach loop, then running these 100 tasks in a regular for loop 4000 times will be faster than running the 100 tasks in a parallelized foreach loop 4000 times.
Since you want to execute the function "x" 100 times, you can do that with:
out <- foreach(i=1:100) %dopar% {
x()
}
This correctly returns a list of 100 vectors. Your other solution is wrong because it will execute the function "x" cores * 100 times, returning a list of cores lists of 100 vectors.
You may be confused because it is common to write parallel loops that use one iteration for each core. For instance, you could also execute "x" 100 times like this:
out <- foreach(i=1:cores, .combine='c') %dopar% {
results <- vector('list', 25)
for (j in 1:25) {
results[[j]] <- x()
}
results
}
This also returns a list of 100 vectors, and it will be somewhat more efficient. This technique is called "task chunking", and it can give significantly better performance when the tasks are short. Your second solution is almost like this, except the helper function should execute fewer iterations, and the resulting lists should be combined, which I do by using c as the combine function.
It's important to realize that you can't control the number of cores that are used via the iteration variable in the foreach loop: that is controlled via the registerDoParallel function. But most parallel backends, including doParallel, will map cores tasks to cores workers. It's also important to realize that you don't truly control the number of cores that will be used by the cores worker processes. You control the number of processes that will be created to execute tasks when you call makeCluster, but ultimately it is up to the operating system to schedule those processes on the cores of the CPU, so the "cores" argument is something of a misnomer.
Also note that for your example, you should call registerDoParallel as:
registerDoParallel(cl)
Since you specified a value for the cl argument, the cores argument is ignored, however the documentation doesn't make that clear.

How to avoid duplicating objects with foreach

I have a very huge string vector and would like to do a parallel computing using foreach and dosnow package. I noticed that foreach would make copies of the vector for each process, thus exhaust system memory quickly. I tried to break the vector into smaller pieces in a list object, but still do not see any memory usage reduction. Does anyone have thoughts on this? Below are some demo code:
library(foreach)
library(doSNOW)
library(snow)
x<-rep('some string', 200000000)
# split x into smaller pieces in a list object
splits<-getsplits(x, mode='bysize', size=1000000)
tt<-vector('list', length(splits$start))
for (i in 1:length(tt)) tt[[i]]<-x[splits$start[i]: splits$end[i]]
ret<-foreach(i = 1:length(splits$start), .export=c('somefun'), .combine=c) %dopar% somefun(tt[[i]])
The style of iterating that you're using generally works well with the doMC backend because the workers can effectively share tt by the magic of fork. But with doSNOW, tt will be auto-exported to the workers, using lots of memory even though they only actually need a fraction of it. The suggestion made by #Beasterfield to iterate directly over tt resolves that issue, but it's possible to be even more memory efficient through the use of iterators and an appropriate parallel backend.
In cases like this, I use the isplitVector function from the itertools package. It splits a vector into a sequence of sub-vectors, allowing them to be processed in parallel without losing the benefits of vectorization. Unfortunately, with doSNOW, it will put these sub-vectors into a list in order to call the clusterApplyLB function in snow since clusterApplyLB doesn't support iterators. However, the doMPI and doRedis backends will not do that. They will send the sub-vectors to the workers right from the iterator, using almost half as much memory.
Here's a complete example using doMPI:
suppressMessages(library(doMPI))
library(itertools)
cl <- startMPIcluster()
registerDoMPI(cl)
n <- 20000000
chunkSize <- 1000000
x <- rep('some string', n)
somefun <- function(s) toupper(s)
ret <- foreach(s=isplitVector(x, chunkSize=chunkSize), .combine='c') %dopar% {
somefun(s)
}
print(length(ret))
closeCluster(cl)
mpi.quit()
When I run this on my MacBook Pro with 4 GB of memory
$ time mpirun -n 5 R --slave -f split.R
it takes about 16 seconds.
You have to be careful with the number of workers that you create on the same machine, although decreasing the value of chunkSize may allow you to start more.
You can decrease your memory usage even more if you're able to use an iterator that doesn't require all of the strings to be in memory at the same time. For example, if the strings are in a file named 'strings.txt', you can use s=ireadLines('strings.txt', n=chunkSize).

Resources