what exactly does the first argument in makeCluster function do? - r

I am new to r programming as you can tell from the nature of my question. I am trying to take advantage of the parallel computing ability of the train function.
library(parallel)
#detects number of cores available to use for parallel package
nCores <- detectCores(logical = FALSE)
cat(nCores, " cores detected.")
# detect threads with parallel()
nThreads<- detectCores(logical = TRUE)
cat(nThreads, " threads detected.")
# Create doSNOW compute cluster (try 64)
# One can increase up to 128 nodes
# Each node requires 44 Mbyte RAM under WINDOWS.
cluster <- makeCluster(128, type = "SOCK")
class(cluster);
I need someone to help me interpret this code. originally the first argument of makeCluster() had nthreads but after running
nCores <- detectCores(logical = FALSE)
I learned that I have 4 threads available. I changed the value based on the message provided in the guide. Will this enable me simultaneously run 128 iterations of the train function at once? If so what is the point of getting the number of threads and cores that my computer has in the first place?

What you want to do is to detect first the amount of cores you have.
nCores <- detectCores() - 1
Most of the time people add minus 1 to be sure you have one core left to do other stuff on.
cluster <- makeCluster(nCores)
This will set the amount of clusters you want your code to run on. There are several parallel methods (doParallel, parApply, parLapply, foreach,..).
Based on the parallel method you choose, there will run a method on one specific cluster you've created.
Small example I used in code of mine
no_cores <- detectCores() - 1
cluster <- makeCluster(no_cores)
result <- parLapply(cluster, docs$text, preProcessChunk)
stopCluster(cluster)
I also see that your making use of sock. Not sure if "type=SOCK" works.
I always use "type=PSOCK". FORK also exists but it depends on which OS you're using.
FORK: "to divide in branches and go separate ways"
Systems: Unix/Mac (not Windows)
Environment: Link all
PSOCK: Parallel Socket Cluster
Systems: All (including Windows)
Environment: Empty

I am not entirely convinced that the spec argument inside parallel::makeCluster is explicitly the max number of cores (actually, logical processors) to use. I've used the value of detectCores()-1 and detectCores()-2 in the spec argument on some computationally expensive processes and the CPU and # cores used==detectCores(), despite specifying to leave a little room (here, leaving 1 logical processor free for other processes).
The below example crude as I've not captured any quantitative outputs of the core usage. Please suggest edit.
You can visualize core usage by monitoring via e.g., task manager whilst running a simple example:
no_cores <- 5
cl<-makeCluster(no_cores)#, outfile = "debug.txt")
parallel::clusterEvalQ(cl,{
library(foreach)
foreach(i = 1:1e5) %do% {
print(sqrt(i))
}
})
stopCluster(cl)
#browseURL("debug.txt")
Then, rerun using e.g., ncores-1:
no_cores <- parallel::detectCores()-1
cl<-makeCluster(no_cores)#, outfile = "debug.txt")
parallel::clusterEvalQ(cl,{
library(foreach)
foreach(i = 1:1e5) %do% {
print(sqrt(i))
}
})
stopCluster(cl)
All 16 cores appear to engage despite no_cores being specified as 15:
Based on the above example and my very crude (visual only) analysis...it looks like it is possible that the spec argument tells the max number of cores to use throughout the process, but it doesn't appear the process is running on multiple cores simultaneously. Being a novice parallelizer, perhaps a more appropriate example is necessary to reject/support this?
The package documentation suggests spec is "A specification appropriate to the type of cluster."
I've dug into the relevant parallel documentation and and cannot determine what, exactly, spec is doing. But I am not convinced the argument necessarily controls the max number of cores (logical processors) to engage.
Here is where I think I could be wrong in my assumptions: If we specify spec as less than the number of the machine's cores (logical processors) then, assuming no other large processes are running, the machine should never achieve no_cores times 100% CPU usage (i.e., 1600% CPU usage max with 16 cores).
However, when I monitor the CPUs on a Windows OS using Resource Monitor), it does appear that there are, in fact, no_cores Images for Rscript.exe running.

Related

Specify number of cores for R doParallel

I'm trying to specify a parallel process in R to use 3 out of 4 possible cores on my computer to leave a bit of CPU power for other processes while this runs in the background. My code looks something like this:
library(doParallel)
cl <- makePSOCKcluster(3)
registerDoParallel(cl)
results <- foreach(i = 1:10) %dopar% {
...some processes to be parallelized...
}
stopCluster(cl)
When I run this and look in task manager, all cores are running at 100%. Is there a way to only use 3 cores, or is this not possible?
Thanks!
I'm sure this has been answered elsewhere, but ...
cl <- makePSOCKcluster(detectCores() * .875)
OR
cl <- makePSOCKcluster(detectCores() - 1)
Will work for this.
Check out the help page on detectCores() and one final warning is that I once put detectCores inside of a loop thinking its fast ... its not, so if you need it more than a few times, assign a variable.
Finally, I very much favor parallelization using furrr (future_map, etc.) instead of foreach() %dopar% these days.
Using mcaffinity before running parallel process, you can limit the number of cores.
parallel::mcaffinity(1:3)
This mcaffinity allow your R work to allocate only the first 3 cores.

R snowfall parallel, Rscript.exe goes inactive one by one with time

I am using sfApply in R snowfall package for parallel computing. There are 32000 tests to run. The code is working fine when starting the computing, it will create 46 Rscript.exe processes and each Rscript.exe has a 2% cpu usage. The overall cpu usage is about 100% and the results are continually writing to disk. The computing will usually take tens of hours. The strange thing is that the Rscript.exe process becomes gradually inactive (cpu usage = 0) one by one, and the conresponding cpu is inactive too. After two days, there are only half number of Rscript.exe which are active by looking at the cpu usage, and overall cpu usage reduces to 50%. However, the work is far away to finish. As time goes by, more and more Rscript.exe go inactive, which makes the work last very very long. I am wondering what makes the process and cpu cores go inactive?
My computer has 46 logical cores. I am using R-3.4.0 from Rstudio in 64-bit windows 7. the following 'test' variable is 32000*2 matrix. myfunction is solving several differential equations.
Thanks.
library(snowfall)
sfInit(parallel=TRUE, cpus=46)
Sys.time()
sfLibrary(deSolve)
sfExport("myfunction","test")
res<-sfApply(test,1,function(x){myfunction(x[1],x[2])})
sfStop()
Sys.time()
What you're describing sounds reasonable since snowfall::sfApply() uses snow::parApply() internally, which chunks up your data (test) into (here) 46 chunks and sends each chunk out to one of the 46 R workers. When a worker finishes its chunk, there is no more work for it and it'll just sit idle while the remaining chunks are processed by the other workers.
What you want to do is to split up your data into smaller chunks which will lead to each worker will process more than one chunk on average. I don't know if (think?) that is possible with snowfall. The parallel package, which is part of R itself and which replaces the snow package (that snowfall relies on), provides parApply() and parApplyLB() where the latter splits up your chunks into minimal sizes, i.e. one per data element (of test). See help("parApply", package = "parallel") for details.
The future.apply package (I'm the author), provides you with the option to scale how much you want to split up the data. It doesn't provide an apply() version, but a lapply() version that you can use (and how parApply() works internally). For instance, your example that uses one chunk per worker would be:
library(future.apply)
plan(multisession, workers = 46L)
## Coerce matrix into list with one element per matrix row
test_rows <- lapply(seq_len(nrow(test)), FUN = function(row) test[row,])
res <- future_lapply(test_rows, FUN = function(x) {
myfunction(x[1],x[2])
})
which is defaults to
res <- future_lapply(test_rows, FUN = function(x) {
myfunction(x[1],x[2])
}, future.scheduling = 1.0)
If you want to split up the data so that each worker processes one row at the time (cf. parallel::parApplyLB()), you do that as:
res <- future_lapply(test_rows, FUN = function(x) {
myfunction(x[1],x[2])
}, future.scheduling = Inf)
By setting future.scheduling in [1, Inf], you can control how big the average chunk size is. For instance, future.scheduling = 2.0 will have each worker process on average two chunks of data before future_lapply() returns.
EDIT 2021-11-08: The future_lapply() and friends are now in the future.apply package (where originally in future).

multinode processing in R

I am trying to run R code on an HPC, but not sure how to take advantage of multiple nodes. The specific HPC I am using has 100 nodes and 36 cores per node.
Here is an example of the code.
n = 3600 ### This would be my ideal. Set to 3 on my laptop
cl = makeCluster(n, "SOCK")
foreach(i in 1:length(files), packages=c("raster","dismo")) %dopar%
Myfunction(files=files[i],template=comm.path, out = outdir)
This code works on my laptop and on the login of the HPC, but it is only using 1 node. I just want to make sure I am taking advantage of all the cores that I can.
How specifically do I take advantage of multiple nodes, or is it done "behind the scenes"?
If you are serious with HPC clusters, use MPI cluster, not SOCK. MPI is the standard for non-shared memory computing, and most clusters are optimized for MPI.
In case of HPC you also need a job-script to start R. There are several ways to start it, you may use mpirun, or invoke the workers directly from R. Scheduler will set up the MPI environment and R will figure out which nodes to use. Start small, with say 4 workers, and increase the number until you have reached the optimal level. Most tasks cannot efficiently use 3600 cpus.
Finally, if you are using tens of CPU-s over MPI, I strongly recommend to use Rhpc instead of Rmpi package. it uses more efficient MPI communication and gives you quite a noticeable speed boost.
On a TORQUE-controlled system I am using something along the lines:
Rhpc_initialize()
nodefile <- Sys.getenv("PBS_NODEFILE")
nodes <- readLines(nodefile)
commSize <- length(nodes)
cl <- Rhpc_getHandle(commSize)
Rhpc_Export(cl, c("data"))
...
result <- Rhpc_lapply(cl, 1:1000, runMySimulation)
...
Rhpc_finalize()
The TORQUE-specific part is the nodefile part, in this way I know how many workers to create. In the jobscript I start R just as Rscript >>output.txt myScript.R.
As a side note: are you sure myfun(files, ...) is correct? Perhaps you mean myfun(files[i], ...)?
Let us know how it goes, I am happy to help :-)

Foreach & SNOW do not work on Windows

I want to use a foreach loop on a Windows machine to make use of multiple cores in cpu heavy computation. However, I cannot get the processes to do any work.
Here is a minimal example of what I think should work, but doesn't:
library(snow)
library(doSNOW)
library(foreach)
cl <- makeSOCKcluster(4)
registerDoSNOW(cl)
pois <- rpois(1e6, 1500) # draw 1500 times from poisson with mean 1500
x <- foreach(i=1:1e6) %dopar% {
runif(pois[i]) # draw from uniform distribution pois[i] times
}
stopCluster(cl)
SNOW does create the 4 "slave" processes, but they don't do any work:
I hope this isn't a duplicate, but I cannot find anything with the search terms I can come up with.
It's probably working (at least it does on my mac). However, one call to runif takes such a small amount of time that all the time is spent for the overhead and the child processes spend negligible CPU power with the actual tasks.
x <- foreach(i=1:20) %dopar% {
system.time(runif(pois[i]))
}
x[[1]]
#user system elapsed
# 0 0 0
Parallelization makes sense if you have some heavy computations that cannot be optimized. That's not the case in your example. You don't need 1e6 calls to runif, one would be sufficient (e.g., runif(sum(pois)) and then split the result).
PS: Always test with a smaller example.
Although this particular example isn't worth executing in parallel, it's worth noting that since it uses doSNOW, the entire pois vector is auto-exported to all of the workers even though each worker only needs a fraction of it. However, you can avoid auto-exporting any data to the workers by iterating over pois itself:
x <- foreach(p=pois) %dopar% {
runif(p)
}
Now the elements of pois are sent to the workers in the tasks, so each worker only receives the data that's actually needed to perform its tasks. This technique isn't important when using doMC, since the doMC workers get pois for free.
You can also often improve performance enormously by processing pois in larger chunks using an iterator function such as "isplitVector" from the itertools package.

R: Foreach Parallelized

I want to run a function 100 times. The function itself contains a for loop that requires running 4000 time. I placed my code online on EC2 to run it on multiple cores but am not sure if I am doing it correctly as it doesn't reveal if its actually utilizing all cores. Does the code below make sense?
#arbitrary function:
x = function() {
y=c()
for(i in 1:4000){
y=c(y,i)
}
return(y)
}
#helper Function
loop.helper<-function(n.times){
results = list()
for(i in 1:n.times){
results[[i]] = x()
}
return(results)
}
#Parallel
require(foreach)
require(parallel)
require(doParallel)
cores = detectCores() #32
cl<-makeCluster(cores) #register cores
registerDoParallel(cl, cores = cores)
This is my problem, I am not sure if its should be this:
out <- foreach(i=1:cores) %dopar% {
helper(n.times = 100)
}
or should it be this:
out <- foreach(i=1:100) %dopar% {
x()
}
Both of them work, but I am not sure if the first one will distribute the task to the 32 cores I have or does it automatically do it in the second foreach loop implementation.
thanks
out <- foreach(i=1:100) %dopar% {
x()
}
Is the correct way to do it. The foreach package will automatically distribute the 100 tasks among the registered cores (32 cores, in your case).
If you read the package documentation, you can read some of the examples and it should become extra clear to you.
EDIT:
To respond to #user1234440's comment:
Some considerations:
There is some time required to set up and manage the parallel tasks (e.g. setting up the multiple jobs to run concurrently, and then combining the results at the end). For some trivial tasks or small jobs, sometimes running a parallel process takes longer than the simple sequential loop simply because setting up the parallel processes takes up more time than it saves. However, for most tasks that require some non-trivial computations, you will likely experience speed improvements.
Also, from what I have read, you will see diminishing returns as you use more cores (e.g. using 8 cores may not necessarily be 2x faster than using 4 cores, but may only be 1.5x faster). In addition, from my personal experience, using ALL the available cores on my system resulted in some performance degradation. I think this was because I was dedicating all of my system resources to the parallel job and it was slowing down my other system processes.
That being said, I have almost always experienced speed improvements when using the parallel processing power offered by the foreach function. For your example of running 100 jobs with 32 cores, 4 cores will receive 4 jobs, and the other 28 cores will receive 3 jobs. Now it will be as if 32 computers are running mini for loops, iterating through the 4 or 3 jobs that were distributed to each of the cores. After each loop is completed, the results are combined and returned to you.
If running the 100 tasks is completed faster with a simple for loop than with a parallel foreach loop, then running these 100 tasks in a regular for loop 4000 times will be faster than running the 100 tasks in a parallelized foreach loop 4000 times.
Since you want to execute the function "x" 100 times, you can do that with:
out <- foreach(i=1:100) %dopar% {
x()
}
This correctly returns a list of 100 vectors. Your other solution is wrong because it will execute the function "x" cores * 100 times, returning a list of cores lists of 100 vectors.
You may be confused because it is common to write parallel loops that use one iteration for each core. For instance, you could also execute "x" 100 times like this:
out <- foreach(i=1:cores, .combine='c') %dopar% {
results <- vector('list', 25)
for (j in 1:25) {
results[[j]] <- x()
}
results
}
This also returns a list of 100 vectors, and it will be somewhat more efficient. This technique is called "task chunking", and it can give significantly better performance when the tasks are short. Your second solution is almost like this, except the helper function should execute fewer iterations, and the resulting lists should be combined, which I do by using c as the combine function.
It's important to realize that you can't control the number of cores that are used via the iteration variable in the foreach loop: that is controlled via the registerDoParallel function. But most parallel backends, including doParallel, will map cores tasks to cores workers. It's also important to realize that you don't truly control the number of cores that will be used by the cores worker processes. You control the number of processes that will be created to execute tasks when you call makeCluster, but ultimately it is up to the operating system to schedule those processes on the cores of the CPU, so the "cores" argument is something of a misnomer.
Also note that for your example, you should call registerDoParallel as:
registerDoParallel(cl)
Since you specified a value for the cl argument, the cores argument is ignored, however the documentation doesn't make that clear.

Resources