R: makePSOCKcluster hyperthreads 50% of CPU cores - r

I try to run an R script on a single Linux machine with two CPUs containing 8 physical cores each.
The R code automatically identifies the number of cores via detectCores(), reduces this number by one and implements it into the makePSOCKcluster command. According to performance parameters, R only utilizes one of CPUs and hyperthreads the included cores. No workload is distributed to the second CPU.
In case I specify detectCores(logical = FALSE), the observed burden on the first CPU becomes smaller but the second one is still inactive.
How do I fix this? Since the entire infrastructure is located in a single machine, Rmpi should not be necessary in this case.
FYI: the R script consists of foreach loops that rely on the doSNOW package.

try using makeCluster() and define the cluster type and length with a task\worker list.
it works for me and runs each task on a different core\process.
consider (if possible) redefining each task separately and not just using foreach.
here is an example of what i'm using,
the result of out would be a list of all results from each core by order from the list.
tasks = list(task1,taks2, ...)
cl = makeCluster(length(Tasks), type = "PSOCK")
clusterEvalQ(cl,c(library(dplyr),library(httr)))
clusterExport(cl, list("varname1", "varname2"),envir=environment())
out <- clusterApply(
cl,
Tasks,
function(f) f()
)

The solution is not to rely on snow in my case. Instead I launch the R script with mpirun and let this command manage the parallel environment from outside R. doSNOW needs to be replaced with doMPI accordingly.
With this setup both CPUs are adequately utilized.

Related

R foreach: from single-machine to cluster

The following (simplified) script works fine on the master node of a unix cluster (4 virtual cores).
library(foreach)
library(doParallel)
nc = detectCores()
cl = makeCluster(nc)
registerDoParallel(cl)
foreach(i = 1:nrow(data_frame_1), .packages = c("package_1","package_2"), .export = c("variable_1","variable_2")) %dopar% {
row_temp = data_frame_1[i,]
function(argument_1 = row_temp, argument_2 = variable_1, argument_3 = variable_2)
}
stopCluster(cl)
I would like to take advantage of the 16 nodes in the cluster (16 * 4 virtual cores in total).
I guess all I need to do is change the parallel backend specified by makeCluster. But how should I do that? The documentation is not very clear.
Based on this quite old (2013) post http://www.r-bloggers.com/the-wonders-of-foreach/ it seems that I should change the default type (sock or MPI - which one- would that work on unix?)
EDIT
From this vignette by the authors of foreach:
By default, doParallel uses multicore functionality on Unix-like
systems and snow functionality on Windows. Note that the multicore
functionality only runs tasks on a single computer, not a cluster of
computers. However, you can use the snow functionality to execute on
a cluster, using Unix-like operating systems, Windows, or even a
combination.
What does you can use the snow functionality mean? How should I do that?
The parallel package is a merger of the multicore and snow packages, but if you want to run on multiple nodes, you have to make use of the "snow functionality" in parallel (that is, the part of parallel that was derived from snow). Practically speaking, that means you need to call makeCluster with the "type" argument set to either "PSOCK", "SOCK", "MPI" or "NWS" because those are the only cluster types supported by the current version of parallel that support execution on multiple nodes. If you're using a cluster that is managed by knowledgeable HPC sysadmins, you should use "MPI", otherwise it may be easier to use "PSOCK" (or "SOCK" if you have a particular reason to use the "snow" package).
If you choose to create an "MPI" cluster, you should execute the script via R using the mpirun command with the "-n 1" option, and the first argument to makeCluster set to the number of workers that should be spawned. (If you don't know what that means, you may not want to use this approach.)
If you choose to create a "PSOCK" or "SOCK" cluster, the first argument to makeCluster must be a vector of hostnames, and makeCluster will start workers on those nodes via the "ssh" command when makeCluster is executed. That means you must have ssh daemons running on all of the specified hosts.
I've written much more on this subject elsewhere, but hopefully this will help you get started.
Here's a partial answer that may send you in the right direction
Based on this quite old (2013) post
http://www.r-bloggers.com/the-wonders-of-foreach/ it seems that I
should change the default type (fork to MPI but why? would that work
on unix?)
fork is a way of spawning background processes on POSIX system. on a single node with n cores, you can spawn n processes in parallel and do work. this doesn't work across multiple machines as they don't share memory. you need a way to get data between them.
MPI is a portable way to communicate between clusters of nodes. An MPI cluster can work across nodes.
What does you can use the snow functionality mean? How should I do that?
snow is a separate package. To make a 16 node MPI cluster with snow, do cl <- makeCluster(16, type = "MPI") but you need to be running R in the right environment, as described Steve Weston's answer and in his answer to a similar question here. (Once you get it running you may also need to modify your loop to use 4 cores on each node.)

parallel r with foreach and mclapply at the same time

I am implementing a parallel processing system which will eventually be deployed on a cluster, but I'm having trouble working out how the various methods of parallel processing interact.
I need to use a for loop to run a big block of code, which contains several large list of matrices operations. To speed this up, I want to parallelise the for loop with a foreach(), and parallelise the list operations with mclapply.
example pseudocode:
cl<-makeCluster(2)
registerDoParallel(cl)
outputs <- foreach(k = 1:2, .packages = "various packages") {
l_output1 <- mclapply(l_input1, function, mc.cores = 2)
l_output2 <- mclapply(l_input2, function, mc.cores = 2)
return = mapply(cbind, l_output1, l_output2, SIMPLIFY=FALSE)
}
This seems to work. My questions are:
1) is it a reasonable approach? They seem to work together on my small scale tests, but it feels a bit kludgy.
2) how many cores/processors will it use at any given time? When I upscale it to a cluster, I will need to understand how much I can push this (the foreach only loops 7 times, but the mclapply lists are up to 70 or so big matrices). It appears to create 6 "cores" as written (presumably 2 for the foreach, and 2 for each mclapply.
I think it's a very reasonable approach on a cluster because it allows you to use multiple nodes while still using the more efficient mclapply across the cores of the individual nodes. It also allows you to do some of the post-processing on the workers (calling cbind in this case) which can significantly improve performance.
On a single machine, your example will create a total of 10 additional processes: two by makeCluster which each call mclapply twice (2 + 2(2 + 2)). However, only four of them should use any significant CPU time at a time. You could reduce that to eight processes by restructuring the functions called by mclapply so that you only need to call mclapply once in the foreach loop, which may be more efficient.
On multiple machines, you will create the same number of processes, but only two processes per node will use much CPU time at a time. Since they are spread out across multiple machines it should scale well.
Be aware that mclapply may not play nicely if you use an MPI cluster. MPI doesn't like you to fork processes, as mclapply does. It may just issue some stern warnings, but I've also seen other problems, so I'd suggest using a PSOCK cluster which uses ssh to launch the workers on the remote nodes rather than using MPI.
Update
It looks like there is a problem calling mclapply from cluster workers created by the "parallel" and "snow" packages. For more information, see my answer to a problem report.

Whether to use the detectCores function in R to specify the number of cores for parallel processing?

In the help for detectCores() it says:
This is not suitable for use directly for the mc.cores argument of
mclapply nor specifying the number of cores in makeCluster. First
because it may return NA, and second because it does not give the
number of allowed cores.
However, I've seen quite a bit of sample code like the following:
library(parallel)
k <- 1000
m <- lapply(1:7, function(X) matrix(rnorm(k^2), nrow=k))
cl <- makeCluster(detectCores() - 1, type = "FORK")
test <- parLapply(cl, m, solve)
stopCluster(cl)
where detectCores() is used to specify the number of cores in makeCluster.
My use cases involve running parallel processing both on my own multicore laptop (OSX) and running it on various multicore servers (Linux). So, I wasn't sure whether there is a better way to specify the number of cores or whether perhaps that advice about not using detectCores was more for package developers where code is meant to run over a wide range of hardware and OS environments.
So in summary:
Should you use the detectCores function in R to specify the number of cores for parallel processing?
What is the distinction mean between detected and allowed cores and when is it relevant?
I think it's perfectly reasonable to use detectCores as a starting point for the number of workers/processes when calling mclapply or makeCluster. However, there are many reasons that you may want or need to start fewer workers, and even some cases where you can reasonably start more.
On some hyperthreaded machines it may not be a good idea to set mc.cores=detectCores(), for example. Or if your script is running on an HPC cluster, you shouldn't use any more resources than the job scheduler has allocated to your job. You also have to be careful in nested parallel situations, as when your code may be executed in parallel by a calling function, or you're executing a multithreaded function in parallel. In general, it's a good idea to run some preliminary benchmarks before starting a long job to determine the best number of workers. I usually monitor the benchmark with top to see if the number of processes and threads makes sense, and to verify that the memory usage is reasonable.
The advice that you quoted is particularly appropriate for package developers. It's certainly a bad idea for a package developer to always start detectCores() workers when calling mclapply or makeCluster, so it's best to leave the decision up to the end user. At least the package should allow the user to specify the number of workers to start, but arguably detectCores() isn't even a good default value. That's why the default value for mc.cores changed from detectCores() to getOptions("mc.cores", 2L) when mclapply was included in the parallel package.
I think the real point of the warning that you quoted is that R functions should not assume that they own the whole machine, or that they are the only function in your script that is using multiple cores. If you call mclapply with mc.cores=detectCores() in a package that you submit to CRAN, I expect your package will be rejected until you change it. But if you're the end user, running a parallel script on your own machine, then it's up to you to decide how many cores the script is allowed to use.
Author of the parallelly package here: The parallelly::availableCores() function acknowledges various HPC environment variables (e.g. NSLOTS, PBS_NUM_PPN, and SLURM_CPUS_PER_TASK) and system and R settings that are used to specify the number of cores available to the process, and if not specified, it'll fall back to parallel::detectCores(). As I, or others, become aware of more settings, I'll be happy to add automatic support also for those; there is an always open GitHub issue for this over at https://github.com/HenrikBengtsson/parallelly/issues/17 (there are some open requests for help).
Also, if the sysadm sets environment variable R_PARALLELLY_AVAILABLECORES_FALLBACK=1 sitewide, then parallelly::availableCores() will return 1, unless explicitly overridden by other means (by the job scheduler, by the user settings, ...). This further protects against software tools taking over all cores by default.
In other words, if you use parallelly::availableCores() rather than parallel::detectCores() you can be fairly sure that your code plays nice in multi-tenant environments (if it turns out it's not enough, please let us know in the above GitHub issue) and that any end user can still control the number of cores without you having to change your code.
EDIT 2021-07-26: availableCores() was moved from future to parallelly in October 2020. For now and for backward compatible reasons, availableCores() function is re-exported by the 'future' package.
Better in my case (I use mac) is future::availableCores() because detectCores() shows 160 which is obviously wrong.

Does multicore computing using R's doParallel package use more memory?

I just tested an elastic net with and without a parallel backend. The call is:
enetGrid <- data.frame(.lambda=0,.fraction=c(.005))
ctrl <- trainControl( method="repeatedcv", repeats=5 )
enetTune <- train( x, y, method="enet", tuneGrid=enetGrid, trControl=ctrl, preProc=NULL )
I ran it without a parallel backend registered (and got the warning message from %dopar% when the train call was finished), and then again with one registered for 7 cores (of 8). The first run took 529 seconds, the second, 313. But the first took 3.3GB memory max (reported by the Sun cluster system), and the second took 22.9GB. I've got 30GB of ram, and the task only gets more complicated from here.
Questions:
1) Is this a general property of parallel computation? I thought they shared memory....
2) Is there a way around this while still using enet inside train? If doParallel is the problem, are there other architectures that I could use with %dopar%--no, right?
Because I am interested in whether this is the expected result, this is closely related but not the exact same as this question, but I'd be fine closing this and merging my question in to that one (or marking that as duplicate and pointing to this one, since this has more detail) if that's what the concensus is:
Extremely high memory consumption of new doParallel package
In multithreaded programs, threads share lots of memory. It's primarily the stack that isn't shared between threads. But, to quote Dirk Eddelbuettel, "R is, and will remain, single-threaded", so R parallel packages use processes rather than threads, and so there is much less opportunity to share memory.
However, memory is shared between the processes that are forked by mclapply (as long as the processes don't modify it, which triggers a copy of the memory region in the operating system). That is one reason that the memory footprint can be smaller when using the "multicore" API versus the "snow" API with parallel/doParallel.
In other words, using:
registerDoParallel(7)
may be much more memory efficient than using:
cl <- makeCluster(7)
registerDoParallel(cl)
since the former will cause %dopar% to use mclapply on Linux and Mac OS X, while the latter uses clusterApplyLB.
However, the "snow" API allows you to use multiple machines, and that means that your memory size increases with the number of CPUs. This is a great advantage since it can allow programs to scale. Some programs even get super-linear speedup when running in parallel on a cluster since they have access to more memory.
So to answer your second question, I'd say to use the "multicore" API with doParallel if you only have a single machine and are using Linux or Mac OS X, but use the "snow" API with multiple machines if you're using a cluster. I don't think there is any way to use shared memory packages such as Rdsm with the caret package.
There is a minimum number of characters elsewise I would simply have typed: 1) Yes. 2) No, er, maybe. There are packages that use a "shared memory" model for parallel computation, but R's more thoroughly tested packages don't use it.
http://www.stat.berkeley.edu/scf/paciorek-parallelWorkshop.pdf
http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf
http://heather.cs.ucdavis.edu/Rdsm/BARUGSlides.pdf

When does foreach call .combine?

I have written some code using foreach which processes and combines a large number of CSV files. I am running it on a 32 core machine, using %dopar% and registering 32 cores with doMC. I have set .inorder=FALSE, .multicombine=TRUE, verbose=TRUE, and have a custom combine function.
I notice if I run this on a sufficiently large set of files, it appears that R attempts to process EVERY file before calling .combine the first time. My evidence is that in monitoring my server with htop, I initially see all cores maxed out, and then for the remainder of the job only one or two cores are used while it does the combines in batches of ~100 (.maxcombine's default), as seen in the verbose console output. What's really telling is the more jobs i give to foreach, the longer it takes to see "First call to combine"!
This seems counter-intuitive to me; I naively expected foreach to process .maxcombine files, combine them, then move on to the next batch, combining those with the output of the last call to .combine. I suppose for most uses of .combine it wouldn't matter as the output would be roughly the same size as the sum of the sizes of inputs to it; however my combine function pares down the size a bit. My job is large enough that I could not possibly hold all 4200+ individual foreach job outputs in RAM simultaneously, so I was counting on my space-saving .combine and separate batching to see me through.
Am I right that .combine doesn't get called until ALL my foreach jobs are individually complete? If so, why is that, and how can I optimize for that (other than making the output of each job smaller) or change that behavior?
The short answer is to use either doMPI or doRedis as your parallel backend. They work more as you expect.
The doMC, doSNOW and doParallel backends are relatively simple wrappers around functions such as mclapply and clusterApplyLB, and don't call the combine function until all of the results have been computed, as you've observed. The doMPI, doRedis, and (now defunct) doSMP backends are more complex, and get inputs from the iterators as needed and call the combine function on-the-fly, as you have assumed they would. These backends have a number of advantages in my opinion, and allow you to handle an arbitrary number of tasks if you have appropriate iterators and combine function. It surprises me that so many people get along just fine with the simpler backends, but if you have a lot of tasks, the fancy ones are essential, allowing you to do things that are quite difficult with packages such as parallel.
I've been thinking about writing a more sophisticated backend based on the parallel package that would handle results on the fly like my doMPI package, but there's hasn't been any call for it to my knowledge. In fact, yours has been the only question of this sort that I've seen.
Update
The doSNOW backend now supports on-the-fly result handling. Unfortunately, this can't be done with doParallel because the parallel package doesn't export the necessary functions.

Resources