I'm using the package ecole within R, with the function found here (https://rdrr.io/github/phytomosaic/ecole/man/ord_nms.html), which may not be relevant.
I specify cores, and to run in parallel, but I only run on a single core it seems based on task manager, although I have 24, it's very slow.
(p <- parallel::detectCores() - 1) # number of cores on your machine, minus one
m1 <- ord_nms(glt_WWTPXCMSspe, autopilot='medium', method='bray', weakties=F, parallel=p)
What's going wrong here? How can I specify to run on available cores? p returns 23.
I expected to see usage on all cores, but only cores 8 and 13 show any activity, the rest completely empty, while the function runs for an hour or so.
Related
I am new to r programming as you can tell from the nature of my question. I am trying to take advantage of the parallel computing ability of the train function.
library(parallel)
#detects number of cores available to use for parallel package
nCores <- detectCores(logical = FALSE)
cat(nCores, " cores detected.")
# detect threads with parallel()
nThreads<- detectCores(logical = TRUE)
cat(nThreads, " threads detected.")
# Create doSNOW compute cluster (try 64)
# One can increase up to 128 nodes
# Each node requires 44 Mbyte RAM under WINDOWS.
cluster <- makeCluster(128, type = "SOCK")
class(cluster);
I need someone to help me interpret this code. originally the first argument of makeCluster() had nthreads but after running
nCores <- detectCores(logical = FALSE)
I learned that I have 4 threads available. I changed the value based on the message provided in the guide. Will this enable me simultaneously run 128 iterations of the train function at once? If so what is the point of getting the number of threads and cores that my computer has in the first place?
What you want to do is to detect first the amount of cores you have.
nCores <- detectCores() - 1
Most of the time people add minus 1 to be sure you have one core left to do other stuff on.
cluster <- makeCluster(nCores)
This will set the amount of clusters you want your code to run on. There are several parallel methods (doParallel, parApply, parLapply, foreach,..).
Based on the parallel method you choose, there will run a method on one specific cluster you've created.
Small example I used in code of mine
no_cores <- detectCores() - 1
cluster <- makeCluster(no_cores)
result <- parLapply(cluster, docs$text, preProcessChunk)
stopCluster(cluster)
I also see that your making use of sock. Not sure if "type=SOCK" works.
I always use "type=PSOCK". FORK also exists but it depends on which OS you're using.
FORK: "to divide in branches and go separate ways"
Systems: Unix/Mac (not Windows)
Environment: Link all
PSOCK: Parallel Socket Cluster
Systems: All (including Windows)
Environment: Empty
I am not entirely convinced that the spec argument inside parallel::makeCluster is explicitly the max number of cores (actually, logical processors) to use. I've used the value of detectCores()-1 and detectCores()-2 in the spec argument on some computationally expensive processes and the CPU and # cores used==detectCores(), despite specifying to leave a little room (here, leaving 1 logical processor free for other processes).
The below example crude as I've not captured any quantitative outputs of the core usage. Please suggest edit.
You can visualize core usage by monitoring via e.g., task manager whilst running a simple example:
no_cores <- 5
cl<-makeCluster(no_cores)#, outfile = "debug.txt")
parallel::clusterEvalQ(cl,{
library(foreach)
foreach(i = 1:1e5) %do% {
print(sqrt(i))
}
})
stopCluster(cl)
#browseURL("debug.txt")
Then, rerun using e.g., ncores-1:
no_cores <- parallel::detectCores()-1
cl<-makeCluster(no_cores)#, outfile = "debug.txt")
parallel::clusterEvalQ(cl,{
library(foreach)
foreach(i = 1:1e5) %do% {
print(sqrt(i))
}
})
stopCluster(cl)
All 16 cores appear to engage despite no_cores being specified as 15:
Based on the above example and my very crude (visual only) analysis...it looks like it is possible that the spec argument tells the max number of cores to use throughout the process, but it doesn't appear the process is running on multiple cores simultaneously. Being a novice parallelizer, perhaps a more appropriate example is necessary to reject/support this?
The package documentation suggests spec is "A specification appropriate to the type of cluster."
I've dug into the relevant parallel documentation and and cannot determine what, exactly, spec is doing. But I am not convinced the argument necessarily controls the max number of cores (logical processors) to engage.
Here is where I think I could be wrong in my assumptions: If we specify spec as less than the number of the machine's cores (logical processors) then, assuming no other large processes are running, the machine should never achieve no_cores times 100% CPU usage (i.e., 1600% CPU usage max with 16 cores).
However, when I monitor the CPUs on a Windows OS using Resource Monitor), it does appear that there are, in fact, no_cores Images for Rscript.exe running.
I have a piece of R code which never freezes on my laptop, but sometimes (~once per 100 times) it freezes immediately after "registerDoParallel(cl)" while running on the computational cluster. So, ages may pass - it will not move further. Nothing complicated with the computational cluster - just LINUX machine with 32 cores and a lot of RAM. It can freeze even when I try to register cluster with 1 core.
I have tried to use FORK or PSOCK clusters - does not matter, still does not work once per several times. It is pretty difficult to get anything meaningful from log files - it stucks immediately after this command:
library(foreach)
library(doParallel)
numberOfThreads = 4 # may be any number from 1 to e.g. 5
no_cores <- min(detectCores() - 1, numberOfThreads)
cl<-makeCluster(no_cores)#, type="FORK")
registerDoParallel(cl)
print("This message will never be printed if it freezes")
Does somebody have any ideas on why it behaves like that?
UPD: forgot to add - there are plenty of free cores and plenty RAM when this error occurs. Clear solution - https://www.rdocumentation.org/packages/R.utils/versions/2.8.0/topics/withTimeout and repetitive tries to register cluster again and again - but this is sooo ugly.
I am using sfApply in R snowfall package for parallel computing. There are 32000 tests to run. The code is working fine when starting the computing, it will create 46 Rscript.exe processes and each Rscript.exe has a 2% cpu usage. The overall cpu usage is about 100% and the results are continually writing to disk. The computing will usually take tens of hours. The strange thing is that the Rscript.exe process becomes gradually inactive (cpu usage = 0) one by one, and the conresponding cpu is inactive too. After two days, there are only half number of Rscript.exe which are active by looking at the cpu usage, and overall cpu usage reduces to 50%. However, the work is far away to finish. As time goes by, more and more Rscript.exe go inactive, which makes the work last very very long. I am wondering what makes the process and cpu cores go inactive?
My computer has 46 logical cores. I am using R-3.4.0 from Rstudio in 64-bit windows 7. the following 'test' variable is 32000*2 matrix. myfunction is solving several differential equations.
Thanks.
library(snowfall)
sfInit(parallel=TRUE, cpus=46)
Sys.time()
sfLibrary(deSolve)
sfExport("myfunction","test")
res<-sfApply(test,1,function(x){myfunction(x[1],x[2])})
sfStop()
Sys.time()
What you're describing sounds reasonable since snowfall::sfApply() uses snow::parApply() internally, which chunks up your data (test) into (here) 46 chunks and sends each chunk out to one of the 46 R workers. When a worker finishes its chunk, there is no more work for it and it'll just sit idle while the remaining chunks are processed by the other workers.
What you want to do is to split up your data into smaller chunks which will lead to each worker will process more than one chunk on average. I don't know if (think?) that is possible with snowfall. The parallel package, which is part of R itself and which replaces the snow package (that snowfall relies on), provides parApply() and parApplyLB() where the latter splits up your chunks into minimal sizes, i.e. one per data element (of test). See help("parApply", package = "parallel") for details.
The future.apply package (I'm the author), provides you with the option to scale how much you want to split up the data. It doesn't provide an apply() version, but a lapply() version that you can use (and how parApply() works internally). For instance, your example that uses one chunk per worker would be:
library(future.apply)
plan(multisession, workers = 46L)
## Coerce matrix into list with one element per matrix row
test_rows <- lapply(seq_len(nrow(test)), FUN = function(row) test[row,])
res <- future_lapply(test_rows, FUN = function(x) {
myfunction(x[1],x[2])
})
which is defaults to
res <- future_lapply(test_rows, FUN = function(x) {
myfunction(x[1],x[2])
}, future.scheduling = 1.0)
If you want to split up the data so that each worker processes one row at the time (cf. parallel::parApplyLB()), you do that as:
res <- future_lapply(test_rows, FUN = function(x) {
myfunction(x[1],x[2])
}, future.scheduling = Inf)
By setting future.scheduling in [1, Inf], you can control how big the average chunk size is. For instance, future.scheduling = 2.0 will have each worker process on average two chunks of data before future_lapply() returns.
EDIT 2021-11-08: The future_lapply() and friends are now in the future.apply package (where originally in future).
I'm experiencing slowness when creating clusters using the parallel package.
Here is a function that just creates and then stops a PSOCK cluster, with n nodes.
library(parallel)
library(microbenchmark)
f <- function(n)
{
cl <- makeCluster(n)
on.exit(stopCluster(cl))
}
microbenchmark(f(2), f(4), times = 10)
## Unit: seconds
## expr min lq median uq max neval
## f(2) 4.095315 4.103224 4.206586 5.080307 5.991463 10
## f(4) 8.150088 8.179489 8.391088 8.822470 9.226745 10
My machine (a reasonably modern 4-core workstation running Win 7 Pro) is taking about 4 seconds to create a two node cluster and 8 seconds to create a four node cluster. This struck me as too slow, so I tried the same profiling on a colleague's identically specced machine, and it took one/two seconds for the two tests respectively.
This suggested I may have some odd configuration set up on my machine, or that there is some other problem. I read the ?makeCluster and socketConnection help pages, but did not see anything related to improving performance.
I had a look in the Windows Task Manager while the code was running: there was no obvious interference with anti-virus or other software, just an Rscript process running at ~17% (less than one core).
I don't know where to look to find the source of the problem. Are there any known causes of slowness with PSOCK cluster creation under Windows?
Is 8 seconds to create a 4-node cluster actually slow (by 2014 standards), or are my expectations too high?
To monitor what was happening, I installed and opened Process Monitor (HT #qethanm). I also exited most of the things in my system tray like Dropbox, in order to generate less noise. (Though in the end, this didn't make a difference.)
I then re-ran a simplified version of the R code in the question, directly from R GUI (instead of an IDE).
microbenchmark(f(4), times = 5)
After some digging, I noticed that R GUI spawns an Rscript process for each cluster that it creates (see picture).
After many dead ends and wild goose chases, it occurred to me that perhaps these Rscript instances weren't vanilla R. I renamed my Rprofile.site file to hide it and repeated the benchmark.
This time, a 4 node cluster was created, on average, in just under a second.
For a four node cluster, the Rprofile.site file (and presumably the personal startup file, ~/.Rprofile, if it exists) gets read four times, which can slow things down considerably. Pass rscript_args = c("--no-init-file", "--no-site-file", "--no-environ") to makeCluster to avoid this behaviour.
In R, the mcparallel() function in the parallel package forks off a new task to a worker each time it is called. If my machine has N (physical) cores, and I fork off 2N tasks, for example, then each core starts off running two tasks, which is not desirable. I would rather like to be able to start running N tasks on N workers, and then, as each tasks finishes, submit the next task to the now-available core. Is there an easy way to do this?
My tasks take different amounts of time, so it is not an option to fork off the tasks serial in batches of N. There might be some workarounds, such as checking the number of active cores and then submitting new tasks when they become free, but does anyone know of a simple solution?
I have tried setting cl <- makeForkCluster(nnodes=N), which does indeed set N cores going, but these are not then used by mcparallel(). Indeed, there appears to be no way to feed cl into mcparallel(). The latter has an option mc.affinity, but it's unclear how to use this and it doesn't seem to do what I want anyway (and according to the documentation its functionality is machine dependent).
you have at least 2 possibilities:
As mentioned above you can use mcparallel's parameters "mc.cores" or "mc.affinity".
On AMD platforms "mc.affinity" is preferred since two cores share same clock.
For example an FX-8350 has 8 cores, but core 0 has same clock as core 1. If you start a task for 2 cores only it is better to assign it to cores 0 and 1 rather than 0 and 2. "mc.affinity" makes that. The price is loosing load balancing.
"mc.affinity" is present in recent versions of the package. See changelog to find when introduced.
Also you can use OS's tool for setting affinity, e.g. "taskset":
/usr/bin/taskset -c 0-1 /usr/bin/R ...
Here you make your script to run on cores 0 and 1 only.
Keep in mind Linux numbers its cores starting from "0". Package parallel conforms to R's indexing and first core is core number 1.
I'd suggest taking advantage of the higher level functions in parallel that include this functionality instead of trying to force low level functions to do what you want.
In this case, try writing your tasks as different arguments of a single function. Then you can use mclapply() with the mc.preschedule parameter set to TRUE and the mc.cores parameter set to the number of threads you want to use at a time. Each time a task finishes and a thread closes, a new thread will be created, operating on the next available task.
Even if each task uses a completely different bit of code, you can create a list of functions and pass that to a wrapper function. For example, the following code executes two functions at a time.
f1 <- function(x) {x^2}
f2 <- function(x) {2*x}
f3 <- function(x) {3*x}
f4 <- function(x) {x*3}
params <- list(f1,f2,f3,f4)
wrapper <- function(f,inx){f(inx)}
output <- mclapply(params,FUN=calling,mc.preschedule=TRUE,mc.cores=2,inx=5)
If need be you could make params a list of lists including various parameters to be passed to each function as well as the function definition. I've used this approach frequently with various tasks of different lengths and it works well.
Of course, it may be that your various tasks are just different calls to the same function, in which case you can use mclapply directly without having to write a wrapper function.