mclapply not using multiple cores - r

I'm trying to process a bunch of csv files and return data frames in R, in parallel using mclapply(). I have a 64 core machine, and I can't seem to get anymore that 1 core utilized at the moment using mclapply(). In fact, it is a bit quicker to run lapply() rather than mclapply() at the moment. Here is an example that shows that mclapply() is not utilizing more the cores available:
library(parallel)
test <- lapply(1:100,function(x) rnorm(10000))
system.time(x <- lapply(test,function(x) loess.smooth(x,x)))
system.time(x <- mclapply(test,function(x) loess.smooth(x,x), mc.cores=32))
user system elapsed
0.000 0.000 7.234
user system elapsed
0.000 0.000 8.612
Is there some trick to getting this working? I had to compile R from source on this machine (v3.0.1), are there some compile flags that I missed to allow forking? detectCores() tells me that I indeed do have 64 cores to play with...
Any tips appreciated!

I get similar results to you, but if I change rnorm(10000) to rnorm(100000), I get significant speed up. I would guess that the additional overhead is canceling out any performance benefit for such a small scale problem.

Related

Why is FASTR (ie GraalVM version of R) 10x *slower* compared to normal R despite Oracle's claim of 40x *faster*?

Oracle claims that its graalvm implementaion of R (called "FastR") is up to 40x faster than normal R (https://www.graalvm.org/r/). However, I ran this super simple (but realistic) 4 line test program and not only was GraalVM/FastR not 40x faster, it was actually 10x SLOWER!
x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
t1 <- system.time(fit1 <- smooth.spline(x,y,spar=0.6))
t1
In FASTR, t1 returns this value:
user system elapsed
0.870 0.012 0.901
While in the original normal R, I get this result:
user system elapsed
0.112 0.000 0.113
As you can see, FAST R is super slow even for this simple (ie 4 lines of code, no extra/special library imported etc). I tested this on a 16 core VM on Google Cloud. Thoughts? (FYI: I did a quick peek at the smooth.spline code, and it does call Fortran, but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code.)
====================================
EDIT:
Per the comments from Ben Bolker and user438383 below, I modified the code to include a for loop so that the code ran for much longer and I had time to monitor CPU usage. The modified code is below:
x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
forloopfunction <- function(xTrain, yTrain) {
for (x in 1:100) {
smooth.spline(xTrain, yTrain, spar=0.6)
}
}
t1 <- system.time(fit1 <-forloopfunction(x,y))
t1
Now, the normal R returns this for t1:
user system elapsed
19.665 0.008 19.667
while FastR returns this:
user system elapsed
76.570 0.210 77.918
So, now, FastR is only 4x slower, but that's still considerably slower. (I would be ok with 5% to even 10% difference, but that's 400% difference.) Moreoever, I checked the cpu usage. Normal R used only 1 core (at 100%) for the entirety of the 19 seconds. However, surprisingly, FastR used between 100% and 300% of CPU usage (ie between 1 full core and 3 full cores) during the ~78 seconds. So, I think it fairly reasonably to conclude that at least for this test (which happens to be a realistic test for my very simple scenario), FastR is at least 4x slower while consuming ~1x to 3x more CPU cores. Particularly given that I'm not importing any special libraries which the FASTR team may not have time to properly analyze (ie I'm using just vanilla R code that ships with R), I think that there's something not quite right with the FASTR implementation, at least when it comes to speed. (I haven't tested accuracy, but that's now moot I think.) Has anyone else experienced anything similar or does anyone know of any "magic" configuration that one needs to do to FASTR to get its claimed speeds (or at least similar, ie +- 5% speeds to normal R)? (Or maybe there's some known limitation to FASTR that I may be able to work around, ie don't use normal fortran binaries etc, but use these special ones etc.)
TL;DR: your example is indeed not the best use-case for FastR, because it spends most of its time in R builtins and Fortran code. There is no reason for it to be slower on FastR, though, and we will work on fixing that. FastR may be still useful for your application overall or just for some selected algorithms that run slowly on GNU-R, but would be a good fit for FastR (loopy, "scalar" code, see FastRCluster package).
As others have mentioned, when it comes to micro benchmarks one needs to repeat the benchmark multiple times to allow the system to warm-up. This is important in any case, but more so for systems that rely on dynamic compilation, like FastR.
Dynamic just-in-time compilation works by first interpreting the program while recording the profile of the execution, i.e., learning how the program executes, and only then compiling the program using this knowledge to optimize it better(*). In case of dynamic languages like R, this can be very beneficial, because we can observe types and other dynamic behavior that is hard if not impossible to statically determine without actually running the program.
It should be now clear why FastR needs few iterations to show the best performance it can achieve. It is true that the interpretation mode of FastR has not been optimized very much, so the first few iterations are actually slower than GNU-R. This is not inherent limitation of the technology that FastR is based on, but tradeoff of where we put our resources. Our priority in FastR has been peak performance, i.e., after a sufficient warm-up for micro benchmarks or for applications that run for long enough time.
To your concrete example. I could also reproduce the issue and I analyzed it by running the program with builtin CPU sampler:
$GRAALVM_HOME/bin/Rscript --cpusampler --cpusampler.Delay=20000 --engine.TraceCompilation example.R
...
-----------------------------------------------------------------------------------------------------------
Thread[main,5,main]
Name || Total Time || Self Time || Location
-----------------------------------------------------------------------------------------------------------
order || 2190ms 81.4% || 2190ms 81.4% || order.r~1-42:0-1567
which || 70ms 2.6% || 70ms 2.6% || which.r~1-6:0-194
ifelse || 140ms 5.2% || 70ms 2.6% || ifelse.r~1-34:0-1109
...
--cpusampler.Delay=20000 delays the start of sampling by 20 seconds
--engine.TraceCompilation prints basic info about the JIT compilation
when the program finishes, it prints the table from CPU sampler
(example.R runs the micro benchmark in a loop)
One observation is that the Fotran routine called from smooth.spline is not to blame here. It makes sense because FastR runs the very same native Fortran code as GNU-R. FastR does have to convert the data to native memory, but that is probably small cost compared to the computation itself. Also the transition between native and R code is in general more expensive on FastR, but here it does not play a role.
So the problem here seems to be a builtin function order. In GNU-R builtin functions are implemented in C, they basically do a big switch on the type of the input (integer/real/...) and then just execute highly optimized C code doing the work on plain C integer/double/... array. That is already the most effective thing one can do and FastR cannot beat that, but there is no reason for it to not be as fast. Indeed it turns out that there is a performance bug in FastR and the fix is on its way to master. Thank you for bringing our attention to it.
Other points raised:
but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code
YMMV. That concrete benchmark presented at our website does spend considerable amount of time in R code, so the overhead of R<->native transition does not skew the result as much. The best results are when translating the Fortran code to R, so making the whole thing just a pure R program. This shows that FastR can run the same algorithm in R as fast as or quite close to Fortran and that is, performance wise, the main benefit of FastR. There is no free lunch. Warm-up time and the costs of R<->native transition is currently the price to pay.
FastR used between 100% and 300% of CPU usage
This is due to JIT compilations going on on background threads. Again, no free lunch.
To summarize:
FastR can run R code faster by using dynamic just-in-time compilation and optimizing chunks of R code (functions or possibly multiple functions inlined into one compilation unit) to the point that it can get close or even match equivalent native code, i.e., significantly faster than GNU-R. This matters on "scalar" R code, i.e., code with loops. For code that spends majority of time in builtin R functions, like, e.g., sum((x - mean(x))^2) for large x, this doesn't gain that much, because that code already spends much of the time in optimized native code even on GNU-R.
What FastR cannot do is to beat GNU-R on execution of a single R builtin function, which is likely to be already highly optimized C code in GNU-R. For individual builtins we may beat GNU-R, because we happen to choose slightly better algorithm or GNU-R has some performance bug somewhere, or it can be the other way around like in this case.
What FastR also cannot do is speeding up native code, like Fortran routines that some R code may call. FastR runs the very same native code. On top of that, the transition between native and R code is more costly in FastR, so programs doing this transition too often may end up being slower on FastR.
Note: what FastR can do and is a work-in-progress is to run LLVM bitcode instead of the native code. GraalVM supports execution of LLVM bitcode and can optimize it together with other languages, which removes the cost of the R<->native transition and even gives more power to the compiler to optimize across this boundary.
Note: you can use FastR via the cluster package interface to execute only parts of you application.
(*) the first profiling tier may be also compiled, which gives different tradeoffs

What happens if registerDoMC() is used with a number greater than the number of cores?

I'm using a laptop with a 4-core CPU, but somehow using registerDoMC(20) seems to work as if I had 20 cores available:
library(tictoc)
library(doMC)
detectCores()
#> [1] 4
registerDoMC(20)
tic()
a <- foreach(i = 1:20) %dopar% {
Sys.sleep(1)
i
}
toc()
#> 1.084 sec elapsed
Created on 2019-07-22 by the reprex package (v0.3.0)
How is this explained? If I only have 4 cores available, how were the 20 jobs ran in 1 second + some overhead?
At first, I thought I registerDoMC() would return an error for anything greater than detectCores(), but since it didn't and I had this surprising result, I think I misunderstood what happens under the hood in foreach.
Because the underlying OS is a multi-tasking operating system, it already has way more processes running than you have cores available. It just means you have more instances and will not be able to benefit from having all R sub-processes running in a core simultaneously. Typically the improvement in overall performance with increasing the number of simultaneous cores goes down when you exceed the number of cores, so your overall speed-gain above ncores is generally not worth the effort. Further, if there is "large-ish data", the time to xfer data between processes is non-trivial.
My practice has typically been to do "ncores minus 1", leaving 1 core for basic OS-admin stuff ... though even then I've occasionally maxed out and still seen benefits. So while you will likely not break anything by using more processes than you have cores available, I suggest never exceeding it, you are unlikely (in R) to see any performance gains from doing so (and likely a performance penalty).

foreach very slow with large number of values

I'm trying to use foreach to do parallel computations. It works fine if there are a small number of values to iterate over, but at some point it becomes incredibly slow. Here's a simple example:
library(foreach)
library(doParallel)
registerDoParallel(8)
out1 <- foreach(idx=1:1e6) %do%
{
1+1
}
out2 <- foreach(idx=1:1e6) %dopar%
{
1+1
}
out3 <- mclapply(1:1e6,
function(x) 1+1,
mc.cores=20)
out1 and out2 take an incredibly long time to run. Neither of them even spawns multiple threads for as long as I keep them running. out3 spawns the threads almost immediately and runs very quickly. Is foreach doing some sort of initial processing that doesn't scale well? If so, is there is a simple fix? I really prefer the syntax of foreach.
I should also note that the actual code that I'm trying to parallelize is substantially more complicated than 1+1. I only show this as an example because even with this simple code foreach seems to be doing some pre-processing that is incredibly slow.
the forach/doParallel vignette says (to a code much smaller than yours):
Note well that this is not a practical use of doParallel. This is our
“Hello, world” program for parallel computing. It tests that
everything is installed and set up properly, but don’t expect it to
run faster than a sequential for loop, because it won’t! sqrt executes
far too quickly to be worth executing in parallel, even with a large
number of iterations. With small tasks, the overhead of scheduling the
task and returning the result can be greater than the time to execute
the task itself, resulting in poor performance. In addition, this
example doesn’t make use of the vector capabilities of sqrt, which it
must to get decent performance. This is just a test and a pedagogical
example, not a benchmark.
So it might be in the nature of your setting that it is not faster.
Instead try without parallelization but using vectorization:
q <- sapply(1:1e6, function(x) 1 + 1 )
It does exactly the same like your example loops and is done in a second.
And now try this (it does still exactly the same thing exaclty the same times:
x <- rep(1, n=1e6)
r <- x + 1
It adds to 1e6 1s a 1 instantly. (The power of vectorization ...)
The combination of foreach with doParallel is from my personal experience much slower than if you use the bioinformatics BiocParallel package from the repository Bioconda. (I am a bioinformatician and in bioinformatics, we have very often calculation-heavy stuff, since we have single data files of several gigabytes to process - and many of them).
I tried your function using BiocParallel and it uses all assigned CPUs by 100% (tested by running htop during job execution) the entire thing took 17 seconds.
For sure - with your lightweight example, this applies:
the overhead of scheduling the task and returning the result
can be greater than the time to execute the task itself
Anyway, it seems to use the CPUs more thoroughly than doParallel. So use this, if you have calculation-heavy tasks to be get done.
Here the code how I did it:
# For bioconductor packages, the best is to install this:
install.packages("BiocManager")
# Then activate the installer
require(BiocManager)
# Now, with the `install()` function in this package, you can install
# conveniently Bioconductor packages like `BiocParallel`
install("BiocParallel")
# then, activate it
require(BiocParallel)
# initiate cores:
bpparam <- bpparam <- SnowParam(workers=4, type="SOCK") # 4 or take more CPUs
# prepare the function you want to parallelize
FUN <- function(x) { 1 + 1 }
# and now you can call the function using `bplapply()`
# the loop parallelizing function in BiocParallel.
s <- bplapply(1:1e6, FUN, BPPARAM=bpparam) # each value of 1:1e6 is given to
# FUN, note you have to pass the SOCK cluster (bpparam) for the
# parallelization
For more info, go to the vignette of the BiocParallel package.
Look at bioconductor how many packages it provides and all well documented.
I hope this helps you for your future parallel computing stuff.

system.time command - diffrence?

When I run this command:
system.time(fread('x.csv', header = T))
I receive this output:
user system elapsed
4.740 0.048 4.785
In simple terms, what does each of those means, besides "elapsed," which the time that has passed since running the command? What do User and System mean?
From http://www.ats.ucla.edu/stat/r/faq/timing_code.htm
The values presented (user, system, and elapsed) will be defined by your operating system, but generally, the user time relates to the execution of the code, the system time relates to your CPU, and the elapsed time is the difference in times since you started the stopwatch (and will be equal to the sum of user and system times if the chunk of code was run altogether). While the difference of .42 seconds may not seem like much, this gain in efficiency is huge!

Why is foreach() %do% sometimes slower than for?

I'm playing around with parallellization in R for the first time. As a first toy example, I tried
library(doMC)
registerDoMC()
B<-10000
myFunc<-function()
{
for(i in 1:B) sqrt(i)
}
myFunc2<-function()
{
foreach(i = 1:B) %do% sqrt(i)
}
myParFunc<-function()
{
foreach(i = 1:B) %dopar% sqrt(i)
}
I know that sqrt() executes too fast for parallellization to matter, but what I didn't expect was that foreach() %do% would be slower than for():
> system.time(myFunc())
user system elapsed
0.004 0.000 0.005
> system.time(myFunc2())
user system elapsed
6.756 0.000 6.759
> system.time(myParFunc())
user system elapsed
6.140 0.524 6.096
In most examples that I've seen, foreach() %dopar% is compared to foreach() %do% rather than for(). Since foreach() %do% was much slower than for() in my toy example, I'm now a bit confused. Somehow, I thought that these were equivalent ways of constructing for-loops. What is the difference? Are they ever equivalent? Is foreach() %do% always slower?
UPDATE: Following #Peter Fines answer, I update myFunc as follows:
a<-rep(NA,B)
myFunc<-function()
{
for(i in 1:B) a[i]<-sqrt(i)
}
This makes for() a bit slower, but not much:
> system.time(myFunc())
user system elapsed
0.036 0.000 0.035
> system.time(myFunc2())
user system elapsed
6.380 0.000 6.385
for will run sqrt B times, presumably discarding the answer each time. foreach, however, returns a list containing the result of each execution of the loop body. This would contribute considerable extra overhead, regardless of whether it's running in parallel or sequential mode (%dopar% or %do%).
I based my answer by running the following code, which appears to be confirmed by the foreach vignette, which states "foreach differs from a for loop in that its return is a list of values, whereas a for loop has no value and uses side effects to convey its result."
> print(for(i in 1:10) sqrt(i))
NULL
> print(foreach(i = 1:10) %do% sqrt(i))
[[1]]
[1] 1
[[2]]
[1] 1.414214
[[3]]
... etc
UPDATE: I see from your updated question that the above answer isn't nearly sufficient to account for the performance difference. So I looked at the source code for foreach and can see that there is a LOT going on! I haven't tried to understand exactly how it works, however do.R and foreach.R show that even when %do% is run, large parts of the foreach configuration is still run, which would make sense if perhaps the %do% option is largely provided to allow you to test foreach code without having to have a parallel backend configured and loaded. It also needs to support the more advanced nesting and iteration facilities that foreach provides.
There are references in the code to results caching, error checking, debugging and the creation of local environment variables for the arguments of each iteration (see the function doSEQ in do.R for example). I'd imagine this is what creates the difference that you've observed. Of course, if you were running much more complicated code inside your loop (that would actually benefit from a parallelisation framework like foreach), this overhead would become irrelevant compared with the benefits it provides.

Resources