I want to solve 10 linear systems (Ax = b) in each iteration of an algorithm.
The A of each system is about 10 x 11 (over-determined).
Cpu has 8 cores.
If I ask each CPU solve one of the 10 linear systems, 6 cpu has to wait for solving the last 2 systems.
If I solve each system one by one with multi-threaded solvers, would the performance be quite bad? I worry about false-sharing because the matrices A are small.
Does Eigen have a multi-threaded solver for this situation?
Thanks again.
Trying to leverage multi-threading within such tiny problems (10 x 11) will only slow down things. If you want to do better than running the 10 solves in parallel, then try to find more parallel tasks within your pipeline.
Related
Common sense indicates that any computation should be faster the more cores or threads we use. If the scaling is bad, the computation time will not improve with increasing number of threads. Thus, how come increasing threads considerably reduces the computation time when fitting a gam with R package MGCV, as shown by this example? :
library(boot) # loads data "amis"
t1<-Sys.time()
mod <- gam(speed ~ s(period, warning, pair, k = 12), data = amis, family=tw (link = log),method="REML",control=list(nthreads=1)) #
t2<-Sys.time()
print("Model fitted in:")
print(t2-t1)
If you increase the number of threads in this example to 2, 4, etc, the fitting procedure will take longer and longer, instead of being faster as we would expect. In my particular case:
1 thread: 32.85333 secs
2 threads: 50.63166 secs
3 threads: 1.2635 mins
Why is this? If I am doing something wrong, what can I do to obtain the desired behavior (i.e., increasing performance with increasing number of threads)?
Some notes:
1) The model, family and solving method shown here make no particular sense. This is only an example. However, I’ve got into this problem with real data and a reasonable model (but for simplicity I use this small code to exemplify the problem). Data, functional form of model, family, solving method seem all to be irrelevant: after many tests I get always the same behaviour, i.e., increasing the number of used threads, decreases performance (i.e., increases computation time).
2) Operative System: Linux Ubuntu 18.04;
3) Architecture: DELL Power Edge with two physical CPUs Intel Xeon X5660 each of them with 6 cores #2800 Mhz and each core being able of handling 2 threads (i.e., total of 24 threads). 80Gb RAM.
4) OpenMP libraries (which are needed for the multi-threath capacity of function gam) were installed with
sudo apt-get install libomp-dev
5) I am aware of the help page for multi-core use of gam (https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/mgcv-parallel.html). The only thing written there pointing to a decrease of performance with increasing number of threads is "Because the computational burden in mgcv is all in the linear algebra, then parallel computation may provide reduced (...) benefit with a tuned BLAS".
I apologize in advance since this post will not have any reproducible example.
I am using R x64 3.4.2 to run some cross-validated analyses on quite big matrices (number of columns ~ 80000, number of rows between 40 and 180). The analyses involve several features selection steps (performed with in-house functions or with functions from the CORElearnpackage, which is written in C++), as well as some clustering of the features and the fitting of a SVM model (by means of the package RWeka, that is written in Java).
I am working on a DELL Precision T7910 machine, with 2 processors Intel Xeon E5-2695 v3 2.30 GHz, 192 Gb RAM and Windows 7 x64 operating system.
To speed up the running time of my analysis I thought to use the doParallel package in combination with foreach. I would set up the cluster as follow
cl <- makeCluster(number_of_cores, type='PSOCK')
registerDoParallel(cl)
with number_of_clusterset to various numbers between 2 and 10 (detectCore() tells me that I have 56 cores in total).
My problem is that even if only setting number_of_cluster to 2, I got a protection from stack overflowerror message. The thing is that I monitor the RAM usage while the script is running and not even 20 Gb of my 192 Gb RAM are being used.
If I run the script in a sequential way it takes its sweet time (~ 3 hours with 42 rows and ~ 80000 columns), but it does run until the end.
I have tried (almost) every trick in the book for good memory management in R:
I am loading and removing big variables as needed in order to reduce memory usage
I am breaking down the steps with functions rather than scripting them directly, to take advantage of scoping
I am calling gc()every time I delete a big object in order to prompt R to return memory to the operating system
But I am still unable to run the script in parallel.
Do someone have any suggestion about this ? Should I just give up and wait > 3 hours every time I run the analyses ? And more generally: how is it possible to have a stack overflow problem when having a lot of free RAM ?
UPDATE
I have now tried to "pseudo-parallelize" the work using the same machine: since I am running a 10-fold cross-validation scheme, I am opening 5 different instances of Rgui and running 2 folds in each instances. Proceeding in this way, everything run smoothly, and the process indeed take 10 times less than running it in a single instance of R. What makes me wonder is that if 10 instances of Rgui can run at the same time and get the job done, this means that the machine has the computational resources needed. Hence I can not really get my head around the fact that %dopar% with 10 clusters does not work.
The "protection stack overflow" means that you have run out of the "protection stack", that is too many pointers have been PROTECTed but not (yet) UNPROTECTed. This could be because of a bug or inefficiency in the code you are running (in native code of a package or in native code of R, but not a bug in R source code).
This problem has nothing to do with the amount of available memory on the heap, so calling gc() will have no impact, and it is not important how much physical memory the machine has. Please do not call gc() explicitly at all, even if there was a problem with the heap usage, it just makes the program run slower but does not help: if there is not enough heap space but it could be obtained by garbage collection, the garbage collector will run automatically. As the problem is the protection stack, neither restructuring the R code nor removing dead variables explicitly will help. In principle, structuring the code into (relatively small) functions is a good thing for maintainability/readability and it also indirectly reduces scope of variables, so removing variables explicitly should become unnecessary.
It might help to increase the pointer protection stack size, which can be done at R startup from the command line using --max-ppsize.
I've been using this code:
library(parallel)
cl <- makeCluster( detectCores() - 1)
clusterCall(cl, function(){library(imager)})
then I have a wrapper function looking something like this:
d <- matrix #Loading a batch of data into a matrix
res <- parApply(cl, d, 1, FUN, ...)
# Upload `res` somewhere
I tested on my notebook, with 8 cores (4 cores, hyperthreading). When I ran it on a 50,000 row, 800 column, matrix, it took 177.5s to complete, and for most of the time the 7 cores were kept at near 100% (according to top), then it sat there for the last 15 or so seconds, which I guess was combining results. According to system.time(), user time was 14s, so that matches.
Now I'm running on EC2, a 36-core c4.8xlarge, and I'm seeing it spending almost all of its time with just one core at 100%. More precisely: There is an approx 10-20 secs burst where all cores are being used, then about 90 secs of just one core at 100% (being used by R), then about 45 secs of other stuff (where I save results and load the next batch of data). I'm doing batches of 40,000 rows, 800 columns.
The long-term load average, according to top, is hovering around 5.00.
Does this seem reasonable? Or is there a point where R parallelism spends more time with communication overhead, and I should be limiting to e.g. 16 cores. Any rules of thumb here?
Ref: CPU spec I'm using "Linux 4.4.5-15.26.amzn1.x86_64 (amd64)". R version 3.2.2 (2015-08-14)
UPDATE: I tried with 16 cores. For the smallest data, run-time increased from 13.9s to 18.3s. For the medium-sized data:
With 16 cores:
user system elapsed
30.424 0.580 60.034
With 35 cores:
user system elapsed
30.220 0.604 54.395
I.e. the overhead part took the same amount of time, but the parallel bit had fewer cores so took longer, and so it took longer overall.
I also tried using mclapply(), as suggested in the comments. It did appear to be a bit quicker (something like 330s vs. 360s on the particular test data I tried it on), but that was on my notebook, where other processes, or over-heating, could affect the results. So, I'm not drawing any conclusions on that yet.
There are no useful rules of thumb — the number of cores that a parallel task is optimal for is entirely determined by said task. For a more general discussion see Gustafson’s law.
The high single-core portion that you’re seeing in your code probably comes from the end phase of the algorithm (the “join” phase), where the parallel results are collated into a single data structure. Since this far surpasses the parallel computation phase, this may indeed be an indication that fewer cores could be beneficial.
I'd add that in case you are not aware of this wonderful resource for parallel computing in R, you may find reading Norman Matloff's recent book Parallel Computing for Data Science: With Examples in R, C++ and CUDA a very helpful read. I'd highly recommend it (I learnt a lot, not coming from a CS background).
The book answers your question in depth (Chapter 2 specifically). The book gives a high level overview of the causes of overhead that lead to bottlenecks to parallel programs.
Quoting section 2.1, which implicitly partially answers your question:
There are two main performance issues in parallel programming:
Communications overhead: Typically data must be transferred back and
forth between processes. This takes time, which can take quite a toll
on performance. In addition, the processes can get in each other’s way
if they all try to access the same data at once. They can collide when
trying to access the same communications channel, the same memory
module, and so on. This is another sap on speed. The term granularity
is used to refer, roughly, to the ratio of computa- tion to overhead.
Large-grained or coarse-grained algorithms involve large enough chunks
of computation that the overhead isn’t much of a problem. In
fine-grained algorithms, we really need to avoid overhead as much as
possible.
^ When overhead is high, less cores for the problem at hand can give shorter total computation time.
Load balance: As noted in the last chapter, if we are not
careful in the way in which we assign work to processes, we risk
assigning much more work to some than to others. This compromises
performance, as it leaves some processes unproductive at the end of
the run, while there is still work to be done.
When if ever do not use all cores? One example from my personal experience in running daily cronjobs in R on data that amounts to 100-200GB data in RAM, in which multiple cores are run to crunch blocks of data, I've indeed found running with say 6 out of 32 available cores to be faster than using 20-30 of the cores. A major reason was memory requirements for children processes (After a certain number of children processes were in action, memory usage got high and things slowed down considerably).
Is it possible to use OpenCL but with R code? I still don't have a good understanding of OpenCL and GPU programming. For example, suppose I have the following R code:
aaa <- function(x) mean(rnorm(1000000))
sapply(1:10, aaa)
I like that I can kind of use mclapply as a dropin replacement for lapply. Is there a way to do that for OpenCL? Or to use OpenCL as a backend for mclapply? I'm guessing this is not possible because I have not been able to find an example, so I have two questions:
Is this possible and if so can you give a complete example using my function aaa above?
If this is not possible, can you please explain why? I do not know much about GPU programming. I view GPU just like CPUs, so why cannot I run R code in parallel?
I would start by looking at the High Performance Computing CRAN task view, in particular the Parallel computing: GPUs section.
There are a number of packages listed there which take advantage of GPGPU for specific tasks that lend themselves to massive parallelisation (e.g. gputools, HiPLARM). Most of these use NVIDIA's own CUDA rather than OpenCL.
There is also a more generic OpenCL package, but it requires you to learn how to write OpenCL code yourself, and merely provides an interface to that code from R.
It isn't possible because GPUs work differently than CPUs which means you can't give them the same instructions that you'd give a CPU.
Nvidia puts on a good show with this video of describing the difference between CPU and GPU processing. Essentially the difference is that GPUs typically have, by orders of magnitude, more cores than CPUs.
Your example is one that can be extended to GPU code because it is highly parallel.
Here's some code to create random numbers (although they aren't normally distributed) http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
Once you create the random numbers you could break them into chunks and then sum each of the chunks in parallel and then add the sums of the chunks to get the overall sum Is it possible to run the sum computation in parallel in OpenCL?
I realize that your code would make the random number vector and its sum in serial and parallel that operation 10 times but with GPU processing, having a mere 10 tasks isn't very efficient since you'd leave so many cores idle.
I have been successfully running some moderate sized lasso simulations in R (5k*5k*100 tables). And I was able to run all 8 threads of an i7 by breaking 100 target regressions into 13 lists of 5k*5k*8 tables each. I noticed when I ran one standalone simulation, it would take about eight minutes per simulation of 1 table, but when I ran a loop over several (size 8 tasks), it would take hours (11 hours all night) to complete.
I finally decided to write out the data in tasks of equal size proportions to a csv file as they were processed. The first few took about 8min each as expected, but when I came back home, there was a single task still running for two hours. I had thought it could be due to the data (each data table has identical regressors but different targets). But then I realized it might be due to the computer going to sleep mode. As soon as I awakened the computer, the two hour simulation quickly finished and the remaining tasks took 8min each as expected.
So does sleep (hibernate) mode, dramatically slow down overnight tasks? Is it normal, in that case, to disable hibernate until the full simulation is compete?
Build:
intel i7 3.2G quad core
16 G ram
R Revolution 64 bit
Windows 7 Pro 64 bit
It appears the answer is yes, hibernate does dramatically slow down R parallel simulations with a multicore computer (win7); I suspect it occurs for other (non R) overnight simulations as well.
Notice the first run had task pred.6 and pred.7 taking about 2 hours. The 2nd set of simulations (pred1.n, all had no greater than 11 minutes per sim.
The 2nd set was run with sleep/hibernate set to never in control panel power options.