I have some large bioinformatics project where I want to run a small function on about a million markers, which takes a small tibble (22 rows, 2 columns) as well as an integer as input. The returned object is about 80KB each, and no large amount of data is created within the function, just some formatting and statistical testing. I've tried various approaches using the parallel, doParallel and doMC packages, all pretty canonical stuff (foreach, %dopar% etc.), on a machine with 182 cores, of which I am using 60.
However, no matter what I do, the memory requirement gets into the terabytes quickly and crashes the machine. The parent process holds many gigabytes of data in memory though, which makes me suspicious: Does all the memory content of the parent process get copied to the parallelized processes, even when it is not needed? If so, how can I prevent this?
Note: I'm not necessarily interested in a solution to my specific problem, hence no code example or the like. I'm having trouble understanding the details of how memory works in R parallelization.
I am working with a large dataset of 8Gb (HIGGS dataset). When looking at the vignette for the dbplyr package (see vignette('dbplyr')) I came across this line,
(If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating.)
The HIGGS dataset does fit in memory on my machine, my questions are:
Is this always true? And if not, when is it not true?
More generally are there any performance benefits to keeping the data out of memory, even if it does fit, and why?
edit: After looking at the link provided by #Waldi: RAM 100x faster than HDD, an additional question is how does this change for a SSD?
R is memory intensive, so it’s best to get as much RAM as possible. the amount of RAM you have can limit the size of data set you can analyse.
Adding a solid state drive (SSD) typically won’t have much impact on the speed of your R – vignette(dbplyr) since R loads object into RAM. However, the reduction in boot time and increase in your overall productivity since I/0 is much faster makes an SSD drive a wonderful purchase.
library(benchmarkme) is package benchmarkme to assess your CPUs number crunching ability. CPU cores is another area you would like to explore for big data performances. The more the cores the better, if you are using CPU.
library(Multidplyr) is a backend for dplyr that partitions a data frame across multiple cores.
This minimizes time spent moving data around, and maximizes parallel performance.
I've been using this code:
library(parallel)
cl <- makeCluster( detectCores() - 1)
clusterCall(cl, function(){library(imager)})
then I have a wrapper function looking something like this:
d <- matrix #Loading a batch of data into a matrix
res <- parApply(cl, d, 1, FUN, ...)
# Upload `res` somewhere
I tested on my notebook, with 8 cores (4 cores, hyperthreading). When I ran it on a 50,000 row, 800 column, matrix, it took 177.5s to complete, and for most of the time the 7 cores were kept at near 100% (according to top), then it sat there for the last 15 or so seconds, which I guess was combining results. According to system.time(), user time was 14s, so that matches.
Now I'm running on EC2, a 36-core c4.8xlarge, and I'm seeing it spending almost all of its time with just one core at 100%. More precisely: There is an approx 10-20 secs burst where all cores are being used, then about 90 secs of just one core at 100% (being used by R), then about 45 secs of other stuff (where I save results and load the next batch of data). I'm doing batches of 40,000 rows, 800 columns.
The long-term load average, according to top, is hovering around 5.00.
Does this seem reasonable? Or is there a point where R parallelism spends more time with communication overhead, and I should be limiting to e.g. 16 cores. Any rules of thumb here?
Ref: CPU spec I'm using "Linux 4.4.5-15.26.amzn1.x86_64 (amd64)". R version 3.2.2 (2015-08-14)
UPDATE: I tried with 16 cores. For the smallest data, run-time increased from 13.9s to 18.3s. For the medium-sized data:
With 16 cores:
user system elapsed
30.424 0.580 60.034
With 35 cores:
user system elapsed
30.220 0.604 54.395
I.e. the overhead part took the same amount of time, but the parallel bit had fewer cores so took longer, and so it took longer overall.
I also tried using mclapply(), as suggested in the comments. It did appear to be a bit quicker (something like 330s vs. 360s on the particular test data I tried it on), but that was on my notebook, where other processes, or over-heating, could affect the results. So, I'm not drawing any conclusions on that yet.
There are no useful rules of thumb — the number of cores that a parallel task is optimal for is entirely determined by said task. For a more general discussion see Gustafson’s law.
The high single-core portion that you’re seeing in your code probably comes from the end phase of the algorithm (the “join” phase), where the parallel results are collated into a single data structure. Since this far surpasses the parallel computation phase, this may indeed be an indication that fewer cores could be beneficial.
I'd add that in case you are not aware of this wonderful resource for parallel computing in R, you may find reading Norman Matloff's recent book Parallel Computing for Data Science: With Examples in R, C++ and CUDA a very helpful read. I'd highly recommend it (I learnt a lot, not coming from a CS background).
The book answers your question in depth (Chapter 2 specifically). The book gives a high level overview of the causes of overhead that lead to bottlenecks to parallel programs.
Quoting section 2.1, which implicitly partially answers your question:
There are two main performance issues in parallel programming:
Communications overhead: Typically data must be transferred back and
forth between processes. This takes time, which can take quite a toll
on performance. In addition, the processes can get in each other’s way
if they all try to access the same data at once. They can collide when
trying to access the same communications channel, the same memory
module, and so on. This is another sap on speed. The term granularity
is used to refer, roughly, to the ratio of computa- tion to overhead.
Large-grained or coarse-grained algorithms involve large enough chunks
of computation that the overhead isn’t much of a problem. In
fine-grained algorithms, we really need to avoid overhead as much as
possible.
^ When overhead is high, less cores for the problem at hand can give shorter total computation time.
Load balance: As noted in the last chapter, if we are not
careful in the way in which we assign work to processes, we risk
assigning much more work to some than to others. This compromises
performance, as it leaves some processes unproductive at the end of
the run, while there is still work to be done.
When if ever do not use all cores? One example from my personal experience in running daily cronjobs in R on data that amounts to 100-200GB data in RAM, in which multiple cores are run to crunch blocks of data, I've indeed found running with say 6 out of 32 available cores to be faster than using 20-30 of the cores. A major reason was memory requirements for children processes (After a certain number of children processes were in action, memory usage got high and things slowed down considerably).
I tried running Apriori algorithm on 30GB CSV file in which each row is a basket upto 34 items(columns) in it. So R studio died just after execution. I want to know what are the minimum system requirements like how much RAM and CPU config I need to run algorithms on large data sets?
This question cannot be answered as such. It highly depends on what you want to do with the data.
Example
If you are able to process all lines 1 by 1, you just need a tiny bit of ram (for example if you want to count them, I believe this also holds for the most trivial use of Apriori)
If you want to calculate the distance between all points efficiently, you will want a ton of ram, and another few GB to store the output (I believe this is even less intense than the most extreme use of Apriori).
Conclusion
As such I would recommend:
Use whatever hardware you have to process a subset of the data. Check your memory and CPU usage, as you increase the data size (or other parameters) and extrapolate your results to see what you probably need.
This is a somewhat generic question for which I apologize, but I can't generate a code example that reproduces the behavior. My question is this: I'm scoring a largish data set (~11 million rows with 274 dimensions) by subdividing the data set into a list of data frames and then running a scoring function on 16 cores of a 24 core Linux server using mclapply. Each data frame on the list is allocated to a spawned instance and scored, returning a list of data frames of predictions. While the mclapply is running the various R instances are spending a lot of time in uninterruptable sleep, more than they're spending running. Has anyone else experienced this using mclapply? I'm a Linux neophyte, from an OS perspective does this make any sense? Thanks.
You need to be careful when using mclapply to operate on large data sets. It's easy to create too many workers for the amount of memory on your computer and the amount of memory used by your computation. It's hard to predict the memory requirements due to the complexity of R's memory management, so it's best to monitor memory usage carefully using a tool such as "top" or "htop".
You may be able to decrease the memory usage by splitting your work into more but smaller tasks since that may reduce the memory needed by the computation. I don't think that the choice of prescheduling affects the memory usage much, since mclapply will never fork more than mc.cores workers at a time, regardless of the value of mc.prescheduling.