I encounter a problem about the scalability of MPI solver on one machine. I want to do twenty matrix summations. This work is done in two way:
1. 20 tasks are launched by MPI, each task do one matrix summations for one round.
2. 2 tasks are launched by MPI, each task do one matrix summation per round and do ten rounds of calculation. Different task are synchronized between different rounds.
CPU time of matrix summation per each round is recorded for both two ways. CPU time does NOT include memory allocation and release, also does NOT include the matrix creation.
As I expect, the CPU time per round in two ways should be the same since each task only calculate one matrix summation in each round and there is no communication between tasks, e.g. calculation of each task are independent. That means the scalability of this solver should be perfect.
But CPU time of the first way (20 MPI tasks) is more than ten times of CPU time of the second way (2 MPI tasks). What is the possible reason of it? Is it related to cache in the memory?
Related
I am using a naive Bayesian classifier to predict some test data in R. The test data has >1,000,000,000 records, and takes far too long to process with one processor. The computer I am using has (only) four processors in total, three of which I can free-up to run my task (I could use all four, but prefer to keep one for other work I need to do).
Using the foreach and doSNOW packages, and following this tutorial, I have things set up and running. My question is:
I have the dataset split into three parts, one part per processor. Is there a benefit to splitting the dataset into say 6,9, or 12 parts? In other words, what is the trade-off between more splits, vs, just having one big block of records for each processor core to run?
I haven't provided any data here because I think this question is more theoretical. But if data are needed, please let me know.
Broadly speaking, the advantage of splitting it up into more parts is that you can optimize your processor use.
If the dataset is split into 3 parts, one per processor, and they take the following time:
Split A - 10 min
Split B - 20 min
Split C - 12 min
You can see immediately that two of your processors are going to be idle for a significant amount of time needed to do the full analysis.
Instead, if you have 12 splits, each one taking between 3 and 6 minutes to run, then processor A can pick up another chunk of the job after it finishes with the first one instead of idling until the longest-running split finishes.
I've been using this code:
library(parallel)
cl <- makeCluster( detectCores() - 1)
clusterCall(cl, function(){library(imager)})
then I have a wrapper function looking something like this:
d <- matrix #Loading a batch of data into a matrix
res <- parApply(cl, d, 1, FUN, ...)
# Upload `res` somewhere
I tested on my notebook, with 8 cores (4 cores, hyperthreading). When I ran it on a 50,000 row, 800 column, matrix, it took 177.5s to complete, and for most of the time the 7 cores were kept at near 100% (according to top), then it sat there for the last 15 or so seconds, which I guess was combining results. According to system.time(), user time was 14s, so that matches.
Now I'm running on EC2, a 36-core c4.8xlarge, and I'm seeing it spending almost all of its time with just one core at 100%. More precisely: There is an approx 10-20 secs burst where all cores are being used, then about 90 secs of just one core at 100% (being used by R), then about 45 secs of other stuff (where I save results and load the next batch of data). I'm doing batches of 40,000 rows, 800 columns.
The long-term load average, according to top, is hovering around 5.00.
Does this seem reasonable? Or is there a point where R parallelism spends more time with communication overhead, and I should be limiting to e.g. 16 cores. Any rules of thumb here?
Ref: CPU spec I'm using "Linux 4.4.5-15.26.amzn1.x86_64 (amd64)". R version 3.2.2 (2015-08-14)
UPDATE: I tried with 16 cores. For the smallest data, run-time increased from 13.9s to 18.3s. For the medium-sized data:
With 16 cores:
user system elapsed
30.424 0.580 60.034
With 35 cores:
user system elapsed
30.220 0.604 54.395
I.e. the overhead part took the same amount of time, but the parallel bit had fewer cores so took longer, and so it took longer overall.
I also tried using mclapply(), as suggested in the comments. It did appear to be a bit quicker (something like 330s vs. 360s on the particular test data I tried it on), but that was on my notebook, where other processes, or over-heating, could affect the results. So, I'm not drawing any conclusions on that yet.
There are no useful rules of thumb — the number of cores that a parallel task is optimal for is entirely determined by said task. For a more general discussion see Gustafson’s law.
The high single-core portion that you’re seeing in your code probably comes from the end phase of the algorithm (the “join” phase), where the parallel results are collated into a single data structure. Since this far surpasses the parallel computation phase, this may indeed be an indication that fewer cores could be beneficial.
I'd add that in case you are not aware of this wonderful resource for parallel computing in R, you may find reading Norman Matloff's recent book Parallel Computing for Data Science: With Examples in R, C++ and CUDA a very helpful read. I'd highly recommend it (I learnt a lot, not coming from a CS background).
The book answers your question in depth (Chapter 2 specifically). The book gives a high level overview of the causes of overhead that lead to bottlenecks to parallel programs.
Quoting section 2.1, which implicitly partially answers your question:
There are two main performance issues in parallel programming:
Communications overhead: Typically data must be transferred back and
forth between processes. This takes time, which can take quite a toll
on performance. In addition, the processes can get in each other’s way
if they all try to access the same data at once. They can collide when
trying to access the same communications channel, the same memory
module, and so on. This is another sap on speed. The term granularity
is used to refer, roughly, to the ratio of computa- tion to overhead.
Large-grained or coarse-grained algorithms involve large enough chunks
of computation that the overhead isn’t much of a problem. In
fine-grained algorithms, we really need to avoid overhead as much as
possible.
^ When overhead is high, less cores for the problem at hand can give shorter total computation time.
Load balance: As noted in the last chapter, if we are not
careful in the way in which we assign work to processes, we risk
assigning much more work to some than to others. This compromises
performance, as it leaves some processes unproductive at the end of
the run, while there is still work to be done.
When if ever do not use all cores? One example from my personal experience in running daily cronjobs in R on data that amounts to 100-200GB data in RAM, in which multiple cores are run to crunch blocks of data, I've indeed found running with say 6 out of 32 available cores to be faster than using 20-30 of the cores. A major reason was memory requirements for children processes (After a certain number of children processes were in action, memory usage got high and things slowed down considerably).
This is a somewhat generic question for which I apologize, but I can't generate a code example that reproduces the behavior. My question is this: I'm scoring a largish data set (~11 million rows with 274 dimensions) by subdividing the data set into a list of data frames and then running a scoring function on 16 cores of a 24 core Linux server using mclapply. Each data frame on the list is allocated to a spawned instance and scored, returning a list of data frames of predictions. While the mclapply is running the various R instances are spending a lot of time in uninterruptable sleep, more than they're spending running. Has anyone else experienced this using mclapply? I'm a Linux neophyte, from an OS perspective does this make any sense? Thanks.
You need to be careful when using mclapply to operate on large data sets. It's easy to create too many workers for the amount of memory on your computer and the amount of memory used by your computation. It's hard to predict the memory requirements due to the complexity of R's memory management, so it's best to monitor memory usage carefully using a tool such as "top" or "htop".
You may be able to decrease the memory usage by splitting your work into more but smaller tasks since that may reduce the memory needed by the computation. I don't think that the choice of prescheduling affects the memory usage much, since mclapply will never fork more than mc.cores workers at a time, regardless of the value of mc.prescheduling.
I have been successfully running some moderate sized lasso simulations in R (5k*5k*100 tables). And I was able to run all 8 threads of an i7 by breaking 100 target regressions into 13 lists of 5k*5k*8 tables each. I noticed when I ran one standalone simulation, it would take about eight minutes per simulation of 1 table, but when I ran a loop over several (size 8 tasks), it would take hours (11 hours all night) to complete.
I finally decided to write out the data in tasks of equal size proportions to a csv file as they were processed. The first few took about 8min each as expected, but when I came back home, there was a single task still running for two hours. I had thought it could be due to the data (each data table has identical regressors but different targets). But then I realized it might be due to the computer going to sleep mode. As soon as I awakened the computer, the two hour simulation quickly finished and the remaining tasks took 8min each as expected.
So does sleep (hibernate) mode, dramatically slow down overnight tasks? Is it normal, in that case, to disable hibernate until the full simulation is compete?
Build:
intel i7 3.2G quad core
16 G ram
R Revolution 64 bit
Windows 7 Pro 64 bit
It appears the answer is yes, hibernate does dramatically slow down R parallel simulations with a multicore computer (win7); I suspect it occurs for other (non R) overnight simulations as well.
Notice the first run had task pred.6 and pred.7 taking about 2 hours. The 2nd set of simulations (pred1.n, all had no greater than 11 minutes per sim.
The 2nd set was run with sleep/hibernate set to never in control panel power options.
I'm using R to convert some shapefiles. R does this using just one core of my processor, and I want to speed it up using parallel processing. So I've parallelized the process like this.
Given files which is a list of files to convert:
library(doMC)
registerDoMC()
foreach(f=files) %dopar% {
# Code to do the conversion
}
This works just fine and it uses 2 cores. According to the documentation for registerDoMC(), by default that function uses half the cores detected by the parallel package.
My question is why should I use half of the cores instead of all the cores? (In this case, 4 cores.) By using the function registerDoMC(detectCores()) I can use all the cores on my system. What, if any, are the downsides to doing this?
Besides the question of scalability, there is a simple rule: Intel Hyperthreading cores do not help, under Windows at least. So I get 8 with detectCores(), but I never found an improvement when going beyond 4 cores, even with MCMC parallel threads which in general scale perfectly.
If someone has a case (under Windows) where there is such an improvement from Hyperthreading, please post it.
Any time you do parallel processing there is some overhead (which can be nontrivial, especially with locking data structures and blocking calls). For small batch jobs, running on a single core or two cores is much faster due to the fact that you're not paying that overhead.
I don't know the size of your job, but you should probably run some scaling experiments where you time your job on 1 processor, 2 processors, 4 processors, 8 processors, until you hit the max core count for your system (typically, you always double the processor count). EDIT: It looks like you're only using 4 cores, so time with 1, 2, and 4.
Run timing results for ~32 trials for each core count and get a confidence interval, then you can say for certain whether running on all cores is right for you. If your job takes a long time, reduce the # of trials, all the way down to 5 or so, but remember that more trials will give you a higher degree of confidence.
To elaborate:
Student's t-test:
The student's t-test essentially says "you calculated an average time for this core count, but that's not the true average. We can only get the true average if we had the average of an infinite number of data points. Your computed true average actually lies in some interval around your computed average"
The t-test for significance then basically compares the intervals around the true average for 2 datapoints and says whether they are significantly different or not. So you may have one average time be less than another, but because the standard deviation is sufficiently high, we can't for certain say that it's actually less; the true averages may be identical.
So, to compute this test for significance:
Run your timing experiments
For each core count:
Compute your mean and standard deviation. The standard deviation should be the population standard deviation, which is the square root of population variance
Population variance is (1/N) * summation_for_all_data_points((datapoint_i - mean)^2)
Now you will have a mean and standard deviations for each core count: (m_1, s_1), (m_2, s_2), etc.
- For every pair of core counts:
- Compute a t-value: t = (mean_1 - mean_2)/(s_1/ sqrt(#dataPoints))
The example t value I showed tests whether the mean timing results for core count of 1 is significantly different than the timing results for core count of 2. You could test the other way around by saying:
t = (m_2 - m_1)/(s_2/ sqrt(#dataPoints))
After you computed these t-values, you can tell whether they're significant by looking at the critical value table. Now, before you click that, you need to know about 2 more things:
Degrees of Freedom
This is related to the number of datapoints you have. The more datapoints you have, the smaller the interval around mean probably is. Degrees of freedom kind of measures your computed mean's ability to move about, and it is #dataPoints - 1 (v in the link I provided).
Alpha
Alpha is a probability threshold. In the Gaussian (Normal, bell-curved) distribution, alpha cuts the bell-curve on both the left and the right. Any probability in the middle of the cutoffs falls inside the threshold and is an insignificant result. A lower alpha makes it harder to get a significant result. That is alpha = 0.01 means only the top 1% of probabilities are significant, and alpha = 0.05 means the top 5%. Most people use alpha = 0.05.
In the table I link to, 1-alpha determines the column you will go down looking for a critical value. (so alpha = 0.05 gives 0.95, or a 95% confidence interval), and v is your degrees of freedom, or row to look at.
If your critical value is less than your computed t (absolute value), then your result is NOT significant. If the critical value is greater than your computed t (absolute value), then you have statistical significance.
Edit: The Student's t-test assumes that variances and standard deviations are the same between the two means being compared. That is, it assumes the distribution of data points around the true mean is equal. If you DON'T want to make this assumption, then you're looking for Welch's t-test, which is slightly different. The wiki page has a good formula for computing t-values for this test.
There is one situation you want to avoid:
spreading a task over all N cores
having each core work the task using something like OpenBLAS or MKL with all cores
because now you have an N by N contention: each of the N task wants to farm its linear algebra work out to all N cores.
Another (trivial) counter example is provided in a multiuser environment where not all M users on a machine can (simultaneously) farm out to N cores.
Another reason not to use all the available cores is if your tasks use a lot memory and you don't have enough memory to support that number of workers. Note that it can be tricky to determine how many workers can be supported by a given amount of memory, because doMC uses mclapply which forks the workers, so memory can be shared between the workers unless it is modified by one of them.
From the answers to this question, it's pretty clear that it's not always easy to figure out the right number of workers to use. One could argue that there shouldn't be a default value, and the user should be forced to specify the number, but I'm not sure if I'd go that far. At any rate, there isn't anything very magical about using half the number of cores.
Hm. I'm not a parallel processing expert, but I always thought the downside of using all your cores was that it made your machine sluggish when you tried to anything else. I've had this happen to myself personally, when I've used all the cores, so my practice now is to use 7 of my 8 cores when I'm doing something parallel, leaving me one core to do other things.