I've written this little MPI_Allreduce benchmark: bench_mpi.cxx.
It work well with Open MPI 1.8.4 and MPICH 1.4.1.
The results (1 column for the number of processors, and 1 columns for the corresponding wall clock time) are here or here.
With MPICH 3.1.4, the wall clock time increase for 7, 8 or more processes: results are here.
In a real code (edit: a Computational Fluid Dynamic software), but for all of the 3 above MPI implementation, I observe the same problem for 7, 8 or more processes, while I expect my code to be scallable to at least 8 or 16 processes.
So I'm trying to understand what could happen with the little benchmark and MPICH 3.1.4?
edit
Here is a zoom in the figure Rob Latham give in his answer.
What does the code do during the green rectangle? The Mpi_Allreduce operation starts too late.
edit
I've posted another question on much more simpler code (just the time to execute MPI_Barrier).
It's interesting you don't see this with OpenMPI or with earlier versions of MPICH, but the way your code is set up seems guaranteed to cause problems for any MPI collective.
You've given each process a variable amount of work to do. The problem with that is the introduction of "pseudo-synchronization" -- the time other MPI processes spend waiting for the laggard to catch up and participate in the collective.
With point-to-point messaging the costs are clear, and probably follow a LogP model
Collective costs have an additional cost: sometimes a process is blocked waiting for a participating process to send it some needed information. In Allgather, well, all the processes have a data dependency on another.
When you have variable-sized work units, no process can make progress until the largest/slowest processor finishes.
If you instrument with MPE and display the trace in Jumpshot, it's easy to see this effect:
I've added (see https://gist.github.com/roblatham00/b981fc875d3852c6b63f) red boxes for work, and the purple boxes are the default allgather color. The second iteration shows this most clearly: rank 0 spends almost no time in allgather . Rank 2,3,4, and 5 have to wait for the slowpokes to catch up.
.
Related
The question comes from a RISCV implementation, but I think it may also apply to many other architectures.
From a code with two completely independent instructions in sequence (generic ISA notation):
REG1 = REG2 + REG3
REG4 = REG5 + REG6
In a pipelined implementation, assuming there are no other hazards (simultaneous r/w access to the registers is possible and there are two independent adders), is it a violation of the ISA if the two instructions are executed completely in parallel?
In other words, at the same clock edge, can the 3 registers (REG1, REG4 and PC) be updated at once (PC+8 for the RISCV-32 example)?
No, clearly there's no problem, since real CPUs do this all the time. (e.g. Intel since Haswell can run 4 independent add instructions per clock: https://www.realworldtech.com/haswell-cpu/4/ https://uops.info/ https://agner.org/optimize/).
It only has to maintain the illusion of having run instructions one at a time, following the ISA's sequential execution model. The same concept as the C "as-if" rule applies.
If the ISA doesn't guarantee anything about timing, like that you can delay N clock cycles with N nop or other instructions, nothing stops a specific implementation from doing as much work as possible in a clock cycle. (Some microcontrollers do have specific timing guarantees or specifications, so code can delay for N cycles with delay loops. Or at least specific implementations of some ISAs have such guarantees.)
It's 100% normal for modern CPUs to average more than 1 instruction per clock, despite stalling sometimes on cache misses and branch mispredicts, so that clearly means fetching, decoding, and executing multiple instructions per clock cycle in other cycles. See also Modern Microprocessors
A 90-Minute Guide! for some basics of superscalar in-order and out-of-order pipelines.
I have a program that performs pointer chasing and I'm trying to optimize the pointer chasing loop as much as possible.
I noticed that perf record detects that ~20% of execution time in function myFunction() is spent executing the jump instruction (used to exit out of the loop after a specific value has been read).
Some things to take note:
the pointer chasing path can comfortably fit in the L1 data cache
using __builtin_expect to avoid the cost of branch misprediction had no noticeable effect
perf record has the following output:
Samples: 153K of event 'cycles', 10000 Hz, Event count (approx.): 35559166926
myFunction /tmp/foobar [Percent: local hits]
Percent│ endbr64
...
80.09 │20: mov (%rdx,%rbx,1),%ebx
0.07 │ add $0x1,%rax
│ cmp $0xffffffff,%ebx
19.84 │ ↑ jne 20
...
I would expect that most of the cycles spent in this loop are used for reading the value from memory, which is confirmed by perf.
I would also expect the remaining cycles to be somewhat evenly spent executing the remaining instructions in the loop. Instead, perf is reporting that a large chunk of the remaining cycles are spent executing the jump.
I suspect that I can better understand these costs by understanding the micro-ops used to execute these instructions, but I'm a bit lost on where to start.
Remember that the cycles event has to pick an instruction to blame, even if both mov-load and the macro-fused cmp-and-branch uops are waiting for the result. It's not a matter of one or the other "costing cycles" while it's running; they're both waiting in parallel. (Modern Microprocessors
A 90-Minute Guide! and https://agner.org/optimize/)
But when the "cycles" event counter overflows, it has to pick one specific instruction to "blame", since you're using statistical-sampling. This is where an inaccurate picture of reality has to be invented by a CPU that has hundreds of uops in flight. Often it's the one waiting for a slow input that gets blamed, I think because it's often the oldest in the ROB or RS and blocking allocation of new uops by the front-end.
The details of exactly which instruction gets picked might tell us something about the internals of the CPU, but only very indirectly. Like perhaps something to do with how it retires groups of 4(?) uops, and this loop has 3, so which uop is oldest when the perf event exception is taken.
The 4:1 split is probably significant for some reason, perhaps because 4+1 = 5 cycle latency of a load with a non-simple addressing mode. (I assume this is an Intel Sandybridge-family CPU, perhaps Skylake-derived?) Like maybe if data arrives from cache on the same cycle as the perf event overflows (and chooses to sample), the mov doesn't get the blame because it can actually execute and get out of the way?
IIRC, BeeOnRope or someone else found experimentally that Skylake CPUs would tend to let the oldest un-retired instruction retire after an exception arrives, at least if it's not a cache miss. In your case, that would be the cmp/jne at the bottom of the loop, which in program order appears before the load at the top of the next iteration.
For a university project I'm currently working on a slurm supercomputer cluster and have written a number of C programs using MPI.
While profiling one of them I have observed that the time elapsed between an MPI_Send and an accompanying MPI_Recv operation is a mostly linear function of the message length. However, at around 32 MiB there is a sudden jump in latency from around 10ms to around 20ms. This happens both for two processes on the same node and two processes on separate nodes.
Now I would like to find out why this happens. I'm aware that this is not an MPI intrinsic phenomenon but must be related to the underlying hardware setup, but I'm not sure where to begin looking for an explanation. What are some possible explanations for this and how could I check whether they apply in my case?
I have a C program (acoustic wave solver) that is parallelized with MPI. However, I've been testing the speed up on various numbers of cores and I've noticed something strange. If I use N processes where N is the number of available cores in the machine, then I do not see a performance improvement over the next step down.
So on my 8 core machine then I see speedup from 1 process to 2 processes to 4 processes, but not from 4 to 8. Similarly on my 4 core laptop I see speedup from 1 to 2, but not from 2 to 4.
Any idea what could be causing this?
Many modern (Intel-)cpu run two hyperthreads on a single physical core. The number of cores you are referencing are actually the number of hardware threads that are available, not the number of physical execution units.
As long as you are using a number of processes that is smaller or equal to the number of physical cores, the processes will (or at least should) be distributed to use all of the available codes. But as soon as all physical cores are taken, additional processes will share a physical core with another process.
It is not possible to give a definitive answer on if using all threads will increase your performance at all or by how much. That strongly depends on the code you are running. A very nice answer to a similar question is given on superuser.com. Essentially, if your process is memory-bound or uses different parts of your cpu (Integer/Floating point arithmetic, Video encoding, vector processing, ...) and communication overhead is small you might even get perfect scaling. Code that is cpu-bound and only does one type of computation might not give any improvement, or might even take longer due to communication overhead.
Imagine I have M independent jobs, each job has N steps. Jobs are independent from each other but steps of each job should be serial. In other words J(i,j) should be started only after J(i,j-1) is finished (i indicates the job index and j indicates the step). This is isomorphic to building a wall with width of M and hight of N blocks.
Each block of job should be executed only once. The time that it takes to do one block of work using one CPU (also the same order) is different for different blocks and is not known in advance.
The simple way of doing this using MPI is to assign blocks of work to processors and wait until all of them finish their blocks before the next assignment. This way we can make ensure that priorities are enforced, but there will be a lot of waiting time.
Is there a more efficient way of doing this? I mean when a processor finishes its job, using some kind of environmental variables or shared memory, could decide which block of job it should do next, without waiting for other processors to finish their jobs and make a collective decision using communications.
You have M jobs with N steps each. You also have a set of worker processes of size W, somewhere between 2 and M.
If W is close to M, the best you can do is simply assign them 1:1. If one worker finishes early that's fine.
If W is much smaller than M, and N is also fairly large, here is an idea:
Estimate some average or typical time for one step to complete. Call this T. You can adjust this estimate as you go in case you have a very poor estimator at the outset.
Divide your M jobs evenly in number among the workers, and start them. Tell the workers to run as many steps of their assigned jobs as possible before a timeout, say T*N/K. Overrunning the timeout slightly to finish the current job is allowed to ensure forward progress.
Have the workers communicate to each other which steps they completed.
Repeat, dividing the jobs evenly again taking into account how complete each one is (e.g. two 50% complete jobs count the same as one 0% complete job).
The idea is to give all the workers enough time to complete roughly 1/K of the total work each time. If no job takes much more than K*T, this will be quite efficient.
It's up to you to find a reasonable K. Maybe try 10.
Here's an idea, IDK if it's good:
Maintain one shared variable: n = the progress of the farthest-behind task. i.e. the lowest step-number that any of the M tasks has completed. It starts out at 0, because all tasks start at the first step. It stays at 0 until all tasks have completed at least 1 step each.
When a processor finishes a step of a job, check the progress of the step it's currently working on against n. If n < current_job_step - 4, switch tasks because the one we're working on is too far ahead of the farthest-behind one.
I picked 4 to give a balance between too much switching vs. having too much serial work in only a couple tasks. Adjust as necessary, and maybe make it adaptive as you near the end.
Switching tasks without having two threads both grab the same work unit is non-trivial unless you have a scheduler thread that makes all the decisions. If this is on a single shared-memory machine, you could use locking to protect a priority queue.