Modeling communication costs in MPI - mpi

Does anyone know of any papers that discuss communication costs in MPI programs? I am trying to predict the time taken by (say) the communication step in two phase I/O. That would depend on the no. of processes, the size and number of messages sent/received, network interconnect and architecture, etc. It would be helpful for us to come up with a formula to assess the time taken by communication alone. I have read some papers , but none of them handle the case where multiple processes are communicating at the same time.

The most critical elements in any time estimate will be the total data to be sent, and the speed of the interconnect. That should give you an effective "minimum" time for the message transfers.
After that, you can measure the actual time taken and use that to determine a rough efficiency rating for the MPI implementation. As the amount of data scales up, the time required will also scale up using the scale factor. This is a very rough way to get an estimate. Keep in mind that as the data size crosses certain interesting thresholds (e.g. page size, cache size, and so on) the scale factor will likely need to be revised.

Related

Why pipelining cannot operate at its maximum theoretical speed?

First of all, what is the maximum theoretical speed/speed up?
Can anyone explain why pipelining cannot operate at its maximum theoretical speed?
The maximum theoretical speedup is equal to the increase in pipeline depth. In a scalar (one instruction wide execution) design, the ideal instructions per cycle is one. Ideally, the clock frequency could increase by a factor equal to the increase in pipeline depth.
The actual frequency increase will be less than this ideal due to latching overheads, clock skew, and imbalanced division of work/latency. (While one can theoretically place latches at any point, the amount of state latched, its position, and other factors make certain points more friendly for stage divisions.
Manufacturing variation also means that work designed to take an equal amount of time will not do so for all stages in the pipeline. A designer can provide more slack so that more chips will meet minimal timing in all stages. Another technique to handle such variation is to accept that not all chips will meet the target frequency (whether one exclusively uses the "golden samples" or use lower frequency chips as well is a marketing decision).
As one might expect, with shallow pipelines variation in a stage is spread out over more logic and so is less likely to affect frequency.
Wave pipelining, where a multiple signal waves (corresponding to pipeline stages) can be passing through a block of logic at the same time, provides a limited method to avoid latch overhead. However, besides other design issues, such is more sensitive to variation both from manufacturing and from run-time conditions such as temperature and voltage (which one might wish to intentionally vary to target different power/performance behaviors).
Even if one did have incredible hardware that provided a perfect frequency increase, hazards (as mentioned in Peter Cordes' comment) would prevent the perfect utilization of available execution resources.

How does OpenMPI's gather work?

I'm new to MPI and I'm trying to understand how MPI (and specifically OpenMPI) work in order to reason about the performance of my system.
I've tried to find resources online to help me understand things a little better, but haven't had much luck. I thought I'd come here.
Right now my question is simple: if I have 3 nodes (1 master, 2 clients) and I issue an MPI_Gather, does the root process handle incoming data sequentially or concurrently? In other words, if processes 1 is the first to make a connection with processes 0, will process 2 have to wait until processes 1 is done sending its data before it can start to send its data?
Thanks!
There are multiple components in Open MPI that implement collective operations and some of them provide multiple algorithms for the implementation of each operation.
What you are most likely interested in is the tuned component of the coll framework as that is what Open MPI uses by default. tuned implements all collectives using point-to-point operations and provides several algorithms for gather:
linear with synchronisation - used when messages are large to mid-size
binomial - used when the number of processes is large or the message size is small
basic linear - used in all other cases
The performance of each algorithm depends strongly on the particular combination of message size and number of ranks, therefore the library comes with a set of heuristics that tries to determine the best algorithm based on the data size and the size of the communicator (as indicated above). There are several mechanisms to override the heuristics and either force a certain algorithm or provide a list of custom algorithm selection rules.
The basic linear algorithm simply has the root loop over all other ranks receiving their messages in sequence. In that case, rank 2 won't be able to send its chunk before rank 1 since the root will first receive the message from rank 1 and only then move on to rank 2.
The linear with synchronisation algorithm splits the chunks into two pieces each. The first pieces are collected in sequence just like in the basic linear algorithm. The second pieces are collected asynchronously using non-blocking receives.
The binomial algorithm arranges the ranks as a binomial tree. The processes at the nodes of the tree receive the chunks from the lower levels and aggregate them into larger chunks that then get passed to the upper levels until they reach the root rank.
You can find the source code of the tuned module in the ompi/mca/coll/tuned folder of the Open MPI source tree. In the development branch, part of the tuned component got promoted to the base implementation of the collective framework and the code for the gather is to be found in ompi/mca/coll/base instead.
Hristo's answer is of course excellent, but I would like to offer a different point of view.
Contrary to your expectation, the question is not simple. It isn't even possible to specifically answer it without knowing more system specifics, as Hristo pointed out. That doesn't mean the question is invalid, but you should start to reason about performance on a different level.
First, consider the complexity of a the gather operation: The total network transfer to the root as well as the memory requirements are linearly growing with the number of processes in the communicator. This naturally limits scalability.
Second, you may assume that your MPI implementation does implement MPI_Gather in the most efficient way possible - better than you could do it by hand. This assumption may very well be wrong, but it is the best starting point to write your program.
Now when you have your program, you should measure and see where time is spent - or wasted. For that you should an MPI performance analysis tools. Now if you have identified that your Gather has a significant impact on performance, you can go ahead and try to optimize that: But to do so, first consider if you can structure your communication conceptually better, e.g. by somehow removing the computation all together or using a clever reduction instead. If you still need to stick to the gather: go ahead and tune your MPI implementation. Afterwards verify that your optimization did indeed improve performance on your specific system.

How to improve the speed of MPI_scatter/MPI_gather?

I found that time used for MPI_scatter/MPI_gather continuously increased (somehow linearly) as the number of workers increases, especially when the workers are across different nodes.
I thought that MPI_scatter/MPI_gather is a parallel process, and wonder what leads to the above increasing? Is there any trick to make it faster, especially for workers distributing across CPU nodes?
The root rank has to push a fixed amount of data to the other ranks. As long as all ranks reside on the same compute node, the process is limited by the memory bandwidth available. Once more nodes become involved, the network bandwidth, usually much lower than the memory bandwidth, becomes the limiting factor.
Also the time to send a message is roughly divided in two parts - initial (network setup and MPI protocol handshake) latency and then the time it takes to physically transfer the actual data bits. As the amount of data is fixed, the total physical transfer time remains the same (as long as the transport type and therefore the bandwidth stays the same) but more setup/latency overhead is being added with each new rank that data is scattered to or gathered from, therefore the linear increase in the time it takes to complete the operation.
How an MPI_Scatter/Gather will work varies between implementations. Some MPI implementations may choose to use a series of MPI_Send as an underlying mechanism.
The parameters that may affect how MPI_Scatter works are:
1. Number of processes
2. Size of data
3. Interconnect
For example, an implementation may avoid using a broadcast for very small number of ranks sending/receiving very large data.

Estimating the heat generated by a process or job

Is it possible to estimate the heat generated by an individual process in runtime.
Temperature readings of the processor is easily accessible but what I need is process specific information.
Is it possible to map information such as cpu utilization, io, running time, memory usage etc to get some kind of an estimate?
I'm gonna say no. Because the overall temperature of your system components isn't a simple mathematical equation with everything that's moving and switching either.
Heat generated by and inside a computer is dependent on many external factors like hardware setup, ambient temperature of the room, possibly the age of the components, is there dust on them or in the fans, was the cooling paste correctly applied on the CPU or elsewhere, where heat sinks are present, how is heat being dissipated etc.etc.. In short, again, no.
Additionally, your computer runs a LOT of processes at any given time apart from the ones that you control (and "control" is a relative term). Even if it is possible to access certain sensory data for individual components (like you can see to some extent in the BIOS) then interpolating one single process' generated temperature in regard to the total is, well, impossible.
At the lowest levels (gate networks, control signalling etc.), an external individual no longer has any means to observe or measure what's going on but there as well, things are in a changing state, a variable amount of electricity is being used and thus a variable amount of heat generated.
Pertaining to your second question: that's basically what your task manager does. There are countless examples and articles on the internet on how to get that done in a plethora of programming languages.
That is, unless some of the actually smart people in this merry little community of keytappers and screengazers say that it IS actually possible, at which point I will be thoroughly amazed...
EDIT: Monitoring the processes is a first step in what you're looking for. take a look at How to detect a process start & end using c# in windows? and be sure to follow up on duplicates like the one mentioned by Hans.
You could take a look at PowerTOP or some other tool that monitors power usage. I am not sure how accurate it is across different systems but a power estimation should provide at least some relative information as the heat generated assuming the processes you are comparing are running in similar manners on hardware. In reality there are just too many factors to predict power, much less heat, effectively but you may be able to get an idea of the usage.

Distributed physics simulation help/advice

I'm working in a distributed memory environment. My task is to simulate using particles tied by springs big 3D objects by dividing them into smaller pieces and each piece get simulated by another computer. I'm using a 3rd party physics engine to a achieve simulation. The problem I am facing is how to transmit the particle information in the extremities where the object is divided. This information is needed to compute interacting particle forces. The line in the image shows where the cut has been made. Because the number o particles is big the communication overhead will be big as well. Is there a good way to transmit such information or is there a way to transmit another value which helps me determine the information I need? Any help is much appreciated. Thank-you
PS: by particle information i mean the new positions from which to compute a resulting force to be applied on the particles simulated in the local machine
"Big" means lots of things. Here the number of points with data being communicated may be "big" in that it's much more than one, but if you have say a million particles in a lattice, and are dividing it between 4 processors (say) by cutting it into squares, you're only communicating 500 particles across each boundary; big compared to one but very small compared to 1,000,000.
A library very commonly used for these sorts of distributed-memory computations (which is somwehat different than distributed computing, which suggests nodes scattered all over the internet; this sort of computation, involving tightly-coupled elements, is usually best done with a series of nearby computers in a lab or in a cluster) is MPI. This pattern of communication is very common, and is called "halo exchange" or "guardcell exchange" or "ghostzone exchange" or some combination; you should be able to find lots of examples of such things by searching for those terms. (There are a few questions on this site on the topic, but they're typically focussed on very specific implementation questions).

Resources