How to improve the speed of MPI_scatter/MPI_gather? - mpi

I found that time used for MPI_scatter/MPI_gather continuously increased (somehow linearly) as the number of workers increases, especially when the workers are across different nodes.
I thought that MPI_scatter/MPI_gather is a parallel process, and wonder what leads to the above increasing? Is there any trick to make it faster, especially for workers distributing across CPU nodes?

The root rank has to push a fixed amount of data to the other ranks. As long as all ranks reside on the same compute node, the process is limited by the memory bandwidth available. Once more nodes become involved, the network bandwidth, usually much lower than the memory bandwidth, becomes the limiting factor.
Also the time to send a message is roughly divided in two parts - initial (network setup and MPI protocol handshake) latency and then the time it takes to physically transfer the actual data bits. As the amount of data is fixed, the total physical transfer time remains the same (as long as the transport type and therefore the bandwidth stays the same) but more setup/latency overhead is being added with each new rank that data is scattered to or gathered from, therefore the linear increase in the time it takes to complete the operation.

How an MPI_Scatter/Gather will work varies between implementations. Some MPI implementations may choose to use a series of MPI_Send as an underlying mechanism.
The parameters that may affect how MPI_Scatter works are:
1. Number of processes
2. Size of data
3. Interconnect
For example, an implementation may avoid using a broadcast for very small number of ranks sending/receiving very large data.


divide workload on different hardware using MPI

I have a small network with computers of different hardware. Is it possible to optimize workload division between these hardware using MPI? ie. give nodes with larger ram and better cpu more data to compute? minimizing waiting time between different nodes for final reduction.
In my program data are divided into equal-sized batches. Each node in the network will process some of them. The result of each batch will be summed up after all batches are processed.
Can you divide the work into more batches than there are processes? If so, change your program so that instead of each process receiving one batch, the master keeps sending batches to whichever node is available, for as long as there are unassigned batches. It should be a fairly easy modification, and it will make faster nodes process more data, leading to a lower overall completion time. There are further enhancements you can make, e.g. once all batches have been assigned and a fast node is available, you could take an already assigned batch away from a slow node and reassign it to said fast node. But these may not be worth the extra effort.
If you absolutely have to work with as many batches as you have nodes, then you'll have to find some way of deciding which nodes are fast and which ones are slow. Perhaps the most robust way of doing this is to assign small, equally sized test batches to each process, and have them time their own solutions. The master can then divide the real data into appropriately sized batches for each node. The biggest downside to this approach is that if the initial speed measurement is inaccurate, then your efforts at load balancing may end up doing more harm than good. Also, depending on the exact data and algorithm you're working with, runtimes with small data sets may not be indicative of runtimes with large data sets.
Yet another way would be to take thorough measurements of each node's speed (i.e. multiple runs with large data sets) in advance, and have the master balance batch sizes according to this precompiled information. The obvious complication here is that you'll somehow have to keep this registry up to date and available.
All in all, I would recommend the very first approach: divide the work into many smaller chunks, and assign chunks to whichever node is available at the moment.

Modeling communication costs in MPI

Does anyone know of any papers that discuss communication costs in MPI programs? I am trying to predict the time taken by (say) the communication step in two phase I/O. That would depend on the no. of processes, the size and number of messages sent/received, network interconnect and architecture, etc. It would be helpful for us to come up with a formula to assess the time taken by communication alone. I have read some papers , but none of them handle the case where multiple processes are communicating at the same time.
The most critical elements in any time estimate will be the total data to be sent, and the speed of the interconnect. That should give you an effective "minimum" time for the message transfers.
After that, you can measure the actual time taken and use that to determine a rough efficiency rating for the MPI implementation. As the amount of data scales up, the time required will also scale up using the scale factor. This is a very rough way to get an estimate. Keep in mind that as the data size crosses certain interesting thresholds (e.g. page size, cache size, and so on) the scale factor will likely need to be revised.

Asynchronous programs showing locality of reference?

I was reading this excellent article which gives an introduction to Asynchronous programming here and I came across the following line which I find hard to understand.
Since there is no actual parallelism(in asnyc), it appears from our diagrams that an asynchronous program will take just as long to execute as a synchronous one, perhaps longer as the asynchronous program might exhibit poorer locality of reference.
Could someone explain how locality of reference comes into picture here?
Locality of reference, like that Wikipedia article mentions, is the observation that when some data is accessed (on disk, in memory, whatever), other data near that location is often accessed as well. This observation makes sense since developers tend to group similar data together. Since the data are related, they're often processed together. Specifically, this is known as spatial locality.
For a weak example, imagine computing the sum of an array or doing a matrix multiplication. The data representing the array or matrix are typically stored in continguous memory locations, and for this example, once you access one specific location in memory, you will be accessing others close to it as well.
Computer architecture takes locality of reference into account. Operating systems have the notion of "pages" which are (roughly) 4KB chunks of data that can be paged in and out individually (moved between physical memory and disk). When you touch some memory that's not resident (not physically in RAM), the OS will bring the entire page of data off disk and into memory. The reason for this is locality: you're likely to touch other data around what you just touched.
Additionally, CPUs have the concept of caches. For example, a CPU might have an L1 (level 1) cache, which is really just a big block of on-CPU data that the CPU can access faster than RAM. If a value is in the L1 cache, the CPU will use that instead of going out to RAM. Following the principle of the locality of reference, when a CPU access some value in main memory, it will bring that value and all values near it into the L1 cache. This set of values is known as a cache line. Cache lines vary in size, but the point is that when you access the first value of an array, the CPU might have to get it from RAM, but subsequent accesses (close in proximity) will be faster since the CPU brought the whole bundle of values into the L1 cache on the first access.
So, to answer your question: if you imagine a synchronous process computing the sum of a very large array, it will touch memory locations in order one after the other. In this case, your locality is good. In the asynchronous case, however, you might have n threads each taking a slice of the array (of size 1/n) and computing the sub-sum. Each thread is touching a potentially very different location in memory (since the array is large) and since each thread can be switched in and out of execution, the actual pattern of data access from the point of view of the OS or CPU is poor. The L1 cache on a CPU is finite, so if Thread 1 brings in a cache line (due to an access), this might evict the cache line of Thread 2. Then, when Thread 2 goes to access its array value, it has to go to RAM, which will bring in its cache line again and potentially evict the cache line of Thread 1, and so on. Depending on the system resources and usage as a whole, this pattern could happen on the OS/page level as well.
The poorer locality of reference results in poorer cache usage -- each time you do a thread switch, you can expect that most of what's in the cache relates to that previous thread, not the current one, so most reads will get data from main memory instead of the cache.
He's ultimately wrong though, at least for quite a few programs. The reason is pretty simple: even though you gain nothing on CPU-bound code, when you can combine some CPU-bound code with some I/O bound code, you can expect an overall speed improvement. You can, for example, initiate a read or write, then switch to doing computation while the disk is busy, then switch back to the I/O bound thread when the disk finishes its work.

How do we determine queue and stack size while working on network processor?

While working on a network processor, how can we determine the size of the queue and the stack.
I have mainly used network processor as router while working on BTS development.
Most important thing while determining size is the speed of the processor and the rate at which packets enter/exit the NP.
Also important factor is the parallelism which needs to be maintained.
Like in case of BTS, KPI give a good idea of total voice/GPRS calls that need to be run at load conditions.Based on that calculate total number of queues and decide how many memory LW each queue will take to be properly identified.
Understand that my queue concept does not talk of catering to the actual data; actual voice would be stored in DRAM buffer handles whose other info would be catered by a queue.
Stack size on a NP i dont have much idea of; but again i am not sure if it is a configurable parameter; more to do with the lifetime of total variable along with thier size durign function calls

Tibco Rendezvous - size constraints

I am attempting to put a potentially large string into a rendezvous message and was curious about size constraints. I understand there is a physical limit (64mb?) to the message as a whole, but I'm curious about how some other variables could affect it. Specifically:
How big the keys are?
How the string is stored (in one field vs. multiple fields)
Any advice on any of the above topics or anything else that could be relevant would be greatly appreciated.
Note: I would like to keep the message as a raw string (as opposed to bytecode, etc).
From the Tibco docs on Very Large Messages:
Rendezvous software can transport very
large messages; it divides them into
small packets, and places them on the
network as quickly as the network can
accept them. In some situations, this
behavior can overwhelm network
capacity; applications can achieve
higher throughput by dividing large
messages into smaller chunks and
regulating the rate at which it sends
those chunks. You can use the
performance tool to evaluate chunk
sizes and send rates for optimal
This example, sends one message
consisting of ten million bytes.
Rendezvous software automatically
divides the message into packets and
sends them. However, this burst of
packets might exceed network capacity,
resulting in poor throughput:
sender> rvperfm -size 10000000 -messages 1
In this second example, the
application divides the ten million
bytes into one thousand smaller
messages of ten thousand bytes each,
and automatically determines the batch
size and interval to regulate the flow
for optimal throughput:
sender> rvperfm -size 10000 -messages 1000 -auto
By varying the -messages and -size
parameters, you can determine the
optimal message size for your
applications in a specific network.
Application developers can use this
information to regulate sending rates
for improved performance.
As to actual limits the Add string function takes a C style ansi string so is theoretically unbounded but, given the signature of the AddOpaque
tibrv_status tibrvMsg_AddOpaque(
tibrvMsg message,
const char* fieldName,
const void* value,
tibrv_u32 size);
which takes a u32 it would seem sensible to state that the limit is likely to be 4GB rather than 64MB.
That said using Tib to transfer such large packets is likely to be a serious performance bottleneck as it may have to buffer significant amounts of traffic as it tries to get these sorts of messages to all consumers. By default the rvd buffer is only 60 seconds so you may find yourself suffering message loss if this is a significant amount of your traffic.
Message overhead within tibco is largely as simple as:
the fixed cost associated with each message (the header)
All the fields (type info and the field id)
Plus the cost of all variable length aspects including:
the send and receive subjects (effectively limited to 256 bytes each)
the field names. I can find no limit to the length of the field names in the docs but the smaller they are the better, better still don't use them at all and use the numerical identifiers
the array/string/opaque/user defined variable length fields in the message
Note: If you use nested messages simply recurse the above.
In your case the payload overhead will be so vast in comparison to the names (so long as they are reasonable and simple) there is little point attempting to optimize these at all.
You may find you can considerable efficiency on the wire/buffered if you transmit the strings in a compressed form, either through the use of an rvrd with compression enabled or by changing your producer/consumer to use something fast but effective like deflate (or if you're feeling esoteric things like QuickLZ,FastLZ,LZO,etc. Especially ones with fixed memory footprint compress/decompress engines)
You don't say which platform api you are targeting (.net/java/C++/C for example) and this will colour things a little. On the wire all string data will be in 1 byte per character regardless of java/.net using UTF-16 by default however you will incur a significant translation cost placing these into/reading them out of the message because the underlying buffer cannot be reused in those cases and a copy (and compaction/expansion respectively) must be performed.
If you stick to opaque byte sequences you will still have the copy overhead in the naieve implementations possible through the managed wrapper apis but this will at least be less overhead if you have no need to work with the data as a native string.
The overall maximum size of a message is 64MB as was speculated in the OP. From the "Tibco Rendezvous Concepts" document:
Although the ability to exchange large data buffers is a feature of Rendezvous
software, it is best not to make messages too large. For example, to exchange data
up to 10,000 bytes, a single message is efficient. But to send files that could be
many megabytes in length, we recommend using multiple send calls, perhaps one
for each record, block or track. Empirically determine the most efficient size for
the prevailing network conditions. (The actual size limit is 64 MB, which is rarely
an appropriate size.)
