Low performance with MPI communication within a single node - mpi

I have a program that is using the openMPI implementation of MPI for data-exchange between processes. Right now I am using this program on only one node, where the data has to be shared from one process to all the others. The total amount of data that the master process is sending is 130 Gb, which is split and sent to 6-8 client processes, but this data-transfer takes an awful amount of time (1 hour).
Knowing that the code is running on the very same node, I would expect that the data-transfer could use some speed-up, through the settings that I could describe when I launch the mpirun program - Do you know which settings could help me to get a faster data-transfer in this scenario? Right now I am using only "--mca btl vader,self" as optional components.
The actual code use MPI_Send() functions that share an amount of data that is near to the maximum amount of data that is possible to transfer with this call. After the data has been transferred to a client-process after multiple MPI_Send() calls, the master process send data to the other pending client-processes.

Related

Deeply control orchestration throttling and dispatching in BizTalk based on message batch size

I have a biztalk orchestration which processes a single message. This messages are actually batches of messages. Most of the time, the batch size n is small (<1.000) but once in a while there are very large batches (>50.000). We have a high throughput of messages as well.
The orchestration takes a linear O(n) amount of system memory depending on the batch size and I know by observation that a single server can process up to an accumulated batch size of ~250k in parallel before it runs out of system memory and only returns OutOfMemoryExceptions. (Which will kill the BizTalk host instance and the orchestrations will startup on another host which will ultimately break again leaving our BizTalk group in a broken state which can currently only be recovered by manual intervention)
Small batches are common, large batches are rare but kind of deadly if there is more than one at the same time.
I know the batch size in advance so I could tell biztalk about it. But I see no way to interact with throttling. When throttling detects a lack of system memory it is already too late.
Do I have to build my own queueing and dispatching on top of biztalk to achieve my goals?
Our current solution is to use a semaphore with a value of 8 and every large message n>1000 needs to get a semaphore slot before it is allowed to start processing. We had an edge case the other day where even this was too much. We reduced 8 to 4 to resolve this but now, we impacted the general throughput noticeably.
Any idea or hint is welcomed!
Don't use XmlDocument within your processing. It will further exacerbate your memory issues. Prefer XmlReader for sure here. However, I'd still try to move processing outside of your orchestration. Even if you can get the streaming working in a .NET component called from the orchestration, you can still end up with an orchestration instance that runs for a long time and consumes lots of memory, which should be avoided whenever possible. Therefore...
Avoid letting the orchestration get messages that large to begin with. It may be possible to debatch the message using the OOB XmlDisassembler if you can mark the schema as an envelope schema; if not, you may need to create a custom disassembler component to do your debatching (just remember to promote/write the proper context properties to the newly created messages from the original). If you use some streaming techniques (see https://www.microsoft.com/en-us/download/details.aspx?id=20375) in the pipeline, you can greatly reduce the memory footprint and have much greater control there. Again, use XmlReader to actually parse and debatch the message (it shouldn't be super difficult - look into the ReadToFollowing and ReadSubTree, as in this Splitting large xml files in to sub files without memory contention). You might get away with doing this in an orchestration rather than a pipeline component, but in a pipeline component it should be easier to control memory usage. You may also look into promoting things like a batch ID if you need to correlate the messages back together.
If you get a large batch, you will still need to throttle the number of concurrent orchestrations; you could do so as Richard Seroter suggests here, which uses multiple convoys that correlate on instance IDs to prevent too many from running at once. Alternatively, you could use ordered delivery on the receive shape (see MSDN), which would probably be my preferred option as it takes significantly less work and won't face the concerns around zombie messages that are possible with convoys.
Basically: try to think small and lean as much as possible and BizTalk will be happier. BizTalk would much rather process 1000 small messages in a second than 1 very large message in a minute.

Dynamically Creating Communicators

I have a small communication problem that has consumed hours of search. I am using MPICH2 to communicate between different workers. At some points in my program a process needs to multi-cast a message to a fraction of the workers (2 or 3 out of a total of 20). Therefore, I temporarily need to create a group that includes the ranks of all those workers and then use MPI_BCast. However, this seems to be impossible!
I have tried MPI_Comm_Create but the program simply hangs because it required "every" worker call MPI_Comm_Create. I can not also use MPI_Comm_Split because I do not know the ranks of the recipient workers in advance and hence can not color code them.
Could you please help me.
Why do you need to create a new communicator at all?
Your description, of what you actually want to achieve and what the constraints are is a little lacking, but here are some hints, that might be applicable for your problem.
Sticking to classical two-sided communication, you need at some point a communication that involves all processes to identify the recipients, I guess. You could for example broadcast to everybody who is to be a recipient, and subsequently send the actual message to those with peer-to-peer communication (If this relation is going to change over time, I would not bother with creating a new communicator each time).
You could use MPI's one-sided communication concepts, and simply write messages from the broadcasting rank into dedicated memory areas of the receiving ranks. However, one-sided is often considered somewhat bad and not so good on the performance side.
With MPI-3 you could make use of an non-blocking barrier: All processes open the barrier, and those, which are not the broadcasting rank start immediately testing for the completion of this barrier, open a non-blocking receive for any source and regularly test for that as well, otherwise they proceed as usual. The broadcasting rank however, starts sending out its message to the actual recipients and when it completed that, it waits for the non-blocking barrier to complete. Now, all processes will find the barrier to complete, and now they can stop listening for the receives, those who didn't get a message can simply send a message to themselves to properly close the communication and proceed in their computation.

MPI alterate order of execution of master and slaves

I have two programs master and slave. My master does data decomposition and slaves do computation on the part of decomposed data. MPI scaterv is implemented for distribution of work.I execute my master program first then it dynamically spawns child or slave processes and slave executes different code ie.computation. Now again master has to collect results from slaves and executes next level of decomposition. how do I do that using MPI? I actually wanted to execute my master and slave code alternately.. How can I implement this?
Thank you in advance..
MPI-2 (if I remember correctly) introduced mechanisms for dynamic process management, you might care to search for mpi_comm_spawn to start learning about those mechanisms. So it is certainly possible to write an MPI program which alternates between one process running the master task and multiple processes running the worker tasks (the term slave is deprecated). It's even possible to design your computation so that one program runs the master task and another program runs the (multiple) worker tasks and to use MPI for passing messages between the two.
BUT (that's a big but) I don't think that many resource managers (either the humans who manage parallel computer systems or the operating system and systems software such as job managers) support such dynamic process management. Imagine the complexities of scheduling, and managing, two or more programs with the basic design that you propose. Just as program A tries to fire up 2^10 worker processes so too does program B, and program C, while program D tries to drop 2^8 worker processes; all this on a cluster with only 2^10 processors (or cores). It's probably not too difficult to construct scenarios where the throughput of jobs on the cluster falls towards zero as multiple jobs contend for scarce resources.
If your platform supports dynamic process management, go right ahead. In the far more likely case that your platform does not you have at least two choices, which one you choose depends on the ratio of master:worker time and probably other factors too. You could:
Do what most of us have always done and continue to do and request a total number of processors for the entire job, leaving all but one of them idle during the master-only phases. Wasteful perhaps but easy for the resource managers to cope with. Relatively easy to program too.
If the master does a lot of work between worker phases you could modify your program so that the master and worker are separate programs. First have the master execute on one process and, as it finishes, submit a request to the job management system to initiate the first phase of the worker computation. Have that, in turn, initiate the execution of the next master phase, and so on and so on.

MPI_Isend /Irecv: Is it possible to access the sendbuffer on unused memory-locations in the meanwhile

I would like to speedup my MPI- Program with the use of asynchronous communication. But the used time remains the same. The workflow is as followed.
before:
1. MPI_send/ MPI_recv Halo (ca. 10 Seconds)
2. process the whole Array (ca. 12 Seconds)
after:
1. MPI_Isend/ MPI_Irecv Halo (ca. 0,1 Seconds)
2. process the Array (without Halo) (ca. 10 Seconds)
3. MPI_Wait (ca. 10 Seconds) (should be ca. 0 Seconds)
4. process the Halo only (ca. 2 Seconds)
Measurements showed that the communication and processing the Array-core nearly take the same time for common workloads. So asynchronism should nearly hide the communication time.
But it dosn't.
One fact - and I thinks this could be the problem - is that the sendbuffer is also the array the calculations are made on. Is it possible that MPI serializes the memory-access although communication ONLY accesses the Halo (with derived datatype) and the computation ONLY accesses the core (only reading) of the array???
Does anybody know if this is for sure the reason?
Is it maybe implementation-dependend (I'm using OpenMPI)?
Thanks in advance.
It isn't the case that MPI serializes the memory accesses in the user code (that's beyond the library's power to do, in general), and it is true that what exactly does happen is implementation specific.
But as a practical matter, MPI libraries don't do as much communication "in the background" as you might hope, and this is particularly true when using transports and networks like tcp + ethernet, where there's no meaningful way to hand off communication to another set of hardware.
You can only be sure that the MPI library is actually doing something when you're running MPI library code, eg in an MPI function call. Often, a call to any of a number of MPI calls will nudge an implementations "progress engine" that keeps track of in-flight messages and ushers them along. So for instance one thing you can quickly do is to make calls to MPI_Test() on the requests within the compute loop to make sure things start happening well before the MPI_Wait(). There is of course overhead to this, but this is something that's easy to try to measure.
Of course you could imagine the MPI library would use some other mechanism to run things behind the scenes. Both MPICH2 and OpenMPI have played with separate "progress threads" which execute separately from the user code and do this ushering along in the background; but getting that to work well, and without tying up a processor while you're trying to run your computation, is a genuinely difficult problem. OpenMPI's progress threads implementation has long been experimental, and in fact is temporarily out of the current (1.6.x) release, although work continues. I'm not sure about MPICH2's support.
If you are using infiniband, where the network hardware has a lot of intelligence to it, then prospects brighten a bit. If you are willing to leave memory pinned (for the openfabrics), and/or you can use a vendor-specific module (mxm for Mellanox, psm for Qlogic), then things can progress somewhat more rapidly. If you're using shared memory, than the knem kernel module can also help with intranode transport.
One other implementation-specific approach you can take, if memory isn't a big issue, is to try to use eager protocols for sending the data directly, or send more data per chunk so fewer nudges of the progress engine are needed. What eager protocols means here is that data is automatically sent at send time, rather than just initiating a set of handshakes which will eventually lead to the message being sent. The bad news is that this generally requires extra buffer memory for the library, but if that's not a problem and you know the number of incoming messages is bounded (eg, by the number of halo neighbours you have), this can help a great deal. How to do this for (eg) shared memory transport for openmpi is described on the OpenMPI page for tuning for shared memory, but similar parameters exist for other transports and often for other implementations. One nice tool that IntelMPI has is an "mpitune" tool that automatically runs through a number of such parameters for best performance.
The MPI specification states:
A nonblocking send call indicates that the system may start copying
data out of the send buffer. The sender should not modify any part of the
send buffer after a nonblocking send operation is called, until the
send completes.
So yes, you should copy your data to a dedicated send buffer first.

MPI programming to implement large data gathering from many workers

Now, I have a application that composed of single master and many workers. The application requirement is very simple: workers finish some jobs and send data to master and master store these data into files separately. I can simply use MPI_Send on worker side to send data to master. But master does not know the data sending sequence. Some workers go fast while some are slow. More specifically, suppose there are 5 workers, then the data sending sequence may be 1,3,4,5,2 or 2,5,4,1,3. If I just write a for loop like for(i=1 to 5) on master side with MPI_Recv to get data, the master and some faster worker have to wait for a long time. I know MPI_Gather can implement this. But I am not sure is MPI_Gather works parallelly or just some sequential calls of MPI_Recv? Another issue is my data is extremely large, more than 1GB data needed to be sent to master. If I divide the data into trunks, it may make it more complex. I do not think MPI_Gather can work. I also tried to think about raw socket programming, but I do not think it is a good practice. Would you give me some suggestion please?
If I understand your question correctly, you want to receive the data back at the master, but since each task takes a different amount of time to finish, you don't want to loop over all the processors in order so that the receive for process 5 (if it's finished) isn't waiting for the receive from process 3 (which is still running).
If want to receive out-of-order, it's possible to use mpi_recv with the MPI_ANY_SOURCE constant as the rank of the processor sending the message. You should then be able to inspect the returned status to determine which processor sent the message to send more work. Rather than looping over all processors, just have a single receive statement in your work loop.
could the workers write out the files instead of sending the data back to the master? when a worker finishes, it could send a "i'm done" message to the master. the master, in turn could send the next chunk of work to that worker. when there is no work left to hand out, have the master send a "no more work" message to the worker, who could then call MPI Finalize.

Resources