MPI programming to implement large data gathering from many workers - mpi

Now, I have a application that composed of single master and many workers. The application requirement is very simple: workers finish some jobs and send data to master and master store these data into files separately. I can simply use MPI_Send on worker side to send data to master. But master does not know the data sending sequence. Some workers go fast while some are slow. More specifically, suppose there are 5 workers, then the data sending sequence may be 1,3,4,5,2 or 2,5,4,1,3. If I just write a for loop like for(i=1 to 5) on master side with MPI_Recv to get data, the master and some faster worker have to wait for a long time. I know MPI_Gather can implement this. But I am not sure is MPI_Gather works parallelly or just some sequential calls of MPI_Recv? Another issue is my data is extremely large, more than 1GB data needed to be sent to master. If I divide the data into trunks, it may make it more complex. I do not think MPI_Gather can work. I also tried to think about raw socket programming, but I do not think it is a good practice. Would you give me some suggestion please?

If I understand your question correctly, you want to receive the data back at the master, but since each task takes a different amount of time to finish, you don't want to loop over all the processors in order so that the receive for process 5 (if it's finished) isn't waiting for the receive from process 3 (which is still running).
If want to receive out-of-order, it's possible to use mpi_recv with the MPI_ANY_SOURCE constant as the rank of the processor sending the message. You should then be able to inspect the returned status to determine which processor sent the message to send more work. Rather than looping over all processors, just have a single receive statement in your work loop.

could the workers write out the files instead of sending the data back to the master? when a worker finishes, it could send a "i'm done" message to the master. the master, in turn could send the next chunk of work to that worker. when there is no work left to hand out, have the master send a "no more work" message to the worker, who could then call MPI Finalize.

Related

Low performance with MPI communication within a single node

I have a program that is using the openMPI implementation of MPI for data-exchange between processes. Right now I am using this program on only one node, where the data has to be shared from one process to all the others. The total amount of data that the master process is sending is 130 Gb, which is split and sent to 6-8 client processes, but this data-transfer takes an awful amount of time (1 hour).
Knowing that the code is running on the very same node, I would expect that the data-transfer could use some speed-up, through the settings that I could describe when I launch the mpirun program - Do you know which settings could help me to get a faster data-transfer in this scenario? Right now I am using only "--mca btl vader,self" as optional components.
The actual code use MPI_Send() functions that share an amount of data that is near to the maximum amount of data that is possible to transfer with this call. After the data has been transferred to a client-process after multiple MPI_Send() calls, the master process send data to the other pending client-processes.

Deeply control orchestration throttling and dispatching in BizTalk based on message batch size

I have a biztalk orchestration which processes a single message. This messages are actually batches of messages. Most of the time, the batch size n is small (<1.000) but once in a while there are very large batches (>50.000). We have a high throughput of messages as well.
The orchestration takes a linear O(n) amount of system memory depending on the batch size and I know by observation that a single server can process up to an accumulated batch size of ~250k in parallel before it runs out of system memory and only returns OutOfMemoryExceptions. (Which will kill the BizTalk host instance and the orchestrations will startup on another host which will ultimately break again leaving our BizTalk group in a broken state which can currently only be recovered by manual intervention)
Small batches are common, large batches are rare but kind of deadly if there is more than one at the same time.
I know the batch size in advance so I could tell biztalk about it. But I see no way to interact with throttling. When throttling detects a lack of system memory it is already too late.
Do I have to build my own queueing and dispatching on top of biztalk to achieve my goals?
Our current solution is to use a semaphore with a value of 8 and every large message n>1000 needs to get a semaphore slot before it is allowed to start processing. We had an edge case the other day where even this was too much. We reduced 8 to 4 to resolve this but now, we impacted the general throughput noticeably.
Any idea or hint is welcomed!
Don't use XmlDocument within your processing. It will further exacerbate your memory issues. Prefer XmlReader for sure here. However, I'd still try to move processing outside of your orchestration. Even if you can get the streaming working in a .NET component called from the orchestration, you can still end up with an orchestration instance that runs for a long time and consumes lots of memory, which should be avoided whenever possible. Therefore...
Avoid letting the orchestration get messages that large to begin with. It may be possible to debatch the message using the OOB XmlDisassembler if you can mark the schema as an envelope schema; if not, you may need to create a custom disassembler component to do your debatching (just remember to promote/write the proper context properties to the newly created messages from the original). If you use some streaming techniques (see https://www.microsoft.com/en-us/download/details.aspx?id=20375) in the pipeline, you can greatly reduce the memory footprint and have much greater control there. Again, use XmlReader to actually parse and debatch the message (it shouldn't be super difficult - look into the ReadToFollowing and ReadSubTree, as in this Splitting large xml files in to sub files without memory contention). You might get away with doing this in an orchestration rather than a pipeline component, but in a pipeline component it should be easier to control memory usage. You may also look into promoting things like a batch ID if you need to correlate the messages back together.
If you get a large batch, you will still need to throttle the number of concurrent orchestrations; you could do so as Richard Seroter suggests here, which uses multiple convoys that correlate on instance IDs to prevent too many from running at once. Alternatively, you could use ordered delivery on the receive shape (see MSDN), which would probably be my preferred option as it takes significantly less work and won't face the concerns around zombie messages that are possible with convoys.
Basically: try to think small and lean as much as possible and BizTalk will be happier. BizTalk would much rather process 1000 small messages in a second than 1 very large message in a minute.

Dynamically Creating Communicators

I have a small communication problem that has consumed hours of search. I am using MPICH2 to communicate between different workers. At some points in my program a process needs to multi-cast a message to a fraction of the workers (2 or 3 out of a total of 20). Therefore, I temporarily need to create a group that includes the ranks of all those workers and then use MPI_BCast. However, this seems to be impossible!
I have tried MPI_Comm_Create but the program simply hangs because it required "every" worker call MPI_Comm_Create. I can not also use MPI_Comm_Split because I do not know the ranks of the recipient workers in advance and hence can not color code them.
Could you please help me.
Why do you need to create a new communicator at all?
Your description, of what you actually want to achieve and what the constraints are is a little lacking, but here are some hints, that might be applicable for your problem.
Sticking to classical two-sided communication, you need at some point a communication that involves all processes to identify the recipients, I guess. You could for example broadcast to everybody who is to be a recipient, and subsequently send the actual message to those with peer-to-peer communication (If this relation is going to change over time, I would not bother with creating a new communicator each time).
You could use MPI's one-sided communication concepts, and simply write messages from the broadcasting rank into dedicated memory areas of the receiving ranks. However, one-sided is often considered somewhat bad and not so good on the performance side.
With MPI-3 you could make use of an non-blocking barrier: All processes open the barrier, and those, which are not the broadcasting rank start immediately testing for the completion of this barrier, open a non-blocking receive for any source and regularly test for that as well, otherwise they proceed as usual. The broadcasting rank however, starts sending out its message to the actual recipients and when it completed that, it waits for the non-blocking barrier to complete. Now, all processes will find the barrier to complete, and now they can stop listening for the receives, those who didn't get a message can simply send a message to themselves to properly close the communication and proceed in their computation.

Advice/experience on testing MPI code with CppUnit

I've got a codebase where I have been using CppUnit for unit testing. I'm now adding some MPI code to the project and I'd like to unit test some abstractions I'm building on top of MPI. For example, I've written some code to manage a single-producer/multiple-consumer relationship where consumers ask for work and the producer serializes the next bit of work to send to the consumer, and I'd like to test just that interaction with a test that generates some fake work items in a producer, distributes them to consumers, which then send some kind of checksum back to the consumer to make sure everything got distributed and nothing deadlocked etc.
Does anyone have experience of what works best here? Some things I've been thinking about:
Is it reasonable to have all processes execute the test runner so that they all execute the test functions in the same order? Or is it better to have only the master run the test runner and have it send broadcasts to the slaves to tell them what to do next (presumably with some kind of lookup table to map commands to test functions)?
Is it sane in any way to use CPPUNIT_ASSERT inside the slaves, or should all information be sent back to the master for assertions? If slaves can assert, how should all the results be combined to get a single output log?
How should one handle test failures, such that the exception thrown in one process doesn't cause synchronization problems e.g. another process is waiting for an MPI_Recv for which the matching MPI_Send will now never happen?

TCP Socket Piping

Suppose that you have 2 sockets(each will be listened by other TCP peers) each resides on the same process, how these sockets could be bound, meaning input stream of each other will be bound to output stream of other. Sockets will continuously carry data, no waiting will happen. Normally thread can solve this problem but, rather than creating threads is there more efficient way of piping sockets?
If you need to connect both ends of the socket to the same process, use the pipe() function instead. This function returns two file descriptors, one used for writing and the other used for reading. There isn't really any need to involve TCP for this purpose.
Update: Based on your clarification of your use case, no, there isn't any way to tell the OS to connect the ends of two different sockets together. You will have to write code to read from one socket and write the same data to the other. Depending on the architecture of your process, you may or may not need an additional thread to do this work. For example, if your application is based on a select() loop, then creating another thread is not necessary.
You can avoid threads with an event queue within the process. The WP Message queue article assumes you want interprocess message passing, but if you are using sockets, you kind of are doing interprocess message passing over the same process.

Resources