Advice/experience on testing MPI code with CppUnit - mpi

I've got a codebase where I have been using CppUnit for unit testing. I'm now adding some MPI code to the project and I'd like to unit test some abstractions I'm building on top of MPI. For example, I've written some code to manage a single-producer/multiple-consumer relationship where consumers ask for work and the producer serializes the next bit of work to send to the consumer, and I'd like to test just that interaction with a test that generates some fake work items in a producer, distributes them to consumers, which then send some kind of checksum back to the consumer to make sure everything got distributed and nothing deadlocked etc.
Does anyone have experience of what works best here? Some things I've been thinking about:
Is it reasonable to have all processes execute the test runner so that they all execute the test functions in the same order? Or is it better to have only the master run the test runner and have it send broadcasts to the slaves to tell them what to do next (presumably with some kind of lookup table to map commands to test functions)?
Is it sane in any way to use CPPUNIT_ASSERT inside the slaves, or should all information be sent back to the master for assertions? If slaves can assert, how should all the results be combined to get a single output log?
How should one handle test failures, such that the exception thrown in one process doesn't cause synchronization problems e.g. another process is waiting for an MPI_Recv for which the matching MPI_Send will now never happen?

Related

How to evenly balance processing many simultaneous tasks?

PROBLEM
Our PROCESSING SERVICE is serving UI, API, and internal clients and listening for commands from Kafka.
Few API clients might create a lot of generation tasks (one task is N messages) in a short time. With Kafka, we can't control commands distribution, because each command comes to the partition which is consumed by one processing instance (aka worker). Thus, UI requests could be waiting too long while API requests are processing.
In an ideal implementation, we should handle all tasks evenly, regardless of its size. The capacity of the processing service is distributed among all active tasks. And even if the cluster is heavily loaded, we always understand that the new task that has arrived will be able to start processing almost immediately, at least before the processing of all other tasks ends.
SOLUTION
Instead, we want an architecture that looks more like the following diagram, where we have separate queues per combination of customer and endpoint. This architecture gives us much better isolation, as well as the ability to dynamically adjust throughput on a per-customer basis.
On the side of the producer
the task comes from the client
immediately create a queue for this task
send all messages to this queue
On the side of the consumer
in one process, you constantly update the list of queues
in other processes, you follow this list and consume for example 1 message from each queue
scale consumers
QUESTION
Is there any common solution to such a problem? Using RabbitMQ or any other tooling. Š¯istorically, we use Kafka on the project, so if there is any approach using - it is amazing, but we can use any technology for the solution.
Why not use spark to execute the messages within the task? What I'm thinking is that each worker creates a spark context that then parallelizes the messages. The function that is mapped can be based on which kafka topic the user is consuming. I suspect however your queues might have tasks that contained a mixture of messages, UI, API calls, etc. This will result in a more complex mapping function. If you're not using a standalone cluster and are using YARN or something similar you can change the queueing method that the spark master is using.
As I understood the problem, you want to create request isolation from the customer using dynamically allocated queues which will allow each customer tasks to be executed independently. The problem looks like similar to Head of line blocking issue in networking
The dynamically allocating queues is difficult. This can also lead to explosion of number of queues that can be a burden to the infrastructure. Also, some queues could be empty or very less load. RabbitMQ won't help here, it is a queue with different protocol than kafka.
One alternative is to use custom partitioner in kafka that can look at the partition load and based on that load balance the tasks. This works if the tasks are independent in nature and there is no state store maintains in the worker.
The other alternative would be to load balance at the customer level. In this case you select a dedicated set of predefined queues for a set of customers. Customers with certain Ids will be getting served by a set of queues. The downside of this is some queues can have less load than others. This solution is similar to Virtual Output Queuing in networking,
My understanding is that the partitioning of the messages it's not ensuring a evenly load-balance. I think that you should avoid create overengineering and so some custom stuff that will come on top of the Kafka partitioner and instead think at a good partitioning key that will allows you to use Kafka in an efficiently manner.

Deeply control orchestration throttling and dispatching in BizTalk based on message batch size

I have a biztalk orchestration which processes a single message. This messages are actually batches of messages. Most of the time, the batch size n is small (<1.000) but once in a while there are very large batches (>50.000). We have a high throughput of messages as well.
The orchestration takes a linear O(n) amount of system memory depending on the batch size and I know by observation that a single server can process up to an accumulated batch size of ~250k in parallel before it runs out of system memory and only returns OutOfMemoryExceptions. (Which will kill the BizTalk host instance and the orchestrations will startup on another host which will ultimately break again leaving our BizTalk group in a broken state which can currently only be recovered by manual intervention)
Small batches are common, large batches are rare but kind of deadly if there is more than one at the same time.
I know the batch size in advance so I could tell biztalk about it. But I see no way to interact with throttling. When throttling detects a lack of system memory it is already too late.
Do I have to build my own queueing and dispatching on top of biztalk to achieve my goals?
Our current solution is to use a semaphore with a value of 8 and every large message n>1000 needs to get a semaphore slot before it is allowed to start processing. We had an edge case the other day where even this was too much. We reduced 8 to 4 to resolve this but now, we impacted the general throughput noticeably.
Any idea or hint is welcomed!
Don't use XmlDocument within your processing. It will further exacerbate your memory issues. Prefer XmlReader for sure here. However, I'd still try to move processing outside of your orchestration. Even if you can get the streaming working in a .NET component called from the orchestration, you can still end up with an orchestration instance that runs for a long time and consumes lots of memory, which should be avoided whenever possible. Therefore...
Avoid letting the orchestration get messages that large to begin with. It may be possible to debatch the message using the OOB XmlDisassembler if you can mark the schema as an envelope schema; if not, you may need to create a custom disassembler component to do your debatching (just remember to promote/write the proper context properties to the newly created messages from the original). If you use some streaming techniques (see https://www.microsoft.com/en-us/download/details.aspx?id=20375) in the pipeline, you can greatly reduce the memory footprint and have much greater control there. Again, use XmlReader to actually parse and debatch the message (it shouldn't be super difficult - look into the ReadToFollowing and ReadSubTree, as in this Splitting large xml files in to sub files without memory contention). You might get away with doing this in an orchestration rather than a pipeline component, but in a pipeline component it should be easier to control memory usage. You may also look into promoting things like a batch ID if you need to correlate the messages back together.
If you get a large batch, you will still need to throttle the number of concurrent orchestrations; you could do so as Richard Seroter suggests here, which uses multiple convoys that correlate on instance IDs to prevent too many from running at once. Alternatively, you could use ordered delivery on the receive shape (see MSDN), which would probably be my preferred option as it takes significantly less work and won't face the concerns around zombie messages that are possible with convoys.
Basically: try to think small and lean as much as possible and BizTalk will be happier. BizTalk would much rather process 1000 small messages in a second than 1 very large message in a minute.

Dynamically Creating Communicators

I have a small communication problem that has consumed hours of search. I am using MPICH2 to communicate between different workers. At some points in my program a process needs to multi-cast a message to a fraction of the workers (2 or 3 out of a total of 20). Therefore, I temporarily need to create a group that includes the ranks of all those workers and then use MPI_BCast. However, this seems to be impossible!
I have tried MPI_Comm_Create but the program simply hangs because it required "every" worker call MPI_Comm_Create. I can not also use MPI_Comm_Split because I do not know the ranks of the recipient workers in advance and hence can not color code them.
Could you please help me.
Why do you need to create a new communicator at all?
Your description, of what you actually want to achieve and what the constraints are is a little lacking, but here are some hints, that might be applicable for your problem.
Sticking to classical two-sided communication, you need at some point a communication that involves all processes to identify the recipients, I guess. You could for example broadcast to everybody who is to be a recipient, and subsequently send the actual message to those with peer-to-peer communication (If this relation is going to change over time, I would not bother with creating a new communicator each time).
You could use MPI's one-sided communication concepts, and simply write messages from the broadcasting rank into dedicated memory areas of the receiving ranks. However, one-sided is often considered somewhat bad and not so good on the performance side.
With MPI-3 you could make use of an non-blocking barrier: All processes open the barrier, and those, which are not the broadcasting rank start immediately testing for the completion of this barrier, open a non-blocking receive for any source and regularly test for that as well, otherwise they proceed as usual. The broadcasting rank however, starts sending out its message to the actual recipients and when it completed that, it waits for the non-blocking barrier to complete. Now, all processes will find the barrier to complete, and now they can stop listening for the receives, those who didn't get a message can simply send a message to themselves to properly close the communication and proceed in their computation.

Abstract implementation of non-blocking MPI calls

Non-blocking sends/recvs return immediately in MPI and the operation is completed in the background. The only way I see that happening is that the current process/thread invokes/creates another process/thread and loads an image of the send/recv code into that and itself returns. Then this new process/thread completes this operation and sets a flag somewhere which the Wait/Test returns. Am I correct ?
There are two ways that progress can happen:
In a separate thread. This is usually an option in most MPI implementations (usually at configure/compile time). In this version, as you speculated, the MPI implementation has another thread that runs a separate progress engine. That thread manages all of the MPI messages and sending/receiving data. This way works well if you're not using all of the cores on your machine as it makes progress in the background without adding overhead to your other MPI calls.
Inside other MPI calls. This is the more common way of doing things and is the default for most implementations I believe. In this version, non-blocking calls are started when you initiate the call (MPI_I<something>) and are essentially added to an internal queue. Nothing (probably) happens on that call until you make another call to MPI later that actually does some blocking communication (or waits for the completion of previous non-blocking calls). When you enter that future MPI call, in addition to doing whatever you asked it to do, it will run the progress engine (the same thing that's running in a thread in version #1). Depending on what the MPI call that's supposed to be happening is doing, the progress engine may run for a while or may just run through once. For instance, if you called MPI_WAIT on an MPI_IRECV, you'll stay inside the progress engine until you receive the message that you're waiting for. If you are just doing an MPI_TEST, it might just cycle through the progress engine once and then jump back out.
More exotic methods. As Jeff mentions in his post, there are more exotic methods that depend on the hardware on which you're running. You may have a NIC that will do some magic for you in terms of moving your messages in the background or some other way to speed up your MPI calls. In general, these are very specific to the implementation and hardware on which you're running, so if you want to know more about them, you'll need to be more specific in your question.
All of this is specific to your implementation, but most of them work in some way similar to this.
Are you asking, if a separate thread for message processing is the only solution for non-blocking operations?
If so, the answer is no. I even think, many setups use a different strategy. Usually progress of the message processing is done during all MPI-Calls. I'd recommend you to have a look into this Blog entry by Jeff Squyres.
See the answer by Wesley Bland for a more complete answer.

MPI programming to implement large data gathering from many workers

Now, I have a application that composed of single master and many workers. The application requirement is very simple: workers finish some jobs and send data to master and master store these data into files separately. I can simply use MPI_Send on worker side to send data to master. But master does not know the data sending sequence. Some workers go fast while some are slow. More specifically, suppose there are 5 workers, then the data sending sequence may be 1,3,4,5,2 or 2,5,4,1,3. If I just write a for loop like for(i=1 to 5) on master side with MPI_Recv to get data, the master and some faster worker have to wait for a long time. I know MPI_Gather can implement this. But I am not sure is MPI_Gather works parallelly or just some sequential calls of MPI_Recv? Another issue is my data is extremely large, more than 1GB data needed to be sent to master. If I divide the data into trunks, it may make it more complex. I do not think MPI_Gather can work. I also tried to think about raw socket programming, but I do not think it is a good practice. Would you give me some suggestion please?
If I understand your question correctly, you want to receive the data back at the master, but since each task takes a different amount of time to finish, you don't want to loop over all the processors in order so that the receive for process 5 (if it's finished) isn't waiting for the receive from process 3 (which is still running).
If want to receive out-of-order, it's possible to use mpi_recv with the MPI_ANY_SOURCE constant as the rank of the processor sending the message. You should then be able to inspect the returned status to determine which processor sent the message to send more work. Rather than looping over all processors, just have a single receive statement in your work loop.
could the workers write out the files instead of sending the data back to the master? when a worker finishes, it could send a "i'm done" message to the master. the master, in turn could send the next chunk of work to that worker. when there is no work left to hand out, have the master send a "no more work" message to the worker, who could then call MPI Finalize.

Resources