MPI - Shared queue across processes

MPI - Shared queue across processes - mpi

I have a MPI program where the master node waits until certain number of tasks (say 1000) are completed by slave nodes. The slave nodes are in a while(True) loop and keep on generating output from their tasks. The runtime of these tasks can vary across tasks and nodes, so if there are 2 slave nodes and the master needs to wait for say 1000 tasks then slave-node-1 could have completed 450 tasks, and slave-node-2 the other 550.
What is the best way for the slave nodes to "tell" the master node that in total 1000 tasks have been completed ? It looks to me that I need some sort of a shared queue across processes where slaves can push data once their task is completed, and the master just polls on this queue's size until it hits 1000. Subsequently, the master can drain data from this queue to reset the queue size for slaves to fill in more data.

There are two solutions I would recommend.
The first as Gilles points out is to use MPI_ANY_SOURCE to receive 1000 completion messages which can be sent from any of the workers.
The second is to use MPI_ACCUMULATE. In this case, the master node shares a window which is initialized to 0, then each worker uses MPI_ACCUMULATE to increment the value in the window after each task is completed. The master polls it's own local window until it reaches 1000.
In this case I'd stick to MPI_ANY_SOURCE rather than mess with creating and destroying windows. I don't think there is a compelling reason to add that complexity here.

Related

Low performance with MPI communication within a single node

I have a program that is using the openMPI implementation of MPI for data-exchange between processes. Right now I am using this program on only one node, where the data has to be shared from one process to all the others. The total amount of data that the master process is sending is 130 Gb, which is split and sent to 6-8 client processes, but this data-transfer takes an awful amount of time (1 hour).
Knowing that the code is running on the very same node, I would expect that the data-transfer could use some speed-up, through the settings that I could describe when I launch the mpirun program - Do you know which settings could help me to get a faster data-transfer in this scenario? Right now I am using only "--mca btl vader,self" as optional components.
The actual code use MPI_Send() functions that share an amount of data that is near to the maximum amount of data that is possible to transfer with this call. After the data has been transferred to a client-process after multiple MPI_Send() calls, the master process send data to the other pending client-processes.

In order queue: two kernels - one waiting for event and one not

Suppose I enqueue two kernels in an in-order queue.
The first kernel is set to only run when it receives a completion event,
while the second kernel is not waiting for an event.
Will the runtime execute the second kernel first in this case?

An in-order queue will execute the items in the order you queue them, essentially giving each operation its predecessor as a wait event. Your second kernel should not be executed until after the first one in your example.
Out of order queues require you to manage the wait lists yourself, but have the advantage that tasks can be executed as soon as their prerequisites have been met. Just make sure your platform supports out-of-order queues before you end up troubleshooting a dead-end. See answers to this SO question.

Synchronous vs Asynchronous Clustering

I was reading the mariaDD knowledge base on Galera Cluster and i came across this:
The basic difference between synchronous and asynchronous replication is that "synchronous" guarantees that if changes happened on one node of the cluster, they happened on other nodes "synchronously", or at the same time. "Asynchronous" gives no guarantees about the delay between applying changes on "master" node and the propagation of changes to "slave" nodes. The delay can be short or long. This also implies that if master node crashes, some of the latest changes may be lost
With the last sentence, i have always understood that even though the updates on the slave in the asynchronous cluster setup is not performed at the same time, it logs these updates to a bin log file as the updates are being made on the master. So in the case that the master crashes before all the data is passed on to the slave, the updates will still go ahead when the master is restored since the bin log file logged the updates. Can somebody please tell me if my understanding is wrong and clarify on the matter for me please. Thanks.

In your example of a normal replication pair, the slave would catch up after the master comes back. Assuming the master does come back, you wouldn't really lose the data but if the master is permanently dead, the data is lost. The knowledge base article you mention is talking about the replication delay and not the overall integrity of the replication stream.
With normal replication, if the slave io thread (the part that gets the replication events from the master) is able to keep up with the master, then the slave may only lose a couple seconds if the master crashes. However, if it cannot keep up and is for example 1 hour behind, the slave would lose access to 1 hour of data. Another way you could lose access to data on the slave is if you have a max relay log size set and that is reached.
Galera makes sure that the write is sent to every node in the cluster before it is actually committed on any of the nodes so once the node that the write is done on commits the write, all of the other nodes will commit the same write. With galera, all writes basically happen at the same time on every node. Losing any node at any time during normal operation will not cause any data loss.

MPI alterate order of execution of master and slaves

I have two programs master and slave. My master does data decomposition and slaves do computation on the part of decomposed data. MPI scaterv is implemented for distribution of work.I execute my master program first then it dynamically spawns child or slave processes and slave executes different code ie.computation. Now again master has to collect results from slaves and executes next level of decomposition. how do I do that using MPI? I actually wanted to execute my master and slave code alternately.. How can I implement this?
Thank you in advance..

MPI-2 (if I remember correctly) introduced mechanisms for dynamic process management, you might care to search for mpi_comm_spawn to start learning about those mechanisms. So it is certainly possible to write an MPI program which alternates between one process running the master task and multiple processes running the worker tasks (the term slave is deprecated). It's even possible to design your computation so that one program runs the master task and another program runs the (multiple) worker tasks and to use MPI for passing messages between the two.
BUT (that's a big but) I don't think that many resource managers (either the humans who manage parallel computer systems or the operating system and systems software such as job managers) support such dynamic process management. Imagine the complexities of scheduling, and managing, two or more programs with the basic design that you propose. Just as program A tries to fire up 2^10 worker processes so too does program B, and program C, while program D tries to drop 2^8 worker processes; all this on a cluster with only 2^10 processors (or cores). It's probably not too difficult to construct scenarios where the throughput of jobs on the cluster falls towards zero as multiple jobs contend for scarce resources.
If your platform supports dynamic process management, go right ahead. In the far more likely case that your platform does not you have at least two choices, which one you choose depends on the ratio of master:worker time and probably other factors too. You could:
Do what most of us have always done and continue to do and request a total number of processors for the entire job, leaving all but one of them idle during the master-only phases. Wasteful perhaps but easy for the resource managers to cope with. Relatively easy to program too.
If the master does a lot of work between worker phases you could modify your program so that the master and worker are separate programs. First have the master execute on one process and, as it finishes, submit a request to the job management system to initiate the first phase of the worker computation. Have that, in turn, initiate the execution of the next master phase, and so on and so on.

How to architect a multi-step process using a message queue?

Say I have a multi-step, asynchronous process with these restrictions:
Individual steps can be performed by any worker
Steps must be performed in-order
The approach I'm considering:
Insert a db row that represents the entire process, with a "Steps completed" column to keep track of the progress.
Subscribe to a queue that will receive a message when the entire process is done.
Upon completion of each step, update the db row and queue the next step in the process.
After the last step is completed, queue the "process is complete" message.
Delete the db row.
Thoughts? Pitfalls? Smarter ways to do it?

I've built a system very similar to what you've described in a large, task-intensive document processing system, and have had to live with both the pros and the cons for the last 7 years now. Your approach is solid and workable, but I see some drawbacks:
Potentially vulnerable to state change (i.e., what if process inputs change before all steps are queued, then the later steps could have inputs inconsistent with earlier steps)
More infrastructural than you'd like, involving both a DB and a queue = more points of failure, harder to set up, more documentation required = doesn't quite feel right
How do you keep multiple workers from acting on the same step concurrently? In other words, the DB row says 4 steps are completed, how does a worker process know if it can take #5 or not? Doesn't it need to know whether another process is already working on this? One way or another (DB or MQ) you need to include additional state for locking.
Your example is robust to failure, but doesn't address concurrency. When you add state to address concurrency, then failure handling becomes a serious problem. For example, a process takes step 5, and then puts the DB row into "Working" state. Then when that process fails, step 5 is stuck in "Working" state.
Your orchestrator is a bit heavy, as it is doing a lot of synchronous DB operations, and I would worry that it might not scale as well as the rest of the architecture, as there can be only one of those...this would depend on how long-running your steps were compared to a database transaction--this would probably only become an issue at very massive scale.
If I had it to do over again, I would definitely push even more of the orchestration onto the worker processes. So, the orchestration code is common and could be called by any worker process, but I would keep the central, controlling process as light as possible. I would also use only message queues and not any database to keep the architecture simple and less synchronous.
I would create an exchange with 2 queues: IN and WIP (work in progress)
The central process is responsible for subscribing to process requests, and checking the WIP queue for timed out steps.
1) When the central process received a request for a given processing (X), it invokes the orchestration code, and it loads the first task (X1) into the IN queue
2) The first available worker process (P1) transactionally dequeues X1, and enqueues it into the WIP queue, with a conservative time-to-live (TTL) timeout value. This dequeueing is atomic, and there are no other X tasks in IN, so no second process can work on an X task.
3) If P1 terminates suddenly, no architecture on earth can save this process except for a timeout. At the end of the timeout period, the central process will find the timed out X1 in WIP, and will transactionally dequeue X1 from WIP and enqueue it back into IN, providing the appropriate notifications.
4) If P1 terminates abnormally but gracefully, then the worker process will transactionally dequeue X1 from WIP and enqueue it back into IN, providing the appropriate notifications. Depending on the exception, the worker process could also choose to reset the TTL and retry the step.
5) If P1 hangs indefinitely, or exceeds its TTL, same result as #3. The central process handles it, and presumably the worker process will at some point be recycled--or the rule could be to recycle the worker process anytime there's a timeout.
6) If P1 succeeds, then the worker process will determine the next step, either X2 or X-done. If the next step is X2, then the worker process will transactionally dequeue X1 from WIP, and enqueue X2 into IN. If the next step is X-done, then the processing is complete, and the appopriate action can be taken, perhaps this would be enqueueing X-done into IN for subsequent processing by the orchestrator.
The benefits of my suggested approach are:
Contention between worker processes is specified
All possible failure scenarios (crash, exception, hang, and success) are handled
Simple architecture can be completely implemented with RabbitMQ and no database, which makes it more scalable
Since workers handle determining and enqueueing the next step, there is a more lightweight orchestrator, leading to a more scalable system
The only real drawback is that it is potentially vulnerable to state change, but often this is not a cause for concern. Only you can know whether this would be an issue in your system.
My final thought on this is: you should have a good reason for this orchestration. After all, if process P1 finishes task X1 and now it is time for some process to work on next task X2, it seems P1 would be a very good candidate, as it just finished X1 and is now available. By that logic, a process should just gun through all the steps until completion--why mix and match processes if the tasks need to be done serially? The only async boundary really would be between the client and the worker process. But I will assume that you have a good reason to do this, for example, the processes can run on different and/or resource-specialized machines.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex