How to architect a multi-step process using a message queue? - asynchronous

Say I have a multi-step, asynchronous process with these restrictions:
Individual steps can be performed by any worker
Steps must be performed in-order
The approach I'm considering:
Insert a db row that represents the entire process, with a "Steps completed" column to keep track of the progress.
Subscribe to a queue that will receive a message when the entire process is done.
Upon completion of each step, update the db row and queue the next step in the process.
After the last step is completed, queue the "process is complete" message.
Delete the db row.
Thoughts? Pitfalls? Smarter ways to do it?

I've built a system very similar to what you've described in a large, task-intensive document processing system, and have had to live with both the pros and the cons for the last 7 years now. Your approach is solid and workable, but I see some drawbacks:
Potentially vulnerable to state change (i.e., what if process inputs change before all steps are queued, then the later steps could have inputs inconsistent with earlier steps)
More infrastructural than you'd like, involving both a DB and a queue = more points of failure, harder to set up, more documentation required = doesn't quite feel right
How do you keep multiple workers from acting on the same step concurrently? In other words, the DB row says 4 steps are completed, how does a worker process know if it can take #5 or not? Doesn't it need to know whether another process is already working on this? One way or another (DB or MQ) you need to include additional state for locking.
Your example is robust to failure, but doesn't address concurrency. When you add state to address concurrency, then failure handling becomes a serious problem. For example, a process takes step 5, and then puts the DB row into "Working" state. Then when that process fails, step 5 is stuck in "Working" state.
Your orchestrator is a bit heavy, as it is doing a lot of synchronous DB operations, and I would worry that it might not scale as well as the rest of the architecture, as there can be only one of those...this would depend on how long-running your steps were compared to a database transaction--this would probably only become an issue at very massive scale.
If I had it to do over again, I would definitely push even more of the orchestration onto the worker processes. So, the orchestration code is common and could be called by any worker process, but I would keep the central, controlling process as light as possible. I would also use only message queues and not any database to keep the architecture simple and less synchronous.
I would create an exchange with 2 queues: IN and WIP (work in progress)
The central process is responsible for subscribing to process requests, and checking the WIP queue for timed out steps.
1) When the central process received a request for a given processing (X), it invokes the orchestration code, and it loads the first task (X1) into the IN queue
2) The first available worker process (P1) transactionally dequeues X1, and enqueues it into the WIP queue, with a conservative time-to-live (TTL) timeout value. This dequeueing is atomic, and there are no other X tasks in IN, so no second process can work on an X task.
3) If P1 terminates suddenly, no architecture on earth can save this process except for a timeout. At the end of the timeout period, the central process will find the timed out X1 in WIP, and will transactionally dequeue X1 from WIP and enqueue it back into IN, providing the appropriate notifications.
4) If P1 terminates abnormally but gracefully, then the worker process will transactionally dequeue X1 from WIP and enqueue it back into IN, providing the appropriate notifications. Depending on the exception, the worker process could also choose to reset the TTL and retry the step.
5) If P1 hangs indefinitely, or exceeds its TTL, same result as #3. The central process handles it, and presumably the worker process will at some point be recycled--or the rule could be to recycle the worker process anytime there's a timeout.
6) If P1 succeeds, then the worker process will determine the next step, either X2 or X-done. If the next step is X2, then the worker process will transactionally dequeue X1 from WIP, and enqueue X2 into IN. If the next step is X-done, then the processing is complete, and the appopriate action can be taken, perhaps this would be enqueueing X-done into IN for subsequent processing by the orchestrator.
The benefits of my suggested approach are:
Contention between worker processes is specified
All possible failure scenarios (crash, exception, hang, and success) are handled
Simple architecture can be completely implemented with RabbitMQ and no database, which makes it more scalable
Since workers handle determining and enqueueing the next step, there is a more lightweight orchestrator, leading to a more scalable system
The only real drawback is that it is potentially vulnerable to state change, but often this is not a cause for concern. Only you can know whether this would be an issue in your system.
My final thought on this is: you should have a good reason for this orchestration. After all, if process P1 finishes task X1 and now it is time for some process to work on next task X2, it seems P1 would be a very good candidate, as it just finished X1 and is now available. By that logic, a process should just gun through all the steps until completion--why mix and match processes if the tasks need to be done serially? The only async boundary really would be between the client and the worker process. But I will assume that you have a good reason to do this, for example, the processes can run on different and/or resource-specialized machines.

Related

Queue system recommendation approach

We have a bus reservation system running in GKE in which we are handling the creation of such reservations with different threads. Due to that, CRUD java methods can sometimes run simultaneously referring to the same bus, resulting in the save in our DB of the LAST simultaneous update only (so the other simultaneous updates are lost).
Even if the probabilities are low (the simultaneous updates need to be really close, 1-2 seconds), we need to avoid this. My question is about how to address the solution:
Lock the bus object and return error to the other simultaneous requests
In-memory map or Redis caché to track the bus requests
Use GCP Pub/Sub, Kafka or RabbitMQ as a queue system.
Try to focus the efforts on reducing the simultaneous time window (reduce from 1-2 seconds up to milliseconds)
Others?
Also, we are worried if in the future the GKE requests handling scalability may be an issue. If we manage a relatively higher number of buses, should we need to implement a queue system between the client and the server? Or GKE load balancer & ambassador will already manages it for us? In case we need a queue system in the future, could it be used also for the collision problem we are facing now?
Last, the reservation requests from the client often takes a while. Therefore, we are changing the requests to be handled asynchronously with a long polling approach from the client to know the task status. Could we link this solution to the current problem? For example, using the Redis caché or the queue system to know the task status? Or should we try to keep the requests synchronous and focus on reducing the processing time (it may be quite difficult).

Synchronous vs Asynchronous Clustering

I was reading the mariaDD knowledge base on Galera Cluster and i came across this:
The basic difference between synchronous and asynchronous replication is that "synchronous" guarantees that if changes happened on one node of the cluster, they happened on other nodes "synchronously", or at the same time. "Asynchronous" gives no guarantees about the delay between applying changes on "master" node and the propagation of changes to "slave" nodes. The delay can be short or long. This also implies that if master node crashes, some of the latest changes may be lost
With the last sentence, i have always understood that even though the updates on the slave in the asynchronous cluster setup is not performed at the same time, it logs these updates to a bin log file as the updates are being made on the master. So in the case that the master crashes before all the data is passed on to the slave, the updates will still go ahead when the master is restored since the bin log file logged the updates. Can somebody please tell me if my understanding is wrong and clarify on the matter for me please. Thanks.
In your example of a normal replication pair, the slave would catch up after the master comes back. Assuming the master does come back, you wouldn't really lose the data but if the master is permanently dead, the data is lost. The knowledge base article you mention is talking about the replication delay and not the overall integrity of the replication stream.
With normal replication, if the slave io thread (the part that gets the replication events from the master) is able to keep up with the master, then the slave may only lose a couple seconds if the master crashes. However, if it cannot keep up and is for example 1 hour behind, the slave would lose access to 1 hour of data. Another way you could lose access to data on the slave is if you have a max relay log size set and that is reached.
Galera makes sure that the write is sent to every node in the cluster before it is actually committed on any of the nodes so once the node that the write is done on commits the write, all of the other nodes will commit the same write. With galera, all writes basically happen at the same time on every node. Losing any node at any time during normal operation will not cause any data loss.

Cooperative Multitasking system

I'm trying to get around the concept of cooperative multitasking system and exactly how it works in a single threaded application.
My understanding is that this is a "form of multitasking in which multiple tasks execute by voluntarily ceding control to other tasks at programmer-defined points within each task."
So if you have a list of tasks and one task is executing, how do you determine to pass execution to another task? And when you give execution back to a previous task, how do resume from where you were previously?
I find this a bit confusing because I don't understand how this can be achieve without a multithreaded application.
Any advice would be very helpeful :)
Thanks
In your specific scenario where a single process (or thread of execution) uses cooperative multitasking, you can use something like Windows' fibers or POSIX setcontext family of functions. I will use the term fiber here.
Basically when one fiber is finished executing a chunk of work and wants to voluntarily allow other fibers to run (hence the "cooperative" term), it either manually switches to the other fiber's context or more typically it performs some kind of yield() or scheduler() call that jumps into the scheduler's context, then the scheduler finds a new fiber to run and switches to that fiber's context.
What do we mean by context here? Basically the stack and registers. There is nothing magic about the stack, it's just a block of memory the stack pointer happens to point to. There is also nothing magic about the program counter, it just points to the next instruction to execute. Switching contexts simply saves the current registers somewhere, changes the stack pointer to a different chunk of memory, updates the program counter to a different stream of instructions, copies that context's saved registers into the CPU, then does a jump. Bam, you're now executing different instructions with a different stack. Often the context switch code is written in assembly that is invoked in a way that doesn't modify the current stack or it backs out the changes, in either case it leaves no traces on the stack or in registers so when code resumes execution it has no idea anything happened. (Again, the theme: we assume that method calls fiddle with registers, push arguments to the stack, move the stack pointer, etc but that is just the C calling convention. Nothing requires you to maintain a stack at all or to have any particular method call leave any traces of itself on the stack).
Since each stack is separate, you don't have some continuous chain of seemingly random method calls eventually overflowing the stack (which might be the result if you naively tried to implement this scheme using standard C methods that continuously called each other). You could implement this manually with a state machine where each fiber kept a state machine of where it was in its work, periodically returning to the calling dispatcher's method, but why bother when actual fiber/co-routine support is widely available?
Also remember that cooperative multitasking is orthogonal to processes, protected memory, address spaces, etc. Witness Mac OS 9 or Windows 3.x. They supported the idea of separate processes. But when you yielded, the context was changed to the OS context, allowing the OS scheduler to run, which then potentially selected another process to switch to. In theory you could have a full protected virtual memory OS that still used cooperative multitasking. In those systems, if a errant process never yielded, the OS scheduler never ran, so all other processes in the system were frozen. **
The next natural question is what makes something pre-emptive... The answer is that the OS schedules an interrupt timer with the CPU to stop the currently executing task and switch back to the OS scheduler's context regardless of whether the current task cares to release the CPU or not, thus "pre-empting" it.
If the OS uses CPU privilege levels, the (kernel configured) timer is not cancelable by lower level (user mode) code, though in theory if the OS didn't use such protections an errant task could mask off or cancel the interrupt timer and hijack the CPU. There are some other scenarios like IO calls where the scheduler can be invoked outside the timer, and the scheduler may decide no other process has higher priority and return control to the same process without a switch... And in reality most OSes don't do a real context switch here because that's expensive, the scheduler code runs inside the context of whatever process was executing, so it has to be very careful not to step on the stack, to save register states, etc.
** You might ask why not just fire a timer if yield isn't called within a certain period of time. The answer lies in multi-threaded synchronization. In a cooperative system, you don't have to bother taking locks, worry about re-entrance, etc because you only yield when things are in a known good state. If this mythical timer fires, you have now potentially corrupted the state of the program that was interrupted. If programs have to be written to handle this, congrats... You now have a half-assed pre-emptive multitasking system. Might as well just do it right! And if you are changing things anyway, may as well add threads, protected memory, etc. That's pretty much the history of the major OSes right there.
The basic idea behind cooperative multitasking is trust - that each subtask will relinquish control, of its own accord, in a timely fashion, to avoid starving other tasks of processor time. This is why tasks in a cooperative multitasking system need to be tested extremely thoroughly, and in some cases certified for use.
I don't claim to be an expert, but I imagine cooperative tasks could be implemented as state machines, where passing control to the task would cause it to run for the absolute minimal amount of time it needs to make any kind of progress. For example, a file reader might read the next few bytes of a file, a parser might parse the next line of a document, or a sensor controller might take a single reading, before returning control back to a cooperative scheduler, which would check for task completion.
Each task would have to keep its internal state on the heap (at object level), rather than on the stack frame (at function level) like a conventional blocking function or thread.
And unlike conventional multitasking, which relies on a hardware timer to trigger a context switch, cooperative multitasking relies on the code to be written in such a way that each step of each long-running task is guaranteed to finish in an acceptably small amount of time.
The tasks will execute an explicit wait or pause or yield operation which makes the call to the dispatcher. There may be different operations for waiting on IO to complete or explicitly yielding in a heavy computation. In an application task's main loop, it could have a *wait_for_event* call instead of busy polling. This would suspend the task until it has input to process.
There may also be a time-out mechanism for catching runaway tasks, but it is not the primary means of switching (or else it wouldn't be cooperative).
One way to think of cooperative multitasking is to split a task into steps (or states). Each task keeps track of the next step it needs to execute. When it's the task's turn, it executes only that one step and returns. That way, in the main loop of your program you are simply calling each task in order, and because each task only takes up a small amount of time to complete a single step, we end up with a system which allows all of the tasks to share cpu time (ie. cooperate).

MPI programming to implement large data gathering from many workers

Now, I have a application that composed of single master and many workers. The application requirement is very simple: workers finish some jobs and send data to master and master store these data into files separately. I can simply use MPI_Send on worker side to send data to master. But master does not know the data sending sequence. Some workers go fast while some are slow. More specifically, suppose there are 5 workers, then the data sending sequence may be 1,3,4,5,2 or 2,5,4,1,3. If I just write a for loop like for(i=1 to 5) on master side with MPI_Recv to get data, the master and some faster worker have to wait for a long time. I know MPI_Gather can implement this. But I am not sure is MPI_Gather works parallelly or just some sequential calls of MPI_Recv? Another issue is my data is extremely large, more than 1GB data needed to be sent to master. If I divide the data into trunks, it may make it more complex. I do not think MPI_Gather can work. I also tried to think about raw socket programming, but I do not think it is a good practice. Would you give me some suggestion please?
If I understand your question correctly, you want to receive the data back at the master, but since each task takes a different amount of time to finish, you don't want to loop over all the processors in order so that the receive for process 5 (if it's finished) isn't waiting for the receive from process 3 (which is still running).
If want to receive out-of-order, it's possible to use mpi_recv with the MPI_ANY_SOURCE constant as the rank of the processor sending the message. You should then be able to inspect the returned status to determine which processor sent the message to send more work. Rather than looping over all processors, just have a single receive statement in your work loop.
could the workers write out the files instead of sending the data back to the master? when a worker finishes, it could send a "i'm done" message to the master. the master, in turn could send the next chunk of work to that worker. when there is no work left to hand out, have the master send a "no more work" message to the worker, who could then call MPI Finalize.

Detecting Blocked Threads

I have a theory regarding trouble shooting a Asynchronous Application (I'm using the CCR) and I wonder if someone can confirm my logic.
If a CCR based multi-threaded application using the default number of threads (i.e. one per core) is slower than the same application with double the threads specified - does this means that threads are being blocked somewhere in the code
What do think? Is this a quick and valid way to detect if threads are being inadvertantly being blocked?
What do you mean by "slower"?
If you want to automatically detect blocked threads, perhaps those threads should send a heartbeat, which are then observed by a monitor of some sort, but your options are limited.
A cheap way to tell if threads are being blocked is to get the current system time before doing any potentially blocking operation, then after the operation, and see how much time has elapsed. For example, while waiting for a message to arrive, measure to see how much time the thread was blocked waiting for a message to arrive.
Unless there are always more than enough messages to be processed, threads will block waiting for a message. If you have more threads, then you have more potential message generators (depending on your design) and thus threads waiting to receive messages will be more likely to have one ready.
Exactly one thread to CPU is too few unless you can guarantee that there will always be enough messages so no thread will have to wait.
If this is the case, that means that your threadpool is being exhausted (i.e. you have 2 threads but you've async pended 4 IOs or something) - if your work is heavily IO bound, the rule of "one thread per core" doesn't really apply.
I've found that to keep the system fluid with minimal threads, I keep the tasks dealing with I/O as concise as possible. They simply post the data from the I/O into another Port and do no further processing. The data is therefore queued elsewhere for processing in a controlled manner without interfering with the task of grabbing data as fast as possible. This processing might happen in the ExclusiveGroup of an Interleave if there's shared state to think about... and a handy side-effect is that exclusive tasks will never tie up all the threads in a Dispatcher (however, I suspect that there's probably nattier ways of managing this in the CCR API)

Resources