How to implement polling in Airflow? - airflow

I want to use Airflow to implement data flows that periodically poll external systems (ftp servers, etc), check for new files matching certain conditions, and then run a bunch of tasks for those files. Now, I'm a newbie to Airflow and read that Sensors are something you would use for this kind of a case, and I actually managed to write a sensor that works ok when I run "airflow test" for it. But I'm a bit confused regarding the relation of poke_interval for the sensor and the DAG scheduling. How should I define those settings for my use case? Or should I use some other approach? I just want Airflow to run the tasks when those files become available, and not flood the dashboard with failures when no new files were available for a while.

Your understanding is correct, using a sensor is the way to go when you want to poll, either by using an existing sensor or by implementing your own.
They are, however, always part of a DAG and they do not execute outside of its boundaries. DAG execution depends on the start_date and schedule_interval, but you can leverage this and a sensor to implement some sort of DAG depending on the status of an external server: one possible approach would be starting the whole DAG with a sensor which checks for a condition to occur and decide to skip the whole DAG if the condition is not met (you can make sure that sensors mark downstream tasks as skipped and not failed by setting their soft_fail parameter to True). You can have a polling interval of one minute by using the most frequent scheduling option (* * * * *). If you really need a shortest polling time you can tweak the sensor's poke_interval and timeout parameters.
Keep in mind, however, that execution times are not probably guaranteed by Airflow itself, so for very short polling times you may want to investigate alternatives (or at least consider different approaches to the one I've just shared).

Related

How to write deferrable SqlSensor in Airflow?

I have a number of DAGs that wait for EOD settlement, and limited worker slot. The settlement ends at varying times. So while waiting the settlement, I want to run a different DAG on the worker slot. From Airflow documentation, deferable operator looks fit for this kind of purpose. I'm new to python and airflow. Can somebody explain how to write deferable sql sensor ?
I have looked at deferable time sensor examples, but can't make it to work with sql sensors.

How to evenly balance processing many simultaneous tasks?

PROBLEM
Our PROCESSING SERVICE is serving UI, API, and internal clients and listening for commands from Kafka.
Few API clients might create a lot of generation tasks (one task is N messages) in a short time. With Kafka, we can't control commands distribution, because each command comes to the partition which is consumed by one processing instance (aka worker). Thus, UI requests could be waiting too long while API requests are processing.
In an ideal implementation, we should handle all tasks evenly, regardless of its size. The capacity of the processing service is distributed among all active tasks. And even if the cluster is heavily loaded, we always understand that the new task that has arrived will be able to start processing almost immediately, at least before the processing of all other tasks ends.
SOLUTION
Instead, we want an architecture that looks more like the following diagram, where we have separate queues per combination of customer and endpoint. This architecture gives us much better isolation, as well as the ability to dynamically adjust throughput on a per-customer basis.
On the side of the producer
the task comes from the client
immediately create a queue for this task
send all messages to this queue
On the side of the consumer
in one process, you constantly update the list of queues
in other processes, you follow this list and consume for example 1 message from each queue
scale consumers
QUESTION
Is there any common solution to such a problem? Using RabbitMQ or any other tooling. Š¯istorically, we use Kafka on the project, so if there is any approach using - it is amazing, but we can use any technology for the solution.
Why not use spark to execute the messages within the task? What I'm thinking is that each worker creates a spark context that then parallelizes the messages. The function that is mapped can be based on which kafka topic the user is consuming. I suspect however your queues might have tasks that contained a mixture of messages, UI, API calls, etc. This will result in a more complex mapping function. If you're not using a standalone cluster and are using YARN or something similar you can change the queueing method that the spark master is using.
As I understood the problem, you want to create request isolation from the customer using dynamically allocated queues which will allow each customer tasks to be executed independently. The problem looks like similar to Head of line blocking issue in networking
The dynamically allocating queues is difficult. This can also lead to explosion of number of queues that can be a burden to the infrastructure. Also, some queues could be empty or very less load. RabbitMQ won't help here, it is a queue with different protocol than kafka.
One alternative is to use custom partitioner in kafka that can look at the partition load and based on that load balance the tasks. This works if the tasks are independent in nature and there is no state store maintains in the worker.
The other alternative would be to load balance at the customer level. In this case you select a dedicated set of predefined queues for a set of customers. Customers with certain Ids will be getting served by a set of queues. The downside of this is some queues can have less load than others. This solution is similar to Virtual Output Queuing in networking,
My understanding is that the partitioning of the messages it's not ensuring a evenly load-balance. I think that you should avoid create overengineering and so some custom stuff that will come on top of the Kafka partitioner and instead think at a good partitioning key that will allows you to use Kafka in an efficiently manner.

Python: Prioritizing tasks and Running asynchronous tasks without a lock

Right now I'm using Gevent, and I wanted to ask two questions:
Is there a way to execute specific tasks that will never execute asynchronously (instead of using a Lock in each of these tasks)
Is there's a way to prioritize spawned tasks in Gevent? Like a group of tasks that will be generated with low priority that will be executed when all of the other tasks are done. For example, two tasks that listen to different socket when each of these tasks handles the socket requests in various priority
If it's not possible in Gevent, is there any other library that it can be done?
Edit
Maybe Celery can help me here?
If you want to manage computing resources, Python async libraries can't help here, because, AFAIK, neither has priority scheduler. All greenthreads are equal.
Task queues generally have a notion of priority, so Celery or Beanstalk is one way to do it.
If your problem does not require task (re)execution guarantees, persistence, multi-machine work distribution, then I would just start few worker processes, assign them CPU, IO, disk priorities using OS and send work/results via UNIX socket DGRAM. Kind of ad-hoc simpler version of task queue. If you go this way, please share your work as open source project, I believe there's demand for this kind of solution.

How to architect a multi-step process using a message queue?

Say I have a multi-step, asynchronous process with these restrictions:
Individual steps can be performed by any worker
Steps must be performed in-order
The approach I'm considering:
Insert a db row that represents the entire process, with a "Steps completed" column to keep track of the progress.
Subscribe to a queue that will receive a message when the entire process is done.
Upon completion of each step, update the db row and queue the next step in the process.
After the last step is completed, queue the "process is complete" message.
Delete the db row.
Thoughts? Pitfalls? Smarter ways to do it?
I've built a system very similar to what you've described in a large, task-intensive document processing system, and have had to live with both the pros and the cons for the last 7 years now. Your approach is solid and workable, but I see some drawbacks:
Potentially vulnerable to state change (i.e., what if process inputs change before all steps are queued, then the later steps could have inputs inconsistent with earlier steps)
More infrastructural than you'd like, involving both a DB and a queue = more points of failure, harder to set up, more documentation required = doesn't quite feel right
How do you keep multiple workers from acting on the same step concurrently? In other words, the DB row says 4 steps are completed, how does a worker process know if it can take #5 or not? Doesn't it need to know whether another process is already working on this? One way or another (DB or MQ) you need to include additional state for locking.
Your example is robust to failure, but doesn't address concurrency. When you add state to address concurrency, then failure handling becomes a serious problem. For example, a process takes step 5, and then puts the DB row into "Working" state. Then when that process fails, step 5 is stuck in "Working" state.
Your orchestrator is a bit heavy, as it is doing a lot of synchronous DB operations, and I would worry that it might not scale as well as the rest of the architecture, as there can be only one of those...this would depend on how long-running your steps were compared to a database transaction--this would probably only become an issue at very massive scale.
If I had it to do over again, I would definitely push even more of the orchestration onto the worker processes. So, the orchestration code is common and could be called by any worker process, but I would keep the central, controlling process as light as possible. I would also use only message queues and not any database to keep the architecture simple and less synchronous.
I would create an exchange with 2 queues: IN and WIP (work in progress)
The central process is responsible for subscribing to process requests, and checking the WIP queue for timed out steps.
1) When the central process received a request for a given processing (X), it invokes the orchestration code, and it loads the first task (X1) into the IN queue
2) The first available worker process (P1) transactionally dequeues X1, and enqueues it into the WIP queue, with a conservative time-to-live (TTL) timeout value. This dequeueing is atomic, and there are no other X tasks in IN, so no second process can work on an X task.
3) If P1 terminates suddenly, no architecture on earth can save this process except for a timeout. At the end of the timeout period, the central process will find the timed out X1 in WIP, and will transactionally dequeue X1 from WIP and enqueue it back into IN, providing the appropriate notifications.
4) If P1 terminates abnormally but gracefully, then the worker process will transactionally dequeue X1 from WIP and enqueue it back into IN, providing the appropriate notifications. Depending on the exception, the worker process could also choose to reset the TTL and retry the step.
5) If P1 hangs indefinitely, or exceeds its TTL, same result as #3. The central process handles it, and presumably the worker process will at some point be recycled--or the rule could be to recycle the worker process anytime there's a timeout.
6) If P1 succeeds, then the worker process will determine the next step, either X2 or X-done. If the next step is X2, then the worker process will transactionally dequeue X1 from WIP, and enqueue X2 into IN. If the next step is X-done, then the processing is complete, and the appopriate action can be taken, perhaps this would be enqueueing X-done into IN for subsequent processing by the orchestrator.
The benefits of my suggested approach are:
Contention between worker processes is specified
All possible failure scenarios (crash, exception, hang, and success) are handled
Simple architecture can be completely implemented with RabbitMQ and no database, which makes it more scalable
Since workers handle determining and enqueueing the next step, there is a more lightweight orchestrator, leading to a more scalable system
The only real drawback is that it is potentially vulnerable to state change, but often this is not a cause for concern. Only you can know whether this would be an issue in your system.
My final thought on this is: you should have a good reason for this orchestration. After all, if process P1 finishes task X1 and now it is time for some process to work on next task X2, it seems P1 would be a very good candidate, as it just finished X1 and is now available. By that logic, a process should just gun through all the steps until completion--why mix and match processes if the tasks need to be done serially? The only async boundary really would be between the client and the worker process. But I will assume that you have a good reason to do this, for example, the processes can run on different and/or resource-specialized machines.

Advice/experience on testing MPI code with CppUnit

I've got a codebase where I have been using CppUnit for unit testing. I'm now adding some MPI code to the project and I'd like to unit test some abstractions I'm building on top of MPI. For example, I've written some code to manage a single-producer/multiple-consumer relationship where consumers ask for work and the producer serializes the next bit of work to send to the consumer, and I'd like to test just that interaction with a test that generates some fake work items in a producer, distributes them to consumers, which then send some kind of checksum back to the consumer to make sure everything got distributed and nothing deadlocked etc.
Does anyone have experience of what works best here? Some things I've been thinking about:
Is it reasonable to have all processes execute the test runner so that they all execute the test functions in the same order? Or is it better to have only the master run the test runner and have it send broadcasts to the slaves to tell them what to do next (presumably with some kind of lookup table to map commands to test functions)?
Is it sane in any way to use CPPUNIT_ASSERT inside the slaves, or should all information be sent back to the master for assertions? If slaves can assert, how should all the results be combined to get a single output log?
How should one handle test failures, such that the exception thrown in one process doesn't cause synchronization problems e.g. another process is waiting for an MPI_Recv for which the matching MPI_Send will now never happen?

Resources