Airflow - Using an upstream task for multiple downstream tasks

Airflow - Using an upstream task for multiple downstream tasks - airflow

Okay I apologize if this is a dumb question, because it seems so obvious that it should work. But I can't find it documented, and as are examining our options as we look to build a new data pipeline, I really want this to be a feature...
Can multiple downstream processes be dependent on a single upstream process, where the upstream process only runs once. In other words, can I extract a table one time, and then load it to my data warehouse, and have multiple aggregations that are dependent on that load being complete?
For a bit more information, we are attempting to go to an asynchronous extract-load-transform where the extract is started, and then the loads and transforms finish as soon as they have the subset of tables they need from the extract.

This seems to me like a usual DAG with unusual wording. I understand the required structure like this:
extract_table_task \
|- task1_do_stuff
|- task2_do_some_other_stuff
|- task3_...
Or in Airflow code:
extract_table_task.set_downstream(task1_do_stuff)
extract_table_task.set_downstream(task2_do_some_other_stuff)
extract_table_task.set_downstream(task3_...)
Then make sure to select the correct trigger rules for your workflow, e.g. if some tasks should run even if something went wrong: https://airflow.apache.org/concepts.html#trigger-rules

If I'm understanding the question, yes, you can set downstream tasks to be dependent on the success of an upstream task.
We use dummyOperators in a lot of cases similar to this sample DAG:
In the case we want the dummyOperator to kick off first and do something, before the downstream tasks kick off. This makes clearing out failed runs easier as well as we can simply clear the dummy operator and downstream tasks at the same time.
You can use the depends_on_past=True parameter to require upstream tasks run before the downstream tasks are queued, otherwise they can be skipped based on logic in the upstream task.

Related

Finding out whether a DAG execution is a catchup or a regularly scheduled one

I have an Airflow pipeline that starts with a FileSensor that may perform a number of retries (which makes sense because the producing process sometimes takes longer, and sometimes simply fails).
However when I restart the pipeline, as it runs in catchup mode, the retries in the file_sensor become spurious: if the file isn't there for a previous day, it wont materialize anymore.
Therefore my question: is it possible to make the behavior of a DAG-run contingent on whether that is currently running in a catch up or in a regularly scheduled run?
My apologies if this is a duplicated question: it seems a rather basic problem, but I couldn't find previous questions or documentation.

The solution is rather simple.
Set a LatestOnlyOperator upstream from the FileSensor
Set an operator of any type you may need downstream from the FileSensor with its trigger rule set to TriggerRule.ALL_DONE.
Both skipped and success states count as "done" states, while an error state doesn't. Hence, in a "non-catch-up" run the FileSensor will need to succeed to give way to the downstream task, while in a catch-up run the downstream will right away start after skipping the FileSensor.

Custom Operator States (queued, success, etc.) in Apache Airflow?

In Apache Airflow (2.x), each Operator Instance has a state as defined here (airflow source repo).
I have two use cases that don't seem to clearly fall into the pre-defined states:
Warn, but don't fail - This seems like it should be a very standard use case and I am surprised to not see it in the out-of-the-box airflow source code. Basically, I'd like to color-code a node with something eye-catching - say orange - corresponding to a non-fatal warning, but continue execution as normal otherwise. Obviously you can print warnings to the log, but finding them takes more work than just looking at the colorful circles on the DAGs page.
"Sensor N/A" or "Data not ready" - This would be a status that gets assigned when a sensor notices that data in the source system is not yet ready, and that downstream operators can be skipped until the next execution of the DAG, but that nothing in the data pipeline is broken. Basically an expected end-of-branch.
Is there a good way of achieving either of these use cases with the out-of-the-box Airflow node states? If not, is there a way to defining custom operator states? Since I am running airflow on a managed service (MWAA), I don't think changing the source code of our deployment is an option.
Thanks,

The task states are tightly integrated with Airflow. There's no way to configure which logging levels lead to which state. I'd say the easiest way is to grep log files for "WARNING" or set up a log aggregation service e.g. Elasticsearch to make log files searchable.
For #2, sensors have no knowledge about why a sensor timed out. After timeout or execution_timeout is reached, they simply raise an Exception. You can deal with exceptions using trigger_rules, but these still don't take the reason for an exception into account.
If you want more control over this, I would implement your own Sensor which takes an argument e.g. data_not_ready_timeout (which is smaller than timeout and execution_timeout). In the poke() method, check if data_not_ready_timeout has been reached, and raise an AirflowSkipException if so. This will skip downstream tasks. Once timeout or execution_timeout are reached, the task is failed. Look at BaseSensorOperator.execute() for some inspiration to get the initial starting date of a sensor.

Can Airflow persist access to metadata of short-lived dynamically generated tasks?

I have a DAG that, whenever there are files detected by FileSensor, generates tasks for each file to (1) move the file to a staging area, (2) trigger a separate DAG to process the file.
FileSensor -> Move(File1) -> TriggerDAG(File1) -> Done
|-> Move(File2) -> TriggerDAG(File2) -^
In the DAG definition file, the middle tasks are generated by iterating over the directory that FileSensor is watching, a bit like this:
# def generate_move_task(f: Path) -> BashOperator
# def generate_dag_trigger(f: Path) -> TriggerDagRunOperator
with dag:
for filepath in Path(WATCH_DIR).glob(*):
sensor_task >> generate_move_task(filepath) >> generate_dag_trigger(filepath)
The Move task moves the files that lead to the task generation, so the next DAG run won't have FileSensor re-trigger either Move or TriggerDAG tasks for this file. In fact, the scheduler won't generate the tasks for this file at all, since after all files go through Move, the input directory has no contents to iterate over anymore..
This gives rise to two problems:
After execution, the task logs and renderings are no longer available. The Graph View only shows the DAG as it is now (empty), not as it was at runtime. (The Tree View shows that the tasks' run and state, but clicking on the "square" and picking any details leads to an Airflow error.)
The downstream tasks can be memory-holed due to a race condition. The first task is to move the originating file to a staging area. If that takes longer than the scheduler polling period, the scheduler no longer collects the downstream TriggerDAG(File1) task, which means that task is not scheduled to be executed even though the upstream task ran successfully. It's as if the downstream task never existed.
The race condition issue is solved by changing the task sequence to Copy(File1) -> TriggerDAG(File1) -> Remove(File1), but the broader problem remains: is there a way to persist dynamically generated tasks, or at least a way to consistently access them through the Airflow interface?

While it isn't clear, i'm assuming that downstream DAG(s) that you trigger via your orchestrator DAG are NOT dynamically generated for each file (like your Move & TriggerDAG tasks); in other words, unlike your Move tasks that keep appearing and disappearing (based on files), the downstream DAGs are static and stay there always
You've already built a relatively complex workflow that does advanced stuff like generating tasks dynamically and triggering external DAGs. I think with slight modification to your DAGs structure, you can get rid of your troubles (which also are quite advanced IMO)
Relocate the Move task(s) from your upstream orchestrator DAG to the downstream (per-file) process DAG(s)
Make the upstream orchestrator DAG do two things
Sense / wait for files to appear
For each file, trigger the downstream processing DAG (which in effect you are already doing).
For the orchestrator DAG, you can do it either ways
have a single task that does file sensing + triggering downstream DAGs for each file
have two tasks (I'd prefer this)
first task senses files and when they appear, publishes their list in an XCOM
second task reads that XCOM and foreach file, triggers it's corresponding DAG
but whatever way you choose, you'll have to replicate the relevant bits of code from
FileSensor (to be able to sense file and then publish their names in XCOM) and
TriggerDagRunOperator (so as to be able to trigger multiple DAGs with single task)
here's a diagram depicting the two tasks approach

The short answer to the title question is, as of Airflow 1.10.11, no, this doesn't seem possible as stated. To render DAG/task details, the Airflow webserver always consults the DAGs and tasks as they are currently defined and collected to DagBag. If the definition changes or disappears, tough luck. The dashboard just shows the log entries in the table; it doesn't probe the logs for prior logic (nor does it seem to store much of it other than the headline).
y2k-shubham provides an excellent solution to the unspoken question of "how can I write DAGs/tasks so that the transient metadata are accessible". The subtext of his solution: convert the transient metadata into something Airflow stores per task run, but keep the tasks themselves fixed. XCom is the solution he uses here, and it does shows up in the task instance details / logs.
Will Airflow implement persistent interface access to fleeting one-time tasks whose definition disappears from the DagBag? It's possible but unlikely, for two reasons:
It would require the webserver to probe the historical logs instead of just the current DagBag when rendering the dashboard, which would require extra infrastructure to keep the web interface snappy, and could make the display very confusing.
As y2k-shubham notes in a comment to another question of mine, fleeting and changing tasks/DAGs are an Airflow anti-pattern. I'd imagine that would make this a tough sell as the next feature.

How to launch a Dataflow job with Apache Airflow and not block other tasks?

Problem
Airflow tasks of the type DataflowTemplateOperator take a long time to complete. This means other tasks can be blocked by it (correct?).
When we run more of these tasks, that means we would need a bigger Cloud Composer cluster (in our case) to execute tasks that are essentially blocking while they shouldn't be (they should be async operations).
Options
Option 1: just launch the job and airflow job is successful
Option 2: write a wrapper as explained here and use a reschedule mode as explained here
Option 1 does not seem feasible as the DataflowTemplateOperator only has an option to specify the wait time between completion checks called poll_sleep (source).
For the DataflowCreateJavaJobOperator there is an option check_if_running to wait for completion of a previous job with the same name (see this code)
It seems that after launching a job, the wait_for_finish is executed (see this line), which boils down to an "incomplete" job (see this line).
For Option 2, I need Option 1.
Questions
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.

Answers
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
A: Partly yes. Airflow has parallelism option in the configuration which define the number of tasks that should execute at a time across the system. Having a task block this slot might slow down the execution in the system but this issue is bound to happen as you increase the number of tasks and DAGs. You can increase this in the configuration depending on your needs
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
A: Yes. You can use PythonOperator and in the python_callable you can use the dataflow hook to launch the job in async mode (launch and don't wait).
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
A: When you say reschedule, I'm assuming that you are going to retry the task that looks for job that checks if the job finished correctly. If I'm right, you can set the task on retry mode and the delay at which you want the retry to happen.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
A: I think I answered this in the second question.

Determining if a DAG is executing

I am using Airflow 1.9.0 with a custom SFTPOperator. I have code in my DAGs that poll an SFTP site to find new files. If any are found, then I create custom task id's for the dynamically created task and retrieve/delete the files.
directory_list = sftp_handler('sftp-site', None, '/', None, SFTPToS3Operation.LIST)
for file_path in directory_list:
... SFTP code that GET's the remote files
That part works fine. It seems both the airflow webserver and airflow scheduler are iterating through all the DAGs once a second and actually running the code that retrieves the directory_list. This means I'm hitting the SFTP site ~2 x a second to authenticate and pull a list of files. I'd like to have some conditional code that only executes if the DAG is actually being run.
When an SFTP site uses password authentication, the # of times I connect really isn't an issue. One site requires key authentication and if there are too many authentication failures in a short timespan, the account is locked. During my testing, this seems to happen occasionally for reasons I'm still trying to track down.
However, if I were authenticating only when the DAG was scheduled to execute, or executing manually, this would not be an issue. It also seems wasteful to spend so much time connecting to an SFTP site when it's not scheduled to do so.
I've seen a post that can check to see if a task is executing, but that's not ideal as I'd have to create a long-running task, using up resources I shouldn't require, just to perform that test. Any thoughts on how to accomplish this?

You have a very good use case for Airflow (SFTP to _____ batch jobs), but Airflow is not meant for dynamic DAGs as you are attempting to use them.
Top-Level DAG Code and the Scheduler Loop
As you noticed, any top-level code in a DAG is executed with each scheduler loop. Or put another way, every time the scheduler loop processes the files in your DAG directory it is interpreting all the code in your DAG files. Anything not in a task or operator is interpreted/executed immediately. This puts undue strain on the scheduler as well as any external systems you are making calls to.
Dynamic DAGs and the Airflow UI
Airflow does not handle dynamic DAGs through the UI well. This is mostly the result of the Airflow DAG state not being stored in the database. DAG views and history are rendered based on what exist in the interpreted DAG file at any given moment. I personally hope to see this change in the future with some form of DAG versioning.
In a dynamic DAG you can both add and remove tasks from a DAG.
Adding Tasks Dynamically
When adding tasks for a DAG run will make it appear (in the UI) that all DAG
runs before when that task never ran that task all. The will have a None state
and the DAG run will be set to success or failed depending on the outcome
of the DAG run.
Removing Tasks Dynamically
If your dynamic DAG ever removes tasks you will lose the ability to review history of the DAG. For example, if you run a DAG with task_x in the first 20 DAG runs but remove it after that, it will fail to show up in the UI until it is added back into the DAG.
Idempotency and Airflow
Airflow works best when the DAG runs are idempotent. This means that re-running any DAG Run should have the same affect no matter when you run it or how many times you run it. Dynamic DAGs in Airflow break idempotency by adding and removing tasks to previous DAG runs so that the results of re-running are not the same.
Solution Options
You have at least two options moving forward
1.) Continue to build your SFTP DAG dynamically, but create another DAG that writes the available SFTP files to a local file (if not using distributed executor) or an Airflow Variable (this will result in more reads to the Airflow DB) and build your DAG dynamically from that.
2.) Overload the SFTPOperator to take a list of files so that every file that exist is processed within a single task run. This will make the DAGs idempotent and you will maintain accurate history through the logs.
I apologize for the extended explanation, but you're touching on one of the rough spots of Airflow and I felt it was appropriate to give an overview of the problem at hand.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex