Airflow - preserve the state in retry - airflow

My Airflow dag sends an HTTP PUT request every hour with the hour in the body.
In case of failure, I want that the retry's request will contain the original hour in the body (even after several days).
How can I achieve that?

There are a few ways you could achieve this, but I'd suggest having a look into Airflows XCOM: https://airflow.apache.org/concepts.html?highlight=xcom#concepts-xcom
A simple example to suit your case would be to create a DAG with 2 nodes - NodeA and NodeB.
NodeA runs and stores the current time in XCOM
NodeB runs, retrieves the time from the XCOM of NodeA, and makes a PUT request with that value in the body.
If you were wanting to re-trigger the PUT request in future, you would only need to clear NodeB in the DAG. When it runs again, it will retrieve same time that was originally stored in the XCOM of NodeA.

Related

Proper way to let airflow sensor continuous triggering?

Is it possible to let airflow sensor continuous triggering? By continuous triggering what I mean is that for example the sensor will listen to a Kafka topic, and will trigger different DAGs depend on the received message, this will keep running possible forever.
The sensor doesn't trigger the dag run, it's a part of the run, but it can block it by staying in running state (or up for rescheduling) waiting certain condition, then all the downstream tasks will stay waiting (None state).
To achieve what you want to do, you can create a sub class from TriggerDagRunOperator to read the kafka topic then trigger runs in other dags based on your needs.
Or you can create a stream application outside Airflow, and use the Airflow API to trigger the runs.
If you're using Deferable operators, you should be able to re-defer them after re-entering , untested pseudo code would be something like :
def execute(self,context) -> Any:
self.defer(AwaitMessageTrigger(), method="execute_complete")
def execute_complete(self,context,event):
TriggerDagRunOperator(conf=event)
self.defer(AwaitMessageTrigger(), method="execute_complete")
This pattern has been implemented as part of the provider package here.
https://github.com/astronomer/airflow-provider-kafka/blob/main/airflow_provider_kafka/operators/event_triggers_function.py
Example here
https://github.com/astronomer/airflow-provider-kafka/blob/main/example_dags/listener_dag_function.py

Airflow Deferrable Operator Pattern for Event-driven DAGs

I'm looking for examples of patterns in place for event-driven DAGs, specifically those with dependencies on other DAGs. Let's start with a simple example:
dag_a -> dag_b
dag_b depends on dag_a. I understand that at the end of dag_a I can add a trigger to launch dag_b. However, this philosophically feels misaligned from an abstraction standpoint: dag_a does not need to understand or know that dag_b exists, yet this pattern would enforce the responsibility of calling dag_b on dag_a.
Let's consider a slightly more complex example (pardon my poor ASCII drawing skills):
dag_a ------> dag_c
/
dag_b --/
In this case, if dag_c depends on both dag_a and dag_b. I understand that we could set up a sensor for the output of each dag_a and dag_b, but with the advent of deferrable operators, it doesn't seem that this remains a best practice. I suppose I'm wondering how to set up a DAG of DAGs in an async fashion.
The potential for deferrable operators for event-driven DAGs is introduced in Astronomer's guide here: https://www.astronomer.io/guides/deferrable-operators, but it's unclear how it would be best applied these in light of the above examples.
More concretely, I'm envisioning a use case where multiple DAGs run every day (so they share the same run date), and the output of each DAG is a date partition in a table somewhere. Downstream DAGs consume the partitions of the upstream tables, so we want to schedule them such that downstream DAGs don't attempt to run before the upstream ones complete.
Right now I'm using a "fail fast and often" approach in downstream dags, where they start running at the scheduled date, but first check if the data they need exists upstream, and if not the task fails. I have these tasks set to retry every x interval, with high number of retries (e.g. retry every hour for 24 hours, if it's still not there then something is wrong and the DAG fails). This is fine since 1) it works for the most part and 2) I don't believe the failed tasks continue to occupy a worker slot between retries, so it actually is somewhat async (I could be wrong). It's just a little crude, so I'm imagining there is a better way.
Any tactical advice for how to set this relationship up to be more event driven while still benefitting from the async nature of deferrable operators is welcome.
we are using an event bus to connect the DAGs, the end task of a DAG will send the event out, and followed DAG will be triggered in the orchestrator according to the event type.
Starting from Airflow 2.4 you can use data-aware scheduling. It would look like this:
dag_a_finished = Dataset("dag_a_finished")
with DAG(dag_id="dag_a", ...):
# Last task in dag_a
BashOperator(task_id="final", outlets=[dag_a_finished], ...)
with DAG(dag_id="dag_b", schedule=[dag_a_finished], ...):
...
with DAG(dag_id="dag_c", schedule=[dag_a_finished], ...):
...
In theory Dataset should represent some piece of data, but technically nothing prevents you from making it just a string identifier used for setting up DAG dependency - just like in above example.

How to run Airflow DAG for specific number of times?

How to run airflow dag for specified number of times?
I tried using TriggerDagRunOperator, This operators works for me.
In callable function we can check states and decide to continue or not.
However the current count and states needs to be maintained.
Using above approach I am able to repeat DAG 'run'.
Need expert opinion, Is there is any other profound way to run Airflow DAG for X number of times?
Thanks.
I'm afraid that Airflow is ENTIRELY about time based scheduling.
You can set a schedule to None and then use the API to trigger runs, but you'd be doing that externally, and thus maintaining the counts and states that determine when and why to trigger externally.
When you say that your DAG may have 5 tasks which you want to run 10 times and a run takes 2 hours and you cannot schedule it based on time, this is confusing. We have no idea what the significance of 2 hours is to you, or why it must be 10 runs, nor why you cannot schedule it to run those 5 tasks once a day. With a simple daily schedule it would run once a day at approximately the same time, and it won't matter that it takes a little longer than 2 hours on any given day. Right?
You could set the start_date to 11 days ago (a fixed date though, don't set it dynamically), and the end_date to today (also fixed) and then add a daily schedule_interval and a max_active_runs of 1 and you'll get exactly 10 runs and it'll run them back to back without overlapping while changing the execution_date accordingly, then stop. Or you could just use airflow backfill with a None scheduled DAG and a range of execution datetimes.
Do you mean that you want it to run every 2 hours continuously, but sometimes it will be running longer and you don't want it to overlap runs? Well, you definitely can schedule it to run every 2 hours (0 0/2 * * *) and set the max_active_runs to 1, so that if the prior run hasn't finished the next run will wait then kick off when the prior one has completed. See the last bullet in https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled.
If you want your DAG to run exactly every 2 hours on the dot [give or take some scheduler lag, yes that's a thing] and to leave the prior run going, that's mostly the default behavior, but you could add depends_on_past to some of the important tasks that themselves shouldn't be run concurrently (like creating, inserting to, or dropping a temp table), or use a pool with a single slot.
There isn't any feature to kill the prior run if your next schedule is ready to start. It might be possible to skip the current run if the prior one hasn't completed yet, but I forget how that's done exactly.
That's basically most of your options there. Also you could create manual dag_runs for an unscheduled DAG; creating 10 at a time when you feel like (using the UI or CLI instead of the API, but the API might be easier).
Do any of these suggestions address your concerns? Because it's not clear why you want a fixed number of runs, how frequently, or with what schedule and conditions, it's difficult to provide specific recommendations.
This functionality isn't natively supported by Airflow
But by exploiting the meta-db, we can cook-up this functionality ourselves
we can write a custom-operator / python operator
before running the actual computation, check if 'n' runs for the task (TaskInstance table) already exist in meta-db. (Refer to task_command.py for help)
and if they do, just skip the task (raise AirflowSkipException, reference)
This excellent article can be used for inspiration: Use apache airflow to run task exactly once
Note
The downside of this approach is that it assumes historical runs of task (TaskInstances) would forever be preserved (and correctly)
in practise though, I've often found task_instances to be missing (we have catchup set to False)
furthermore, on large Airflow deployments, one might need to setup routinal cleanup of meta-db, which would make this approach impossible

How to setup multi-operator dag so that another instance would not get instantiated until all tasks of the running instance are completed?

We have multi-operator dags in our airflow implementation.
Lets say dag-a has operator t1, t2, t3 which are set up to run sequentially (ie. t2 is dependent on t1, and t3 is dependent on t2.)
task_2.set_upstream(task_1)
task_3.set_upstream(task_2)
We need to insure that when dag-a is instantiated, all its tasks complete successfully before another instance of the same dag is instantiated (or before the first task on the next dag instance is triggered.)
we have set the following in our dags:
da['depends_on_past'] = True
What is happening right now is that if the instantiated dag does not have any errors, we see the desired effect.
However, lets say dag-a is scheduled to run hourly. On the hour dag-a-i1 instance is trigged as scheduled. Then dag-a-i1 task t1 runs successfully and then t2 starts running and fails. In that scenario , we see dag-a-i1 instance stops as expected. when the next hour comes, we see dag-a-i2 instance is triggered and we see task t1 for that dag instance (i2) starts running and lets say completes, and then the dag-a-i2 stops, since its t2 can not run because previous instanse of t2 (for dag-a-i1) has failed status.
The behavior we need to see is that the second instance not get triggered, or if it gets triggered, we do not want to see task t1 for that second instance get triggered. This is causing problem for us.
Any help is appreciated.
Before I begin to answer, I'm going to set up lay out a naming conventions that will differ from the one you presented in your question.
DagA.TimeA.T1 will refer to an instance of a DAG A executing task T1 at time A.
Moving on, I see two potential solutions here.
The first:
Although not particularly pretty, you could add a sensor task to the beginning of your DAG. This sensor should wait for the execution of the final task of the same DAG. Something like the following should work:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.sensors import ExternalTaskSensor
from datetime import timedelta
dag = DAG(dag_id="ETL", schedule_interval="#hourly")
ensure_prior_success = ExternalTaskSensor(external_dag_id="ETL",
external_task_id="final_task", execution_delta=timedelta(hours=1))
final_task = DummyOperator(task_id="final_task", dag=dag)
Written this way, if any of the non-sensor tasks fail during a DagA.TimeA run, DagA.TimeB will begin executing its sensor task but will eventually timeout.
If you choose to write your DAG in this way, there are a couple things you should be aware of.
If you are planning on performing backfills of this DAG (or, if you think you ever may), you should set your DAG's max_active_runs to a low number. The reason for this is that a large enough backfill could fill the global task queue with sensor tasks and create a situation where new tasks are be unable to be queued.
The first run of this DAG will require human intervention. The human will need to mark the initial sensor task as success (because no previous runs exist, the sensor cannot complete successfully).
The second:
I'm not sure what work your tasks are performing, but for sake of example let's say they involve writes to a database. Create an operator that looks at your database for evidence that DagA.TimeA.T3 completed successfully.
As I said, without knowing what your tasks are doing, it is tough to offer concrete advice on what this operator would look like. If your use-case involves a constant number of database writes, you could perform a query to count the number of documents that exist in the target table WHERE TIME <= NOW - 1 HOUR.

Out of order race condition

When a workflow has a receive activity that occurs after another receive activity and the second receive activity is called first the workflow holds the caller by blocking for 1 minute before timing out.
I want the workflow to return immediately when there are no matching workflow instances.
I do not want to change the timeout on the client as some calls may take a while.
This is a known issue in WF4, at least I am not aware of it being fixed yet.

Resources