How to get the DAG chain execution time in Airflow? - airflow

Lets say I have two DAG, where dag2 executed dag1 as part of it's flow using TriggerDagRunOperator as follows:
dag1: task1 > task2 > task3
dag2: task4 > dag1 > task5
Now lets say dag2 is scheduled for once a day at 5PM.
Is there a way for me to get the execution timestamp for dag2 (the parent DAG) while I'm running dag1?
Is there any built-in parameter that holds that value?
And if something happened and dag2 was triggered later than usual, lets say 6PM same day, then I still want to get the original scheduling time - that is 5PM while I'm in dag1.

Pass a function to the python_callable argument of TriggerDagRunOperator that injects the execution_date into the triggered DAG:
def inject_execution_date(context, dag_run_obj):
dag_run_obj.payload = {"parent_execution_date": context["execution_date"]}
return dag_run_obj
[...]
trigger_dro = TriggerDagRunOperator(python_callable=inject_execution_date, [...])
You can access this in the child DAG with context["conf"]["parent_execution_date"]

Related

How to get Airflow's previous execution date regardless of how the DAG is triggered?

When I trigger a DAG manually, prev_execution_date and execution_date are the same.
echo_exec_date = BashOperator(
task_id='bash_script',
bash_command='echo "prev_exec_date={{ prev_execution_date }} execution_date={{ execution_date }}"',
dag=dag)
results in:
prev_exec_date=2022-06-29T08:50:37.506898+00:00 execution_date=2022-06-29T08:50:37.506898+00:00
They are different if the DAG is triggered automatically by the scheduler.
I would like to have prev_execution_date regardless of triggering it manually or automatically.
When manually triggering DAG, the schedule will be ignored, and prev_execution_date == next_execution_date == execution_date
This is explained in the Airflow docs
This is because previous / next of manual run is not something that is well defined. Consider you have a daily schedule (say at 00:00) and you invoke a manual run on 13:00. What is the expected next schedule? should it be daily from 00:00 or daily from 13:00? a DagRun can have only 1 prev and only 1 next. In your senario it seems like you are interested in a case where there can be more than 1 or that the manual run "comes between" the two scheduled runs. This is not something that Airflow supports - It really over complicate things.
If you want to workaround it you can create custom macro that checks the run_type, searches the specific DagRun that you consider as previous and return it's execution_date. Be noted that it might create some side effects (overlapping data interval process etc..) you need to really verify that the logic you implement make sense for your specific use case.

Airflow: Master Dag with ExternalTaskSensor gets stuck forever

The requirement is to have DAG run one after the other and on success of each DAG
I have a Master DAG in which I am calling all the DAG to get executed one after the other in sequence
Also in each of the dag_A, dag_B, dag_C I have to given schedule_interval = None and manually turn ON in GUI
I am using ExternalTaskSensor, coz even before all the tasks in the first dag_A gets completed, it kicks off the second dag_B, to avoid such issues I am using ExternalTaskSensor.If any better implementation please kindly let me know
Don't know What I am missing here
Code: master_dag.py
import datetime
import os
from datetime import timedelta
from airflow.models import DAG, Variable
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from airflow.operators.sensors import ExternalTaskSensor
default_args = {
'owner': 'airflow',
'start_date': datetime.datetime(2020, 1, 7),
'provide_context': True,
'execution_timeout': None,
'retries': 0,
'retry_delay': timedelta(minutes=3),
'retry_exponential_backoff': True,
'email_on_retry': False,
}
dag = DAG(
dag_id='master_dag',
schedule_interval='7 3 * * *',
default_args=default_args,
max_active_runs=1,
catchup=False,
)
trigger_dag_A = TriggerDagRunOperator(
task_id='trigger_dag_A',
trigger_dag_id='dag_A',
dag=dag,
)
wait_for_dag_A = ExternalTaskSensor(
task_id='wait_for_dag_A',
external_dag_id='dag_A',
external_task_id='proc_success',
poke_interval=60,
allowed_states=['success'],
dag=dag,
)
trigger_dag_B = TriggerDagRunOperator(
task_id='trigger_dag_B',
trigger_dag_id='dag_B',
dag=dag,
)
wait_for_dag_B = ExternalTaskSensor(
task_id='wait_for_dag_B',
external_dag_id='dag_B',
external_task_id='proc_success',
poke_interval=60,
allowed_states=['success'],
dag=dag)
trigger_dag_C = TriggerDagRunOperator(
task_id='trigger_dag_C',
trigger_dag_id='dag_C',
dag=dag,
)
trigger_dag_A >> wait_dag_A >> trigger_dag_B >> wait_dag_B >> trigger_dag_C
Each of the DAG has multiple tasks running with last task been proc_success
Background
ExternalTaskSensor works by polling the state of DagRun / TaskInstance of the external DAG or task respectively (based on whether or not external_task_id is passed)
Now since a single DAG can have multiple active DagRuns, the sensor must be told that which of these runs / instances it is supposed to sense
For that, it uses execution_date as a distinguishing criteria. This can be expressed in (only) one of following two ways
:param execution_delta: time difference with the previous execution to
look at, the default is the same execution_date as the current task or DAG.
For yesterday, use [positive!] datetime.timedelta(days=1). Either
execution_delta or execution_date_fn can be passed to
ExternalTaskSensor, but not both.
:type execution_delta: datetime.timedelta
:param execution_date_fn: function that receives the current execution date
and returns the desired execution dates to query. Either execution_delta
or execution_date_fn can be passed to ExternalTaskSensor, but not both.
:type execution_date_fn: callable
The problem in your implementation
In your ExternalTaskSensors, you are not passing either of execution_date_fn or execution_delta params
as a result, the sensor picks up its own execution_date to poll for DagRuns of child DAGs, thereby getting stuck (clearly the execution_date of your parent / orchestrator DAG would be different from child DAGs)
#provide_session
def poke(self, context, session=None):
if self.execution_delta:
dttm = context['execution_date'] - self.execution_delta
elif self.execution_date_fn:
dttm = self.execution_date_fn(context['execution_date'])
else:
# if neither of above is passed, use current DAG's execution date
dttm = context['execution_date']
Further tips
You can skip passing external_task_id; when you do that, the ExternalTaskSensor, in effect, becomes an ExternalDagSensor. This is particularly helpful when your child DAGs (A, B & C) have more than one end task (so that completion of any one of those end-tasks doesn't guarantee the completion of entire DAG)
Also have a look at this discussion: Wiring top-level DAGs together
EDIT-1
On an afterthought, my initial judgement appears to be wrong; particularly following statement doesn't hold true.
clearly the execution_date of your parent / orchestrator DAG would be
different from child DAGs
Looking at the source, it becomes clear the TriggerDagRunOperator passes its own execution_date to child DagRun, meaning that the ExternalTaskSensor should then be able to sense that DAG or it's task.
trigger_dag(
dag_id=self.trigger_dag_id,
run_id=run_id,
conf=self.conf,
# own execution date passed to child DAG
execution_date=self.execution_date,
replace_microseconds=False,
)
so then the explanation holds no truth.
I would suggest you to
check the execution_date of your triggered child DAGs / the tasks whose external_task_id you are passing, in the UI or by querying meta-db
and compare it with execution_date of your orchestrator DAG
that should clarify certain bits

Can I add a delay to a schedule in airflow?

I have a pipeline I want to run everyday, but I would like the execution date to lag. That is, on day X I want the execution date to be X-3. Is something like that possible?
It looks like you are using execution_date as a variable in your pipeline logic. For example, to process the data that is 3 days older than the execution_date. So, instead of making execution_date to lag by 3 days you can subtract the lag from execution_date and use the result in you pipeline logic. Airflow provides a number of ways to do it:
Templates: {{ execution_date - macros.timedelta(days=3) }}. So, for example, the bash_command parameter of BashOperator can be bash_command='echo Processing date: {{ execution_date - macros.timedelta(days=3) }} '
The PythonOperator's python callable: Define the callable something like def func(execution_date, **kwargs): ... and set the PythonOperator's parameter provide_context=True. The execution_date parameter of func() will be set to the current execution date (datetime object) on call. So, inside func() you can do processing_date = execution_date - timedelta(days=3).
The Sensors' context parameter: The poke() and execute() methods of any sensor have the context paramter that is a dict with all macros including execution_date. So, in these methods you can do processing_date = context['execution_date'] - timedelta(days=3).
Forcing execution date to have a lag simply does not feel right. Because, according to the Airflow's logic, the execution date of the currently running DAG normally can have lag only if it is catching up (bakcfilling).
You can use a TimeSensor to delay the execution of tasks in a DAG. I don't think you can change the actual execution_date unless you can describe the behavior as a cron.
If you want this to only apply this delay for a subset of scheduled DAG runs, you could use a BranchPythonOperator to first check if execution_date is one of those days you want the lag. If it is, then take the branch with the sensor. Otherwise, move along without it.
Alternatively, especially if you plan to have this behavior in more than one DAG, you can write a modified version of the sensor. It might look something like this:
def poke(self, context):
if should_delay(context['execution_date']):
self.log.info('Checking if the time (%s) has come', self.target_time)
return timezone.utcnow().time() > self.target_time
else:
self.log.info('Not one of those days, just run')
return True
You can reference the code for the existing time sensor in https://github.com/apache/incubator-airflow/blob/1.10.1/airflow/sensors/time_sensor.py#L38-L40.

Finding out what triggered a task run programmatically

Is there a way to programmatically determine what triggered the current task run of the PythonOperator from inside of the operator?
I want to differentiate between the task runs triggered on schedule, those catching up, and those triggered by the backfill CLI command.
The template context contains two variables: dag_run and run_id that you can use to determine whether the run was scheduled, a backfill, or externally triggered.
from airflow import jobs
def python_target(**context):
is_backfill = context["dag_run"].is_backfill
is_external = context["dag_run"].external_trigger
is_latest = context["execution_date"] == context["dag"].latest_execution_date
# More code...

Airflow: Dynamic SubDag creation

I have a use case where I have a list of clients. The client can be added or removed from the list, and they can have different start dates, and different initial parameters.
I want to use airflow to backfill all data for each client based on their initial start date + rerun if something fails. I am thinking about creating a SubDag for each client. Will this address my problem?
How can I dynamically create SubDags based on the client_id?
You can definitely create DAG objects dynamically:
def make_client_dag(parent_dag, client):
return DAG(
'%s.client_%s' % (parent_dag.dag_id, client.name),
start_date = client.start_date
)
You could then use that method in a SubDagOperator from your main dag:
for client in clients:
SubDagOperator(
task_id='client_%s' % client.name,
dag=main_dag,
subdag = make_client_dag(main_dag, client)
)
This will create a subdag specific to each member of the collection clients, and each will run for the next invocation of the main dag. I'm not sure if you'll get the backfill behavior you want.

Resources