How does Airflow parse and store schedule_interval - airflow

I am working on a feature that requires schedule_intervals of Airflow jobs. Instead of writing the code to parse cron expressions in the DAG files myself, I have been trying to find parsed schedule_interval values in Airflow metadata DB, but to no avail.
Can someone give me a pointer to how Airflow parses schedule_interval expressions (e.g. a file at https://github.com/apache/incubator-airflow), and where it stores the parsed values (if the values are stored)?
Edit:
The schedule_interval expression above is the DAG argument schedule_interval, as in:
dag = DAG(
'tutorial', default_args=default_args, schedule_interval='#daily')
According to this documentation page, schedule_interval can be a cron expression, a datetime.timedelta object, or one of the 'presets' like '#daily'. Because schedule_interval can take multiple forms, I don't want to re-invent the wheel and write code to parse schedule_interval arguments, if Airflow has already parsed and stored these values.
I am building a system to periodically check all Airflow jobs and summarize their status, through querying the Airflow metadata db. Although not absolutely necessary, it would be useful to know schedule_interval, because it reveals information such as for each Airflow job, how many dag runs are expected in the last 24 hours, and when the next dag run would be.

The schedule_interval value isn't stored anywhere but the process itself. Airflow determines when it's time to create a new run by checking, more or less, NOW() >= (MAX(execution_date, start_date)) + schedule_interval
You can programmatically calculate Airflow's execution_date values with the airflow.models.DAG.following_schedule and airflow.models.DAG.previous_schedule methods if you'd like.
Note: Airflow uses the croniter package to calculate off cron values.

I couldn't manage to find where Airflow store parsed value of schedule_interval, however I did find the code that parses schedule_interval expressions. It's in the utils module (https://github.com/apache/incubator-airflow/blob/master/airflow/utils/dates.py).

Related

How to get Airflow's previous execution date regardless of how the DAG is triggered?

When I trigger a DAG manually, prev_execution_date and execution_date are the same.
echo_exec_date = BashOperator(
task_id='bash_script',
bash_command='echo "prev_exec_date={{ prev_execution_date }} execution_date={{ execution_date }}"',
dag=dag)
results in:
prev_exec_date=2022-06-29T08:50:37.506898+00:00 execution_date=2022-06-29T08:50:37.506898+00:00
They are different if the DAG is triggered automatically by the scheduler.
I would like to have prev_execution_date regardless of triggering it manually or automatically.
When manually triggering DAG, the schedule will be ignored, and prev_execution_date == next_execution_date == execution_date
This is explained in the Airflow docs
This is because previous / next of manual run is not something that is well defined. Consider you have a daily schedule (say at 00:00) and you invoke a manual run on 13:00. What is the expected next schedule? should it be daily from 00:00 or daily from 13:00? a DagRun can have only 1 prev and only 1 next. In your senario it seems like you are interested in a case where there can be more than 1 or that the manual run "comes between" the two scheduled runs. This is not something that Airflow supports - It really over complicate things.
If you want to workaround it you can create custom macro that checks the run_type, searches the specific DagRun that you consider as previous and return it's execution_date. Be noted that it might create some side effects (overlapping data interval process etc..) you need to really verify that the logic you implement make sense for your specific use case.

How to pass Yesterday's date as parameter to Airflow Task

I need to pass yesterday's date as parameter to my airflow task. I tried using the following.
{{ prev_execution_date.strftime('%Y-%m-%d') }}. This block is taking today's date when I manually trigger the DAG. Can someone help here.
This is expected.
When manually triggering DAG, the schedule will be ignored, and prev_ds == next_ds == ds
You can read more about it in the documentation
However for scheduled runs the execution_date is always 1 cycle behind (see Problem with start date and scheduled date in Apache Airflow for more information about it.
You will need to look in the macros page to find the right macro for your use case.
Use yesterday_ds variable from the task instance template context which is computed based on the execution_date of the DagRun.

Macros in YYYYMMDDHHMISS format

Requirement:
Get the date value in the format of YYYYMMDDHHMMSS
Code:
TS_HOURS_NODASH = "{{ execution_date.strftime('%Y%m%d%H%M%S') }}"
Output
20200721000000
Expected: Actual hour/minute/seconds
It depends on that you need:
execution_date - it is a time when your dag expected to run. In case your dag run on a #daily basis your time will be exactly 00:00:00
ti.start_date - it is a time when your task instance actually started.
I have achieved using pendulum
pendulum.now().format('%Y%m%d%H%M%S')
execution_date is calculated according to schedule interval, execution_date of all task instances related to the dag run is the same, and it is not the actual datetime that a task is run.
if you just want to get the actual start time of the task, why not get the system time at the beginning of your task, although it is slightly later than airflow task's start_time, it is much easier.
if you insist on the start_time of airflow, it needs to do some change on the operator, and it is another story.
usually, it is better to use execution_date as suffix of a file, it is stable and will not change after the task instance is generated, the actual start time of a task depends on the upstream tasks, retry also change start time, and it will also change if your clear some task instances and re-run them.

Can I add a delay to a schedule in airflow?

I have a pipeline I want to run everyday, but I would like the execution date to lag. That is, on day X I want the execution date to be X-3. Is something like that possible?
It looks like you are using execution_date as a variable in your pipeline logic. For example, to process the data that is 3 days older than the execution_date. So, instead of making execution_date to lag by 3 days you can subtract the lag from execution_date and use the result in you pipeline logic. Airflow provides a number of ways to do it:
Templates: {{ execution_date - macros.timedelta(days=3) }}. So, for example, the bash_command parameter of BashOperator can be bash_command='echo Processing date: {{ execution_date - macros.timedelta(days=3) }} '
The PythonOperator's python callable: Define the callable something like def func(execution_date, **kwargs): ... and set the PythonOperator's parameter provide_context=True. The execution_date parameter of func() will be set to the current execution date (datetime object) on call. So, inside func() you can do processing_date = execution_date - timedelta(days=3).
The Sensors' context parameter: The poke() and execute() methods of any sensor have the context paramter that is a dict with all macros including execution_date. So, in these methods you can do processing_date = context['execution_date'] - timedelta(days=3).
Forcing execution date to have a lag simply does not feel right. Because, according to the Airflow's logic, the execution date of the currently running DAG normally can have lag only if it is catching up (bakcfilling).
You can use a TimeSensor to delay the execution of tasks in a DAG. I don't think you can change the actual execution_date unless you can describe the behavior as a cron.
If you want this to only apply this delay for a subset of scheduled DAG runs, you could use a BranchPythonOperator to first check if execution_date is one of those days you want the lag. If it is, then take the branch with the sensor. Otherwise, move along without it.
Alternatively, especially if you plan to have this behavior in more than one DAG, you can write a modified version of the sensor. It might look something like this:
def poke(self, context):
if should_delay(context['execution_date']):
self.log.info('Checking if the time (%s) has come', self.target_time)
return timezone.utcnow().time() > self.target_time
else:
self.log.info('Not one of those days, just run')
return True
You can reference the code for the existing time sensor in https://github.com/apache/incubator-airflow/blob/1.10.1/airflow/sensors/time_sensor.py#L38-L40.

Apache Airflow - How to retrieve dag_run data outside an operator in a flow triggered with TriggerDagRunOperator

I set up two DAGs, let's call the first one orchestrator and the second one worker. Orchestrator work is to retrieve a list from an API and, for each element in this list, trigger the worker DAG with some parameters.
The reason why I separated the two workflows is I want to be able to replay only the "worker" workflows that fail (if one fails, I don't want to replay all the worker instances).
I was able to make things work but now I see how hard it is to monitor, as my task_id are the same for all, so I decided to have dynamic task_id based on a value retrieved from the API by "orchestrator" workflow.
However, I am not able to retrieve the value from the dag_run object outside an operator. Basically, I would like this to work :
with models.DAG('specific_workflow', schedule_interval=None, default_args=default_dag_args) as dag:
name = context['dag_run'].name
hello_world = BashOperator(task_id='hello_{}'.format(name), bash_command="echo Hello {{ dag_run.conf.name }}", dag=dag)
bye = BashOperator(task_id='bye_{}'.format(name), bash_command="echo Goodbye {{ dag_run.conf.name }}", dag=dag)
hello_world >> bye
But I am not able to define this "context" object. However, I am able to access it from an operator (PythonOperator and BashOperator for instance).
Is it possible to retrieve the dag_run object outside an operator ?
Yup it is possible
What I tried and worked for me is
In the following code block, I am trying to show all possible ways to use the configurations passed,
directly to different operators
pyspark_task = DataprocSubmitJobOperator(
task_id="task_0001",
job=PYSPARK_JOB,
location=f"{{{{dag_run.conf.get('dataproc_region','{config_data['cluster_location']}')}}}}",
project_id="{{dag_run.conf['dataproc_project_id']}}",
gcp_conn_id="{{dag_run.conf.gcp_conn_id}}"
)
So either you can use it like
"{{dag_run.conf.field_name}}" or "{{dag_run.conf['field_name']}}"
Or
If you want to use some default values in case the configuration field is optional,
f"{{{{dag_run.conf.get('field_name', '{local_variable['field_name_0002']}')}}}}"
I don't think it's easily possible currently. For example, as part of the worker run process, the DAG is retrieved without any TaskInstance context provided besides where to find the DAG: https://github.com/apache/incubator-airflow/blob/f18e2550543e455c9701af0995bc393ee6a97b47/airflow/bin/cli.py#L353
The context is injected later: https://github.com/apache/incubator-airflow/blob/c5f1c6a31b20bbb80b4020b753e88cc283aaf197/airflow/models.py#L1479
The run_id of the DAG would be good place to store this information.

Resources