Given a DAG having an start_date, which is run at a specific date, how is the execution_date of the corresponding DAGRun defined?
I have read the documentation but one example is confusing me:
"""
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': '#hourly',
}
dag = DAG('tutorial', catchup=False, default_args=default_args)
Assuming that the DAG is run on 2016-01-02 at 6 AM, the first DAGRun will have an execution_date of 2016-01-01 and, as said in the documentation
the next one will be created just after midnight on the morning of
2016-01-03 with an execution date of 2016-01-02
Here is how I would have set the execution_date:
the DAG having its schedule_interval set to every hour and being run on 2016-01-02 at 6 AM, the execution_date of the first DAGRun would have been set to 2016-01-02 at 7 AM, the second to 2016-01-02 at 8 AM ...ect.
This is just how scheduling works in Airflow. I think it makes sense to do it the way that Airflow does when you think about how normal ETL batch processes run and how you use the execution_date to pick up delta records that have changed.
Lets say that we want to schedule a batch job to run every night to extract new records from some source database. We want all records that were changed from the 1/1/2018 onwards (we want all records changed on the 1st too). To do this you would set the start_date of the DAG to the 1/1/2018, the scheduler will run a bunch of times but when it gets to 2/1/2018 (or very shortly after) it will run our DAG with an execution_date of 1/1/2018.
Now we can send an SQL statement to the source database which uses the execution_date as part of the SQL using JINJA templating. The SQL would look something like:
SELECT row1, row2, row3
FROM table_name
WHERE timestamp_col >= {{ execution_date }} and timestamp_col < {{ next_execution_date }}
I think when you look at it this way it makes more sense although I admit I had trouble trying to understand this at the beginning.
Here is a quote from the documentation https://airflow.apache.org/scheduler.html:
The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Also it's worth noting that the example you're looking at from the documentation is describing the behaviour of the schedule when backfilling is disabled. If backfilling was enabled there would be a DAG run created for every 1 hour interval between 1/12/2015 and the current date if the DAG had never been run before.
We get this question a lot from analysts writing airflow dags.
Each dag run covers a period of time with a start & end.
The start = execution_date
The end = when the dag run is created and executed (next_execution_date)
An example that should help:
Schedule interval: '0 0 * * *' (run daily at 00:00:00 UTC)
Start date: 2019-10-01 00:00:00
10/1 00:00 10/2 00:00
*<------------------>*
< your 1st dag run >
^ execution_date
next_execution_date^
^when this 1st dag run is actually created by the scheduler
As #simond pointed out in a comment, "execution_date" is a poor name for this variable. It is neither a date nor represents when it was executed. Alas we're stuck with what the creators of airflow gave us... I find it helpful to just use next_execution_date if I want the datetime the dag run will execute my code.
Related
I have a DAG that i need to backfill for many dates. Due to resources i think the best option is to create a list of dates between a start and end date and in a for loop, run the task.
Should i have a function that returns all the dates (after formatting) to a variable and run the DAG task in a for loop or should the list of dates be part of a function that runs as a task and then somehow uses Xcom to send the list of dates? How can this be assigned to a variable with xcom pull but not requiring a task?
Let me focus on the first part of the question, and not on the way you are suggesting to solve it:
I have a DAG that I need to backfill for many dates.
But you don't want to run all the dates at the same time because of limited resources!
I would look into utilizing the max_active_runs_per_dag configuration variable that airflow provides out of the box. You don't need to create any additional logic.
This way you can limit the backfill process to a certain dag runs in parallel. For example if I want to only have 2 dag runs at a time:
export AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=2; airflow backfill -s {start_date} -e {end_date} --reset-dagruns {dag_id}
Run with --dry-run if you want to get a feel of what's happening before you execute the actual backfill command.
Hope that helps.
1) You don't want your DAG to exhaust limited resources :
Limit your dag with max_active_runs or max_active_tasks
DAG(
"MyDAG",
start_date=datetime(2021, 1, 1),
schedule_interval="#hourly",
catchup=False,
tags=["benchmark"],
max_active_runs=1,
max_active_tasks=1
)
Limit your resource utilisation using pools
MyOperator(
task_id="MyTask",
...
pool="MyPool"
)
In admin/pool set the capacity to 1 or more
2) You wan't to programatically control your DAG :
Create a second DAG.
In this new DAG add a TriggerDagRunOperator
When using the TriggerDagRunOperator use the conf argument to carry information to your main DAG (in your case, you would carry dates)
I've read multiple examples about schedule_interval, start_date and the Airflow docs multiple times aswell, and I still can't wrap my head around:
How do I get to execute my DAG at a specific time each day? E.g say it's now 9:30 (AM), I deploy my DAG and I want it to get executed at 10:30
I have tried
with DAG(
"test",
default_args=default_args,
description= "test",
schedule_interval = "0 10 * * *",
start_date = days_ago(0),
tags = ["goodie"]) as dag:
but for some reason that wasnt run today. I have tried different start_dates altso start_date = datetime.datetime(2021,6,23) but it does not get executed.
If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time i.e it does not get run today but did run yesterday
Isn't there an easy way to say "I deploy my DAG now, and I want to get it executed with this cron-syntax" (which I assume is what most people want) instead of calculating an execution time, based on start_date, schedule_interval and figuring out, how to interpret it?
If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time
It's not behind. You are simply confusing Airflow scheduling mechanizem with cron jobs. In cron jobs you just provide a cron expression and it schedule accordingly - This is not how it works in Airflow.
In Airflow the scheduling is calculated by start_date + schedule interval. Airflow execute the job at the END of the interval. This is consistent with how data pipelines usually works. Today you are processing yesterday data so at the end of this day you want to start a process that will go over yesterday records.
As a rule - NEVER use dynamic start date.
Setting:
with DAG(
"test",
default_args=default_args,
description= "test",
schedule_interval = "0 10 * * *",
start_date = datetime(2021,06,23, 10 ,0), # 2021-06-23 10:00
tags = ["goodie"]) as dag:
Means that the first will start on 2021-06-24 10:00 this run execution_date will be 2021-06-23 10:00. The second run will start on 2021-06-25 10:00 this run execution_date will be 2021-06-24 10:00
Since this is a source of confusion to many new users there is an architecture change in progress AIP-39 Richer scheduler_interval which will decople between WHEN to run and WHAT interval to consider with this run. It will be available in Airflow 2.3.0
UPDATE for Airflow>=2.3.0:
AIP-39 Richer scheduler_interval has been completed and released
It added Timetable support so you can Customizing DAG Scheduling with Timetables
define the instance for processing the training data
dag = DAG(
dag_id,
start_date = datetime(2019, 11, 14),
description = 'Reading training logs from the corresponding location',
default_args = default_args,
schedule_interval = timedelta(hours=1),
)
I have the code like this. So in my opinion, this dag will execute every one hour.
But in the airflow web, I got many run days in Schedule part. The day is executing all the time.
Especially, in the Tree View part, I could see all the block were filled within one hour!!!
I am confused about the schedule_interval function. Any ideas on how to fix that .
On the FIRST DAG run, it will start on the date you define on start_date. From that point on, the scheduler creates new DagRuns based on your schedule_interval and the corresponding task instances run as your dependencies are met.
You can read more about it here .
I know, it is the problem coming from the non consistent time setting between the really time and start_date. It the start_date is behind the really time, the system will backfill the past time.
I'm new to Airflow.
My goal is to run a dag, on a daily basis, starting 1 hour from now.
I'm truly misunderstanding the airflow schedule "end-of-interval invoke" rules.
From the docs [(Airflow Docs)][1]
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
I set schedule_interval as followed:
schedule_interval="00 15 * * *"
and start_date as followed:
start_date=datetime(year=2019, month=8, day=7)
My assumption was, that if now it's 14:00:00 PM (UTC time) and the date today is 07-08-2019, then my dag will be executed exactly in one hour.
However, my dag is not starting at all.
So there is a whole page talking about airflow job not been scheduled. https://airflow.apache.org/faq.html
The key thing to notice here is:
The Airflow scheduler triggers the task soon after the start_date +
scheduler_interval is passed.
To my understanding, you want to trigger a task start_date=datetime(year=2019, month=8, day=7) at 15:00 UTC daily. schedule_interval="00 15 * * *" means you would run the task every day at 15:00 UTC. According to the docs, The scheduler triggers your task after start_date + scheduler_interval, so airflow won't trigger it until the next day which is August 8th 2019 15:00:00 UTC. Or you can change the day to 6th. It might be easier to understand this from ETL way: you can only process the data for a given period after it has passed. So August 7th 2019 15:00:00 UTC is your start point, you need to wait until August 8th 2019 15:00:00 UTC to run the task within that given period.
Also, note airflow has execution_data and start_date, you can find more here
schedule_interval="00 15 * * *"
start_date=07-08-2019
1st run will be on 08-08-2019 at 3:00
if you created this dag before 3:00 on 7-8-2019
I'm working with: EXEC_DATE = '{{ macros.ds_add(ds, 1) }}' This gives me the execution date but not the hour.
I want to be able to save this value as YYYY-MM-DD HH:MM into variable called process_last_run.
Basically read the variable in the beginning of the run and write to it in the end of the dag. This variable indicates what was the running time of the last dag.
How can I do that?
You can do this with the macro execution_date. However be advised that this is a poorly named concept in airflow. It represents the beginning of a scheduled interval period. It will not change within the same dag-run, even if the task is re-run manually. It's there to support idempotent data updates. Which frankly is the optimal way to approach data pipelines. In your case though you said elsewhere that your data fetching api takes a start date and provides all the data up to current, which isn't conducive to being processed idempotently, though you could throw away data after a cut off specified.
So instead you might just take the date after your processing of data has completed, and store that for later. You can store into an Airflow Variable. You might note though that the time you get out of a date command shown below is going to be later then the last time of the data you might have gotten from within your process_data api call for all data from a start date. So it might be better if your processing step outputs the actual last date and time of the data processed as the last line of stdout (which is captured by BashOperator for xcom).
E.G.
from airflow.models import Variable, DAG
from datetime import datetime
def pyop_fun(**context):
# You could have used execution_date here and in the next operator
# to make the operator rerun safe.
# date_string = context['execution_date'].strftime('%Y-%m-%d %H:%M')
# But elsewhere you said your api is always giving you the up-to-the-minute data.
# So maybe getting the date from the prior task would work better for you.
Variable.set(
'process_last_run',
context['task_instance'].xcom_pull(task_ids='process_data')
with dag as DAG(…):
pyop = PythonOperator(
task_id='set_process_last_run',
callable=pyop_fun,
provide_context=True, …)
shop = BashOperator(
task_id='process_data',
bash_command='''
process_data "{{var.value.process_last_run}}";
date -u +%Y-%m-%d\ %H:%M''',
xcom_push=True, …)
shop >> pyop
# Because the last output line of a BashOperator is pushed into xcom for that
# task id with the default key, it can be pulled by the PythonOperator and
# stored in a variable.
There's an {{ execution_date }} variable in Jinja you can use to get the execution date for the current DAG run.
More info: Airflow - Macros
If you're looking to track something like start time or end time of execution or duration (in seconds) of a particular task instance, that info is stored in the TaskInstance model.
class TaskInstance(Base, LoggingMixin):
...
start_date = Column(UtcDateTime)
end_date = Column(UtcDateTime)
duration = Column(Float)
https://github.com/apache/incubator-airflow/blob/4c30d402c4cd57dc56a5d9dcfe642eadc77ec3ba/airflow/models.py#L877-L879
Also, if you wanted to compute the running time of the entire DAG, you could get that from querying the Airflow metadata database around these fields for a particular DAG run.
If you're doing this in your Python code already, you can access the execution_date field on the task instance itself as well instead of using the template layer.
Variables
You can write to and read from Airflow Variables like so:
Variable.set('my_key', 'some value')
my_val = Variable.get('my_key')
You can also perform CRUD operations on variables with the CLI.
Stats
Another thing you might keep in mind if you find yourself working with stats like task duration a lot is Airflow's StatsD integration which gathers metrics on Airflow itself at execution time. You can have these metrics go into a push-based system like StatsD itself, or into a pull-based system like Prometheus / Grafana by using statsd_exporter.