Airflow schedule_interval and the active dags run - airflow

define the instance for processing the training data
dag = DAG(
dag_id,
start_date = datetime(2019, 11, 14),
description = 'Reading training logs from the corresponding location',
default_args = default_args,
schedule_interval = timedelta(hours=1),
)
I have the code like this. So in my opinion, this dag will execute every one hour.
But in the airflow web, I got many run days in Schedule part. The day is executing all the time.
Especially, in the Tree View part, I could see all the block were filled within one hour!!!
I am confused about the schedule_interval function. Any ideas on how to fix that .

On the FIRST DAG run, it will start on the date you define on start_date. From that point on, the scheduler creates new DagRuns based on your schedule_interval and the corresponding task instances run as your dependencies are met.
You can read more about it here .

I know, it is the problem coming from the non consistent time setting between the really time and start_date. It the start_date is behind the really time, the system will backfill the past time.

Related

DAG backfilling for many dates

I have a DAG that i need to backfill for many dates. Due to resources i think the best option is to create a list of dates between a start and end date and in a for loop, run the task.
Should i have a function that returns all the dates (after formatting) to a variable and run the DAG task in a for loop or should the list of dates be part of a function that runs as a task and then somehow uses Xcom to send the list of dates? How can this be assigned to a variable with xcom pull but not requiring a task?
Let me focus on the first part of the question, and not on the way you are suggesting to solve it:
I have a DAG that I need to backfill for many dates.
But you don't want to run all the dates at the same time because of limited resources!
I would look into utilizing the max_active_runs_per_dag configuration variable that airflow provides out of the box. You don't need to create any additional logic.
This way you can limit the backfill process to a certain dag runs in parallel. For example if I want to only have 2 dag runs at a time:
export AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=2; airflow backfill -s {start_date} -e {end_date} --reset-dagruns {dag_id}
Run with --dry-run if you want to get a feel of what's happening before you execute the actual backfill command.
Hope that helps.
1) You don't want your DAG to exhaust limited resources :
Limit your dag with max_active_runs or max_active_tasks
DAG(
"MyDAG",
start_date=datetime(2021, 1, 1),
schedule_interval="#hourly",
catchup=False,
tags=["benchmark"],
max_active_runs=1,
max_active_tasks=1
)
Limit your resource utilisation using pools
MyOperator(
task_id="MyTask",
...
pool="MyPool"
)
In admin/pool set the capacity to 1 or more
2) You wan't to programatically control your DAG :
Create a second DAG.
In this new DAG add a TriggerDagRunOperator
When using the TriggerDagRunOperator use the conf argument to carry information to your main DAG (in your case, you would carry dates)

Run DAG at specific time each day

I've read multiple examples about schedule_interval, start_date and the Airflow docs multiple times aswell, and I still can't wrap my head around:
How do I get to execute my DAG at a specific time each day? E.g say it's now 9:30 (AM), I deploy my DAG and I want it to get executed at 10:30
I have tried
with DAG(
"test",
default_args=default_args,
description= "test",
schedule_interval = "0 10 * * *",
start_date = days_ago(0),
tags = ["goodie"]) as dag:
but for some reason that wasnt run today. I have tried different start_dates altso start_date = datetime.datetime(2021,6,23) but it does not get executed.
If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time i.e it does not get run today but did run yesterday
Isn't there an easy way to say "I deploy my DAG now, and I want to get it executed with this cron-syntax" (which I assume is what most people want) instead of calculating an execution time, based on start_date, schedule_interval and figuring out, how to interpret it?
If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time
It's not behind. You are simply confusing Airflow scheduling mechanizem with cron jobs. In cron jobs you just provide a cron expression and it schedule accordingly - This is not how it works in Airflow.
In Airflow the scheduling is calculated by start_date + schedule interval. Airflow execute the job at the END of the interval. This is consistent with how data pipelines usually works. Today you are processing yesterday data so at the end of this day you want to start a process that will go over yesterday records.
As a rule - NEVER use dynamic start date.
Setting:
with DAG(
"test",
default_args=default_args,
description= "test",
schedule_interval = "0 10 * * *",
start_date = datetime(2021,06,23, 10 ,0), # 2021-06-23 10:00
tags = ["goodie"]) as dag:
Means that the first will start on 2021-06-24 10:00 this run execution_date will be 2021-06-23 10:00. The second run will start on 2021-06-25 10:00 this run execution_date will be 2021-06-24 10:00
Since this is a source of confusion to many new users there is an architecture change in progress AIP-39 Richer scheduler_interval which will decople between WHEN to run and WHAT interval to consider with this run. It will be available in Airflow 2.3.0
UPDATE for Airflow>=2.3.0:
AIP-39 Richer scheduler_interval has been completed and released
It added Timetable support so you can Customizing DAG Scheduling with Timetables

How can I author a DAG which has a dynamic set of tasks relative to the execution date?

We have a DAG which pulls in some data from an ad platform. These ads are organized into campaigns. Our goal is to pull in the high-level metrics for these campaigns. To do so we first need to get the list of active campaigns for the given execution date--fortunately the ad platform's API makes this trivial, provided we know the time range we'd like to inquire about.
Currently our DAG is structured to go and fetch these campaigns and then to store them in S3 and finally Redshift. We then query Redshift before setting up the subsequent tasks which pull the data for each campaign. This is the gross part. We could also look in S3, but the trouble is the keys are templated with the value of the ds macro. There doesn't seem to be a way to know that value when constructing the DAG itself.
Our current approach also isn't aware of the execution date so it always queries all campaigns even if those campaigns aren't active for the time period we're interested in.
To make this a little more concrete, here's what that DAG looks like today:
Another approach would be to roll this all up into a single operator that encapsulates getting the set of campaigns for the current execution date and then getting the metrics for each of those campaigns. We avoided this because that seems to preclude pulling the data in parallel via separate tasks per campaign.
How can we author this DAG such that we maintain the parallelization offered by dynamically querying the Redshift tables for campaigns but the campaigns are correctly constrained to the execution date?
I don't believe this is possible. The DAG can only render in one configuration defined by the DAG's python definition. You won't be able to control which version of the DAG renders as a function of execution date, so you won't be able to look back at how a DAG should render in the past, for instance. If you want the current DAG to render based on execution date then you can possibly write some logic in your DAG's python definition.
Depending on how you orchestrate your Airflow jobs, you may be able to have a single operator as you described, but have that single operator then kick off parallel queries on Redshift and terminate when all queries are complete.
A caveat, in the interest of time I am going to piece together ideas and code examples from third party sources. I will give credit to those sources so you can take a look at context and documentation. An additional caveat, I have not been able to test this, but I am 99% certain this will work.
The tricky part of this whole operation will be figuring out how to handle your campaigns that might have ended and have started back up. Airflow is not going to like a DAG with a moving start or stop date. Moving the stop date might work a little better, moving the start date for the dag does not work at all. That said, if there are campaigns that get extended you should be able to move the end date as long as there are no gaps in continuity. If you have a campaign that lapses and then gets extended with a couple of non-active days in between you will probably want to figure out how to make those two look like unique campaigns to airflow.
First step
You will want to create a python script that will call your database and return the relevant details from your campaigns. Assuming it is in MySQL it will look something like this, an example connection from PyMySQL pip package documentation:
import pymysql.cursors
# Connect to the database
connection = pymysql.connect(host='localhost',
user='user',
password='passwd',
db='db',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
# Create a new record
sql = "INSERT INTO `users` (`email`, `password`) VALUES (%s, %s)"
cursor.execute(sql, ('webmaster#python.org', 'very-secret'))
# connection is not autocommit by default. So you must commit to save
# your changes.
connection.commit()
with connection.cursor() as cursor:
# Read a single record
sql = "SELECT `id`, `password` FROM `users` WHERE `email`=%s"
cursor.execute(sql, ('webmaster#python.org',))
result = cursor.fetchall()
finally:
connection.close()
Second step
You will want to iterate through that cursor and create your dags dynamically similar to this example from Astronomer.io:
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def create_dag(dag_id,
schedule,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py,
dag_number=dag_number)
return dag
# build a dag for each number in range(10)
for campaign in result: # This is the pymysql result from above
dag_id = 'hello_world_{}'.format(str(n))
default_args = {'owner': 'airflow',
'start_date': datetime(2018, 1, 1)
}
schedule = '#daily'
dag_number = n
globals()[dag_id] = create_dag(dag_id,
schedule,
dag_number,
default_args)
If you house all of this code in a single file, it will need to go in your dags folder. When a new campaign shows up in your database you will create a dag from it and can use your subdag architecture to run exactly the same set of steps/tasks with parameters pulled from that MySQL database. To be safe, and keep recent campaigns in your dag list I would write the mysql query with a date buffer. This way you still have dags that have ended recently in your list. The day these dags end you should populate the end_date argument of the dag.

How to write/read time stamp from a variable in airflow?

I'm working with: EXEC_DATE = '{{ macros.ds_add(ds, 1) }}' This gives me the execution date but not the hour.
I want to be able to save this value as YYYY-MM-DD HH:MM into variable called process_last_run.
Basically read the variable in the beginning of the run and write to it in the end of the dag. This variable indicates what was the running time of the last dag.
How can I do that?
You can do this with the macro execution_date. However be advised that this is a poorly named concept in airflow. It represents the beginning of a scheduled interval period. It will not change within the same dag-run, even if the task is re-run manually. It's there to support idempotent data updates. Which frankly is the optimal way to approach data pipelines. In your case though you said elsewhere that your data fetching api takes a start date and provides all the data up to current, which isn't conducive to being processed idempotently, though you could throw away data after a cut off specified.
So instead you might just take the date after your processing of data has completed, and store that for later. You can store into an Airflow Variable. You might note though that the time you get out of a date command shown below is going to be later then the last time of the data you might have gotten from within your process_data api call for all data from a start date. So it might be better if your processing step outputs the actual last date and time of the data processed as the last line of stdout (which is captured by BashOperator for xcom).
E.G.
from airflow.models import Variable, DAG
from datetime import datetime
def pyop_fun(**context):
# You could have used execution_date here and in the next operator
# to make the operator rerun safe.
# date_string = context['execution_date'].strftime('%Y-%m-%d %H:%M')
# But elsewhere you said your api is always giving you the up-to-the-minute data.
# So maybe getting the date from the prior task would work better for you.
Variable.set(
'process_last_run',
context['task_instance'].xcom_pull(task_ids='process_data')
with dag as DAG(…):
pyop = PythonOperator(
task_id='set_process_last_run',
callable=pyop_fun,
provide_context=True, …)
shop = BashOperator(
task_id='process_data',
bash_command='''
process_data "{{var.value.process_last_run}}";
date -u +%Y-%m-%d\ %H:%M''',
xcom_push=True, …)
shop >> pyop
# Because the last output line of a BashOperator is pushed into xcom for that
# task id with the default key, it can be pulled by the PythonOperator and
# stored in a variable.
There's an {{ execution_date }} variable in Jinja you can use to get the execution date for the current DAG run.
More info: Airflow - Macros
If you're looking to track something like start time or end time of execution or duration (in seconds) of a particular task instance, that info is stored in the TaskInstance model.
class TaskInstance(Base, LoggingMixin):
...
start_date = Column(UtcDateTime)
end_date = Column(UtcDateTime)
duration = Column(Float)
https://github.com/apache/incubator-airflow/blob/4c30d402c4cd57dc56a5d9dcfe642eadc77ec3ba/airflow/models.py#L877-L879
Also, if you wanted to compute the running time of the entire DAG, you could get that from querying the Airflow metadata database around these fields for a particular DAG run.
If you're doing this in your Python code already, you can access the execution_date field on the task instance itself as well instead of using the template layer.
Variables
You can write to and read from Airflow Variables like so:
Variable.set('my_key', 'some value')
my_val = Variable.get('my_key')
You can also perform CRUD operations on variables with the CLI.
Stats
Another thing you might keep in mind if you find yourself working with stats like task duration a lot is Airflow's StatsD integration which gathers metrics on Airflow itself at execution time. You can have these metrics go into a push-based system like StatsD itself, or into a pull-based system like Prometheus / Grafana by using statsd_exporter.

How is the execution_date of a DagRun set?

Given a DAG having an start_date, which is run at a specific date, how is the execution_date of the corresponding DAGRun defined?
I have read the documentation but one example is confusing me:
"""
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': '#hourly',
}
dag = DAG('tutorial', catchup=False, default_args=default_args)
Assuming that the DAG is run on 2016-01-02 at 6 AM, the first DAGRun will have an execution_date of 2016-01-01 and, as said in the documentation
the next one will be created just after midnight on the morning of
2016-01-03 with an execution date of 2016-01-02
Here is how I would have set the execution_date:
the DAG having its schedule_interval set to every hour and being run on 2016-01-02 at 6 AM, the execution_date of the first DAGRun would have been set to 2016-01-02 at 7 AM, the second to 2016-01-02 at 8 AM ...ect.
This is just how scheduling works in Airflow. I think it makes sense to do it the way that Airflow does when you think about how normal ETL batch processes run and how you use the execution_date to pick up delta records that have changed.
Lets say that we want to schedule a batch job to run every night to extract new records from some source database. We want all records that were changed from the 1/1/2018 onwards (we want all records changed on the 1st too). To do this you would set the start_date of the DAG to the 1/1/2018, the scheduler will run a bunch of times but when it gets to 2/1/2018 (or very shortly after) it will run our DAG with an execution_date of 1/1/2018.
Now we can send an SQL statement to the source database which uses the execution_date as part of the SQL using JINJA templating. The SQL would look something like:
SELECT row1, row2, row3
FROM table_name
WHERE timestamp_col >= {{ execution_date }} and timestamp_col < {{ next_execution_date }}
I think when you look at it this way it makes more sense although I admit I had trouble trying to understand this at the beginning.
Here is a quote from the documentation https://airflow.apache.org/scheduler.html:
The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Also it's worth noting that the example you're looking at from the documentation is describing the behaviour of the schedule when backfilling is disabled. If backfilling was enabled there would be a DAG run created for every 1 hour interval between 1/12/2015 and the current date if the DAG had never been run before.
We get this question a lot from analysts writing airflow dags.
Each dag run covers a period of time with a start & end.
The start = execution_date
The end = when the dag run is created and executed (next_execution_date)
An example that should help:
Schedule interval: '0 0 * * *' (run daily at 00:00:00 UTC)
Start date: 2019-10-01 00:00:00
10/1 00:00 10/2 00:00
*<------------------>*
< your 1st dag run >
^ execution_date
next_execution_date^
^when this 1st dag run is actually created by the scheduler
As #simond pointed out in a comment, "execution_date" is a poor name for this variable. It is neither a date nor represents when it was executed. Alas we're stuck with what the creators of airflow gave us... I find it helpful to just use next_execution_date if I want the datetime the dag run will execute my code.

Resources