Schedule airflow dag with delay - airflow

I’m trying to create an airflow dag that runs an sql query to get all of yesterday’s data, but I want the execution date to be delayed from the data_interval_end.
So the data interval is ending at midnight, but it takes few hours for the data itself to be ready for querying. This is why I want the dag to run only after 4 hours.
For example:
data_interval_start = 2022-01-01 00:00:00
data_interval_end = 2022-01-02 00:00:00
wanted dag execution = 2022-01-02 00:04:0
How can I achieve this?
Thanks!
So far I just adjusted the sql query itself with date_trunc, but I hope there is a solution to keep the query without this function.

Instead of delaying by fixed time, you may use BranchSQLOperator, it has follow_task_ids_if_true and follow_task_ids_if_false. If you use fixed time window, it might run even in the cases where your data is not ready.
operator = BranchSQLOperator(
task_id="check_data_presence_task",
conn_id="sql_connection_id",
sql="SELECT count(1) FROM my_table where date>=today_date",
follow_task_ids_if_true="success_task_id",
follow_task_ids_if_false="fail_task_id",
dag=dag
)

Related

Airflow - How to properly define the time my DAG will execute every day?

I have two dags: The first one extracts data from one database to another. I want it to run everyday at 4 AM and then, that's how I defined my params:
Note: The code has 7am instead of 4 because Airflow is in UTC time and my time is GMT-3.
start_date=datetime(2022, 9, 6),
schedule_interval="0 7 * * *",
catchup=False
) as dag:
But, when I check Airflow's UI, the DAG time is shown like this:
screenshot
I have no idea about or have seen this Data Interval, why is it starting at today 21:00 (9 PM) and why this is the next run for this DAG?
How do I set my DAG to run at the next day (Sep 6, as I'm posting on 5) at 4 AM?
Thank you!

How to get airflow time zone information from macros?

Background
I am trying to run a DAG at 10pm America/New_York once every day from Monday to Friday. The script which the DAG runs takes as input the day it runs on for its
time zone (10pm Mon-Fri). When I run this scrip as an Airflow DAG, the date is derived from the macro {{ ds_nodash }}
The problem
When Airflow runs, by the time it's 10pm NY time, it's already the next day on UTC time. Since Airflow uses UTC, the execution date is one day ahead, so when my DAG uses the macro {{ ds_nodash }}, it is one day ahead.
Question:
Is there a way to get the time-zone adjusted date as a macro on airflow or is the only solution to my problem to adjust the macro myself?
According to the airflow documentation, the default variables (such as {{ ds_nodash }}) are in UTC. So, we need to convert them ourselves. It can go something like this:
# ...
local_ds_nodash = '{{ dag.timezone.convert(execution_date).strftime("%Y%m%d") }}'
# ...
create_file = BashOperator(
task_id='create_file',
bash_command=f'touch {local_ds_nodash}.txt'
)
I supposed that you may mess up with two different concepts in airflow.
Actually, 'ds' is not the date which the tasks are running, it is the previous period of tasks running. for example, for ds is 3/25/2019, it would be running on 3/26/2019 rather than 3/25. So if you want your tasks running exactly on Mon-Fri, you need to set the schedule_interval as '0 22 * * 1-5'. The weekday settings should be '1-5' instead of '2-6'.
For timezone, kaxil's answer has explained very well. But if for some reason, you cannot change the airflow server configuration, what you need to do is to adjust schedule_interval as '0 2 * * 2-6'. Then the tasks will run as you expected.
Timezone feature is now available in Airflow. Have a look at https://airflow.readthedocs.io/en/1.10.2/timezone.html and adjust your config in airflow.cfg accordingly.
By default it is
[core]
default_timezone = utc
adjust it to
[core]
default_timezone = America/New_York
The execution_date will then contain TZ info as well which you would be able to extract. Try it in a test environment before you roll out to your production environment.

How to consider daylight savings time when using cron schedule in Airflow

In Airflow, I'd like a job to run at specific time each day in a non-UTC timezone. How can I go about scheduling this?
The problem is that once daylight savings time is triggered, my job will either be running an hour too soon or an hour too late. In the Airflow docs, it seems like this is a known issue:
In case you set a cron schedule, Airflow assumes you will always want
to run at the exact same time. It will then ignore day light savings
time. Thus, if you have a schedule that says run at end of interval
every day at 08:00 GMT+1 it will always run end of interval 08:00
GMT+1, regardless if day light savings time is in place.
Has anyone else run into this issue? Is there a work around? Surely the best practice cannot be to alter all the scheduled times after Daylight Savings Time occurs?
Thanks.
Starting with Airflow 1.10, time-zone aware DAGs can be defined using time-zone aware datetime objects to specify start_date. For Airflow to schedule DAG runs always at the same time (regardless of a possible daylight-saving-time switch), use cron expressions to specify schedule_interval. To make Airflow schedule DAG runs with fixed intervals (regardless of a possible daylight-saving-time switch), use datetime.timedelta() to specify schedule_interval.
For example, consider the following code that, first, uses a cron expression to schedule two consecutive DAG runs, and then uses a fixed interval to do the same.
import pendulum
from airflow import DAG
from datetime import datetime, timedelta
START_DATE = datetime(
year=2019,
month=10,
day=25,
hour=8,
minute=0,
tzinfo=pendulum.timezone('Europe/Kiev'),
)
def gen_execution_dates(start_date, schedule_interval):
dag = DAG(
dag_id='id', start_date=start_date, schedule_interval=schedule_interval
)
execution_date = dag.start_date
for i in range(1, 3):
execution_date = dag.following_schedule(execution_date)
print(
f'[Run {i}: Execution Date for "{schedule_interval}"]:',
dag.timezone.convert(execution_date),
)
gen_execution_dates(START_DATE, '0 8 * * *')
gen_execution_dates(START_DATE, timedelta(days=1))
Running the code produces the following output:
[Run 1: Execution Date for "0 8 * * *"]: 2019-10-26 08:00:00+03:00
[Run 2: Execution Date for "0 8 * * *"]: 2019-10-27 08:00:00+02:00
[Run 1: Execution Date for "1 day, 0:00:00"]: 2019-10-26 08:00:00+03:00
[Run 2: Execution Date for "1 day, 0:00:00"]: 2019-10-27 07:00:00+02:00
For the zone [Europe/Kiev], the daylight saving time of 2019 ends on 2019-10-27 at 03:00:00+03:00. That is, between Run 1 and Run 2 in our example.
The first two output lines show that for the DAG runs scheduled with a cron expression the first run and second run are both scheduled for 08:00 (although, in different timezones: Eastern European Summer Time (EEST) and Eastern European Time (EET) respectively).
The last two output lines show that for the DAG runs scheduled with a fixed interval the first run is scheduled for 08:00 (EEST), and the second run is scheduled exactly 1 day (24 hours) later, which is at 07:00 (EET) due to the daylight-saving-time switch.
The following figure illustrates the example:

What's the eloquent way to use the run date for a weekly Airflow job?

The problem: Airflow's execution_date is defined as the beginning of the period between runs. For example, a DAG run on a weekly schedule would run on 2018-01-08 T11:00:00, but the execution_date would be 2018-01-01 T11:01:00.
The objective: I receive a file once a week, with the file date in the file's name. To identify the file, I'd like to use Airflow's execution_date. But I cannot seem to find a way to use the date of the run, versus using the earliest possible execution_date for a period.
Possible solutions:
Modify the execution_date on the fly. Something like: context['execution_date'] + timedelta(days=7). This seems hacky.
Run the DAG daily, insert a ShortCircuitOperator at the beginning of the DAG execution graph, exit if execution_date is not the expected date.
All suggestions or recommendations are welcomed. It's a nuanced problem, but causing some issues with my ETL pipeline.
Another possible solution?
Have the DAG run once a week just after you "think" the file will arrive. Parse the names of the files in the landing area which will give you a bunch of dates. Check and see which of these dates is between the execution_date + schedule_interval (or next_execution_date if you're using airflow version >= 1.8). Then ingest file/s which match.
I think using execution_date + timedelta(days=7) is a bit hacky, intead use the execution_date + schedule_interval, that way if the interval changes there shouldn't be any issues (I do this for one of my DAGS). If you're using a newer airflow version then you can use the next_execution_date which is better.
I'm using macro for this issue.
This function (for macro) can handle manual trigger, too.
def weekly_today(execution_date, run_id, years=0, months=0, days=0, fmt="%Y%m%d"):
d = pendulum.instance(execution_date)
if run_id.startswith('scheduled_'):
d = d.add(days=7)
return d.add(years=years, months=months, days=days).strftime(fmt)
This function should be added to DAG as user_defined_macros
dag = DAG(
dag_id='test',
start_date=timezone.datetime(2019, 6, 24, 6),
schedule_interval=timedelta(days=7),
user_defined_macros={
'weekly_today': weekly_today
},
)
And I needed to set data range from 1 year ago to today.
Here is sample macro usage.
from_macro = '{{ weekly_today(execution_date, run_id, years=-1) }}'
to_macro = '{{ weekly_today(execution_date, run_id) }}'
bad naming.. but works.

Airflow : dag run with execution_date = trigger_date = fixed_schedule

in airflow, I would like to run a dag each monday at 8am (the execution_date should be of course "current day monday 8 am"). The relevant parameters to set up for this workflow are :
start_date : "2018-03-19"
schedule_interval : "0 8 * * MON"
I expect to see a dag run every monday at 8am . The first one being run the 19-03-2018 at 8 am with execution_date = 2018-03-19-08-00-00 and so on each monday.
However it's not what happens : the dag is not started on 19/03/18 at 8 am. The real behaviour is explained here for exemple : https://stackoverflow.com/a/39620901/1510109 or https://stackoverflow.com/a/48213964/1510109
The behaviour is : at each end of the interval ( weekly in my case) the dag is run with execution_date = beginning of the interval (i.e the previous week). This behavour is apparently motivated by an "ETL way of thinking" (see the link above). But it's absolutely not what I want.
How what can I achieve to run my dag each monday at 08:00am with execution_date = trigger_date = now ( = current monday 8am) ?
Thanks
Take a quick look at my answer with start times and execution_date examples.
You want to run every Monday at 8am.
So this part is going to stay the same:
schedule_interval: '0 8 * * MON',
You want it to run it's first run on 2018-03-19, since the first run occurs at the end of the first full schedule period after the start date, you should change your start date to:
start_date: datetime(2018,03,12),
You will have to live with the fact that Airflow will name your DagRuns with the start of each period and pass in macros based on the execution_date set to the start of the interval period. Adjust your logic accordingly.
Your first run will start after 2018-03-19T08:00:00.0Z and the execution_date, every other macro that depends on it, and name of the DagRun will be 2018-03-12T08:00:00.0Z
So long as you understand what to expect from the execution_date and you don't try to base your time off of datetime.now() your DAGs will be able to be idempotent in operation. Feel free to make a new variable like my_execution_date = execution_date + datetime.timedelta(7) within any PythonOperator or custom operator (you get execution_date from the context of the task), use template statements like {{ (execution_date + macros.timedelta(7)).strftime('%Y%m%d') }} or {{ macros.ds_add(ds, 7) }}, or use the next_execution_date.
You can even add a dag level user_defined_macros like {'dt':lambda d: d+datetime.timedelta(days=7)} to enable {{ dt(execution_date) }}. And recently user_defined_filters were added like {'dt':lambda d: d+datetime.timedelta(days=7)} enabling {{ execution_date | dt }}. The next_ds and next_execution_date would be easier for your purposes.
While thinking about templating, you may as well read up on the built-in stuff out there: http://jinja.pocoo.org/docs/2.10/templates/#builtin-filters
That is how airflow behaves, it always runs when the duration is completed. Detailed behavior here and airflow faq.
But in order to somehow make it run for current week, what we can do is manipulate execution_date of DAG. That may be in form of adding 7 days to a datetime object (if weekly schedule) or may use {{ next_execution_date }} macro.
Agreed that this is only possible if somehow dates are used in your DAG or dependencies are triggered by it.
Just to be clear again, DAG is still running as per its normal behavior. Only thing what we trying to do is manipulate date in program/DAG.
args = { ....
'start_date': datetime.datetime(2018,3,18)
}
dag = DAG(...
schedule_interval = "#weekly"
)
# DAG would run on 3/25/2018 for week of 18th March
# but lets say we manipulate here
# {{ next_execution_date }} macro
# or add 7 days
# So basically we are running with date 3/25/2018 instead of 3/18/2018 for the week of 18th March
For me I solved it in this way:
{{ ds if dag_run.external_trigger or dag_run.is_backfill else macros.ds_add(ds, 1) }}
If DAG was run by external trigger we shouldn't change ds.
If DAG was run by backfilling we shouldn't change ds.
If DAG was scheduled we use macros to increment it by one day.

Resources