How to consider daylight savings time when using cron schedule in Airflow - airflow

In Airflow, I'd like a job to run at specific time each day in a non-UTC timezone. How can I go about scheduling this?
The problem is that once daylight savings time is triggered, my job will either be running an hour too soon or an hour too late. In the Airflow docs, it seems like this is a known issue:
In case you set a cron schedule, Airflow assumes you will always want
to run at the exact same time. It will then ignore day light savings
time. Thus, if you have a schedule that says run at end of interval
every day at 08:00 GMT+1 it will always run end of interval 08:00
GMT+1, regardless if day light savings time is in place.
Has anyone else run into this issue? Is there a work around? Surely the best practice cannot be to alter all the scheduled times after Daylight Savings Time occurs?
Thanks.

Starting with Airflow 1.10, time-zone aware DAGs can be defined using time-zone aware datetime objects to specify start_date. For Airflow to schedule DAG runs always at the same time (regardless of a possible daylight-saving-time switch), use cron expressions to specify schedule_interval. To make Airflow schedule DAG runs with fixed intervals (regardless of a possible daylight-saving-time switch), use datetime.timedelta() to specify schedule_interval.
For example, consider the following code that, first, uses a cron expression to schedule two consecutive DAG runs, and then uses a fixed interval to do the same.
import pendulum
from airflow import DAG
from datetime import datetime, timedelta
START_DATE = datetime(
year=2019,
month=10,
day=25,
hour=8,
minute=0,
tzinfo=pendulum.timezone('Europe/Kiev'),
)
def gen_execution_dates(start_date, schedule_interval):
dag = DAG(
dag_id='id', start_date=start_date, schedule_interval=schedule_interval
)
execution_date = dag.start_date
for i in range(1, 3):
execution_date = dag.following_schedule(execution_date)
print(
f'[Run {i}: Execution Date for "{schedule_interval}"]:',
dag.timezone.convert(execution_date),
)
gen_execution_dates(START_DATE, '0 8 * * *')
gen_execution_dates(START_DATE, timedelta(days=1))
Running the code produces the following output:
[Run 1: Execution Date for "0 8 * * *"]: 2019-10-26 08:00:00+03:00
[Run 2: Execution Date for "0 8 * * *"]: 2019-10-27 08:00:00+02:00
[Run 1: Execution Date for "1 day, 0:00:00"]: 2019-10-26 08:00:00+03:00
[Run 2: Execution Date for "1 day, 0:00:00"]: 2019-10-27 07:00:00+02:00
For the zone [Europe/Kiev], the daylight saving time of 2019 ends on 2019-10-27 at 03:00:00+03:00. That is, between Run 1 and Run 2 in our example.
The first two output lines show that for the DAG runs scheduled with a cron expression the first run and second run are both scheduled for 08:00 (although, in different timezones: Eastern European Summer Time (EEST) and Eastern European Time (EET) respectively).
The last two output lines show that for the DAG runs scheduled with a fixed interval the first run is scheduled for 08:00 (EEST), and the second run is scheduled exactly 1 day (24 hours) later, which is at 07:00 (EET) due to the daylight-saving-time switch.
The following figure illustrates the example:

Related

Airflow - How to properly define the time my DAG will execute every day?

I have two dags: The first one extracts data from one database to another. I want it to run everyday at 4 AM and then, that's how I defined my params:
Note: The code has 7am instead of 4 because Airflow is in UTC time and my time is GMT-3.
start_date=datetime(2022, 9, 6),
schedule_interval="0 7 * * *",
catchup=False
) as dag:
But, when I check Airflow's UI, the DAG time is shown like this:
screenshot
I have no idea about or have seen this Data Interval, why is it starting at today 21:00 (9 PM) and why this is the next run for this DAG?
How do I set my DAG to run at the next day (Sep 6, as I'm posting on 5) at 4 AM?
Thank you!

schedule a monthly DAG to run on the next weekday

I have to schedule a DAG that should run on 15th of every month. However, if 15th falls on a Sunday/Saturday then the DAG should skip weekends and run on coming Monday.
For example, May 15 2021 falls on a Saturday. So, instead of running on 15th of May, the DAG should run on 17th, which is Monday.
Can you please help to schedule it in airflow?
Thanks in advance!
The logic of scheduling is limited by what you can do with single cron expression. So if you can't say it in cron expression you can't provide such scheduling in Airflow. For that reason there is an open airflow improvement proposal AIP-39 Richer scheduler_interval to give more scheduling capabilities.
That said, you can still get the desired functionality by writing some code.
You can set your dag to start on the 15th of each month and then place a sensor that verify that the date is Mon-Fri (if not it will wait):
from airflow.sensors.weekday import DayOfWeekSensor
dag = DAG(
dag_id='work',
schedule_interval='0 0 15 * *',
default_args=default_args,
description='Schedule a Job on 15 of each month',
)
weekend_check = DayOfWeekSensor(
task_id='weekday_check_task',
week_day={'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'},
mode='reschedule',
dag=dag)
op_1 = YourOperator(task_id='op1_task',dag=dag)
weekend_check >> op_1
Note: If you are running airflow<2.0.0 you will need to change the import to:
from airflow.contrib.sensors.weekday_sensor import DayOfWeekSensor
The answer posted by Elad works pretty well. I came up with another solution that works as well.
I scheduled the job to run on 15,16 and 17 of the month. However, I added a condition so that the job runs on the 15th if its a weekday. The job runs on 16th and 17th if its a Monday.
To achieve that, I added a BranchPythonOperator:
from airflow.operators.python_operator import BranchPythonOperator
def _conditinal_task_initiator(**kwargs):
execution_date=kwargs['execution_date']
if int(datetime.strftime(execution_date,'%d'))==15 and (execution_date.weekday()<5):
return 'dummy_task_run_cmo_longit'
elif int(datetime.strftime(execution_date,'%d'))==16 and (execution_date.weekday()==0):
return 'dummy_task_run_cmo_longit'
elif int(datetime.strftime(execution_date,'%d'))==17 and (execution_date.weekday()==0):
return 'dummy_task_run_cmo_longit'
else:
return 'dummy_task_skip_cmo_longit'
with DAG(dag_id='NXS_FM_LOAD_CMO_CHOICE_LONGIT',default_args = default_args, schedule_interval = "0 8 15-17 * *", catchup=False) as dag:
conditinal_task_initiator=BranchPythonOperator(
task_id='cond_task_check_day',
provide_context=True,
python_callable=_conditinal_task_initiator,
do_xcom_push=False)
dummy_task_run_cmo_longit=DummyOperator(
task_id='dummy_task_run_cmo_longit')
dummy_task_skip_cmo_longit=DummyOperator(
task_id='dummy_task_skip_cmo_longit')
conditinal_task_initiator >> [dummy_task_run_cmo_longit,dummy_task_skip_cmo_longit]
dummy_task_run_cmo_longit >> <main tasks for execution>
Using this, the job'll run on 15,16 and 17 of every month. However, it'll run the functional tasks only once every month.

Airflow: How to schedule dag multiple times a day on specific times

I'm trying to run an airflow dag at specific times on a day.
I'm aware that the airflow scheduler runs at the end of a period.
But this becoming a time scheduler nightmare and I need some guidance.
In essence I want to run the dag on 1:30, 7:45 and say somewhere in the afternoon. Let's make it 14:00 so there is exactly 6h 15m between each run.
It's also important that it's UK time. It needs to switch with UK summer/winter time
This is what I came up with:
dag_timezone = pendulum.timezone("Europe/London")
dt_now = datetime.now(tz=dag_timezone)
schedule_interval = timedelta(hours=6, minutes=15)
start_date = datetime(dt_now.year, dt_now.month, dt_now.day, 1, 30, 0, 0, dag_timezone) - schedule_interval
I expected it to immediately start running for today (1:30 & 7:45 run at least) since catchup=True
Alas, no success.
In the interface the start_date is 2020-07-30 6:30:00
It almost looks that the schedule_interval is added to the start_date instead of subtracted
I would expect 2020-07-30 01:30:00 - 6h15m => 2020-07-29 19:15:00 =UTC> 2020-07-29 18:15:00
Also: Is there a debug mode for the scheduler to see the 'reasoning' ?
In apache airflow when schedule your DAG then it is actually start_data + schedule_interval. For example, suppose I have passed start_date=datetime(2020, 7, 30) and my schedule_interval=#daily then actually my first task will run/start at 31st July not on 30 July

How to get airflow time zone information from macros?

Background
I am trying to run a DAG at 10pm America/New_York once every day from Monday to Friday. The script which the DAG runs takes as input the day it runs on for its
time zone (10pm Mon-Fri). When I run this scrip as an Airflow DAG, the date is derived from the macro {{ ds_nodash }}
The problem
When Airflow runs, by the time it's 10pm NY time, it's already the next day on UTC time. Since Airflow uses UTC, the execution date is one day ahead, so when my DAG uses the macro {{ ds_nodash }}, it is one day ahead.
Question:
Is there a way to get the time-zone adjusted date as a macro on airflow or is the only solution to my problem to adjust the macro myself?
According to the airflow documentation, the default variables (such as {{ ds_nodash }}) are in UTC. So, we need to convert them ourselves. It can go something like this:
# ...
local_ds_nodash = '{{ dag.timezone.convert(execution_date).strftime("%Y%m%d") }}'
# ...
create_file = BashOperator(
task_id='create_file',
bash_command=f'touch {local_ds_nodash}.txt'
)
I supposed that you may mess up with two different concepts in airflow.
Actually, 'ds' is not the date which the tasks are running, it is the previous period of tasks running. for example, for ds is 3/25/2019, it would be running on 3/26/2019 rather than 3/25. So if you want your tasks running exactly on Mon-Fri, you need to set the schedule_interval as '0 22 * * 1-5'. The weekday settings should be '1-5' instead of '2-6'.
For timezone, kaxil's answer has explained very well. But if for some reason, you cannot change the airflow server configuration, what you need to do is to adjust schedule_interval as '0 2 * * 2-6'. Then the tasks will run as you expected.
Timezone feature is now available in Airflow. Have a look at https://airflow.readthedocs.io/en/1.10.2/timezone.html and adjust your config in airflow.cfg accordingly.
By default it is
[core]
default_timezone = utc
adjust it to
[core]
default_timezone = America/New_York
The execution_date will then contain TZ info as well which you would be able to extract. Try it in a test environment before you roll out to your production environment.

Does Apache Airflow 1.10+ scheduler support running 2 DAGs in different DST aware time-zones at specific times?

Apache Airflow 1.10+ introduced native support for DST aware timezones.
This leads me to think (perhaps incorrectly) it should be possible to create 2 DAGs on the same Airflow scheduler that are scheduled like so:
Starts every day at 06:00 Pacific/Auckland time
Starts every day at 21:00 America/New_York time
Without the need to introduce tasks that "sleep" until the required start time. The documentation explicitly rules out the cron scheduler for DST aware scheduling but only explains how to set the DAGs to run every day in that timezone, which by default is midnight.
Previous questions on this topic have considered only using the cron scheduler or are based on pre-1.10 airflow which did not have the introduced native support for DST aware timezones.
In the "airflow.cfg" I updated the default_timezone to the system timezone. And then I tried to schedule the DAGs like so:
DAG('NZ_SOD',
description='New Zealand Start of Day',
start_date=datetime(2018, 12, 11, 06, 00, tzinfo=pendulum.timezone('Pacific/Auckland')),
catchup=False)
And:
DAG('NAM_EOD',
description='North Americas End of Day',
start_date=datetime(2018, 12, 11, 21, 00, tzinfo=pendulum.timezone('America/New_York')),
catchup=False)
But it seems that the "Time" part of the datetime object that is passed to start_date is not explicitly considered in Apache Airflow and creates unexpected behavior.
Does Airflow have any in built option to produce desired behavior or am I trying to use the wrong tool for the job?
The answer is yes, the cron schedule supports having DAGs run in DST aware timezones.
But there are a number of caveats so I have to assume the maintainers of Airflow do not have this as a supported use case. Firstly the documentation, as of the time of writing, is explicitly wrong when it states:
Cron schedules
In case you set a cron schedule, Airflow assumes you will always want to run at the exact same time. It will then ignore day light savings time. Thus, if you have a schedule that says run at end of interval every day at 08:00 GMT+1 it will always run end of interval 08:00 GMT+1, regardless if day light savings time is in place.
I've written this somewhat hacky code which let's you see how a schedule will work without the need for a running Airflow instance (be careful you have Penulum 1.x installed and using the correct documentation if you run or edit this code):
import pendulum
from airflow import DAG
from datetime import timedelta
# Set-up DAG
test_dag = DAG(
dag_id='foo',
start_date=pendulum.datetime(year=2019, month=4, day=4, tz='Pacific/Auckland'),
schedule_interval='00 03 * * *',
catchup=False
)
# Check initial schedule
execution_date = test_dag.start_date
for _ in range(7):
next_execution_date = test_dag.following_schedule(execution_date)
if next_execution_date <= execution_date:
execution_date = test_dag.following_schedule(execution_date + timedelta(hours=2))
else:
execution_date = next_execution_date
print('Execution Date:', execution_date)
This gives us a 7 day period over which New Zealand experiences DST:
Execution Date: 2019-04-03 14:00:00+00:00
Execution Date: 2019-04-04 14:00:00+00:00
Execution Date: 2019-04-05 14:00:00+00:00
Execution Date: 2019-04-06 14:00:00+00:00
Execution Date: 2019-04-07 15:00:00+00:00
Execution Date: 2019-04-08 15:00:00+00:00
Execution Date: 2019-04-09 15:00:00+00:00
As we can see DST is observed using the cron schedule, further if you edit my code to remove the cron schedule you can see that DST is not observed.
But be warned, even with the cron schedule observing DST you may still have an out by 1 day error and on the day of the DST change because Airflow is providing the previous date and not the current date (e.g. Sunday on the Calendar but in Airflow the execution date is Saturday). It doesn't look to me like this is accounted for in the follow_schedule logic.
Finally as #dlamblin points out the variables that Airflow provides to the jobs, either via templated strings or provide_context=True for Python callables will be the wrong if the local execution date for the DAG is not the same as the UTC execution date. This can be observed in TaskInstance.get_template_context which uses self.execution_date without modifying it to be in local time. And we can see in TaskInstance.__init__ that self.execution_date is converted to UTC.
The way I handle this is to derive a variable I call local_cal_date by doing what #dlamblin suggests and using the convert method from Pendulum. Edit this code to fit your specific needs (I actually use it in a wrapper around all my Python callables so that they all receive local_cal_date):
import datetime
def foo(*args, dag, execution_date, **kwargs):
# Derive local execution datetime from dag and execution_date that
# airflow passes to python callables where provide_context is set to True
airflow_timezone = dag.timezone
local_execution_datetime = airflow_timezone.convert(execution_date)
# I then add 1 day to make it the calendar day
# and not the execution date which Airflow provides
local_cal_datetime = local_execution_datetime + datetime.timedelta(days=1)
Update: For templated strings I found for me the best approach was to create custom operators that injected the custom varaibles in to the context before the template is rendered. The problem I found with using custom macros is they don't expand other macros automatically, which means you have to do a bunch of extra work to render them in a useful way. So in a custom operators module I some similar to this code:
# Standard Library
import datetime
# Third Party Libraries
import airflow.operators.email_operator
import airflow.operators.python_operator
import airflow.operators.bash_operator
class CustomTemplateVarsMixin:
def render_template(self, attr, content, context):
# Do Calculations
airflow_execution_datetime = context['execution_date']
airflow_timezone = context['dag'].timezone
local_execution_datetime = airflow_timezone.convert(airflow_execution_datetime)
local_cal_datetime = local_execution_datetime + datetime.timedelta(days=1)
# Add to contexts
context['local_cal_datetime'] = local_cal_datetime
# Run normal Method
return super().render_template(self, attr, content, context)
class BashOperator(CustomTemplateVarsMixin, airflow.operators.bash_operator.BashOperator):
pass
class EmailOperator(CustomTemplateVarsMixin, airflow.operators.email_operator.EmailOperator):
pass
class PythonOperator(CustomTemplateVarsMixin, airflow.operators.python_operator.PythonOperator):
pass
class BranchPythonOperator(CustomTemplateVarsMixin, airflow.operators.python_operator.BranchPythonOperator):
pass
First a few nits:
Don't specify datetimes with a leading 0 like 06 am because if you edit it to 9am in a rush, you're going to find out that that's not a valid octal number and the whole DAG file will stop parsing.
You might as well use the pendulum notation: start_date=pendulum.datetime(2018, 12, 11, 6, 0, tz='Pacific/Auckland'),
Yeah timezones in Airflow got a little confusing. The docs say that a cron schedule is always in that timezone's offset. This isn't as clear as it should be because, offsets vary. Lets assume you set the default config timezone like this:
[core]
default_timezone = America/New_York
With a start_date like:
start_date = datetime(2018, 12, 11, 6, 0),
you get the offset with UTC of -18000 or -5h.
start_date = datetime(2018, 4, 11, 6, 0),
you get the offset with UTC of -14400 or -4h.
Where as the one in the second bullet point gives an offset of 46800 or 13h, while in April in Auckland it is 43200 or 12h. These get applied to the schedule_interval for the DAG if I recall correctly.
What the docs seem to say is your schedule_interval crontab string will be interpreted forever in that same offset. So, a 0 5 * * * is going to run at 5 or 6 am if you started in December in NYC OR 5 or 4 am if you started in April in NYC. Uh. I think that's right. I am also confused by this.
This isn't avoided by leaving the default at utc. No, not if you use the start_date as you've shown and picked zones with varying offsets to utc.
Now… the second issue, time of day. The start date is used to be the earliest start interval that's valid. A time of day being in there is great but the schedule defaults to timedelta(days=1). I thought it was #daily which also means 0 0 * * *, and gives you fun results like starting at a start date of 6am 11th of December, your first full midnight-to-midnight interval will close at midnight 13th of December, thus the first run gets passed in the date of midnight of 12th December as the execution_date. But I would expect that with a timedelta being applied to the start_date it would instead start 6am on the 12th December with the same time yesterday passed in as the execution_date. However I've not seen it work out that way, which does make me think that it might be using only the date part of the datetime for start_date somewhere.
As documented, this passed in exeucution_date (and all macro dates) are going to be in UTC (so midnight or 6am in your start_date timezone offset, converted to UTC). At least they have the tz attached so you can use convert on them if you must.

Resources