How to schedule execution without delay? - airflow

I have one problem, I need to launch one DAG every first day of month, but I have one problem, the DAG started on 1 October but executed that day on 1 November, I need that 1 October execute 1 October and 1 November execute 1 November, and not delay the execution one month.
My scheduler was: '0 10 1 * *'
Thanks

This is how Airflow works.
Airflow schedule DAGs at the end of the interval.
So if you have:
DAG(
dag_id='tutorial',
schedule_interval='0 10 1 * *',
start_date=datetime(2021, 10, 1),
)
The first run will start on 2021-11-01 - this run will have execution date of 2021-10-01. This behavior is consistent with how data pipelines work. In November you will want to process October data. Or in the terminology that I mentioned before - Your monthly interval starts on beginning of October it ends in November so at the beginning of November you can run the job that process October data.
That said - In the job itself you can process any interval you wish. For that you can use Airflow macros.
In simple words if you want your first DAG run to start on 2021-10-01 you should set start_date=datetime(2021, 9, 1)
Starting from Airflow 2.2.0 there was enhancement in that area.
Airflow decoupled the "When to run" from the "What interval to process" with the completion of AIP- 39 Richer Scheduler. You can read about the concept of Timetables in this doc.

Related

Airflow - How to properly define the time my DAG will execute every day?

I have two dags: The first one extracts data from one database to another. I want it to run everyday at 4 AM and then, that's how I defined my params:
Note: The code has 7am instead of 4 because Airflow is in UTC time and my time is GMT-3.
start_date=datetime(2022, 9, 6),
schedule_interval="0 7 * * *",
catchup=False
) as dag:
But, when I check Airflow's UI, the DAG time is shown like this:
screenshot
I have no idea about or have seen this Data Interval, why is it starting at today 21:00 (9 PM) and why this is the next run for this DAG?
How do I set my DAG to run at the next day (Sep 6, as I'm posting on 5) at 4 AM?
Thank you!

schedule a monthly DAG to run on the next weekday

I have to schedule a DAG that should run on 15th of every month. However, if 15th falls on a Sunday/Saturday then the DAG should skip weekends and run on coming Monday.
For example, May 15 2021 falls on a Saturday. So, instead of running on 15th of May, the DAG should run on 17th, which is Monday.
Can you please help to schedule it in airflow?
Thanks in advance!
The logic of scheduling is limited by what you can do with single cron expression. So if you can't say it in cron expression you can't provide such scheduling in Airflow. For that reason there is an open airflow improvement proposal AIP-39 Richer scheduler_interval to give more scheduling capabilities.
That said, you can still get the desired functionality by writing some code.
You can set your dag to start on the 15th of each month and then place a sensor that verify that the date is Mon-Fri (if not it will wait):
from airflow.sensors.weekday import DayOfWeekSensor
dag = DAG(
dag_id='work',
schedule_interval='0 0 15 * *',
default_args=default_args,
description='Schedule a Job on 15 of each month',
)
weekend_check = DayOfWeekSensor(
task_id='weekday_check_task',
week_day={'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'},
mode='reschedule',
dag=dag)
op_1 = YourOperator(task_id='op1_task',dag=dag)
weekend_check >> op_1
Note: If you are running airflow<2.0.0 you will need to change the import to:
from airflow.contrib.sensors.weekday_sensor import DayOfWeekSensor
The answer posted by Elad works pretty well. I came up with another solution that works as well.
I scheduled the job to run on 15,16 and 17 of the month. However, I added a condition so that the job runs on the 15th if its a weekday. The job runs on 16th and 17th if its a Monday.
To achieve that, I added a BranchPythonOperator:
from airflow.operators.python_operator import BranchPythonOperator
def _conditinal_task_initiator(**kwargs):
execution_date=kwargs['execution_date']
if int(datetime.strftime(execution_date,'%d'))==15 and (execution_date.weekday()<5):
return 'dummy_task_run_cmo_longit'
elif int(datetime.strftime(execution_date,'%d'))==16 and (execution_date.weekday()==0):
return 'dummy_task_run_cmo_longit'
elif int(datetime.strftime(execution_date,'%d'))==17 and (execution_date.weekday()==0):
return 'dummy_task_run_cmo_longit'
else:
return 'dummy_task_skip_cmo_longit'
with DAG(dag_id='NXS_FM_LOAD_CMO_CHOICE_LONGIT',default_args = default_args, schedule_interval = "0 8 15-17 * *", catchup=False) as dag:
conditinal_task_initiator=BranchPythonOperator(
task_id='cond_task_check_day',
provide_context=True,
python_callable=_conditinal_task_initiator,
do_xcom_push=False)
dummy_task_run_cmo_longit=DummyOperator(
task_id='dummy_task_run_cmo_longit')
dummy_task_skip_cmo_longit=DummyOperator(
task_id='dummy_task_skip_cmo_longit')
conditinal_task_initiator >> [dummy_task_run_cmo_longit,dummy_task_skip_cmo_longit]
dummy_task_run_cmo_longit >> <main tasks for execution>
Using this, the job'll run on 15,16 and 17 of every month. However, it'll run the functional tasks only once every month.

Airflow: How to schedule dag multiple times a day on specific times

I'm trying to run an airflow dag at specific times on a day.
I'm aware that the airflow scheduler runs at the end of a period.
But this becoming a time scheduler nightmare and I need some guidance.
In essence I want to run the dag on 1:30, 7:45 and say somewhere in the afternoon. Let's make it 14:00 so there is exactly 6h 15m between each run.
It's also important that it's UK time. It needs to switch with UK summer/winter time
This is what I came up with:
dag_timezone = pendulum.timezone("Europe/London")
dt_now = datetime.now(tz=dag_timezone)
schedule_interval = timedelta(hours=6, minutes=15)
start_date = datetime(dt_now.year, dt_now.month, dt_now.day, 1, 30, 0, 0, dag_timezone) - schedule_interval
I expected it to immediately start running for today (1:30 & 7:45 run at least) since catchup=True
Alas, no success.
In the interface the start_date is 2020-07-30 6:30:00
It almost looks that the schedule_interval is added to the start_date instead of subtracted
I would expect 2020-07-30 01:30:00 - 6h15m => 2020-07-29 19:15:00 =UTC> 2020-07-29 18:15:00
Also: Is there a debug mode for the scheduler to see the 'reasoning' ?
In apache airflow when schedule your DAG then it is actually start_data + schedule_interval. For example, suppose I have passed start_date=datetime(2020, 7, 30) and my schedule_interval=#daily then actually my first task will run/start at 31st July not on 30 July

Problem with Airflow DAG and its schedule

I am having issues with a DAG created, I am trying to create a DAG to run weekly on Mondays.
This is the current setting of the DAG:
dag = DAG(
"NAME_OF_THE_DAG",
description="description",
default_args=default_args,
start_date="Fri, 01 May 2020 00:00:00 GMT",
end_date=None,
schedule_interval="30 3 * * 1",
catchup=True,
)
The DAG runs the first task on the first Monday of the May (4th of May 2020), but then is not executing the following week.
Does anyone what could be? I thought it was something with qeued tasks, but after cleaning the tasks, the scheduled didn't triggered the expected next execution.
Airflow schedules each task at the end of each interval. So your runs will be...
Monday, May 3 # 3:30 will execute on Monday, May 10 # 3:30
Monday, May 10 # 3:30 will execute on Monday, May 17 # 3:30
Monday, May 17 # 3:30 will execute on Monday, May 24 # 3:30
etc.
Why is this? Since Airflow is a tool built for data workflows where each task operates on a slice of time, and the data for a slice of time is not ready until that slice of time has ended. So for instance with your schedule, you define a window for May 3 to May 10, and that data won't be ready until May 10.
I've found pages 11-14 of this PDF to do the best job of explaining: https://drive.google.com/file/d/1DVN4HXtOC-HXvv00sEkoB90mxLDnCIKc/view

How to consider daylight savings time when using cron schedule in Airflow

In Airflow, I'd like a job to run at specific time each day in a non-UTC timezone. How can I go about scheduling this?
The problem is that once daylight savings time is triggered, my job will either be running an hour too soon or an hour too late. In the Airflow docs, it seems like this is a known issue:
In case you set a cron schedule, Airflow assumes you will always want
to run at the exact same time. It will then ignore day light savings
time. Thus, if you have a schedule that says run at end of interval
every day at 08:00 GMT+1 it will always run end of interval 08:00
GMT+1, regardless if day light savings time is in place.
Has anyone else run into this issue? Is there a work around? Surely the best practice cannot be to alter all the scheduled times after Daylight Savings Time occurs?
Thanks.
Starting with Airflow 1.10, time-zone aware DAGs can be defined using time-zone aware datetime objects to specify start_date. For Airflow to schedule DAG runs always at the same time (regardless of a possible daylight-saving-time switch), use cron expressions to specify schedule_interval. To make Airflow schedule DAG runs with fixed intervals (regardless of a possible daylight-saving-time switch), use datetime.timedelta() to specify schedule_interval.
For example, consider the following code that, first, uses a cron expression to schedule two consecutive DAG runs, and then uses a fixed interval to do the same.
import pendulum
from airflow import DAG
from datetime import datetime, timedelta
START_DATE = datetime(
year=2019,
month=10,
day=25,
hour=8,
minute=0,
tzinfo=pendulum.timezone('Europe/Kiev'),
)
def gen_execution_dates(start_date, schedule_interval):
dag = DAG(
dag_id='id', start_date=start_date, schedule_interval=schedule_interval
)
execution_date = dag.start_date
for i in range(1, 3):
execution_date = dag.following_schedule(execution_date)
print(
f'[Run {i}: Execution Date for "{schedule_interval}"]:',
dag.timezone.convert(execution_date),
)
gen_execution_dates(START_DATE, '0 8 * * *')
gen_execution_dates(START_DATE, timedelta(days=1))
Running the code produces the following output:
[Run 1: Execution Date for "0 8 * * *"]: 2019-10-26 08:00:00+03:00
[Run 2: Execution Date for "0 8 * * *"]: 2019-10-27 08:00:00+02:00
[Run 1: Execution Date for "1 day, 0:00:00"]: 2019-10-26 08:00:00+03:00
[Run 2: Execution Date for "1 day, 0:00:00"]: 2019-10-27 07:00:00+02:00
For the zone [Europe/Kiev], the daylight saving time of 2019 ends on 2019-10-27 at 03:00:00+03:00. That is, between Run 1 and Run 2 in our example.
The first two output lines show that for the DAG runs scheduled with a cron expression the first run and second run are both scheduled for 08:00 (although, in different timezones: Eastern European Summer Time (EEST) and Eastern European Time (EET) respectively).
The last two output lines show that for the DAG runs scheduled with a fixed interval the first run is scheduled for 08:00 (EEST), and the second run is scheduled exactly 1 day (24 hours) later, which is at 07:00 (EET) due to the daylight-saving-time switch.
The following figure illustrates the example:

Resources