schedule a monthly DAG to run on the next weekday - airflow

I have to schedule a DAG that should run on 15th of every month. However, if 15th falls on a Sunday/Saturday then the DAG should skip weekends and run on coming Monday.
For example, May 15 2021 falls on a Saturday. So, instead of running on 15th of May, the DAG should run on 17th, which is Monday.
Can you please help to schedule it in airflow?
Thanks in advance!

The logic of scheduling is limited by what you can do with single cron expression. So if you can't say it in cron expression you can't provide such scheduling in Airflow. For that reason there is an open airflow improvement proposal AIP-39 Richer scheduler_interval to give more scheduling capabilities.
That said, you can still get the desired functionality by writing some code.
You can set your dag to start on the 15th of each month and then place a sensor that verify that the date is Mon-Fri (if not it will wait):
from airflow.sensors.weekday import DayOfWeekSensor
dag = DAG(
dag_id='work',
schedule_interval='0 0 15 * *',
default_args=default_args,
description='Schedule a Job on 15 of each month',
)
weekend_check = DayOfWeekSensor(
task_id='weekday_check_task',
week_day={'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'},
mode='reschedule',
dag=dag)
op_1 = YourOperator(task_id='op1_task',dag=dag)
weekend_check >> op_1
Note: If you are running airflow<2.0.0 you will need to change the import to:
from airflow.contrib.sensors.weekday_sensor import DayOfWeekSensor

The answer posted by Elad works pretty well. I came up with another solution that works as well.
I scheduled the job to run on 15,16 and 17 of the month. However, I added a condition so that the job runs on the 15th if its a weekday. The job runs on 16th and 17th if its a Monday.
To achieve that, I added a BranchPythonOperator:
from airflow.operators.python_operator import BranchPythonOperator
def _conditinal_task_initiator(**kwargs):
execution_date=kwargs['execution_date']
if int(datetime.strftime(execution_date,'%d'))==15 and (execution_date.weekday()<5):
return 'dummy_task_run_cmo_longit'
elif int(datetime.strftime(execution_date,'%d'))==16 and (execution_date.weekday()==0):
return 'dummy_task_run_cmo_longit'
elif int(datetime.strftime(execution_date,'%d'))==17 and (execution_date.weekday()==0):
return 'dummy_task_run_cmo_longit'
else:
return 'dummy_task_skip_cmo_longit'
with DAG(dag_id='NXS_FM_LOAD_CMO_CHOICE_LONGIT',default_args = default_args, schedule_interval = "0 8 15-17 * *", catchup=False) as dag:
conditinal_task_initiator=BranchPythonOperator(
task_id='cond_task_check_day',
provide_context=True,
python_callable=_conditinal_task_initiator,
do_xcom_push=False)
dummy_task_run_cmo_longit=DummyOperator(
task_id='dummy_task_run_cmo_longit')
dummy_task_skip_cmo_longit=DummyOperator(
task_id='dummy_task_skip_cmo_longit')
conditinal_task_initiator >> [dummy_task_run_cmo_longit,dummy_task_skip_cmo_longit]
dummy_task_run_cmo_longit >> <main tasks for execution>
Using this, the job'll run on 15,16 and 17 of every month. However, it'll run the functional tasks only once every month.

Related

Airflow - How to properly define the time my DAG will execute every day?

I have two dags: The first one extracts data from one database to another. I want it to run everyday at 4 AM and then, that's how I defined my params:
Note: The code has 7am instead of 4 because Airflow is in UTC time and my time is GMT-3.
start_date=datetime(2022, 9, 6),
schedule_interval="0 7 * * *",
catchup=False
) as dag:
But, when I check Airflow's UI, the DAG time is shown like this:
screenshot
I have no idea about or have seen this Data Interval, why is it starting at today 21:00 (9 PM) and why this is the next run for this DAG?
How do I set my DAG to run at the next day (Sep 6, as I'm posting on 5) at 4 AM?
Thank you!

Schedule airflow job bi-weekly

I have a requirement that I want to schedule an airflow job every alternate Friday. However, the problem is I am not able to figure out how to write a schedule for this.
I don't want to have multiple jobs for this.
I tried this
'0 0 1-7,15-21 * 5
However it's not working it's running from 1 to 7 and 15 to 21 everyday.
from shubham's answer I realize that we can have a PythonOperator which can skip the task for us. An I tried to implement the solution. However doesn't seem to work.
As testing this on 2 week period would be too difficult. This is what I did.
I schedule the DAG to run every 5 mins
However, I am writing python operator the skip althernate task (pretty similar to what I am trying to do, alternate friday).
DAG:
args = {
'owner': 'Gaurang Shah',
'retries': 0,
'start_date':airflow.utils.dates.days_ago(1),
}
dag = DAG(
dag_id='test_dag',
default_args=args,
catchup=False,
schedule_interval='*/5 * * * *',
max_active_runs=1
)
dummy_op = DummyOperator(task_id='dummy', dag=dag)
def _check_date(execution_date, **context):
min_date = datetime.now() - relativedelta(minutes=10)
print(context)
print(context.get("prev_execution_date"))
print(execution_date)
print(datetime.now())
print(min_date)
if execution_date > min_date:
raise AirflowSkipException(f"No data available on this execution_date ({execution_date}).")
check_date = PythonOperator(
task_id="check_if_min_date",
python_callable=_check_date,
provide_context=True,
dag=dag,
)
I doubt that a single crontab expression can solve this
Using airflow's tricks, solution is much more straightforward:
schedule your DAG every Friday 0 0 * * FRI and
on alternate Fridays (based on your business logic), skip the DAG by raising AirflowSkipException
Here you'll have to let your DAG begin with a dedicated skip_decider task that will let your DAG run / skip every alternate Friday by
conditionally raising AirflowSkipException (to skip the DAG)
not doing anything to let the DAG run
You can also leverage
ShortCircuitOperator
BranchPythonOperator
but IMO, AirflowSkipException is the cleanest solution
Reference: How to define a DAG that scheduler a monthly job together with a daily job?
Depending on your implementation you can use the hash. Worked in my airflow schedules using version 1.10:
Hash (#)
'#' is allowed for the day-of-week field, and must be followed by a number between one and five. It allows specifying constructs such as "the second Friday" of a given month.[19] For example, entering "5#3" in the day-of-week field corresponds to the third Friday of every month. Reference
you can use timedelta as mentioned below, combine it with start_date to schedule your job bi_weekly.
dag = DAG(
dag_id='test_dag',
default_args=args,
catchup=False,
start_date=datetime(2021, 3, 26),
schedule_interval=timedelta(days=14),
max_active_runs=1
)

Airflow DAG Runs Daily Excluding Weekends?

Is it possible to create an Airflow DAG that runs every day except for Saturday and Sunday? It doesn't seem like this is possible since you only have a start_date and a schedule_interval.
I'm setting up a workflow that will process a batch of files every morning. The files won't be present on weekends though only Monday through Friday. I could simply use a timeout setting of 24 hours which essentially makes Saturday and Sunday timeout because the file would never appear on those days but this would mark the DAG as failed for those two days and that would be very pleasant.
'schedule_interval': '0 0 * * 1-5' runs at 00:00 on every day-of-week from Monday through Friday.
The answer from Zack already has a weekdays cron schedule that'll do what you want (0 0 * * 1-5), but I wanted to add an answer with a site for examples of common cron schedules, ahem, crontab expressions.
I use this with Airflow a lot to come up with a DAG's schedule_interval.
The main app to help you design a cron schedule interactively is at crontab.guru.
An example only on weekdays schedule - https://crontab.guru/every-weekday
More common examples (e.g., every half hour, every quarter, etc) - https://crontab.guru/examples.html
I had a similar need and ended up putting this at the beginning of my dags -- it's similar to the ShortCircuitOperator.
import logging
from airflow.models import SkipMixin, BaseOperator
from airflow.utils.decorators import apply_defaults
class pull_file_decision_operator(BaseOperator, SkipMixin):
template_fields = ('execution_date',)
#apply_defaults
def __init__(self,
day_of_week,
hour_of_day,
execution_date,
**kwargs):
self.day_of_week = day_of_week
self.hour_of_week = hour_of_day
self.execution_date = execution_date
def execute(self, context):
# https://docs.python.org/3/library/datetime.html#datetime.date.weekday
run_dt = self.execution_date
dow = self.day_of_week
hod = self.hour_of_day
if run_dt.weekday() == dow and run_dt.hour == hod:
return True
else:
downstream_tasks = context['task'].get_flat_relatives(upstream=False)
logging.info('Skipping downstream tasks...')
logging.info("Downstream task_ids %s", downstream_tasks)
if downstream_tasks:
self.skip(context['dag_run'],
context['ti'].execution_date,
downstream_tasks)

Airflow : dag run with execution_date = trigger_date = fixed_schedule

in airflow, I would like to run a dag each monday at 8am (the execution_date should be of course "current day monday 8 am"). The relevant parameters to set up for this workflow are :
start_date : "2018-03-19"
schedule_interval : "0 8 * * MON"
I expect to see a dag run every monday at 8am . The first one being run the 19-03-2018 at 8 am with execution_date = 2018-03-19-08-00-00 and so on each monday.
However it's not what happens : the dag is not started on 19/03/18 at 8 am. The real behaviour is explained here for exemple : https://stackoverflow.com/a/39620901/1510109 or https://stackoverflow.com/a/48213964/1510109
The behaviour is : at each end of the interval ( weekly in my case) the dag is run with execution_date = beginning of the interval (i.e the previous week). This behavour is apparently motivated by an "ETL way of thinking" (see the link above). But it's absolutely not what I want.
How what can I achieve to run my dag each monday at 08:00am with execution_date = trigger_date = now ( = current monday 8am) ?
Thanks
Take a quick look at my answer with start times and execution_date examples.
You want to run every Monday at 8am.
So this part is going to stay the same:
schedule_interval: '0 8 * * MON',
You want it to run it's first run on 2018-03-19, since the first run occurs at the end of the first full schedule period after the start date, you should change your start date to:
start_date: datetime(2018,03,12),
You will have to live with the fact that Airflow will name your DagRuns with the start of each period and pass in macros based on the execution_date set to the start of the interval period. Adjust your logic accordingly.
Your first run will start after 2018-03-19T08:00:00.0Z and the execution_date, every other macro that depends on it, and name of the DagRun will be 2018-03-12T08:00:00.0Z
So long as you understand what to expect from the execution_date and you don't try to base your time off of datetime.now() your DAGs will be able to be idempotent in operation. Feel free to make a new variable like my_execution_date = execution_date + datetime.timedelta(7) within any PythonOperator or custom operator (you get execution_date from the context of the task), use template statements like {{ (execution_date + macros.timedelta(7)).strftime('%Y%m%d') }} or {{ macros.ds_add(ds, 7) }}, or use the next_execution_date.
You can even add a dag level user_defined_macros like {'dt':lambda d: d+datetime.timedelta(days=7)} to enable {{ dt(execution_date) }}. And recently user_defined_filters were added like {'dt':lambda d: d+datetime.timedelta(days=7)} enabling {{ execution_date | dt }}. The next_ds and next_execution_date would be easier for your purposes.
While thinking about templating, you may as well read up on the built-in stuff out there: http://jinja.pocoo.org/docs/2.10/templates/#builtin-filters
That is how airflow behaves, it always runs when the duration is completed. Detailed behavior here and airflow faq.
But in order to somehow make it run for current week, what we can do is manipulate execution_date of DAG. That may be in form of adding 7 days to a datetime object (if weekly schedule) or may use {{ next_execution_date }} macro.
Agreed that this is only possible if somehow dates are used in your DAG or dependencies are triggered by it.
Just to be clear again, DAG is still running as per its normal behavior. Only thing what we trying to do is manipulate date in program/DAG.
args = { ....
'start_date': datetime.datetime(2018,3,18)
}
dag = DAG(...
schedule_interval = "#weekly"
)
# DAG would run on 3/25/2018 for week of 18th March
# but lets say we manipulate here
# {{ next_execution_date }} macro
# or add 7 days
# So basically we are running with date 3/25/2018 instead of 3/18/2018 for the week of 18th March
For me I solved it in this way:
{{ ds if dag_run.external_trigger or dag_run.is_backfill else macros.ds_add(ds, 1) }}
If DAG was run by external trigger we shouldn't change ds.
If DAG was run by backfilling we shouldn't change ds.
If DAG was scheduled we use macros to increment it by one day.

Airflow: re execute the jobs of a DAG for the past n days on a daily basis

I have scheduled the execution of a DAG to run daily.
It works perfectly for one day.
However each day I would like to re-execute not only for the current day {{ ds }} but also for the previous n days (let's say n = 7).
For example, in the next execution scheduled to run on "2018-01-30" I would like Airflow not only to run the DAG using as execution date "2018-01-30", but also to re-run the DAGs for all the previous days from "2018-01-23" to "2018-01-30".
Is there an easy way to "invalidate" the previous execution so that a backfill is run automatically?
You can generate dynamically tasks in a loop and pass the offset to your operator.
Here is an example with the Python one.
import airflow
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import timedelta
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
'schedule_interval': '0 10 * * *'
}
def check_trigger(execution_date, day_offset, **kwargs):
target_date = execution_date - timedelta(days=day_offset)
# use target_date
for day_offset in xrange(1, 8):
PythonOperator(
task_id='task_offset_' + i,
python_callable=check_trigger,
provide_context=True,
dag=dag,
op_kwargs={'day_offset' : day_offset}
)
Have you considered having the dag that runs once a day just run your task for the last 7 days? I imagine you’ll just have 7 tasks that each spawn a SubDAG with a different day offset from your execution date.
I think that will make debugging easier and history cleaner. I believe trying to backfill already executed tasks will involve deleting task instances or setting their states all to NONE. Then you’ll still have to trigger a backfill on those dag runs. It’ll be harder to track when things fail and just seems a bit messier.

Resources