Airflow Scheduler is slow by 1 interval - airflow

I have a couple of schedules that are slow by one interval. My configuration looks like
args = {
'owner' : 'test',
'start_date' : datetime.now(),
'email' : ['a#b.com'],
'email_on_failure': True,
'email_on_retry' : True,
'retries' : 3,
'retry_delay' : timedelta(seconds=30)
}
dag = DAG(
dag_id='feed_response', default_args=args,
concurrency=4,
schedule_interval='0 2 * * 6',
dagrun_timeout=timedelta(minutes=20)
)
This schedule should have run an instance for last Saturday. It ran for the previous Saturday. I've noticed this behavior in a couple of our jobs. Is there a reason why the scheduler seems to lag by one interval behind?

This behavior is described on the airflow wiki in the "Common Pitfalls" section (https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls):
Understanding the execution date: Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available.

Airflow documentation doesn't recommend using dynamic values for start_date specially datetime.now()
https://airflow.incubator.apache.org/faq.html#what-s-the-deal-with-start-date

Related

schedule a monthly DAG to run on the next weekday

I have to schedule a DAG that should run on 15th of every month. However, if 15th falls on a Sunday/Saturday then the DAG should skip weekends and run on coming Monday.
For example, May 15 2021 falls on a Saturday. So, instead of running on 15th of May, the DAG should run on 17th, which is Monday.
Can you please help to schedule it in airflow?
Thanks in advance!
The logic of scheduling is limited by what you can do with single cron expression. So if you can't say it in cron expression you can't provide such scheduling in Airflow. For that reason there is an open airflow improvement proposal AIP-39 Richer scheduler_interval to give more scheduling capabilities.
That said, you can still get the desired functionality by writing some code.
You can set your dag to start on the 15th of each month and then place a sensor that verify that the date is Mon-Fri (if not it will wait):
from airflow.sensors.weekday import DayOfWeekSensor
dag = DAG(
dag_id='work',
schedule_interval='0 0 15 * *',
default_args=default_args,
description='Schedule a Job on 15 of each month',
)
weekend_check = DayOfWeekSensor(
task_id='weekday_check_task',
week_day={'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'},
mode='reschedule',
dag=dag)
op_1 = YourOperator(task_id='op1_task',dag=dag)
weekend_check >> op_1
Note: If you are running airflow<2.0.0 you will need to change the import to:
from airflow.contrib.sensors.weekday_sensor import DayOfWeekSensor
The answer posted by Elad works pretty well. I came up with another solution that works as well.
I scheduled the job to run on 15,16 and 17 of the month. However, I added a condition so that the job runs on the 15th if its a weekday. The job runs on 16th and 17th if its a Monday.
To achieve that, I added a BranchPythonOperator:
from airflow.operators.python_operator import BranchPythonOperator
def _conditinal_task_initiator(**kwargs):
execution_date=kwargs['execution_date']
if int(datetime.strftime(execution_date,'%d'))==15 and (execution_date.weekday()<5):
return 'dummy_task_run_cmo_longit'
elif int(datetime.strftime(execution_date,'%d'))==16 and (execution_date.weekday()==0):
return 'dummy_task_run_cmo_longit'
elif int(datetime.strftime(execution_date,'%d'))==17 and (execution_date.weekday()==0):
return 'dummy_task_run_cmo_longit'
else:
return 'dummy_task_skip_cmo_longit'
with DAG(dag_id='NXS_FM_LOAD_CMO_CHOICE_LONGIT',default_args = default_args, schedule_interval = "0 8 15-17 * *", catchup=False) as dag:
conditinal_task_initiator=BranchPythonOperator(
task_id='cond_task_check_day',
provide_context=True,
python_callable=_conditinal_task_initiator,
do_xcom_push=False)
dummy_task_run_cmo_longit=DummyOperator(
task_id='dummy_task_run_cmo_longit')
dummy_task_skip_cmo_longit=DummyOperator(
task_id='dummy_task_skip_cmo_longit')
conditinal_task_initiator >> [dummy_task_run_cmo_longit,dummy_task_skip_cmo_longit]
dummy_task_run_cmo_longit >> <main tasks for execution>
Using this, the job'll run on 15,16 and 17 of every month. However, it'll run the functional tasks only once every month.

Schedule airflow job bi-weekly

I have a requirement that I want to schedule an airflow job every alternate Friday. However, the problem is I am not able to figure out how to write a schedule for this.
I don't want to have multiple jobs for this.
I tried this
'0 0 1-7,15-21 * 5
However it's not working it's running from 1 to 7 and 15 to 21 everyday.
from shubham's answer I realize that we can have a PythonOperator which can skip the task for us. An I tried to implement the solution. However doesn't seem to work.
As testing this on 2 week period would be too difficult. This is what I did.
I schedule the DAG to run every 5 mins
However, I am writing python operator the skip althernate task (pretty similar to what I am trying to do, alternate friday).
DAG:
args = {
'owner': 'Gaurang Shah',
'retries': 0,
'start_date':airflow.utils.dates.days_ago(1),
}
dag = DAG(
dag_id='test_dag',
default_args=args,
catchup=False,
schedule_interval='*/5 * * * *',
max_active_runs=1
)
dummy_op = DummyOperator(task_id='dummy', dag=dag)
def _check_date(execution_date, **context):
min_date = datetime.now() - relativedelta(minutes=10)
print(context)
print(context.get("prev_execution_date"))
print(execution_date)
print(datetime.now())
print(min_date)
if execution_date > min_date:
raise AirflowSkipException(f"No data available on this execution_date ({execution_date}).")
check_date = PythonOperator(
task_id="check_if_min_date",
python_callable=_check_date,
provide_context=True,
dag=dag,
)
I doubt that a single crontab expression can solve this
Using airflow's tricks, solution is much more straightforward:
schedule your DAG every Friday 0 0 * * FRI and
on alternate Fridays (based on your business logic), skip the DAG by raising AirflowSkipException
Here you'll have to let your DAG begin with a dedicated skip_decider task that will let your DAG run / skip every alternate Friday by
conditionally raising AirflowSkipException (to skip the DAG)
not doing anything to let the DAG run
You can also leverage
ShortCircuitOperator
BranchPythonOperator
but IMO, AirflowSkipException is the cleanest solution
Reference: How to define a DAG that scheduler a monthly job together with a daily job?
Depending on your implementation you can use the hash. Worked in my airflow schedules using version 1.10:
Hash (#)
'#' is allowed for the day-of-week field, and must be followed by a number between one and five. It allows specifying constructs such as "the second Friday" of a given month.[19] For example, entering "5#3" in the day-of-week field corresponds to the third Friday of every month. Reference
you can use timedelta as mentioned below, combine it with start_date to schedule your job bi_weekly.
dag = DAG(
dag_id='test_dag',
default_args=args,
catchup=False,
start_date=datetime(2021, 3, 26),
schedule_interval=timedelta(days=14),
max_active_runs=1
)

Airflow: Rerunning DAG can't load XCOMs from previous run

Is there a way to persist an XCOM value during re-runs of a DAG step (after clearing the status)?
Below is a simplified version of what I'm trying to accomplish, namely when a DAG step status is cleared and the step re-run, I would like to be able to load the XCOM value pushed on the previous run. However, even though I can see the value in the XCOM interface, the value does not get pulled. I've looked through the source code for the pull_xcom() method but can't figure out where it is being filtered out.
The functionality I'm trying to achieve is to maintain some amount of state between failed runs of a DAG. In the example, this would mean that 1 is added to the stored value every time the DAG step is cleared and rerun.
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def test_step(**kwargs):
ti = kwargs.get('task_instance')
value = ti.xcom_pull(key='key', include_prior_dates=True)
if value is None:
value = 0
print(f'BEFORE VALUE: {value}')
value += 1
print(f'AFTER VALUE: {value}')
ti.xcom_push(key='key', value=value)
# Simulating a failure
raise Exception
default_args = {
'owner': 'Testing',
'depends_on_past': False,
'email': ['test#test.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
}
dag = DAG(
'test_dag',
default_args=default_args,
schedule_interval=None,
start_date=datetime(2020, 4, 9),
)
t1 = PythonOperator(
task_id='test_step',
provide_context=True,
python_callable=test_step,
dag=dag,
)
t1
Anytime a task is about to run, its XCom is cleared for the current execution date (https://github.com/apache/airflow/blob/1.10.10/airflow/models/taskinstance.py#L960). This is why you won't ever pull values from previous task tries. Use of include_prior_dates=True only pulls from previous execution dates, but not previous runs of the same execution date.
One possible solution is to put a DummyOperator task upstream of your test_step task, called say xcom_store.test_step. Then use airflow.models.XCom.set() directly in test_step to your XCom values into the xcom_store.test_step task (reference xcom_push() as an example). When you need to pull, just pull as you usually would with but from the dummy task instead, i.e. ti.xcom_pull(task_ids='xcom_store.test_step', key='key'). Definitely not ideal and could lead to some confusion, but if you standardize it and build some helpers around it, it could be alright?

Airflow: re execute the jobs of a DAG for the past n days on a daily basis

I have scheduled the execution of a DAG to run daily.
It works perfectly for one day.
However each day I would like to re-execute not only for the current day {{ ds }} but also for the previous n days (let's say n = 7).
For example, in the next execution scheduled to run on "2018-01-30" I would like Airflow not only to run the DAG using as execution date "2018-01-30", but also to re-run the DAGs for all the previous days from "2018-01-23" to "2018-01-30".
Is there an easy way to "invalidate" the previous execution so that a backfill is run automatically?
You can generate dynamically tasks in a loop and pass the offset to your operator.
Here is an example with the Python one.
import airflow
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import timedelta
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
'schedule_interval': '0 10 * * *'
}
def check_trigger(execution_date, day_offset, **kwargs):
target_date = execution_date - timedelta(days=day_offset)
# use target_date
for day_offset in xrange(1, 8):
PythonOperator(
task_id='task_offset_' + i,
python_callable=check_trigger,
provide_context=True,
dag=dag,
op_kwargs={'day_offset' : day_offset}
)
Have you considered having the dag that runs once a day just run your task for the last 7 days? I imagine you’ll just have 7 tasks that each spawn a SubDAG with a different day offset from your execution date.
I think that will make debugging easier and history cleaner. I believe trying to backfill already executed tasks will involve deleting task instances or setting their states all to NONE. Then you’ll still have to trigger a backfill on those dag runs. It’ll be harder to track when things fail and just seems a bit messier.

How can I retrieve the 'scheduled time' for catchup jobs in Airflow?

When building an Airflow dag, I typically specify a simple schedule to run periodically - I expect this is the most common use.
dag = DAG('my_dag',
description='this is what it does',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 10, 1),
catchup=False)
I then need to use the 'date' as a parameter in my actual process, so I just check the current date.
date = datetime.date.today()
# do some date-sensitive stuff
operator = MyOperator(..., params=[date, ...])
My understanding is that setting catchup=True will have Airflow schedule my dag for every schedule interval between start_date and now (or end_date); e.g. every day.
How do I get the scheduled_date for use within my dag instance?
I think you mean execution date here, You can use Macros inside your operators, more detail can be found here: https://airflow.apache.org/code.html#macros. So airflow will respect it so you don't need to have your date been generated dynamically
Inside of Operator, you can call {{ ds }} in a str directly
Outside of Operator, for example PythonOperator, you will need provide_context=True first then to pass **kwargs as last arguments to your function then you can call kwargs['ds']

Resources