How to use External Task Sensor with negative timedelta - airflow

I have two dags (both scheduled):
DAG A RUNS 1 AM
DAG B RUNS 3 AM
How can I sensor DAG B in DAG A? I have the following External Task Sensor:
sensor = ExternalTaskSensor(
task_id='sensor',
external_dag_id='dag',
external_task_id='task',
retries=10,
execution_delta=timedelta(minutes=180),
poke_interval=300,
timeout=600,
check_existence=True
)
I use this sensor in the inverse situation, when DAG A runs after DAG B...
Basically, I have two questions:
Should I use a negative timedelta since DAG B runs 3 AM (After DAG A)?
What macro airflow uses to monitor the timedelta? I have a dag that
runs in 2 hours interval (start_interval 2 AM end_interval
4 AM). Airflow use start_interval, end_interval or scheduled_id_run?

Related

modify dag runs when triggered

I was curious if there's a way to customise the dag runs.
So I'm currently checking for updates for another table which gets updated manually by someone and once that's been updated, I would run my dag for the month.
At the moment I have created a branch operator that compares the dates of the 2 tables but is there a way to run the dag (compare the two dates) and run it everyday until there is a change and not run for the remaining of the month?
For example,
Table A (that is updated manually) has YYYYMM as 202209 and Table B also has YYYYMM as 202209.
Atm, my branch operator compares the two YYYYMM and would point to a dummy operator end when it's the same. However, when Table A has been updated to 202210, there's a difference in the two YYYYMM hence another task would run and overwrite Table B.
It all works but this would run the dag everyday even though the table A only gets updated once a month at a random point of time within the month. So is there way to trigger the dag to stop for the remaining days of the month after the task has been triggered?
Hope this is clear.
If you would be using data stored on S3 there would be easy solution starting from the version 2.4 - the Data-aware scheduling.
But probably you're not so there is another option.
A dag in Airflow is Dag object that is assigned to global scope. This allows for dynamic creation of dags. This implies each file is loaded on certain interval. A very good description with examples is here
Second thing you need to use is Airflow Variables
So the concept is as follows:
Create a variable in Airflow named dag_run that will hold the month when the dag has successfully run
Create a python file that has a function that creates a dag object based on input parameters.
In the same file use conditional statements that will set the 'schedule' param differently depending if the dag has run for current month
In your dag in the branch that executes when data has changed set the variable dag_run to the current months value like so: Variable.set(key='dag_run', value=datetime.now().month)
step 1:
python code:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
from airflow.models import Variable
#function that creates dag based on input
def create_dag(dag_id,
schedule,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_id)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py)
return dag
#run some checks
current_month = datetime.now().month
dag_run_month = int(Variable.get('run_month'))
if current_month == dag_run_month:
# keep the schedule off
schedule = None
dag_id = "Database_insync"
elif current_month != dag_run_month:
# keep the schedule on
schedule = "30 * * * *"
dag_id = "Database_notsynced"
#watch out for start_date if you leave
#it in the past airflow will execute past missing schedules
default_args = {'owner': 'airflow',
'start_date': datetime.datetime.now() - datetime.timedelta(minutes=15)
}
globals()[dag_id] = create_dag(dag_id,
schedule,
default_args)

Airflow - How to properly define the time my DAG will execute every day?

I have two dags: The first one extracts data from one database to another. I want it to run everyday at 4 AM and then, that's how I defined my params:
Note: The code has 7am instead of 4 because Airflow is in UTC time and my time is GMT-3.
start_date=datetime(2022, 9, 6),
schedule_interval="0 7 * * *",
catchup=False
) as dag:
But, when I check Airflow's UI, the DAG time is shown like this:
screenshot
I have no idea about or have seen this Data Interval, why is it starting at today 21:00 (9 PM) and why this is the next run for this DAG?
How do I set my DAG to run at the next day (Sep 6, as I'm posting on 5) at 4 AM?
Thank you!

Schedule airflow job bi-weekly

I have a requirement that I want to schedule an airflow job every alternate Friday. However, the problem is I am not able to figure out how to write a schedule for this.
I don't want to have multiple jobs for this.
I tried this
'0 0 1-7,15-21 * 5
However it's not working it's running from 1 to 7 and 15 to 21 everyday.
from shubham's answer I realize that we can have a PythonOperator which can skip the task for us. An I tried to implement the solution. However doesn't seem to work.
As testing this on 2 week period would be too difficult. This is what I did.
I schedule the DAG to run every 5 mins
However, I am writing python operator the skip althernate task (pretty similar to what I am trying to do, alternate friday).
DAG:
args = {
'owner': 'Gaurang Shah',
'retries': 0,
'start_date':airflow.utils.dates.days_ago(1),
}
dag = DAG(
dag_id='test_dag',
default_args=args,
catchup=False,
schedule_interval='*/5 * * * *',
max_active_runs=1
)
dummy_op = DummyOperator(task_id='dummy', dag=dag)
def _check_date(execution_date, **context):
min_date = datetime.now() - relativedelta(minutes=10)
print(context)
print(context.get("prev_execution_date"))
print(execution_date)
print(datetime.now())
print(min_date)
if execution_date > min_date:
raise AirflowSkipException(f"No data available on this execution_date ({execution_date}).")
check_date = PythonOperator(
task_id="check_if_min_date",
python_callable=_check_date,
provide_context=True,
dag=dag,
)
I doubt that a single crontab expression can solve this
Using airflow's tricks, solution is much more straightforward:
schedule your DAG every Friday 0 0 * * FRI and
on alternate Fridays (based on your business logic), skip the DAG by raising AirflowSkipException
Here you'll have to let your DAG begin with a dedicated skip_decider task that will let your DAG run / skip every alternate Friday by
conditionally raising AirflowSkipException (to skip the DAG)
not doing anything to let the DAG run
You can also leverage
ShortCircuitOperator
BranchPythonOperator
but IMO, AirflowSkipException is the cleanest solution
Reference: How to define a DAG that scheduler a monthly job together with a daily job?
Depending on your implementation you can use the hash. Worked in my airflow schedules using version 1.10:
Hash (#)
'#' is allowed for the day-of-week field, and must be followed by a number between one and five. It allows specifying constructs such as "the second Friday" of a given month.[19] For example, entering "5#3" in the day-of-week field corresponds to the third Friday of every month. Reference
you can use timedelta as mentioned below, combine it with start_date to schedule your job bi_weekly.
dag = DAG(
dag_id='test_dag',
default_args=args,
catchup=False,
start_date=datetime(2021, 3, 26),
schedule_interval=timedelta(days=14),
max_active_runs=1
)

How to consider daylight savings time when using cron schedule in Airflow

In Airflow, I'd like a job to run at specific time each day in a non-UTC timezone. How can I go about scheduling this?
The problem is that once daylight savings time is triggered, my job will either be running an hour too soon or an hour too late. In the Airflow docs, it seems like this is a known issue:
In case you set a cron schedule, Airflow assumes you will always want
to run at the exact same time. It will then ignore day light savings
time. Thus, if you have a schedule that says run at end of interval
every day at 08:00 GMT+1 it will always run end of interval 08:00
GMT+1, regardless if day light savings time is in place.
Has anyone else run into this issue? Is there a work around? Surely the best practice cannot be to alter all the scheduled times after Daylight Savings Time occurs?
Thanks.
Starting with Airflow 1.10, time-zone aware DAGs can be defined using time-zone aware datetime objects to specify start_date. For Airflow to schedule DAG runs always at the same time (regardless of a possible daylight-saving-time switch), use cron expressions to specify schedule_interval. To make Airflow schedule DAG runs with fixed intervals (regardless of a possible daylight-saving-time switch), use datetime.timedelta() to specify schedule_interval.
For example, consider the following code that, first, uses a cron expression to schedule two consecutive DAG runs, and then uses a fixed interval to do the same.
import pendulum
from airflow import DAG
from datetime import datetime, timedelta
START_DATE = datetime(
year=2019,
month=10,
day=25,
hour=8,
minute=0,
tzinfo=pendulum.timezone('Europe/Kiev'),
)
def gen_execution_dates(start_date, schedule_interval):
dag = DAG(
dag_id='id', start_date=start_date, schedule_interval=schedule_interval
)
execution_date = dag.start_date
for i in range(1, 3):
execution_date = dag.following_schedule(execution_date)
print(
f'[Run {i}: Execution Date for "{schedule_interval}"]:',
dag.timezone.convert(execution_date),
)
gen_execution_dates(START_DATE, '0 8 * * *')
gen_execution_dates(START_DATE, timedelta(days=1))
Running the code produces the following output:
[Run 1: Execution Date for "0 8 * * *"]: 2019-10-26 08:00:00+03:00
[Run 2: Execution Date for "0 8 * * *"]: 2019-10-27 08:00:00+02:00
[Run 1: Execution Date for "1 day, 0:00:00"]: 2019-10-26 08:00:00+03:00
[Run 2: Execution Date for "1 day, 0:00:00"]: 2019-10-27 07:00:00+02:00
For the zone [Europe/Kiev], the daylight saving time of 2019 ends on 2019-10-27 at 03:00:00+03:00. That is, between Run 1 and Run 2 in our example.
The first two output lines show that for the DAG runs scheduled with a cron expression the first run and second run are both scheduled for 08:00 (although, in different timezones: Eastern European Summer Time (EEST) and Eastern European Time (EET) respectively).
The last two output lines show that for the DAG runs scheduled with a fixed interval the first run is scheduled for 08:00 (EEST), and the second run is scheduled exactly 1 day (24 hours) later, which is at 07:00 (EET) due to the daylight-saving-time switch.
The following figure illustrates the example:

Airflow: re execute the jobs of a DAG for the past n days on a daily basis

I have scheduled the execution of a DAG to run daily.
It works perfectly for one day.
However each day I would like to re-execute not only for the current day {{ ds }} but also for the previous n days (let's say n = 7).
For example, in the next execution scheduled to run on "2018-01-30" I would like Airflow not only to run the DAG using as execution date "2018-01-30", but also to re-run the DAGs for all the previous days from "2018-01-23" to "2018-01-30".
Is there an easy way to "invalidate" the previous execution so that a backfill is run automatically?
You can generate dynamically tasks in a loop and pass the offset to your operator.
Here is an example with the Python one.
import airflow
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import timedelta
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
'schedule_interval': '0 10 * * *'
}
def check_trigger(execution_date, day_offset, **kwargs):
target_date = execution_date - timedelta(days=day_offset)
# use target_date
for day_offset in xrange(1, 8):
PythonOperator(
task_id='task_offset_' + i,
python_callable=check_trigger,
provide_context=True,
dag=dag,
op_kwargs={'day_offset' : day_offset}
)
Have you considered having the dag that runs once a day just run your task for the last 7 days? I imagine you’ll just have 7 tasks that each spawn a SubDAG with a different day offset from your execution date.
I think that will make debugging easier and history cleaner. I believe trying to backfill already executed tasks will involve deleting task instances or setting their states all to NONE. Then you’ll still have to trigger a backfill on those dag runs. It’ll be harder to track when things fail and just seems a bit messier.

Resources