Airflow schedule not updating - airflow

I created a DAG that will run on a weekly basis. Below is what I tried and it's working as expected.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
SCHEDULE_INTERVAL = timedelta(weeks=1, seconds=00, minutes=00, hours=00)
default_args = {
'depends_on_past': False,
'retries': 0,
'retry_delay': timedelta(minutes=2),
'wait_for_downstream': True,
'provide_context': True,
'start_date': datetime(2020, 12, 20, hour=00, minute=00, second=00)
}
with DAG("DAG", default_args=default_args, schedule_interval=SCHEDULE_INTERVAL, catchup=True) as dag:
t1 = BashOperator(
task_id='dag_schedule',
bash_command='echo DAG',
dag=dag)
As per the schedule, it ran on the 27(i.e. 20 in the script). As there is a change in requirement, Now I updated the start date to 30th(i.e 23 in the script) instead of 27(My idea is to start the schedule from 30 and from there onwards every week). When I change the schedule of the DAG i.e. start date from 27 to 30th. DAG is not picking as per the latest start date, not sure why? When I deleted the DAG(as it is test DAG I deleted it, in prod I can't delete it) and created the new DAG with the same name with the latest start date i.e. 30th, it's running as per the schedule.

As per the Airflow DOC's
When needing to change your start_date and schedule interval, change the name of the dag (a.k.a. dag_id) - I follow the convention : my_dag_v1, my_dag_v2, my_dag_v3, my_dag_v4, etc...
Changing schedule interval always requires changing the dag_id, because previously run TaskInstances will not align with the new schedule interval
Changing start_date without changing schedule_interval is safe, but changing to an earlier start_date will not create any new DagRuns for the time between the new start_date and the old one, so tasks will not automatically backfill to the new dates. If you manually create DagRuns, tasks will be scheduled, as long as the DagRun date is after both the task start_date and the dag start_date.
So if we change start date we need to change the DAG name or delete the existing DAG so that it will be recreated with the same name again(metadata related to previous DAG will be deleted from metadata)
Source

Your DAG as you defined it will be triggered on 6-Jan-2021
Airflow schedule tasks at the END of the interval (See doc reference)
So per your settings:
SCHEDULE_INTERVAL = timedelta(weeks=1, seconds=00, minutes=00, hours=00)
and
'start_date': datetime(2020, 12 , 30, hour=00, minute=00, second=00)
This means the first run will be on 6-Jan-2021 because 30-Dec-2020 + 1 week = 6-Jan-2021 Note that the execution_date of this run will be 2020-12-30

Related

How to configure Apache Airflow start_date and schedule_interval to run daily at 7am?

I'm using Airflow airflow-2.3.3 (through GCP Composer)
I pass this yaml configuration when deploying my DAG:
dag_args:
dag_id: FTP_DAILY
default_args:
owner: 'Dev team'
start_date: "00:00:00"
max_active_runs: 1
retries: 2
schedule_interval: "0 7 * * *"
ftp_conn_id: 'ftp_dev'
I want this DAG to run at 7am UTC every morning, but it's not running. In the UI it says next run: 2022-11-22, 07:00:00 (as of Nov 22nd) and it never runs. How should I configure my start_date and schedule_interval so that the DAG runs at 7am UTC every day, starting from the nearest 7am after the deployment?
You can pass default args directly in the Python DAG code and calculate yesterday's date, example :
from airflow.utils.dates import days_ago
dag_default_args = {
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
'start_date': days_ago(1)
}
Then in the DAG :
with airflow.DAG(
"dag_name",
default_args=dag_default_args,
schedule_interval="0 7 * * *") as dag:
......
In this case the schedule_interval and cron will work correctly, Airflow will based the cron DAG on the start date.
The main concept of airflow is that the execution of a dag starts after the required interval has passed. If you schedule a dag with the above setup airflow will parse
interval_start_date as 2022-11-22 07:00:00
and interval_end_date as 2022-11-23 07:00:00
As you are requesting airflow to fetch data from this interval it will wait for the interval to pass, thus starting execution on 23rd November 7am.
If you want it to trigger immediately after you deploy the dag you need to move the start date back by one day. You might need to set up the catchup flag to true.
with DAG(
dag_id='new_workflow4',
schedule_interval="0 7 * * *",
start_date=pendulum.datetime(2022, 11, 21, hour=0, tz="UTC"),
catchup=True
) as dag:

Airflow not running on scheduled interval

My Airflow webserver is up and running,As other job are running as per scheduled.
I added a new DAG to be executed every 5 minute.
Once added i ran it first time manually and it completed. However after that it is not running again
every 5 min.
Dag code is below
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
current_date = datetime.now()
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2019, 6, 11, current_date.hour, current_date.minute),
"email": ["airflow#airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=1),
}
dag = DAG("Incremental", default_args=default_args, schedule_interval='*/5 * * * *')
Suggestion please
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended. Based on Airflow docs
In your case, if you have start date of 2019-01-01 00:00:00, 5 minutes interval. You might expect airflow to run at 2019-01-01 00:05:00, but it will run after 2019-01-01 00:10:00 because its waiting for the 5 minutes interval to finish (this is how I imagine it). Hope this helps. :)

Run only the latest Airflow DAG

Let's say I would like to run a pretty simple ETL DAG with Airflow:
it checks the last insert time in DB2, and it loads newer rows from DB1 to DB2 if any.
There are some understandable requirements:
It scheduled hourly, the first few runs will last more than 1 hour
eg. the first run should process a month data, and it lasts for 72 hours,
so the second run should process the last 72 hour, it last 7.2 hours,
the third processes 7.2 hours and it finishes within an hour,
and from then on it runs hourly.
While the DAG is running, don't start the next one, skip it instead.
If the time passed the trigger event, and the DAG didn't start, don't start it subsequently.
There are other DAGs as well, the DAGs should be executed independently.
I've found these parameters and operator a little confusing, what is the distinctions between them?
depends_on_past
catchup
backfill
LatestOnlyOperator
Which one should I use, and which LocalExecutor?
Ps. there's already a very similar thread, but it isn't exhausting.
DAG max_active_runs = 1 combined with catchup = False would solve this.
This one satisfies my requirements. The DAG runs in every minute, and my "main" task lasts for 90 seconds, so it should skip every second run.
I've used a ShortCircuitOperator to check whether the current run is the only one at the moment (query in the dag_run table of airflow db), and catchup=False to disable backfilling.
However I cannot utilize properly the LatestOnlyOperator which should do something similar.
DAG file
import os
import sys
from datetime import datetime
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator, ShortCircuitOperator
import foo
import util
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2018, 2, 13), # or any date in the past
'email': ['services#mydomain.com'],
'email_on_failure': True}
dag = DAG(
'test90_dag',
default_args=default_args,
schedule_interval='* * * * *',
catchup=False)
condition_task = ShortCircuitOperator(
task_id='skip_check',
python_callable=util.is_latest_active_dagrun,
provide_context=True,
dag=dag)
py_task = PythonOperator(
task_id="test90_task",
python_callable=foo.bar,
provide_context=True,
dag=dag)
airflow.utils.helpers.chain(condition_task, py_task)
util.py
import logging
from datetime import datetime
from airflow.hooks.postgres_hook import PostgresHook
def get_num_active_dagruns(dag_id, conn_id='airflow_db'):
# for this you have to set this value in the airflow db
airflow_db = PostgresHook(postgres_conn_id=conn_id)
conn = airflow_db.get_conn()
cursor = conn.cursor()
sql = "select count(*) from public.dag_run where dag_id = '{dag_id}' and state in ('running', 'queued', 'up_for_retry')".format(dag_id=dag_id)
cursor.execute(sql)
num_active_dagruns = cursor.fetchone()[0]
return num_active_dagruns
def is_latest_active_dagrun(**kwargs):
num_active_dagruns = get_num_active_dagruns(dag_id=kwargs['dag'].dag_id)
return (num_active_dagruns == 1)
foo.py
import datetime
import time
def bar(*args, **kwargs):
t = datetime.datetime.now()
execution_date = str(kwargs['execution_date'])
with open("/home/airflow/test.log", "a") as myfile:
myfile.write(execution_date + ' - ' + str(t) + '\n')
time.sleep(90)
with open("/home/airflow/test.log", "a") as myfile:
myfile.write(execution_date + ' - ' + str(t) + ' +90\n')
return 'bar: ok'
Acknowledgement: this answer is based on this blog post.
DAG max_active_runs = 1 combined with catchup = False and add a DUMMY task right at the beginning( sort of START task) with wait_for_downstream=True.
As of LatestOnlyOperator - it will help to avoid reruning a Task if previous execution is not yet finished.
Or create the "START" task as LatestOnlyOperator and make sure all Taks part of 1st processing layer are connecting to it. But pay attention - as per the Docs "Note that downstream tasks are never skipped if the given DAG_Run is marked as externally triggered."

How to avoid Airflow to backfill when using trigger_dag?

I want to create a DAG that will only run upon an external trigger (i.e., using the 'airflow trigger_dag ' command). However, when I do this, I see multiple 'scheduled_xxx' DagRuns in addition to the 'manual_xxx' that I want. I am assuming the scheduled_xxx DagRuns are created to backfill?
Is there a way to only have the 'manual_xxx' DagRun created and no 'scheduled_xxx' DagRuns?
I tried different values for start_date (past, datetime.now() and future but got the same result. Here's my toy DAG ...
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'start_date': datetime.now(), (also tried past and future dates)
'schedule_interval': None,
'depends_on_past': False,
}
dag = DAG('my_test_dag', default_args=default_args)
date_task = BashOperator(
task_id='date',
bash_command='date',
dag=dag)
This is how I am issuing the trigger_dag command ...
airflow trigger_dag my_test_dag
It appears the "schedule_interval" is not a task level parameter. If I move it to the DAG object then I get the expected result. That is, I only get the manually triggered DagRun when I add the 'schedule_interval' argument to the DAG constructor ..
default_args = {
'start_date': datetime.now(), (also tried past and future dates)
'depends_on_past': False,
}
dag = DAG('my_test_dag', default_args=default_args, **schedule_interval=None**)
This results in only a single DagRun getting created for each external/manual trigger (trigger_dag).

How to work correctly airflow schedule_interval

I want to try to use Airflow instead of Cron.
But schedule_interval doesn't work as I expected.
I wrote the python code like below.
And in my understanding, Airflow should have ran on "2016/03/30 8:15:00" but it didn't work at that time.
If I changed it like this "'schedule_interval': timedelta(minutes = 5)", it worked correctly, I think.
The "notice_slack.sh" is just to call slack api to my channels.
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import os
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2016, 3, 29, 8, 15),
}
dag = DAG(
dag_id='notice_slack',
default_args=args,
schedule_interval="#daily",
dagrun_timeout=timedelta(minutes=1))
# cmd file name
CMD = '/tmp/notice_slack.sh'
run_this = BashOperator(
task_id='run_transport', bash_command=CMD, dag=dag)
I want to run some of my scripts at specific time every day like this cron setting.
15 08 * * * bash /tmp/notice_slack.sh
I have read the document Scheduling & Triggers, and I know it's a little bit different cron.
So I attempt to arrange at "start_date" and "schedule_interval" settings.
Does anyone know what should I do ?
airflow version
INFO - Using executor LocalExecutor
v1.7.0
amazon-linux-ami/2015.09-release-notes
Try this:
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import os
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2016, 3, 29),
}
dag = DAG(
dag_id='notice_slack',
default_args=args,
schedule_interval="15 08 * * *",
dagrun_timeout=timedelta(minutes=1))
# cmd file name
CMD = 'bash /tmp/notice_slack.sh'
run_this = BashOperator(
task_id='run_transport', bash_command=CMD, dag=dag)
start_date (datetime) – The start_date for the task, determines the execution_date for the first task instance. The best practice is to have the start_date rounded to your DAG’s schedule_interval.
schedule_interval (datetime.timedelta or dateutil.relativedelta.relativedelta or str that acts as a cron expression) – Defines how often that DAG runs, this timedelta object gets added to your latest task instance’s execution_date to figure out the next schedule.
Simply configuring the schedule_interval and bash_command as the same in your cron setting is okay.
Airflow will start your DAG when the 2016/03/30 8:15:00 + schedule interval (daily) is passed. So your DAG will run on 2016/03/31 8:15:00.
You can check the Airflow FAQ
First, your start date should be in the past -
Instead of 'start_date': datetime(2016, 3, 29, 8, 15)
Would you try 'start_date': datetime(2016, 2, 29, 8, 15)
and apply 'catchup':False to prevent backfills - unless this was something you wanted to do.
From Airflow documentation -
The Airflow scheduler triggers the task soon after the start_date + schedule_interval is passed.
The schedule interval can be supplied as a cron -
If you want to run it everyday at 8:15 AM, the expression would be - *'15 8 * * '
If you want to run it only on Oct 31st at 8:15 AM, the expression would be - *'15 8 31 10 '
To supply this, 'schedule_inteval':'15 8 * * *' in your Dag property
You can figure this out more from https://crontab.guru/
Alternatively, there are Airflow presets -
If any of these meet your requirements, it would be simply, 'schedule_interval':'#hourly'
Lastly, you can also apply the schedule as python timedelta object e.g. for 12 PM
'schedule_interval': timedelta(hours=12)
With the example you've given #daily will run your job after it passes midnight. You might try changing it either to timedelta(days=1) which is relative to your fixed start_date that includes 08:15.
Or you could use a cron spec for the schedule_interval='15 08 * * *' in which case any start date prior to 8:15 on the day BEFORE the day you wanted the first run would work.
Note that depends_on_past: False is already the default, and you may have confused its behavior with catchup=false in the DAG parameters, which would avoid making past runs for time between the start date and now where the DAG schedule interval would have run.

Resources