I have the following in the dag.py file, this is a newly pushed to prod DAG, it should have run at 14UTC (9EST) it should have ran a few hours ago but it still hasn't run even thought in the UI is still saying it will run at 14UTC.
DAG_NAME = "revenue_monthly"
START_DATE = datetime(2023, 1, 12)
SCHEDULE_INTERVAL = "0 14 3 * *"
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False
}
dag = DAG(DAG_NAME,
default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL,
doc_md=doc_md,
max_active_runs=1,
catchup=False,
)
See picture below of the UI:
The date and time you are seeing as Next Run is the logical_date which is the start of the data interval. With the current configuration the first DAGrun will be on data from 2023-02-03 to 2023-03-03 so the DAG will only actually be running on 2023-03-03 (the Run After date, you can see that one when you are viewing the DAG and hover over the schedule in the upper right corner:
Assuming you want the DAG to do the run it would have done on 2023-02-03 (today) you can achieve that by backfilling one run, either by manually backfilling. Or by using catchup=True with a start_date before 2023-01-03:
from airflow import DAG
from pendulum import datetime
from airflow.operators.empty import EmptyOperator
DAG_NAME = "revenue_monthly_1"
START_DATE = datetime(2023, 1, 1)
SCHEDULE_INTERVAL = "0 14 3 * *"
doc_md="documentation"
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False
}
with DAG(
DAG_NAME,
default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL,
doc_md=doc_md,
max_active_runs=1,
catchup=True,
) as dag:
t1 = EmptyOperator(task_id="t1")
gave me one run with the run id scheduled__2023-01-03T14:00:00+00:00 and the next_run date interval 2023-02-03 to 2023-03-03 which will Run after 2023-03-03.
This guide might help with terminology Airflow uses around schedules.
I want to run my dag as per new york time zone. As the data comes as per the new york time zone and dag fails for the initial runs and skips last runs data as well since UTC is 4 hours ahead and the day gets completed for UTC 4 hours earlier than for new york.
I have done
timezone = pendulum.timezone("America/New_York")
default_args = {
'depends_on_past': False,
'start_date': pendulum.datetime(2022, 5, 13, tzinfo=timezone),
...
}
test_dag = DAG(
'test_data_import_tz', default_args=default_args,
schedule_interval="0 * * * *", catchup=False)
DAG CODE works with:
yesterday_report_date = context.get('yesterday_ds_nodash')
report_date = context.get('ds_nodash')
But when the DAG runs as per UTC only and the initial run fails as data is not present.
Dag Run times are:
2022-05-13T00:00:00+00:00
2022-05-13T01:00:00+00:00
and so on
whereas I want this to start from
2022-05-13T04:00:00+00:00
2022-05-13T05:00:00+00:00
I have a job that i had set to run a 9:00 UTC on Wednesday. It didn't run as planned by the end of the delay interval, which I thought was curious because I believe I have everything defined properly.
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'noncomp_trial',
default_args=default_args,
description='test of dag',
schedule_interval='0 9 * * 3',
dagrun_timeout=timedelta(minutes=20))
If anyone has any advice here that would be greatly appreciated!
The Airflow Scheduler runs tasks once the start_date + one schedule_interval value has passed. In your example, the DAG won't run until 9:00AM on Wednesday the following week occurs.
See more information about the relationship between the start_date and schedule_interval here.
You could try setting your start_date to a static date in the past by a week or two to see if that works? And to make sure the Scheduler doesn't try to execute every start_date + schedule_interval occurrence between that new start_date and now, you can set catchup=False on the DAG. For example:
from datetime import datetime
dag = DAG(
'noncomp_trial',
default_args= {
'start_date': datetime(2021, 7, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5)
},
description='test of dag',
schedule_interval='0 9 * * 3',
dagrun_timeout=timedelta(minutes=20),
catchup=False,
)
Here is the dag, which I want to execute on fixed date of every month, as of now I kept it on 18th of every month.
But the task is getting triggered daily by the scheduler. catchup_by_default = False is set to false in airflow.cfg file
default_args = {
'owner': 'muddassir',
'depends_on_past': True,
'start_date': datetime(2021, 3, 18),
'retries': 1,
'schedule_interval': '0 0 18 * *',
}
Image 1
Image 2
Image 3
Image 4
you have mentioned schedule_interval in your default_args, which is not the way to schedule the DAG. default_args are actually applied to the tasks, as they are passed to Operators and not the DAG itself.
you can modify your code as follows, just by removing the schedule_interval from default_args and passing it in the DAG instance as follows and you can set catchup flag as False to avoid any backfills :
# assuming this is how you initialize your DAG
dag = DAG('your DAG UI name', default_args=default_args, schedule_interval = '0 0 18 * *', catchup=False)
.Hi Everyone,
From the Airflow UI, we are trying to understand how to start a DAG run in the future at a specific time, but we always get 2 additional runs in catch-up mode (even though catch-up is disabled)
Example
Create a DAG run with the below parameters
start_date: 10:30
execution_date: not defined
interval = 3 minutes (from the .py file)
catchup_by_default = False
Turn the ON switch at Current time: 10:28. What we get is Airflow triggers 2 DAG runs with execution_date at:
10:24
10:27
and these 2 DAG runs are run in catch-up mode one after the other, and that's not what we want :-(
What are we doing wrong?
We maybe understand the 10:27 run (ETL concept), but we do not get the 10:24 one :-(
Thank you for the help :-)
DETAILS:
OS: RedHat 7
Python: 2.7
Airflow: v1.8.0
DAG python file:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2017, 9, 7, 10, 30),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, schedule_interval=timedelta(minutes=3))
dag.catchup = False
create_command = "/script.sh "
t1 = BashOperator(
task_id='task',
bash_command='date',
dag=dag)
I tried with Airflow v.1.8.0, python v.3.5, db on SQLite. The following DAG, unpaused at 10:28, is quite similar to yours, and works as it should (only one run, at 10:33, for 10:30).
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello_3min():
return ('Hello world! %s' % datetime.now())
dag = DAG('hello_world_3min', description='Simple tutorial DAG 3min',
schedule_interval='*/3 * * * *',
start_date=datetime(2017, 9, 18, 10, 30),
catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task_3min', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task_3min',
python_callable=print_hello_3min, dag=dag)
dummy_operator >> hello_operator
Written with StackEdit.
I'm not sure about my solution whether good enough, but I'd like to present my understanding.
There are 2 things to consider together:
schedule_interval mode, such as 'hourly', 'daily', 'weekly','annually'.
hourly = (* 1 * * *) = “At every minute past hour 1.”
daily = (0 1 * * *) = “At 01:00.”
monthly = (0 1 1 * *) = “At 01:00 on day-of-month 1.”
start_date
hourly = datetime(2019, 4, 5, 1, 30)
daily = datetime(2019, 4, 5)
monthly = datetime(2019, 4, 1)
My strategy is to set [start_date] by doing minus the expecting start date & time by the 1 unit of your interval mode.
Example:
To start the first job at 2019-4-5 01:00 and the interval are hourly.
schedule_interval mode = hourly
expecting start datetime = 2019-4-5 01:00
so, start_date = 2019-4-5 00:00
minus hour by 1 hour
CRON = ( * 1 * * * ) which means “At every minute past hour 1.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 5, 0, 0),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='* 1 * * *')
To start the first job at 2019-4-5 01:00 and the interval are daily.
schedule_interval mode = daily
expecting start datetime date = 2019-4-5 01:00
so, start_date = 2019-4-4
minus day by 1 day
CRON = ( 0 1 * * * ) which means “At 01:00.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 4),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='0 1 * * *')
To start the first job at 2019-4-5 01:00 and the interval are monthly.
schedule_interval mode = monthly
expecting start datetime date = 2019-4-5 01:00
so, start_date = 2019-4-4
minus day by 1 day
CRON = ( 0 1 1 * * ) which means “At 01:00 on day-of-month 1.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 4),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='0 1 1 * *')
So far, the strategy is useful for me, but if anyone got better, please kindly share.
PS. I'm using [https://crontab.guru] to generate a perfect cron-schedule.
This appears to happen exclusively when providing a timedelta as a schedule. Switch your schedule interval to be cron formatted and it won't run twice anymore.