I have the following in the dag.py file, this is a newly pushed to prod DAG, it should have run at 14UTC (9EST) it should have ran a few hours ago but it still hasn't run even thought in the UI is still saying it will run at 14UTC.
DAG_NAME = "revenue_monthly"
START_DATE = datetime(2023, 1, 12)
SCHEDULE_INTERVAL = "0 14 3 * *"
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False
}
dag = DAG(DAG_NAME,
default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL,
doc_md=doc_md,
max_active_runs=1,
catchup=False,
)
See picture below of the UI:
The date and time you are seeing as Next Run is the logical_date which is the start of the data interval. With the current configuration the first DAGrun will be on data from 2023-02-03 to 2023-03-03 so the DAG will only actually be running on 2023-03-03 (the Run After date, you can see that one when you are viewing the DAG and hover over the schedule in the upper right corner:
Assuming you want the DAG to do the run it would have done on 2023-02-03 (today) you can achieve that by backfilling one run, either by manually backfilling. Or by using catchup=True with a start_date before 2023-01-03:
from airflow import DAG
from pendulum import datetime
from airflow.operators.empty import EmptyOperator
DAG_NAME = "revenue_monthly_1"
START_DATE = datetime(2023, 1, 1)
SCHEDULE_INTERVAL = "0 14 3 * *"
doc_md="documentation"
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False
}
with DAG(
DAG_NAME,
default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL,
doc_md=doc_md,
max_active_runs=1,
catchup=True,
) as dag:
t1 = EmptyOperator(task_id="t1")
gave me one run with the run id scheduled__2023-01-03T14:00:00+00:00 and the next_run date interval 2023-02-03 to 2023-03-03 which will Run after 2023-03-03.
This guide might help with terminology Airflow uses around schedules.
I have a DAG that needs to be scheduled to run in working days (Mon to Fri) between 9AM to 4PM in every 10 minutes. How do i do this in Airflow.
Set your DAG with cron expression: */10 9-16 * * 1-5 Which means at every 10th minute past every hour from 9 through 16 on every day-of-week from Monday through Friday. See crontab guru.
dag = DAG(
dag_id='my_dag',
schedule_interval='*/10 9-16 * * 1-5',
start_date=datetime(2021, 1, 1),
)
When I deploy a new dag on airflow, let's say I deploy it today (28 April).
And I have the Cron expression as this: 0 3 * * *, then I expect the first run is on 29 April at 3 am. however, I get a run as soon as deploy with this run id: 2021-04-27, 03:00:00`.
Dag code:
DAG(
dag_id="namexx",
schedule_interval='0 3 * * *',
max_active_runs=1,
is_paused_upon_creation=False,
dagrun_timeout=timedelta(hours=1),
catchup=False,
default_args={
"start_date": datetime(2021, 1, 1),
"retries": 0,
"retry_delay": timedelta(minutes=1)
}
)
Any idea why is that?
This is expected.
Airflow schedule DAGs at the end of the interval.
if start_date is 2021-01-01 and interval is hourly a run will be triggered as soon as the DAG deployed.
See also previous answer 1, answer 2 on this subject
Here is the dag, which I want to execute on fixed date of every month, as of now I kept it on 18th of every month.
But the task is getting triggered daily by the scheduler. catchup_by_default = False is set to false in airflow.cfg file
default_args = {
'owner': 'muddassir',
'depends_on_past': True,
'start_date': datetime(2021, 3, 18),
'retries': 1,
'schedule_interval': '0 0 18 * *',
}
Image 1
Image 2
Image 3
Image 4
you have mentioned schedule_interval in your default_args, which is not the way to schedule the DAG. default_args are actually applied to the tasks, as they are passed to Operators and not the DAG itself.
you can modify your code as follows, just by removing the schedule_interval from default_args and passing it in the DAG instance as follows and you can set catchup flag as False to avoid any backfills :
# assuming this is how you initialize your DAG
dag = DAG('your DAG UI name', default_args=default_args, schedule_interval = '0 0 18 * *', catchup=False)
.Hi Everyone,
From the Airflow UI, we are trying to understand how to start a DAG run in the future at a specific time, but we always get 2 additional runs in catch-up mode (even though catch-up is disabled)
Example
Create a DAG run with the below parameters
start_date: 10:30
execution_date: not defined
interval = 3 minutes (from the .py file)
catchup_by_default = False
Turn the ON switch at Current time: 10:28. What we get is Airflow triggers 2 DAG runs with execution_date at:
10:24
10:27
and these 2 DAG runs are run in catch-up mode one after the other, and that's not what we want :-(
What are we doing wrong?
We maybe understand the 10:27 run (ETL concept), but we do not get the 10:24 one :-(
Thank you for the help :-)
DETAILS:
OS: RedHat 7
Python: 2.7
Airflow: v1.8.0
DAG python file:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2017, 9, 7, 10, 30),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, schedule_interval=timedelta(minutes=3))
dag.catchup = False
create_command = "/script.sh "
t1 = BashOperator(
task_id='task',
bash_command='date',
dag=dag)
I tried with Airflow v.1.8.0, python v.3.5, db on SQLite. The following DAG, unpaused at 10:28, is quite similar to yours, and works as it should (only one run, at 10:33, for 10:30).
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello_3min():
return ('Hello world! %s' % datetime.now())
dag = DAG('hello_world_3min', description='Simple tutorial DAG 3min',
schedule_interval='*/3 * * * *',
start_date=datetime(2017, 9, 18, 10, 30),
catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task_3min', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task_3min',
python_callable=print_hello_3min, dag=dag)
dummy_operator >> hello_operator
Written with StackEdit.
I'm not sure about my solution whether good enough, but I'd like to present my understanding.
There are 2 things to consider together:
schedule_interval mode, such as 'hourly', 'daily', 'weekly','annually'.
hourly = (* 1 * * *) = “At every minute past hour 1.”
daily = (0 1 * * *) = “At 01:00.”
monthly = (0 1 1 * *) = “At 01:00 on day-of-month 1.”
start_date
hourly = datetime(2019, 4, 5, 1, 30)
daily = datetime(2019, 4, 5)
monthly = datetime(2019, 4, 1)
My strategy is to set [start_date] by doing minus the expecting start date & time by the 1 unit of your interval mode.
Example:
To start the first job at 2019-4-5 01:00 and the interval are hourly.
schedule_interval mode = hourly
expecting start datetime = 2019-4-5 01:00
so, start_date = 2019-4-5 00:00
minus hour by 1 hour
CRON = ( * 1 * * * ) which means “At every minute past hour 1.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 5, 0, 0),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='* 1 * * *')
To start the first job at 2019-4-5 01:00 and the interval are daily.
schedule_interval mode = daily
expecting start datetime date = 2019-4-5 01:00
so, start_date = 2019-4-4
minus day by 1 day
CRON = ( 0 1 * * * ) which means “At 01:00.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 4),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='0 1 * * *')
To start the first job at 2019-4-5 01:00 and the interval are monthly.
schedule_interval mode = monthly
expecting start datetime date = 2019-4-5 01:00
so, start_date = 2019-4-4
minus day by 1 day
CRON = ( 0 1 1 * * ) which means “At 01:00 on day-of-month 1.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 4),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='0 1 1 * *')
So far, the strategy is useful for me, but if anyone got better, please kindly share.
PS. I'm using [https://crontab.guru] to generate a perfect cron-schedule.
This appears to happen exclusively when providing a timedelta as a schedule. Switch your schedule interval to be cron formatted and it won't run twice anymore.