Run Airflow DAG 2 times per day - airflow

How can I schedule my DAG 2 times per day with such settings?
I tried
'start_date': datetime(2022, 4, 6)
with
schedule_interval='0 1,4 * * *'
So if today is 2022-04-08 about 5 pm, we should see 4 (or 6?) starts, but i see 5 (last today at 01:00 UTC), how its works?

Related

DAG run as per timezone

I want to run my dag as per new york time zone. As the data comes as per the new york time zone and dag fails for the initial runs and skips last runs data as well since UTC is 4 hours ahead and the day gets completed for UTC 4 hours earlier than for new york.
I have done
timezone = pendulum.timezone("America/New_York")
default_args = {
'depends_on_past': False,
'start_date': pendulum.datetime(2022, 5, 13, tzinfo=timezone),
...
}
test_dag = DAG(
'test_data_import_tz', default_args=default_args,
schedule_interval="0 * * * *", catchup=False)
DAG CODE works with:
yesterday_report_date = context.get('yesterday_ds_nodash')
report_date = context.get('ds_nodash')
But when the DAG runs as per UTC only and the initial run fails as data is not present.
Dag Run times are:
2022-05-13T00:00:00+00:00
2022-05-13T01:00:00+00:00
and so on
whereas I want this to start from
2022-05-13T04:00:00+00:00
2022-05-13T05:00:00+00:00

How to run airflow dag on working days between 9am to 4pm in every 10 minutes

I have a DAG that needs to be scheduled to run in working days (Mon to Fri) between 9AM to 4PM in every 10 minutes. How do i do this in Airflow.
Set your DAG with cron expression: */10 9-16 * * 1-5 Which means at every 10th minute past every hour from 9 through 16 on every day-of-week from Monday through Friday. See crontab guru.
dag = DAG(
dag_id='my_dag',
schedule_interval='*/10 9-16 * * 1-5',
start_date=datetime(2021, 1, 1),
)

Airflow start Dag on yesterday

When I deploy a new dag on airflow, let's say I deploy it today (28 April).
And I have the Cron expression as this: 0 3 * * *, then I expect the first run is on 29 April at 3 am. however, I get a run as soon as deploy with this run id: 2021-04-27, 03:00:00`.
Dag code:
DAG(
dag_id="namexx",
schedule_interval='0 3 * * *',
max_active_runs=1,
is_paused_upon_creation=False,
dagrun_timeout=timedelta(hours=1),
catchup=False,
default_args={
"start_date": datetime(2021, 1, 1),
"retries": 0,
"retry_delay": timedelta(minutes=1)
}
)
Any idea why is that?
This is expected.
Airflow schedule DAGs at the end of the interval.
if start_date is 2021-01-01 and interval is hourly a run will be triggered as soon as the DAG deployed.
See also previous answer 1, answer 2 on this subject

Airflow DAG Scheduling for end of month

I want to run a schedule on Airflow (v1.9.0).
My DAG needs to run at every end of month, but I don't know how to write the settings.
my_dag = DAG(dag_id=DAG_ID,
catchup=False,
default_args=default_args,
schedule_interval='30 0 31 * *',
start_date=datetime(2019, 7, 1))
But this won't work in a month where there's no 31st, right?
How can I write a schedule_interval to run at every end of the month?
You can do this by putting L in the month position of your schedule_interval cron expression.
schedule_interval='59 23 L * *' # 23:59 on the last day of the month

Apache Airflow: DAG executed twice before start_date

.Hi Everyone,
From the Airflow UI, we are trying to understand how to start a DAG run in the future at a specific time, but we always get 2 additional runs in catch-up mode (even though catch-up is disabled)
Example
Create a DAG run with the below parameters
start_date: 10:30
execution_date: not defined
interval = 3 minutes (from the .py file)
catchup_by_default = False
Turn the ON switch at Current time: 10:28. What we get is Airflow triggers 2 DAG runs with execution_date at:
10:24
10:27
and these 2 DAG runs are run in catch-up mode one after the other, and that's not what we want :-(
What are we doing wrong?
We maybe understand the 10:27 run (ETL concept), but we do not get the 10:24 one :-(
Thank you for the help :-)
DETAILS:
OS: RedHat 7
Python: 2.7
Airflow: v1.8.0
DAG python file:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2017, 9, 7, 10, 30),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, schedule_interval=timedelta(minutes=3))
dag.catchup = False
create_command = "/script.sh "
t1 = BashOperator(
task_id='task',
bash_command='date',
dag=dag)
I tried with Airflow v.1.8.0, python v.3.5, db on SQLite. The following DAG, unpaused at 10:28, is quite similar to yours, and works as it should (only one run, at 10:33, for 10:30).
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello_3min():
return ('Hello world! %s' % datetime.now())
dag = DAG('hello_world_3min', description='Simple tutorial DAG 3min',
schedule_interval='*/3 * * * *',
start_date=datetime(2017, 9, 18, 10, 30),
catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task_3min', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task_3min',
python_callable=print_hello_3min, dag=dag)
dummy_operator >> hello_operator
Written with StackEdit.
I'm not sure about my solution whether good enough, but I'd like to present my understanding.
There are 2 things to consider together:
schedule_interval mode, such as 'hourly', 'daily', 'weekly','annually'.
hourly = (* 1 * * *) = “At every minute past hour 1.”
daily = (0 1 * * *) = “At 01:00.”
monthly = (0 1 1 * *) = “At 01:00 on day-of-month 1.”
start_date
hourly = datetime(2019, 4, 5, 1, 30)
daily = datetime(2019, 4, 5)
monthly = datetime(2019, 4, 1)
My strategy is to set [start_date] by doing minus the expecting start date & time by the 1 unit of your interval mode.
Example:
To start the first job at 2019-4-5 01:00 and the interval are hourly.
schedule_interval mode = hourly
expecting start datetime = 2019-4-5 01:00
so, start_date = 2019-4-5 00:00
minus hour by 1 hour
CRON = ( * 1 * * * ) which means “At every minute past hour 1.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 5, 0, 0),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='* 1 * * *')
To start the first job at 2019-4-5 01:00 and the interval are daily.
schedule_interval mode = daily
expecting start datetime date = 2019-4-5 01:00
so, start_date = 2019-4-4
minus day by 1 day
CRON = ( 0 1 * * * ) which means “At 01:00.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 4),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='0 1 * * *')
To start the first job at 2019-4-5 01:00 and the interval are monthly.
schedule_interval mode = monthly
expecting start datetime date = 2019-4-5 01:00
so, start_date = 2019-4-4
minus day by 1 day
CRON = ( 0 1 1 * * ) which means “At 01:00 on day-of-month 1.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 4),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='0 1 1 * *')
So far, the strategy is useful for me, but if anyone got better, please kindly share.
PS. I'm using [https://crontab.guru] to generate a perfect cron-schedule.
This appears to happen exclusively when providing a timedelta as a schedule. Switch your schedule interval to be cron formatted and it won't run twice anymore.

Resources