I created Apache Airflow DAG with following default args. I want this DAG to run every day at 10PM UTC but it's always running at 12AM UTC and ignoring the date time I had set in start_date. Is this not the right way? Thanks.
default_args = {
'owner': config.OWNER,
'depends_on_past': False,
'start_date': datetime(2018, 10, 14, 22, 0, 0),
'email': [config.ALERT_EMAIL],
'email_on_failure': True,
'email_on_retry': False,
'retry_delay': timedelta(minutes=1),
'retries': 2,
}
# DAG
dag = DAG('Test',
default_args=default_args,
description='Initial setup',
schedule_interval='#daily')
You can also use cron format in your schedule interval argument like this:
# DAG
dag = DAG('Test',
default_args=default_args,
description='Initial setup',
schedule_interval='0 22 * * *')
Regarding the schedule_interval you have at least three options:
datetime.timedelta
dateutil.relativedelta
cron style string
The schedule_interval defines how often that DAG runs. This timedelta object is added to your latest task instance’s execution_date to figure out the next schedule. And keep in mind that: start_date for the task, determines the execution_date for the first task instance.
All of the above is correct.
I have encountered an issue where, in Airflow 2.0, schedule_interval is ignored when put in the default_args. When I removed it and put it in the DAG declaration, all worked. I could test this by looking at the DAG details in the UI.
Example:
default_args = {
'owner': 'Hector Hoffman',
'depends_on_past': False,
'start_date':start_date,
'schedule_interval': '0 5 * * *',
'email': ['hector#email.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 0,
'on_failure_callback': task_fail_slack_alert
}
Results in:
Whereas, when I put it in the DAG:
with models.DAG(
"dealstampede_workflow",
default_args=default_args,
catchup=False,
schedule_interval='0 5 * * *'
) as dag:
Results in:
If anyone has any insight as to why the schedule_interval doesn't work in the default_args I'd appreciate feedback. Thanks.
Related
I've seen a couple of responses to similar questions but I can seem to wrap my head around this one. We have a process that we want to run only on set days so Tues-Sat at 8am.
default_args = {
'depends_on_past': False,
'start_date': datetime(2023, 1, 27),
'retries': 1,
'retry_delay': timedelta(minutes=10),
}
with DAG(
dag_id='dag_name',
schedule_interval='0 8 * * Tue-Sat',
catchup=False,
default_args=default_args
) as dag:
However the DAG is still running everyday. Have I missed something?
I did a DAG's with the following configuration:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(0, 0, minute=1),
'email': ['francisco.salazar.12#sansano.usm.cl'],
'email_on_failure': False,
'email_on_retry': False,
'max_active_runs': 1,
'retries': 1,
'retry_delay': timedelta(minutes=1),
'provide_context': True
}
dag = DAG(
'terralink_environmetal_darksky',
default_args=default_args,
description='Extract Data from Darksky API',
catchup=False,
schedule_interval='31 * * * *',
)
The issue is that scheduler works correctly and execute DAG run at every hour that I defined in schedule_inverval (in minute 31 of every hour) BUT in midnight or the last execution of the day (scheduled at 00:31:00 for the next day) the DAG execution is not triggered.
I think that is a problem based on start_date but I don't know yet how to define this parameter in order to avoid the problem.
Airflow recommends to state a fixed startstart_date for your DAG. start_date is mainly for the purpose to specify when do you want your DAG to start running for the very first time. schedule_interval will be the most relevant one after the start_date did its purpose or (if you do not need to backfill or reset your dag).
Context: I successfully installed Airflow on EC2, changed things like executor to LocalExecutor; sql_alchemy_conn to postgresql+psycopg2://postgres#localhost:5432/airflow; max_threads to 10.
My problem is when I create a dag which I indicate to be run everyday everything is fine, but when I create a dag to be run like at 10am on Monday and Wednesday Airflow doesn't does not run it. Does anybody know what could I do wrong and should I do in order to fix this issue?
Dag for script which runs fine and properly:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
args = {
'owner': 'arseniyy123',
'start_date': airflow.utils.dates.days_ago(1),
'depends_on_past': False,
'email': ['exam#exam.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG(
'daily_script',
default_args=args,
description = 'daily_script',
schedule_interval = "0 10 * * *",
)
t1 = BashOperator(
task_id='daily',
bash_command='cd /root/ && python3 DAILY_WORK.py',
dag=dag)
t1
Dag for script which should run on Monday and Wednesday, but it does not run at all:
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
args = {
'owner': 'arseniyy123',
'start_date': airflow.utils.dates.days_ago(1),
'depends_on_past': False,
'email': ['exam#exam.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG(
'monday_wednesday',
default_args=args,
description = 'monday_wednesday',
schedule_interval = "0 10 * * 1,3",
)
t1 = BashOperator(
task_id='monday_wednesday',
bash_command='cd /root/ && python3 not_daily_work.py',
dag=dag)
t1
I also have some problems with scheduler, it uses to die after being working more than 10 hours, anybody know why does it happen?
Thank you in advance!
Can you try changing the start_date to a static datetime e.g. datetime.date(2020, 3, 20) instead of using airflow.utils.dates.days_ago(1)
Maybe read through the scheduling examples here, to understand why your code didn't work. From that documentation:
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period
In its 2 out of 10 runs, the DAG status automatically sets to succes even when no tasks inside of it ran. Following is the DAG args which was passed and its tree view.
args = {
'owner': 'xyz',
'depends_on_past': False,
'catchup': False,
'start_date': datetime(2019, 7, 8),
'email': ['a#b.c'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'provide_context': True,
'retry_delay': timedelta(minutes=2)
}
And I am passing DAG as a context like this:
with DAG(PARENT_DAG_NAME, default_args=args, schedule_interval='30 * * * *') as main_dag:
task1 = DummyOperator(
task_id='Raw_Data_Ingestion_Started',
)
task2 = DummyOperator(
task_id='Raw_Data_Ingestion_Completed',
)
task1 >> task2
Any idea what could be the issue? Is it something I need to change in the config file? And this behaviour is not periodic.
According to the official airflow documentation on DummyOperator:
Operator that does literally nothing. It can be used to group tasks in a DAG.
I configured my DAG like this:
default_args = {
'owner': 'Aviv',
'depends_on_past': False,
'start_date': datetime(2017, 1, 1),
'email': ['aviv#oron.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=1)
}
dag = DAG(
'MyDAG'
, schedule_interval=timedelta(minutes=3)
, default_args=default_args
, catchup=False
)
and for some reason, when i un-pause the DAG, its being executed twice immediatly.
Any idea why? And is there any rule i can apply to tell this DAG to never run more than once in the same time?
You can specify max_active_runs like this:
dag = airflow.DAG(
'customer_staging',
schedule_interval="#daily",
dagrun_timeout=timedelta(minutes=60),
template_searchpath=tmpl_search_path,
default_args=args,
max_active_runs=1)
I've never seen it happening, are you sure that those runs are not backfills, see: https://stackoverflow.com/a/47953439/9132848
I think its because you have missed the scheduled time and airflow is backfilling it automatically when you ON it again. You can disable this by
catchup_by_default = False in the airflow.cfg.