airflow backfill dag not running until given end date - airflow

I have a backfill DAG, which is scheduled to run yearly from 01-01-2012 to 01-01-2018, but this runs only from 01-01-2012 until 01-01-2017. Why is this not running until 01-01-2018 and how to make it run until 2018.
Here is the code that I have used in the DAG:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2012,1,1),
'end_date': datetime(2018,1,1),
'email': ['sef12#gmail.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
dag = DAG(
dag_id='SAMPLE_LOAD',schedule_interval= '#yearly',default_args=default_args,catchup=True,max_active_runs=1, concurrency=1)

This is due to how Airflow handles scheduling. From the docs:
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be triggered soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Your run for 2018 will start once 2018 is over, since that's the end of the interval.

Related

How to configure Apache Airflow start_date and schedule_interval to run daily at 7am?

I'm using Airflow airflow-2.3.3 (through GCP Composer)
I pass this yaml configuration when deploying my DAG:
dag_args:
dag_id: FTP_DAILY
default_args:
owner: 'Dev team'
start_date: "00:00:00"
max_active_runs: 1
retries: 2
schedule_interval: "0 7 * * *"
ftp_conn_id: 'ftp_dev'
I want this DAG to run at 7am UTC every morning, but it's not running. In the UI it says next run: 2022-11-22, 07:00:00 (as of Nov 22nd) and it never runs. How should I configure my start_date and schedule_interval so that the DAG runs at 7am UTC every day, starting from the nearest 7am after the deployment?
You can pass default args directly in the Python DAG code and calculate yesterday's date, example :
from airflow.utils.dates import days_ago
dag_default_args = {
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
'start_date': days_ago(1)
}
Then in the DAG :
with airflow.DAG(
"dag_name",
default_args=dag_default_args,
schedule_interval="0 7 * * *") as dag:
......
In this case the schedule_interval and cron will work correctly, Airflow will based the cron DAG on the start date.
The main concept of airflow is that the execution of a dag starts after the required interval has passed. If you schedule a dag with the above setup airflow will parse
interval_start_date as 2022-11-22 07:00:00
and interval_end_date as 2022-11-23 07:00:00
As you are requesting airflow to fetch data from this interval it will wait for the interval to pass, thus starting execution on 23rd November 7am.
If you want it to trigger immediately after you deploy the dag you need to move the start date back by one day. You might need to set up the catchup flag to true.
with DAG(
dag_id='new_workflow4',
schedule_interval="0 7 * * *",
start_date=pendulum.datetime(2022, 11, 21, hour=0, tz="UTC"),
catchup=True
) as dag:

Airflow doesn't execute DAG's at midnight

I did a DAG's with the following configuration:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(0, 0, minute=1),
'email': ['francisco.salazar.12#sansano.usm.cl'],
'email_on_failure': False,
'email_on_retry': False,
'max_active_runs': 1,
'retries': 1,
'retry_delay': timedelta(minutes=1),
'provide_context': True
}
dag = DAG(
'terralink_environmetal_darksky',
default_args=default_args,
description='Extract Data from Darksky API',
catchup=False,
schedule_interval='31 * * * *',
)
The issue is that scheduler works correctly and execute DAG run at every hour that I defined in schedule_inverval (in minute 31 of every hour) BUT in midnight or the last execution of the day (scheduled at 00:31:00 for the next day) the DAG execution is not triggered.
I think that is a problem based on start_date but I don't know yet how to define this parameter in order to avoid the problem.
Airflow recommends to state a fixed startstart_date for your DAG. start_date is mainly for the purpose to specify when do you want your DAG to start running for the very first time. schedule_interval will be the most relevant one after the start_date did its purpose or (if you do not need to backfill or reset your dag).

Skip run if DAG is already running

I have a DAG that I need to run only one instance at the same time. To solve this I am using max_active_runs=1 which works fine:
dag_args = {
'owner': 'Owner',
'depends_on_past': False,
'start_date': datetime(2018, 01, 1, 12, 00),
'email_on_failure': False
}
sched = timedelta(hours=1)
dag = DAG(job_id, default_args=dag_args, schedule_interval=sched, max_active_runs=1)
The problem is:
When DAG is going to be triggered and there's an instance running, AirFlow waits for this run to finish and then triggers the DAG again.
My question is:
Is there any way to skip this run so DAG will not run after this execution in this case?
Thanks!
This is just from checking the docs, but it looks like you only need to add another parameter:
catchup=False
catchup (bool) – Perform scheduler catchup (or only run latest)?
Defaults to True

Apache Airflow dag schedules in midnight UTC

I created Apache Airflow DAG with following default args. I want this DAG to run every day at 10PM UTC but it's always running at 12AM UTC and ignoring the date time I had set in start_date. Is this not the right way? Thanks.
default_args = {
'owner': config.OWNER,
'depends_on_past': False,
'start_date': datetime(2018, 10, 14, 22, 0, 0),
'email': [config.ALERT_EMAIL],
'email_on_failure': True,
'email_on_retry': False,
'retry_delay': timedelta(minutes=1),
'retries': 2,
}
# DAG
dag = DAG('Test',
default_args=default_args,
description='Initial setup',
schedule_interval='#daily')
You can also use cron format in your schedule interval argument like this:
# DAG
dag = DAG('Test',
default_args=default_args,
description='Initial setup',
schedule_interval='0 22 * * *')
Regarding the schedule_interval you have at least three options:
datetime.timedelta
dateutil.relativedelta
cron style string
The schedule_interval defines how often that DAG runs. This timedelta object is added to your latest task instance’s execution_date to figure out the next schedule. And keep in mind that: start_date for the task, determines the execution_date for the first task instance.
All of the above is correct.
I have encountered an issue where, in Airflow 2.0, schedule_interval is ignored when put in the default_args. When I removed it and put it in the DAG declaration, all worked. I could test this by looking at the DAG details in the UI.
Example:
default_args = {
'owner': 'Hector Hoffman',
'depends_on_past': False,
'start_date':start_date,
'schedule_interval': '0 5 * * *',
'email': ['hector#email.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 0,
'on_failure_callback': task_fail_slack_alert
}
Results in:
Whereas, when I put it in the DAG:
with models.DAG(
"dealstampede_workflow",
default_args=default_args,
catchup=False,
schedule_interval='0 5 * * *'
) as dag:
Results in:
If anyone has any insight as to why the schedule_interval doesn't work in the default_args I'd appreciate feedback. Thanks.

How to limit Airflow to run only one instance of a DAG run at a time?

I want the tasks in the DAG to all finish before the 1st task of the next run gets executed.
I have max_active_runs = 1, but this still happens.
default_args = {
'depends_on_past': True,
'wait_for_downstream': True,
'max_active_runs': 1,
'start_date': datetime(2018, 03, 04),
'owner': 't.n',
'email': ['t.n#example.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 3,
'retry_delay': timedelta(minutes=4)
}
dag = DAG('example', default_args=default_args, schedule_interval = schedule_interval)
(All of my tasks are dependent on the previous task. Airflow version is 1.8.0)
Thank you
I changed to put max_active_runs as an argument of DAG() instead of in default_arguments, and it worked.
Thanks SimonD for giving me the idea, though not directly pointing to it in your answer.
You've put the 'max_active_runs': 1 into the default_args parameter and not into the correct spot.
max_active_runs is a constructor argument for a DAG and should not be put into the default_args dictionary.
Here is an example DAG that shows where you need to move it to:
dag_args = {
'owner': 'Owner',
# 'max_active_runs': 1, # <--- Here is where you had it.
'depends_on_past': False,
'start_date': datetime(2018, 01, 1, 12, 00),
'email_on_failure': False
}
sched = timedelta(hours=1)
dag = DAG(
job_id,
default_args=dag_args,
schedule_interval=sched,
max_active_runs=1 # <---- Here is where it is supposed to be
)
If the tasks that your dag is running are actually sub-dags then you may need to pass max_active_runs into the subdags too but not 100% sure on this.
You can use xcoms to do it. First take 2 python operators as 'start' and 'end' to the DAG. Set the flow as:
start ---> ALL TASKS ----> end
'end' will always push a variable
last_success = context['execution_date'] to xcom (xcom_push). (Requires provide_context = True in the PythonOperators).
And 'start' will always check xcom (xcom_pull) to see whether there exists a last_success variable with value equal to the previous DagRun's execution_date or to the DAG's start_date (to let the process start).
Followed this answer
Actually you should use DAG_CONCURRENCY=1 as environment var. Worked for me.

Resources