Airflow ExternalTaskSensor does not trigger task - airflow

I'm trying to add a cross dag dependency using ExternalTaskSensor but haven't been able to get it to work. Dag A has schedule_interval=None as it doesn't have a fixed schedule and is triggered externally by a file creation event. Dag B should execute once Dag A has completed. Here is code for dag_a and dag_b.
DAG A
default_args = {
'depends_on_past': False,
'start_date': datetime.today()-timedelta(1),
'email_on_failure': True,
'email_on_retry': False,
'queue': 'default'
}
dag = DAG(
'dag_a', default_args=default_args, schedule_interval=None)
dag_a = AWSBatchOperator(
task_id='dag_a',
job_name='dag_a',
job_definition='dag_a',
job_queue='MyAWSJobQueue',
max_retries=10,
aws_conn_id='aws_conn',
region_name='us-east-1',
dag=dag,
parameters={},
overrides={})
DAG B
default_args = {
'depends_on_past': False,
'start_date': datetime.today()-timedelta(1),
'email_on_failure': True,
'email_on_retry': False,
'queue': 'default'
}
dag = DAG(
'dag_b', default_args=default_args, schedule_interval=None)
dag_b = AWSBatchOperator(
task_id='dag_b',
job_name='dag_b',
job_definition='dag_b',
job_queue='MyAWSJobQueue',
max_retries=10,
aws_conn_id='aws_conn',
region_name='us-east-1',
dag=dag,
parameters={},
overrides={})
wait_for_dag_a = ExternalTaskSensor(
task_id='wait_for_irr',
external_dag_id='dag_a',
external_task_id=None,
execution_delta = timedelta(hours=1),
dag=dag,
timeout = 300)
dag_b.set_upstream(wait_for_dag_a)
I set both dags with schedule_interval=None and same start_date. I even added execution_delta = timedelta(hours=1) for dag_b, but dag_b hasn't triggered so far, though dag_a is complete. Any help is appreciated.
I have tried using TriggerDagRunOperator which works, but is not suitable for my use case since dag_b will eventually be dependent on multiple parent dags.

I've met similar problem before, so there are two things need to check, first I cannot see any time delta between DAG A and DAG B, both use the default arg so you should not give the waiting task a execution_delta, and for the airflow trigger, somehow it cannot detect the DAG finish sign if there are multiple parents DAGs, so I've tried give a value to external_task_id, like 'dag_a-done' instead of the default 'None', and that works. One more thing to mention is the task_id normally should not contain underscore.
The link is the source code of external sensor:
https://airflow.apache.org/docs/stable/_modules/airflow/sensors/external_task_sensor.html
Also an article describes how the ExternalTaskSensors works:
https://medium.com/#fninsiima/sensing-the-completion-of-external-airflow-tasks-827344d03142

Related

Apache-Airflow - Task is in the none state when running DAG

Just started with airflow and wanted to run simple dag with BashOperator that outputs 'Hello' to console
I noticed that my status is indefinitely stuck in 'Running'
When I go on task details, I get this:
Task is in the 'None' state which is not a valid state for execution. The task must be cleared in order to be run.
Any suggestions or hints are much appreciated.
Dag:
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'dude_whose_doors_open_like_this_-W-',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['yessure#gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'Test',
default_args=default_args,
description='Test',
schedule_interval=timedelta(days=1)
)
t1 = BashOperator(
task_id='ECHO',
bash_command='echo "Hello"',
dag=dag
)
t1
I've managed to solve it by adding 'start_date': dt(1970, 1, 1),
to default args object
and adding schedule_interval=None to my dag object
Could you remove the last line of t1- this isn't necessary. Also start_dateshouldn't be set dynamically - this can lead to problems with the scheduling.

Skip run if DAG is already running

I have a DAG that I need to run only one instance at the same time. To solve this I am using max_active_runs=1 which works fine:
dag_args = {
'owner': 'Owner',
'depends_on_past': False,
'start_date': datetime(2018, 01, 1, 12, 00),
'email_on_failure': False
}
sched = timedelta(hours=1)
dag = DAG(job_id, default_args=dag_args, schedule_interval=sched, max_active_runs=1)
The problem is:
When DAG is going to be triggered and there's an instance running, AirFlow waits for this run to finish and then triggers the DAG again.
My question is:
Is there any way to skip this run so DAG will not run after this execution in this case?
Thanks!
This is just from checking the docs, but it looks like you only need to add another parameter:
catchup=False
catchup (bool) – Perform scheduler catchup (or only run latest)?
Defaults to True

Airflow ExternalTaskSensor execution timeout

I'm using airflow.operators.sensors.ExternalTaskSensor to make one Dag wait for another.
dag = DAG(
'dag2',
default_args={
'owner': 'Me',
'depends_on_past': False,
'start_date': start_datetime,
'email': ['me#example.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 2,
'retry_delay': timedelta(minutes=10),
},
template_searchpath="%s/me/resources/" % DAGS_FOLDER,
schedule_interval="{} {} * * *".format(minute, hour),
max_active_runs=1
)
wait_for_dag1 = ExternalTaskSensor(
task_id='wait_for_dag1',
external_dag_id='dag1',
external_task_id='dag1_task1',
dag=dag
)
If something seriously wrong happens with upstream Dag and it fails to complete during the given time period, I want upstream Dag (ExternalTaskSensor operator) crash as well, instead of hanging forever.
How can I add a timeout to ExternalTaskSensor?
I'm looking into documentation, but it does not seem to have a timeout parameter or something similar. What should I do?
https://airflow.readthedocs.io/en/stable/_modules/airflow/sensors/external_task_sensor.html
The ExternalTaskSensor does take a timeout argument in seconds. It inherits the argument from BaseSensorOperator (https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/sensors/base/index.html). If you pass it timeout=60 on instantiation, it will fail after 60 seconds.

How to limit Airflow to run only one instance of a DAG run at a time?

I want the tasks in the DAG to all finish before the 1st task of the next run gets executed.
I have max_active_runs = 1, but this still happens.
default_args = {
'depends_on_past': True,
'wait_for_downstream': True,
'max_active_runs': 1,
'start_date': datetime(2018, 03, 04),
'owner': 't.n',
'email': ['t.n#example.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 3,
'retry_delay': timedelta(minutes=4)
}
dag = DAG('example', default_args=default_args, schedule_interval = schedule_interval)
(All of my tasks are dependent on the previous task. Airflow version is 1.8.0)
Thank you
I changed to put max_active_runs as an argument of DAG() instead of in default_arguments, and it worked.
Thanks SimonD for giving me the idea, though not directly pointing to it in your answer.
You've put the 'max_active_runs': 1 into the default_args parameter and not into the correct spot.
max_active_runs is a constructor argument for a DAG and should not be put into the default_args dictionary.
Here is an example DAG that shows where you need to move it to:
dag_args = {
'owner': 'Owner',
# 'max_active_runs': 1, # <--- Here is where you had it.
'depends_on_past': False,
'start_date': datetime(2018, 01, 1, 12, 00),
'email_on_failure': False
}
sched = timedelta(hours=1)
dag = DAG(
job_id,
default_args=dag_args,
schedule_interval=sched,
max_active_runs=1 # <---- Here is where it is supposed to be
)
If the tasks that your dag is running are actually sub-dags then you may need to pass max_active_runs into the subdags too but not 100% sure on this.
You can use xcoms to do it. First take 2 python operators as 'start' and 'end' to the DAG. Set the flow as:
start ---> ALL TASKS ----> end
'end' will always push a variable
last_success = context['execution_date'] to xcom (xcom_push). (Requires provide_context = True in the PythonOperators).
And 'start' will always check xcom (xcom_pull) to see whether there exists a last_success variable with value equal to the previous DagRun's execution_date or to the DAG's start_date (to let the process start).
Followed this answer
Actually you should use DAG_CONCURRENCY=1 as environment var. Worked for me.

unwanted DAG runs in Airflow

I configured my DAG like this:
default_args = {
'owner': 'Aviv',
'depends_on_past': False,
'start_date': datetime(2017, 1, 1),
'email': ['aviv#oron.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=1)
}
dag = DAG(
'MyDAG'
, schedule_interval=timedelta(minutes=3)
, default_args=default_args
, catchup=False
)
and for some reason, when i un-pause the DAG, its being executed twice immediatly.
Any idea why? And is there any rule i can apply to tell this DAG to never run more than once in the same time?
You can specify max_active_runs like this:
dag = airflow.DAG(
'customer_staging',
schedule_interval="#daily",
dagrun_timeout=timedelta(minutes=60),
template_searchpath=tmpl_search_path,
default_args=args,
max_active_runs=1)
I've never seen it happening, are you sure that those runs are not backfills, see: https://stackoverflow.com/a/47953439/9132848
I think its because you have missed the scheduled time and airflow is backfilling it automatically when you ON it again. You can disable this by
catchup_by_default = False in the airflow.cfg.

Resources