How to define a timeout for Apache Airflow DAGs?

How to define a timeout for Apache Airflow DAGs? - airflow

I'm using Airflow 1.10.2 but Airflow seems to ignore the timeout I've set for the DAG.
I'm setting a timeout period for the DAG using the dagrun_timeout parameter (e.g. 20 seconds) and I've got a task which takes 2 mins to run, but Airflow marks the DAG as successful!
args = {
'owner': 'me',
'start_date': airflow.utils.dates.days_ago(2),
'provide_context': True,
}
dag = DAG(
'test_timeout',
schedule_interval=None,
default_args=args,
dagrun_timeout=timedelta(seconds=20),
)
def this_passes(**kwargs):
return
def this_passes_with_delay(**kwargs):
time.sleep(120)
return
would_succeed = PythonOperator(
task_id='would_succeed',
dag=dag,
python_callable=this_passes,
email=to,
)
would_succeed_with_delay = PythonOperator(
task_id='would_succeed_with_delay',
dag=dag,
python_callable=this_passes_with_delay,
email=to,
)
would_succeed >> would_succeed_with_delay
No error messages are thrown. Am I using an incorrect parameter?

As stated in the source code:
:param dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns, and only once the
# of active DagRuns == max_active_runs.
so this might be expected behavior as you set schedule_interval=None. Here, the idea is rather to make sure a scheduled DAG won't last forever and block subsequent run instances.
Now, you may be interested in the execution_timeout available in all operators.
For example, you could set a 60s timeout on your PythonOperator like this:
would_succeed_with_delay = PythonOperator(task_id='would_succeed_with_delay',
dag=dag,
execution_timeout=timedelta(seconds=60),
python_callable=this_passes_with_delay,
email=to)

Related

Airflow triggering the "on_failure_callback" when the "dagrun_timeout" is exceeded

Currently working on setting up alerts for long running tasks in Airflow. To cancel/fail the airflow dag I've put "dagrun_timeout" in the default_args, and it does what I need, fails/errors the dag when its been running for too long (usually stuck). The only problem is that the function in "on_failure_callback" doesn't get called when the dagrun_timeout is exceeded, because the "on_failure_callback" is on the task level (I think) while the dagrun_timeout is on the dag level.
How can I execute the "on_failure_callback" when the dagrun_timeout is exceeded, or how can I specify a function to be called when a dag fails? Or should I re-think my approach?

Try setting on_failure_callback during DAG declaration:
with DAG(
dag_id="failure_callback_example",
on_failure_callback=_on_dag_run_fail,
...
) as dag:
...
The explanation is that on_failure_callback defined in default_args will get passed only to the Tasks being created and not to the DAG object.
Here is an example to try this behaviour:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import TaskInstance
from airflow.operators.bash import BashOperator
def _on_dag_run_fail(context):
print("***DAG failed!! do something***")
print(f"The DAG failed because: {context['reason']}")
print(context)
def _alarm(context):
print("** Alarm Alarm!! **")
task_instance: TaskInstance = context.get("task_instance")
print(f"Task Instance: {task_instance} failed!")
default_args = {
"owner": "mi_empresa",
"email_on_failure": False,
"on_failure_callback": _alarm,
}
with DAG(
dag_id="failure_callback_example",
start_date=datetime(2021, 9, 7),
schedule_interval=None,
default_args=default_args,
catchup=False,
on_failure_callback=_on_dag_run_fail,
dagrun_timeout=timedelta(seconds=45),
) as dag:
delayed = BashOperator(
task_id="delayed",
bash_command='echo "waiting..";sleep 60; echo "Done!!"',
)
will_fail = BashOperator(
task_id="will_fail",
bash_command="exit 1",
# on_failure_callback=_alarm,
)
delayed >> will_fail
You can find the logs of the callbacks execution in the Scheduler logs AIRFLOW_HOME/logs/scheduler/date/failure_callback_example :
[2021-09-24 13:12:34,285] {logging_mixin.py:104} INFO - [2021-09-24 13:12:34,285] {dag.py:862} INFO - Executing dag callback function: <function _on_dag_run_fail at 0x7f83102e8670>
[2021-09-24 13:12:34,336] {logging_mixin.py:104} INFO - ***DAG failed!! do something***
[2021-09-24 13:12:34,345] {logging_mixin.py:104} INFO - The DAG failed because: timed_out
Edit:
Within the context dict the key reason is passed in order to specify the cause of the DAG run failure. Some values are: 'reason': 'timed_out' or 'reason': 'task_failure' . This could be use to perfom specific behaviour in the callback based on the reason of the DAG Run failure.

DAG failure during the initial run

I have two DAGs:
DAG_A , DAG_B.
DAG_A triggers DAG_B thru TriggerDagRunOperator.
My tasks in DAG_B:
with DAG(
dag_id='DAG_B',
default_args=default_args,
schedule_interval='#once',
description='ETL pipeline for processing users'
) as dag:
start = DummyOperator(
task_id='start')
delete_xcom_task = PostgresOperator(
task_id='clean_up_xcom',
postgres_conn_id='postgres_default',
sql="delete from xcom where dag_id='DAG_A' and task_id='TASK_A' ")
end = DummyOperator(
task_id='end')
#trigger_rule='none_failed')
#num_table is set by DAG_A. Will have an empty list initially.
iterable_string = Variable.get("num_table",default_var="[]")
iterable_list = ast.literal_eval(iterable_string)
for index,table in enumerate(iterable_list):
table = table.strip()
read_src1 = PythonOperator(
task_id=f'Read_Source_data_{table}',
python_callable=read_src,
op_kwargs={'index': index}
)
upload_file_to_directory_bulk1 = PythonOperator(
task_id=f'ADLS_Loading_{table}',
python_callable=upload_file_to_directory_bulk,
op_kwargs={'index': index}
)
write_Snowflake1 = PythonOperator(
task_id=f'Snowflake_Staging_{table}',
python_callable=write_Snowflake,
op_kwargs={'index': index}
)
task_sf_storedproc1 = DummyOperator(
task_id=f'Snowflake_Processing_{table}'
)
start >> read_src1 >> upload_file_to_directory_bulk1 >> write_Snowflake1 >>task_sf_storedproc1 >> delete_xcom_task >> end
After executing airflow db init and making the webserver and scheduler up, DAG_B fails with failure in task delete_xcom_task.
2021-06-22 08:04:43,647] {taskinstance.py:871} INFO - Dependencies not met for <TaskInstance: Target_DIF.clean_up_xcom 2021-06-22T08:04:27.861718+00:00 [queued]>, dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 2 non-success(es). upstream_tasks_state={'total': 2, 'successes': 0, 'skipped': 0, 'failed': 0, 'upstream_failed': 0, 'done': 0}, upstream_task_ids={'Snowflake_Processing_products', 'Snowflake_Processing_inventories'}
[2021-06-22 08:04:43,651] {local_task_job.py:93} INFO - Task is not able to be run
But both DAGs become successful from the second runs.
Can anyone explain me what is happening internally?
How can I avoid the failure during the first run?
Thanks.

I suspect that the problem is in schedule_interval='#once' for DAG_B: When you add the DAG for the first time, the schedule_interval tells the scheduler to run the DAG once. So, DAG_B is triggered once by the scheduler and not by DAG_A. Any preparations that needs to be done by DAG_A for DAG_B to run successfully have not been done yet, therefore DAG_B fails.
Later on, DAG_A runs as scheduled and triggers DAG_B as expected. Both succeed.
To avoid DAG_B being triggered by the scheduler set schedule_interval=None.

Airflow schedule getting skipped if previous task execution takes more time

I have two tasks in my airflow DAG. One triggers an API call ( Http operator ) and another one keeps checking its status using another api ( Http sensor ). This DAG is scheduled to run every hour & 10 minutes. But some times one execution can take long time to finish for example 20 hours. In such cases all the schedules while the previous task is running is not executing.
For example say if I the job at 01:10 takes 10 hours to finish. Schedules 02:10, 03:10, 04:10, ... 11:10 etc which are supposed to run are getting skipped and only the one at 12:10 is executed.
I am using local executor. I am running airflow server & scheduler using below script.
start_server.sh
export AIRFLOW_HOME=./airflow_home;
export AIRFLOW_GPL_UNIDECODE=yes;
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
airflow initdb;
airflow webserver -p 7200;
start_scheduler.sh
export AIRFLOW_HOME=./airflow_home;
# Connection string for connecting to REST interface server
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
#export AIRFLOW__SMTP__SMTP_PASSWORD=**********;
airflow scheduler;
my_dag_file.py
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': admin_email_ids,
'email_on_failure': False,
'email_on_retry': False
}
DAG_ID = 'reconciliation_job_pipeline'
MANAGEMENT_RES_API_CONNECTION_CONFIG = 'management_api'
DA_REST_API_CONNECTION_CONFIG = 'rest_api'
recon_schedule = Variable.get('recon_cron_expression',"10 * * * *")
dag = DAG(DAG_ID, max_active_runs=1, default_args=default_args,
schedule_interval=recon_schedule,
catchup=False)
dag.doc_md = __doc__
spark_job_end_point = conf['sip_da']['spark_job_end_point']
fetch_index_record_count_config_key = conf['reconciliation'][
'fetch_index_record_count']
fetch_index_record_count = SparkJobOperator(
job_id_key='fetch_index_record_count_job',
config_key=fetch_index_record_count_config_key,
exec_id_req=False,
dag=dag,
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_job',
data={},
method='POST',
endpoint=spark_job_end_point,
headers={
"Content-Type": "application/json"}
)
job_endpoint = conf['sip_da']['job_resource_endpoint']
fetch_index_record_count_status_job = JobStatusSensor(
job_id_key='fetch_index_record_count_job',
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_status_job',
endpoint=job_endpoint,
method='GET',
request_params={'required': 'status'},
headers={"Content-Type": "application/json"},
dag=dag,
poke_interval=15
)
fetch_index_record_count>>fetch_index_record_count_status_job
SparkJobOperator & JobStatusSensor my custom class extending SimpleHttpOperator & HttpSensor.
If I set depends_on_past true will it work as expected?. Another problem I have for this option is some time the status check job will fail. But the next schedule should get trigger. How can I achieve this behavior ?

I think the main discussion point here is what you set is catchup=False, more detail can be found here. So airflow scheduler will skip those task execution and you would see the behavior as you mentioned.
This sounds like you would need to perform catchup if the previous process took longer than expected. You can try to change it catchup=True

How to schedule DAG tasks once the DAG has been triggered?

I am using the TriggerDagRunOperator so that one controller DAG may trigger a target DAG. However, once the controller DAG triggers the target DAG, the target DAG switches to "running", but none of its tasks are scheduled. I would like for the target DAG's tasks to be scheduled as soon as the target DAG is triggered by the controller DAG.
# Controller DAG's callable
def conditionally_trigger(context, dag_run_object):
condition_param = context['params']['condition_param']
if condition_param:
return dag_run_obj
return None
# Target DAG's callable
def say_hello():
print("Hello")
# Controller DAG
controller_dag = DAG(
dag_id="controller",
default_args = {
"owner":"Patrick Stump",
"start_date":datetime.utcnow(),
},
schedule_interval='#once',
)
# Target DAG
target_dag = DAG(
dag_id="target",
default_args = {
"owner":"Patrick Stump",
"start_date":datetime.utcnow(),
},
schedule_interval=None,
)
# Controller DAG's task
controller_task = TriggerDagRunOperator(
task_id="trigger_dag",
trigger_dag_id="target",
python_callable=conditionally_trigger,
params={'condition_param':True},
dag=controller_dag,
)
# Target DAG's task -- never scheduled!
target_task = PythonOperator(
task_id="print_hello",
python_callable=say_hello,
dag=target_dag,
)
Thanks in advance :)

The problem may be using a dynamic start date like this: "start_date":datetime.utcnow(),
I would rename the dags, and give them a start date like 2019-01-01, and then try again.
The scheduler reads DAGs repeatedly, and when the start date changes every time the DAG is parsed (utcnow() will evaluate to a new value every time), unexpected things can happen.
Here is some further reading on start_date.

clear an upstream task in airflow within the dag

i have a task in an airflow DAG. it has three child tasks. unfortunately, there are cases where this parent task will succeed, but two of the three children will fail (and a retry on the children won't fix them).
it requires the parent to retry (even though it didn't fail).
so i dutifully go into the graph view of the dag run and 'clear' this parent task and all downstream tasks (+recursive).
is there a way i can do this within the dag itself?

If your tasks are part of a subdag, calling dag.clear() in the on_retry_callback of a SubDagOperator should do the trick:
SubDagOperator(
subdag=subdag,
task_id="...",
on_retry_callback=lambda context: subdag.clear(
start_date=context['execution_date'],
end_date=context['execution_date']),
dag=dag
)

We had a similar problem which we resolved by putting the task with dependencies that we want to repeat into a sub dag. Then when the sub dag retries we clear the sub dag tasks state using the on_retry_callback so that they all run again.
sub_dag = SubDagOperator(
retry_delay=timedelta(seconds=30),
subdag=create_sub_dag(),
on_retry_callback=callback_subdag_clear,
task_id=sub_dag_name,
dag=dag,
)
def callback_subdag_clear(context):
"""Clears a sub-dag's tasks on retry."""
dag_id = "{}.{}".format(
context['dag'].dag_id,
context['ti'].task_id,
)
execution_date = context['execution_date']
sub_dag = DagBag().get_dag(dag_id)
sub_dag.clear(
start_date=execution_date,
end_date=execution_date,
only_failed=False,
only_running=False,
confirm_prompt=False,
include_subdags=False
)
(originally taken from here https://gist.github.com/nathairtras/6ce0b0294be8c27d672e2ad52e8f2117)

We opted for using the clear_task_instances method of the taskinstance:
#provide_session
def clear_tasks_fn(tis,session=None,activate_dag_runs=False,dag=None) -> None:
"""
Wrapper for `clear_task_instances` to be used in callback function
(that accepts only `context`)
"""
taskinstance.clear_task_instances(tis=tis,
session=session,
activate_dag_runs=activate_dag_runs,
dag=dag)
def clear_tasks_callback(context) -> None:
"""
Clears tasks based on list passed as `task_ids_to_clear` parameter
To be used as `on_retry_callback`
"""
all_tasks = context["dag_run"].get_task_instances()
dag = context["dag"]
task_ids_to_clear = context["params"].get("task_ids_to_clear", [])
tasks_to_clear = [ ti for ti in all_tasks if ti.task_id in task_ids_to_clear ]
clear_tasks_fn(tasks_to_clear,
dag=dag)
You would need to provide the list of tasks you want cleared on the callback, e.g on any child task:
DummyOperator('some_child',
on_retry_callback=clear_tasks_callback,
params=dict(
task_ids_to_clear=['some_child', 'parent']
),
retries=1
)

It does not directly answer your question but I can suggest a better workaround:
default_args = {
'start_date': datetime(2017, 12, 16),
'depends_on_past': True,
}
dag = DAG(
dag_id='main_dag',
schedule_interval='#daily',
default_args=default_args,
max_active_runs=1,
retries=100,
retry_delay= timedelta(seconds=120)
)
Set the depends_on_past to True in the DAG.
Then in the tasks of this dag, limit the retries using retries
DummyOperator(
task_id='bar',
retries=0
dag=child)
This way the DAG is marked as failed when any task fails. Then the DAG will be retried.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to define a timeout for Apache Airflow DAGs? - airflow

Related

Airflow triggering the "on_failure_callback" when the "dagrun_timeout" is exceeded

DAG failure during the initial run

Airflow schedule getting skipped if previous task execution takes more time

How to schedule DAG tasks once the DAG has been triggered?

clear an upstream task in airflow within the dag

Categories

Resources