DAG failure during the initial run - airflow

I have two DAGs:
DAG_A , DAG_B.
DAG_A triggers DAG_B thru TriggerDagRunOperator.
My tasks in DAG_B:
with DAG(
dag_id='DAG_B',
default_args=default_args,
schedule_interval='#once',
description='ETL pipeline for processing users'
) as dag:
start = DummyOperator(
task_id='start')
delete_xcom_task = PostgresOperator(
task_id='clean_up_xcom',
postgres_conn_id='postgres_default',
sql="delete from xcom where dag_id='DAG_A' and task_id='TASK_A' ")
end = DummyOperator(
task_id='end')
#trigger_rule='none_failed')
#num_table is set by DAG_A. Will have an empty list initially.
iterable_string = Variable.get("num_table",default_var="[]")
iterable_list = ast.literal_eval(iterable_string)
for index,table in enumerate(iterable_list):
table = table.strip()
read_src1 = PythonOperator(
task_id=f'Read_Source_data_{table}',
python_callable=read_src,
op_kwargs={'index': index}
)
upload_file_to_directory_bulk1 = PythonOperator(
task_id=f'ADLS_Loading_{table}',
python_callable=upload_file_to_directory_bulk,
op_kwargs={'index': index}
)
write_Snowflake1 = PythonOperator(
task_id=f'Snowflake_Staging_{table}',
python_callable=write_Snowflake,
op_kwargs={'index': index}
)
task_sf_storedproc1 = DummyOperator(
task_id=f'Snowflake_Processing_{table}'
)
start >> read_src1 >> upload_file_to_directory_bulk1 >> write_Snowflake1 >>task_sf_storedproc1 >> delete_xcom_task >> end
After executing airflow db init and making the webserver and scheduler up, DAG_B fails with failure in task delete_xcom_task.
2021-06-22 08:04:43,647] {taskinstance.py:871} INFO - Dependencies not met for <TaskInstance: Target_DIF.clean_up_xcom 2021-06-22T08:04:27.861718+00:00 [queued]>, dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 2 non-success(es). upstream_tasks_state={'total': 2, 'successes': 0, 'skipped': 0, 'failed': 0, 'upstream_failed': 0, 'done': 0}, upstream_task_ids={'Snowflake_Processing_products', 'Snowflake_Processing_inventories'}
[2021-06-22 08:04:43,651] {local_task_job.py:93} INFO - Task is not able to be run
But both DAGs become successful from the second runs.
Can anyone explain me what is happening internally?
How can I avoid the failure during the first run?
Thanks.

I suspect that the problem is in schedule_interval='#once' for DAG_B: When you add the DAG for the first time, the schedule_interval tells the scheduler to run the DAG once. So, DAG_B is triggered once by the scheduler and not by DAG_A. Any preparations that needs to be done by DAG_A for DAG_B to run successfully have not been done yet, therefore DAG_B fails.
Later on, DAG_A runs as scheduled and triggers DAG_B as expected. Both succeed.
To avoid DAG_B being triggered by the scheduler set schedule_interval=None.

Related

How to run a specific dag first in airflow?

I'm using apache airflow (2.3.1) to load data into a database. I have more than 150 dags, I need to run some of them first, how can I do this?
The initialization of the work of the dags occurs at 3 am and the dags start to run randomly, standing in a queue.
I read about priority_weight and weight_rule, but this is only used for tasks, not for dag in general.
As I said, the dag queue is built randomly, and I would like to control it and hard-code which dag should be executed first.
You can use an ExternalTaskSensor to define cross-DAG dependencies.
In particular it allows you to wait for an external (= on a different DAG) task or DAG to complete before proceeding. You can configure the dag_id and task_id to wait for and a time-delta for the execution_date (by default, it expects that the external DAG run has the same execution date as the current).
Full details and possible configurations are available in the official documentation: Cross-DAG Dependencies.
Example usage
Task to be executed first
with DAG(
dag_id = 'first_dag',
start_date = datetime(2022, 1, 1),
schedule_interval = '0 0 * * *'
) as first_dag:
first_task = DummyOperator(task_id = 'first_task')
Task to be executed later
with DAG(
dag_id = 'second_dag',
start_date = datetime(2022, 1, 1),
schedule = '0 0 * * *'
) as second_dag:
first_task_sensor = ExternalTaskSensor(
task_id = 'first_task_sensor',
external_dag_id = 'first_dag',
external_task_id = 'first_task',
timeout = 600,
allowed_states = ['success'],
failed_states = ['failed', 'skipped'],
mode = 'reschedule'
)
second_task = DummyOperator(task_id = 'second_task')
first_task_sensor >> second_task

Airflow - Task-Group with Dynamic task - Can't trigger Downstream if one upstream is failed/skipped

I've an Airflow DAG where I've a task_group with a loop inside that generates two dynamic tasks. After the task_group I need to perform other actions. My problem is:
Inside the task_group I've a branching operators that validates if the last task should run or not. In case of one of the two flows are completed with success, I want to continue my process. For that I'm using the trigger_rule one_success. My code:
with DAG(
dag_id='hello_world',
schedule_interval=None,
start_date=datetime(2022, 8, 25),
default_args=default_args,
max_active_runs=1,
catchup = False,
concurrency = 1,
) as dag:
task_a = DummyOperator(task_id="task_a")
with TaskGroup(group_id='task_group') as my_group:
my_list = ['a','b']
for i in my_list:
task_b = PythonOperator(
task_id="task_a_".format(i),
python_callable=p_task_1)
var_to_continue = check_status(i)
is_running = ShortCircuitOperator(
task_id="is_{}_running".format(i),
python_callable=lambda x: x in [True],
op_args=[var_to_continue])
task_c = PythonOperator(
task_id="task_a_".format(i),
python_callable=p_task_2)
task_b >> is_running >> task_c
task_d = DummyOperator(task_id="task_c",trigger_rule=TriggerRule.ONE_SUCCESS)
task_a >> my_group >> task_d
My problem is: if one of the iterations return skipped the task_d is always skipped, even one of the flow return success.
Do you know how to resolve this?
Thanks!
After a deep search, I found the problem.
In fact, by default ShortCircuitOperator ignore all the downstream tasks trigger rules, if its value is False, it will cut the circuit, which means it will skip all the downstream tasks (its downstream tasks and their downstream tasks and their downstream tasks, ...).
In Airflow 2.3.0, in this PR, they added a new argument ignore_downstream_trigger_rules with default value True to ignore the downstream trigger rules, but you can stop that by providing a False value.
If you are using a version older than 2.3.0, you should replace the operator ShortCircuitOperator by another solution, for ex:
def check_condition():
if not condition: # add your logic
raise AirflowSkipException()
is_running = PythonOperator(..., python_callable=check_condition)
is_running >> task_c

Programmatically clear the state of airflow task instances

I want to clear the tasks in DAG B when DAG A completes execution. Both A and B are scheduled DAGs.
Is there any operator/way to clear the state of tasks and re-run DAG B programmatically?
I'm aware of the CLI option and Web UI option to clear the tasks.
I would recommend staying away from CLI here!
The airflow functionality of dags/tasks are much better exposed when referencing the objects, as compared to going through BashOperator and/or CLI module.
Add a python operation to dag A named "clear_dag_b", that imports dag_b from the dags folder(module) and this:
from dags.dag_b import dag as dag_b
def clear_dag_b(**context):
exec_date = context[some date object, I forget the name]
dag_b.clear(start_date=exec_date, end_date=exec_date)
Important! If you for some reason do not match or overlap the dag_b schedule time with start_date/end_date, the clear() operation will miss the dag executions. This example assumes dag A and B are scheduled identical, and that you only want to clear day X from B, when A executes day X
It might make sense to include a check for whether the dag_b has already run or not, before clearing:
dab_b_run = dag_b.get_dagrun(exec_date) # returns None or a dag_run object
cli.py is an incredibly useful place to peep into SQLAlchemy magic of Airflow.
The clear command is implemented here
#cli_utils.action_logging
def clear(args):
logging.basicConfig(
level=settings.LOGGING_LEVEL,
format=settings.SIMPLE_LOG_FORMAT)
dags = get_dags(args)
if args.task_regex:
for idx, dag in enumerate(dags):
dags[idx] = dag.sub_dag(
task_regex=args.task_regex,
include_downstream=args.downstream,
include_upstream=args.upstream)
DAG.clear_dags(
dags,
start_date=args.start_date,
end_date=args.end_date,
only_failed=args.only_failed,
only_running=args.only_running,
confirm_prompt=not args.no_confirm,
include_subdags=not args.exclude_subdags,
include_parentdag=not args.exclude_parentdag,
)
Looking at the source, you can either
replicate it (assuming you also want to modify the functionality a bit)
or maybe just do from airflow.bin import cli and invoke the required functions directly
Since my objective was to re-run the DAG B whenever DAG A completes execution, i ended up clearing the DAG B using BashOperator:
# Clear the tasks in another dag
last_task = BashOperator(
task_id='last_task',
bash_command= 'airflow clear example_target_dag -c ',
dag=dag)
first_task >> last_task
It is possible but I would be careful about getting into an endless loop of retries if the task never succeeds. You can call a bash command within the on_retry_callback where you can specify which tasks/dag runs you want to clear.
This works in 2.0 as the clear commands have changed
https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#clear
In this example, I am clearing from t2 & downstream tasks when t3 eventually fails:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
def clear_upstream_task(context):
execution_date = context.get("execution_date")
clear_tasks = BashOperator(
task_id='clear_tasks',
bash_command=f'airflow tasks clear -s {execution_date} -t t2 -d -y clear_upstream_task'
)
return clear_tasks.execute(context=context)
# Default settings applied to all tasks
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
with DAG('clear_upstream_task',
start_date=datetime(2021, 1, 1),
max_active_runs=3,
schedule_interval=timedelta(minutes=5),
default_args=default_args,
catchup=False
) as dag:
t0 = DummyOperator(
task_id='t0'
)
t1 = DummyOperator(
task_id='t1'
)
t2 = DummyOperator(
task_id='t2'
)
t3 = BashOperator(
task_id='t3',
bash_command='exit 123',
#retries=1,
on_failure_callback=clear_upstream_task
)
t0 >> t1 >> t2 >> t3

How to define a timeout for Apache Airflow DAGs?

I'm using Airflow 1.10.2 but Airflow seems to ignore the timeout I've set for the DAG.
I'm setting a timeout period for the DAG using the dagrun_timeout parameter (e.g. 20 seconds) and I've got a task which takes 2 mins to run, but Airflow marks the DAG as successful!
args = {
'owner': 'me',
'start_date': airflow.utils.dates.days_ago(2),
'provide_context': True,
}
dag = DAG(
'test_timeout',
schedule_interval=None,
default_args=args,
dagrun_timeout=timedelta(seconds=20),
)
def this_passes(**kwargs):
return
def this_passes_with_delay(**kwargs):
time.sleep(120)
return
would_succeed = PythonOperator(
task_id='would_succeed',
dag=dag,
python_callable=this_passes,
email=to,
)
would_succeed_with_delay = PythonOperator(
task_id='would_succeed_with_delay',
dag=dag,
python_callable=this_passes_with_delay,
email=to,
)
would_succeed >> would_succeed_with_delay
No error messages are thrown. Am I using an incorrect parameter?
As stated in the source code:
:param dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns, and only once the
# of active DagRuns == max_active_runs.
so this might be expected behavior as you set schedule_interval=None. Here, the idea is rather to make sure a scheduled DAG won't last forever and block subsequent run instances.
Now, you may be interested in the execution_timeout available in all operators.
For example, you could set a 60s timeout on your PythonOperator like this:
would_succeed_with_delay = PythonOperator(task_id='would_succeed_with_delay',
dag=dag,
execution_timeout=timedelta(seconds=60),
python_callable=this_passes_with_delay,
email=to)

clear an upstream task in airflow within the dag

i have a task in an airflow DAG. it has three child tasks. unfortunately, there are cases where this parent task will succeed, but two of the three children will fail (and a retry on the children won't fix them).
it requires the parent to retry (even though it didn't fail).
so i dutifully go into the graph view of the dag run and 'clear' this parent task and all downstream tasks (+recursive).
is there a way i can do this within the dag itself?
If your tasks are part of a subdag, calling dag.clear() in the on_retry_callback of a SubDagOperator should do the trick:
SubDagOperator(
subdag=subdag,
task_id="...",
on_retry_callback=lambda context: subdag.clear(
start_date=context['execution_date'],
end_date=context['execution_date']),
dag=dag
)
We had a similar problem which we resolved by putting the task with dependencies that we want to repeat into a sub dag. Then when the sub dag retries we clear the sub dag tasks state using the on_retry_callback so that they all run again.
sub_dag = SubDagOperator(
retry_delay=timedelta(seconds=30),
subdag=create_sub_dag(),
on_retry_callback=callback_subdag_clear,
task_id=sub_dag_name,
dag=dag,
)
def callback_subdag_clear(context):
"""Clears a sub-dag's tasks on retry."""
dag_id = "{}.{}".format(
context['dag'].dag_id,
context['ti'].task_id,
)
execution_date = context['execution_date']
sub_dag = DagBag().get_dag(dag_id)
sub_dag.clear(
start_date=execution_date,
end_date=execution_date,
only_failed=False,
only_running=False,
confirm_prompt=False,
include_subdags=False
)
(originally taken from here https://gist.github.com/nathairtras/6ce0b0294be8c27d672e2ad52e8f2117)
We opted for using the clear_task_instances method of the taskinstance:
#provide_session
def clear_tasks_fn(tis,session=None,activate_dag_runs=False,dag=None) -> None:
"""
Wrapper for `clear_task_instances` to be used in callback function
(that accepts only `context`)
"""
taskinstance.clear_task_instances(tis=tis,
session=session,
activate_dag_runs=activate_dag_runs,
dag=dag)
def clear_tasks_callback(context) -> None:
"""
Clears tasks based on list passed as `task_ids_to_clear` parameter
To be used as `on_retry_callback`
"""
all_tasks = context["dag_run"].get_task_instances()
dag = context["dag"]
task_ids_to_clear = context["params"].get("task_ids_to_clear", [])
tasks_to_clear = [ ti for ti in all_tasks if ti.task_id in task_ids_to_clear ]
clear_tasks_fn(tasks_to_clear,
dag=dag)
You would need to provide the list of tasks you want cleared on the callback, e.g on any child task:
DummyOperator('some_child',
on_retry_callback=clear_tasks_callback,
params=dict(
task_ids_to_clear=['some_child', 'parent']
),
retries=1
)
It does not directly answer your question but I can suggest a better workaround:
default_args = {
'start_date': datetime(2017, 12, 16),
'depends_on_past': True,
}
dag = DAG(
dag_id='main_dag',
schedule_interval='#daily',
default_args=default_args,
max_active_runs=1,
retries=100,
retry_delay= timedelta(seconds=120)
)
Set the depends_on_past to True in the DAG.
Then in the tasks of this dag, limit the retries using retries
DummyOperator(
task_id='bar',
retries=0
dag=child)
This way the DAG is marked as failed when any task fails. Then the DAG will be retried.

Resources