Airflow tree view inversion bug? [closed] - airflow

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 months ago.
Improve this question
I am running a couple active DAGs on Google Cloud composer (managed airflow) and it seems the tree view of all my DAGS are rendering inverted.
Note the following DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.dummy_operator import DummyOperator
default_args = {
"owner": "airflow",
"start_date": datetime(2020, 4, 26),
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"email": "youremail#host.com",
"retries": 1,
"retry_delay": timedelta(minutes=5)
}
with DAG(dag_id="Dummy_test", schedule_interval="#daily", default_args=default_args, catchup=True) as dag:
op1 = DummyOperator(task_id='op1', dag=dag)
op2 = DummyOperator(task_id='op2', dag=dag)
op3 = DummyOperator(task_id='op3', dag=dag)
op4 = DummyOperator(task_id='op4', dag=dag)
op5 = DummyOperator(task_id='op5', dag=dag)
op6 = DummyOperator(task_id='op6', dag=dag)
op7 = DummyOperator(task_id='op7', dag=dag)
op8 = DummyOperator(task_id='op8', dag=dag)
op9 = DummyOperator(task_id='op9', dag=dag)
op1 >> op2
op2 >> op3
op3 >> op4
op4 >> op5
op5 >> op6
op6 >> [op7, op8, op9]
It renders correctly in Graph view:
But the tree view is inverted?
Is this a UI bug, or am I defining my dependencies incorrectly? This is happening on all my dags, not just this simple dummy example.

Thanks to itroulli and his comments, I now see this is the expected behavior. Thanks!

Related

Airflow Debugging: How to skip backfill job execution when running DAG in vscode

I have setup airflow and am running a DAG using the following vscode debug configuration:
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": false,
"env":{
"AIRFLOW__CORE__EXECUTOR": "DebugExecutor",
"AIRFLOW__DEBUG__FAIL_FAST": "True",
"LC_ALL": "en_US.UTF-8",
"LANG": "en_US.UTF-8"
}
}
]
}
It runs the file, my breakpoints DAG defs break as expected, then at the end of the file: It executes the dag.run() and then I wait forever for the dag to backfill, and my breakpoints within python_callable functions of tasks never break.
What airflow secret am I not seeing?
Here is my dag:
# scheduled to run every minute, poke for a new file every ten seconds
dag = DAG(
dag_id='download-from-s3',
start_date=days_ago(2),
catchup=False,
schedule_interval='*/1 * * * *',
is_paused_upon_creation=False
)
def new_file_detection(**context):
print("File found...") # a breakpoint here never lands
pprint(context)
init = BashOperator(
task_id='init',
bash_command='echo "My DAG initiated at $(date)"',
dag=dag,
)
file_sensor = S3KeySensor(
task_id='file_sensor',
poke_interval=10, # every 10 seconds
timeout=60,
bucket_key="s3://inbox/new/*",
bucket_name=None,
wildcard_match=True,
soft_fail=True,
dag=dag
)
file_found_message = PythonOperator(
task_id='file_found_message',
provide_context=True,
python_callable=new_file_detection,
dag=dag
)
init >> file_sensor >> file_found_message
if __name__ == '__main__':
dag.clear(reset_dag_runs=True)
dag.run() #this triggers a backfill job
This is working for me as expected. I can set breakpoints at DAG level, or inside the python callables definition and go through them using VSCode debugger.
I'm using the same debug settings that you provided, but I changed the parameter reset_dag_runs=True to dag_run_state=State.NONE during dag.clear() call, as specified on the DebugExecutor docs page. I believe this has changed on one of the latest releases.
Regarding backfills, I'm setting catchup=False on the DAG arguments (it works both ways). Important note, I'm running version 2.0.0 of Airflow.
Here is an example using the same code from example_xcomp.py that comes with the default installation:
Debug settings:
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "internalConsole",
"justMyCode": false,
"env":{
"AIRFLOW__CORE__EXECUTOR": "DebugExecutor",
"AIRFLOW__DEBUG__FAIL_FAST": "True",
}
}
]
}
Example DAG:
import logging
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
dag = DAG(
'excom_xample',
schedule_interval="#once",
start_date=days_ago(2),
default_args={'owner': 'airflow'},
tags=['example'],
catchup=False
)
value_1 = [1, 2, 3]
value_2 = {'a': 'b'}
def push(**kwargs):
"""Pushes an XCom without a specific target"""
logging.info("log before PUSH") # <<<<<<<<<<< Before landing on breakpoint
kwargs['ti'].xcom_push(key='value from pusher 1', value=value_1)
def push_by_returning(**kwargs):
"""Pushes an XCom without a specific target, just by returning it"""
return value_2
def puller(**kwargs):
"""Pull all previously pushed XComs and
check if the pushed values match the pulled values."""
ti = kwargs['ti']
# get value_1
pulled_value_1 = ti.xcom_pull(key=None, task_ids='push')
print("PRINT Line after breakpoint ") # <<<< After landing on breakpoint
if pulled_value_1 != value_1:
raise ValueError("The two values differ"
f"{pulled_value_1} and {value_1}")
# get value_2
pulled_value_2 = ti.xcom_pull(task_ids='push_by_returning')
if pulled_value_2 != value_2:
raise ValueError(
f'The two values differ {pulled_value_2} and {value_2}')
# get both value_1 and value_2
pulled_value_1, pulled_value_2 = ti.xcom_pull(
key=None, task_ids=['push', 'push_by_returning'])
if pulled_value_1 != value_1:
raise ValueError(
f'The two values differ {pulled_value_1} and {value_1}')
if pulled_value_2 != value_2:
raise ValueError(
f'The two values differ {pulled_value_2} and {value_2}')
push1 = PythonOperator(
task_id='push',
dag=dag,
python_callable=push,
)
push2 = PythonOperator(
task_id='push_by_returning',
dag=dag,
python_callable=push_by_returning,
)
pull = PythonOperator(
task_id='puller',
dag=dag,
python_callable=puller,
)
pull << [push1, push2]
if __name__ == '__main__':
from airflow.utils.state import State
dag.clear(dag_run_state=State.NONE)
dag.run()

Programmatically clear the state of airflow task instances

I want to clear the tasks in DAG B when DAG A completes execution. Both A and B are scheduled DAGs.
Is there any operator/way to clear the state of tasks and re-run DAG B programmatically?
I'm aware of the CLI option and Web UI option to clear the tasks.
I would recommend staying away from CLI here!
The airflow functionality of dags/tasks are much better exposed when referencing the objects, as compared to going through BashOperator and/or CLI module.
Add a python operation to dag A named "clear_dag_b", that imports dag_b from the dags folder(module) and this:
from dags.dag_b import dag as dag_b
def clear_dag_b(**context):
exec_date = context[some date object, I forget the name]
dag_b.clear(start_date=exec_date, end_date=exec_date)
Important! If you for some reason do not match or overlap the dag_b schedule time with start_date/end_date, the clear() operation will miss the dag executions. This example assumes dag A and B are scheduled identical, and that you only want to clear day X from B, when A executes day X
It might make sense to include a check for whether the dag_b has already run or not, before clearing:
dab_b_run = dag_b.get_dagrun(exec_date) # returns None or a dag_run object
cli.py is an incredibly useful place to peep into SQLAlchemy magic of Airflow.
The clear command is implemented here
#cli_utils.action_logging
def clear(args):
logging.basicConfig(
level=settings.LOGGING_LEVEL,
format=settings.SIMPLE_LOG_FORMAT)
dags = get_dags(args)
if args.task_regex:
for idx, dag in enumerate(dags):
dags[idx] = dag.sub_dag(
task_regex=args.task_regex,
include_downstream=args.downstream,
include_upstream=args.upstream)
DAG.clear_dags(
dags,
start_date=args.start_date,
end_date=args.end_date,
only_failed=args.only_failed,
only_running=args.only_running,
confirm_prompt=not args.no_confirm,
include_subdags=not args.exclude_subdags,
include_parentdag=not args.exclude_parentdag,
)
Looking at the source, you can either
replicate it (assuming you also want to modify the functionality a bit)
or maybe just do from airflow.bin import cli and invoke the required functions directly
Since my objective was to re-run the DAG B whenever DAG A completes execution, i ended up clearing the DAG B using BashOperator:
# Clear the tasks in another dag
last_task = BashOperator(
task_id='last_task',
bash_command= 'airflow clear example_target_dag -c ',
dag=dag)
first_task >> last_task
It is possible but I would be careful about getting into an endless loop of retries if the task never succeeds. You can call a bash command within the on_retry_callback where you can specify which tasks/dag runs you want to clear.
This works in 2.0 as the clear commands have changed
https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#clear
In this example, I am clearing from t2 & downstream tasks when t3 eventually fails:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
def clear_upstream_task(context):
execution_date = context.get("execution_date")
clear_tasks = BashOperator(
task_id='clear_tasks',
bash_command=f'airflow tasks clear -s {execution_date} -t t2 -d -y clear_upstream_task'
)
return clear_tasks.execute(context=context)
# Default settings applied to all tasks
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
with DAG('clear_upstream_task',
start_date=datetime(2021, 1, 1),
max_active_runs=3,
schedule_interval=timedelta(minutes=5),
default_args=default_args,
catchup=False
) as dag:
t0 = DummyOperator(
task_id='t0'
)
t1 = DummyOperator(
task_id='t1'
)
t2 = DummyOperator(
task_id='t2'
)
t3 = BashOperator(
task_id='t3',
bash_command='exit 123',
#retries=1,
on_failure_callback=clear_upstream_task
)
t0 >> t1 >> t2 >> t3

Is there a way to configure different 'retries' for tasks in the same DAG

I have a DAG with many sub-tasks in it. In the middle of the DAG, there is a validation task and based on the result/return code from the task, i want to take two different paths. If success, one route(a sequence of tasks) will be followed and in case of failure, we would like to execute a different set of tasks. There are two problems with the current approach, one is that, validation tasks execute many times(as per the retries configured) if the exit code is 1. Second there is no way possible to take different branches of execution
To solve problem number 1, we can use the retry number is available from the task instance, which is available via the macro {{ task_instance }} . Appreciate if someone could point us to a cleaner approach, and the problem number 2 of taking different paths remains unsolved.
You can have retries at the task level.
run_this = BashOperator(
task_id='run_after_loop',
bash_command='echo 1',
retries=3,
dag=dag,
)
run_this_last = DummyOperator(
task_id='run_this_last',
retries=1,
dag=dag,
)
Regarding your 2nd problem, there is a concept of Branching.
The BranchPythonOperator is much like the PythonOperator except that it expects a python_callable that returns a task_id (or list of task_ids). The task_id returned is followed, and all of the other paths are skipped. The task_id returned by the Python function has to be referencing a task directly downstream from the BranchPythonOperator task.
Example DAG:
import random
import airflow
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import BranchPythonOperator
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
dag = DAG(
dag_id='example_branch_operator',
default_args=args,
schedule_interval="#daily",
)
run_this_first = DummyOperator(
task_id='run_this_first',
dag=dag,
)
options = ['branch_a', 'branch_b', 'branch_c', 'branch_d']
branching = BranchPythonOperator(
task_id='branching',
python_callable=lambda: random.choice(options),
dag=dag,
)
run_this_first >> branching
join = DummyOperator(
task_id='join',
trigger_rule='one_success',
dag=dag,
)
for option in options:
t = DummyOperator(
task_id=option,
dag=dag,
)
dummy_follow = DummyOperator(
task_id='follow_' + option,
dag=dag,
)
branching >> t >> dummy_follow >> join
Regarding your first problem, you set task/Operator specific retry options quite easily. Reference: baseoperator.py#L77.
Problem two, you can branch within a DAG easily with BranchPythonOperator (Example Usage: example_branch_operator.py). You will want to nest your validation task/logic within the BranchPythonOperator (You can define and execute operators within operators).

DAG marked as "success" if one task fails, because of trigger rule ALL_DONE

I have the following DAG with 3 tasks:
start --> special_task --> end
The task in the middle can succeed or fail, but end must always be executed (imagine this is a task for cleanly closing resources). For that, I used the trigger rule ALL_DONE:
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
Using that, end is properly executed if special_task fails. However, since end is the last task and succeeds, the DAG is always marked as SUCCESS.
How can I configure my DAG so that if one of the tasks failed, the whole DAG is marked as FAILED?
Example to reproduce
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils import trigger_rule
dag = DAG(
dag_id='my_dag',
start_date=datetime.datetime.today(),
schedule_interval=None
)
start = BashOperator(
task_id='start',
bash_command='echo start',
dag=dag
)
special_task = BashOperator(
task_id='special_task',
bash_command='exit 1', # force failure
dag=dag
)
end = BashOperator(
task_id='end',
bash_command='echo end',
dag=dag
)
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
start.set_downstream(special_task)
special_task.set_downstream(end)
This post seems to be related, but the answer does not suit my needs, since the downstream task end must be executed (hence the mandatory trigger_rule).
I thought it was an interesting question and spent some time figuring out how to achieve it without an extra dummy task. It became a bit of a superfluous task, but here's the end result:
This is the full DAG:
import airflow
from airflow import AirflowException
from airflow.models import DAG, TaskInstance, BaseOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils.db import provide_session
from airflow.utils.state import State
from airflow.utils.trigger_rule import TriggerRule
default_args = {"owner": "airflow", "start_date": airflow.utils.dates.days_ago(3)}
dag = DAG(
dag_id="finally_task_set_end_state",
default_args=default_args,
schedule_interval="0 0 * * *",
description="Answer for question https://stackoverflow.com/questions/51728441",
)
start = BashOperator(task_id="start", bash_command="echo start", dag=dag)
failing_task = BashOperator(task_id="failing_task", bash_command="exit 1", dag=dag)
#provide_session
def _finally(task, execution_date, dag, session=None, **_):
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
print("Do logic here...")
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
finally_ = PythonOperator(
task_id="finally",
python_callable=_finally,
trigger_rule=TriggerRule.ALL_DONE,
provide_context=True,
dag=dag,
)
succesful_task = DummyOperator(task_id="succesful_task", dag=dag)
start >> [failing_task, succesful_task] >> finally_
Look at the _finally function, which is called by the PythonOperator. There are a few key points here:
Annotate with #provide_session and add argument session=None, so you can query the Airflow DB with session.
Query all upstream task instances for the current task:
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
From the returned task instances, get the states and check if State.FAILED is in there:
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
Perform your own logic:
print("Do logic here...")
And finally, fail the task if fail_this_task=True:
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
The end result:
As #JustinasMarozas explained in a comment, a solution is to create a dummy task like :
dummy = DummyOperator(
task_id='test',
dag=dag
)
and bind it downstream to special_task :
failing_task.set_downstream(dummy)
Thus, the DAG is marked as failed, and the dummy task is marked as upstream_failed.
Hope there is an out-of-the-box solution, but waiting for that, this solution does the job.
To expand on Bas Harenslak answer, a simpler _finally function which will check the state of all tasks (not only the upstream ones) can be:
def _finally(**kwargs):
for task_instance in kwargs['dag_run'].get_task_instances():
if task_instance.current_state() != State.SUCCESS and \
task_instance.task_id != kwargs['task_instance'].task_id:
raise Exception("Task {} failed. Failing this DAG run".format(task_instance.task_id))

How to check if task 1 fail then run task 2 in airflow?

How to check failed task like task 1 is failed then run task 2, like if else condition.
I want to run the dependent task.
Task1 failed then how can i get that error log in a condition like if task1== failed then run task2 and else task3. I tried SSHHOOK but I am looking for a simple solution.
with DAG(
'airflow',
catchup=False,
default_args={
'owner': 'abc',
'start_date': datetime(2018, 4, 17),
'schedule_interval':None,
'depends_on_past': False,
},
) as dag:
task_1 = PythonOperator(
task_id='task_1',
python_callable=do(),
)
task_2 = PythonOperator(
task_id='task_2',
python_callable=do(),
)
task_3 = PythonOperator(
task_id='task_3',
python_callable=do()
task_3.set_upstream(task_2)
task_2.set_upstream(task_1)
Since there were no code examples I have to assume what your DAG might look like and what you want to do. Also, I didn't understand why you wanted to use SSHHook but again, no code examples. So here we go:
Create error task
def t2_error_task(context):
instance = context['task_instance']
do_stuff()
Create tasks
t1_task = PythonOperator(
task_id='my_operator_t1',
python_callable=do_python_stuff,
on_failure_callback=t2_error_task,
dag=dag
)
t3_task_success = PythonOperator(
task_id='my_operator_t3',
python_callable=do_python_stuff_success,
dag=dag
)
Then set t3 upstream of t1:
t1_task >> t3_task_success
One solution that would be explicit in your DAG topology is to mkake task_1 write a XCOM to mark it's success or failure, then create a BranchPythonOperator that reads that XCOM and decides based on it if you should execute task_2 or not.

Resources