How to not display a task in DAG based on a condition - airflow

I have a DAG and have 3 tasks in it. I would like to not display the 2nd task (middle_name) in the DAG run based on a condition. for e.g. if middle_name_var == 'false', i don't want to display the middle_name task in the DAG. Is there a way to elegantly achieve this?
from airflow.operators import PythonOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
from airflow.models import Variable
middle_name_var = Variable.get('middle_name')
default_args = {
'owner': 'test',
'depends_on_past': False,
'start_date': datetime(2018, 6, 18),
'email': ['tes#abc.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}
dag = DAG(
'name',
default_args=default_args,
schedule_interval="#once")
def first_name():
print('John')
def middle_name():
print('Smith')
def last_name():
print('Doe')
first_name_task = PythonOperator(
task_id='first_name',
provide_context=False,
python_callable=first_name,
dag=dag
)
middle_name_task = PythonOperator(
task_id='middle_name',
provide_context=False,
python_callable=middle_name,
dag=dag
)
last_name_task = PythonOperator(
task_id='last_name',
provide_context=False,
python_callable=last_name,
dag=dag
)
if middle_name_var == 'true':
first_name_task >> middle_name_task >>last_name_task
else:
first_name_task >> last_name_task
MY DAG looks like this with middle_name task...But i would like to not have middle_name task based on middle_name_var, which is set to false in this case.

With reference to that last set of chaining statements
# by the way i believe the comparison expression should be
# middle_name_var == True (boolean rather than string), but lets ignore it for now
if middle_name_var == 'true':
first_name_task >> middle_name_task >>last_name_task
else:
first_name_task >> last_name_task
Let me ask you: what would happen if you remove these chaining statements? Would the tasks disappear from DAG?
Not quite.
Chaining merely establishes a dependency relationship between tasks. Even without chaining, your task would remain part of your DAG (like in the screenshot that you've posted).
Here's the secret bit: a task becomes part of your dag as soon as you declare it
middle_name_task = PythonOperator(
task_id='middle_name',
provide_context=False,
python_callable=middle_name,
dag=dag
)
And whether or not you set that task upstream or downstream of some other tasks, it will continue to 'appear' in your DAG. Quoting the docs in this regard
Operators do not have to be assigned to DAGs immediately (previously
dag was a required argument). However, once an operator is assigned to
a DAG, it can not be transferred or unassigned. DAG assignment can be
done explicitly when the operator is created, through deferred
assignment, or even inferred from other operators.
Q So what should you do to 'not display' the task?
A Simply not declare (instantiate) it.
Q And how would you go about doing that?
A Just move task declaration inside your if-else clause
if middle_name_var == 'true':
middle_name_task = PythonOperator(
task_id='middle_name',
provide_context=False,
python_callable=middle_name,
dag=dag
)
first_name_task >> middle_name_task >>last_name_task
else:
first_name_task >> last_name_task

Related

Using dag_run variables in airflow Dag

I am trying to use airflow variables to determine whether to execute a task or not. I have tried this and it's not working:
if '{{ params.year }}' == '{{ params.message }}':
run_this = DummyOperator (
task_id = 'dummy_dag'
)
I was hoping to get some help making it work. Also is there a better way of doing something like this in airflow?
I think a good way to solve this, is with BranchPythonOperator to branch dynamically based on the provided DAG parameters. Consider this example:
Use params to provide the parameters to the DAG (could be also done from the UI), in this example: {"enabled": True}
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
from airflow.operators.python import get_current_context, BranchPythonOperator
#dag(
default_args=default_args,
schedule_interval=None,
start_date=days_ago(1),
catchup=False,
tags=["example"],
params={"enabled": True},
)
def branch_from_dag_params():
def _print_enabled():
context = get_current_context()
enabled = context["params"].get("enabled", False)
print(f"Task id: {context['ti'].task_id}")
print(f"Enabled is: {enabled}")
#task
def task_a():
_print_enabled()
#task
def task_b():
_print_enabled()
Define a callable to the BranchPythonOperator in which you will perform your conditionals and return the next task to be executed. You can access the execution context variables from **kwargs. Also keep in mind that this operator should return a single task_id or a list of task_ids to follow downstream. Those resultant tasks should always be directly downstream from it.
def _get_task_run(ti, **kwargs):
custom_param = kwargs["params"].get("enabled", False)
if custom_param:
return "task_a"
else:
return "task_b"
branch_task = BranchPythonOperator(
task_id="branch_task",
python_callable=_get_task_run,
)
task_a_exec = task_a()
task_b_exec = task_b()
branch_task >> [task_a_exec, task_b_exec]
The result is that task_a gets executed and task_b is skipped :
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=branch_from_dag_params
AIRFLOW_CTX_TASK_ID=task_a
Task id: task_a
Enabled is: True
Let me know if that worked for you.
Docs

task got unexpected argument 'dag' in airflow

I have created a task with PythonOperator as the operator. It calls for a function in another folder with an argument. But the operator does not accept the argument dag=dag when in fact it is a must since it is used to point to the dag context.
dags/
- my_dag.py
sub_folder/
- __init__.py
- my_functions.py
My DAG contains task1 and task2. They will call the function from a sub folder, and pass an argument to print.
my_dag.py
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from sub_folder.my_functions import task1, task2
args = {
'owner': 'hello',
'start_date': dt.datetime(2019, 1, 1),
'retries': 1,
'retry_delay': dt.timedelta(minutes=2)
}
dag = DAG(
'try',
default_args = args,
schedule_interval = dt.timedelta(minutes=2))
task1 = PythonOperator(
task_id='task1',
python_callable=task1,
provide_context=True,
op_kwargs={'idx': "Hello "},
dag=dag,
)
task2 = PythonOperator(
task_id='task2',
python_callable=task2,
provide_context=True,
op_kwargs={'idx': "World!"},
dag=dag,
)
task1 >> task2
The callable functions are just simple functions that prints the argument passed in them.
my_functions.py
def task1(idx):
print(f"Task 1! {idx}")
def task2(idx):
print(f"Task 2! {idx}")
My task1 is always retrying to run and in some time it will fail. I looked into the logs to find out what's going on. I found that that it gets a
TypeError: task1() got an unexpected keyword argument 'dag'
I don't know what is happening here. Obviously I have to call dag=dag and it's really an argument for making an operator to point which dag container it must have context with.
There is conflict between my_functions.task1 and the PythonOperator named task1
try:
import sub_folder.my_functions as mf # changed
task1 = PythonOperator(
task_id='task1',
python_callable=mf.task1, # changed
provide_context=True,
op_kwargs={'idx': "Hello "},
dag=dag,
)

DAG marked as "success" if one task fails, because of trigger rule ALL_DONE

I have the following DAG with 3 tasks:
start --> special_task --> end
The task in the middle can succeed or fail, but end must always be executed (imagine this is a task for cleanly closing resources). For that, I used the trigger rule ALL_DONE:
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
Using that, end is properly executed if special_task fails. However, since end is the last task and succeeds, the DAG is always marked as SUCCESS.
How can I configure my DAG so that if one of the tasks failed, the whole DAG is marked as FAILED?
Example to reproduce
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils import trigger_rule
dag = DAG(
dag_id='my_dag',
start_date=datetime.datetime.today(),
schedule_interval=None
)
start = BashOperator(
task_id='start',
bash_command='echo start',
dag=dag
)
special_task = BashOperator(
task_id='special_task',
bash_command='exit 1', # force failure
dag=dag
)
end = BashOperator(
task_id='end',
bash_command='echo end',
dag=dag
)
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
start.set_downstream(special_task)
special_task.set_downstream(end)
This post seems to be related, but the answer does not suit my needs, since the downstream task end must be executed (hence the mandatory trigger_rule).
I thought it was an interesting question and spent some time figuring out how to achieve it without an extra dummy task. It became a bit of a superfluous task, but here's the end result:
This is the full DAG:
import airflow
from airflow import AirflowException
from airflow.models import DAG, TaskInstance, BaseOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils.db import provide_session
from airflow.utils.state import State
from airflow.utils.trigger_rule import TriggerRule
default_args = {"owner": "airflow", "start_date": airflow.utils.dates.days_ago(3)}
dag = DAG(
dag_id="finally_task_set_end_state",
default_args=default_args,
schedule_interval="0 0 * * *",
description="Answer for question https://stackoverflow.com/questions/51728441",
)
start = BashOperator(task_id="start", bash_command="echo start", dag=dag)
failing_task = BashOperator(task_id="failing_task", bash_command="exit 1", dag=dag)
#provide_session
def _finally(task, execution_date, dag, session=None, **_):
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
print("Do logic here...")
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
finally_ = PythonOperator(
task_id="finally",
python_callable=_finally,
trigger_rule=TriggerRule.ALL_DONE,
provide_context=True,
dag=dag,
)
succesful_task = DummyOperator(task_id="succesful_task", dag=dag)
start >> [failing_task, succesful_task] >> finally_
Look at the _finally function, which is called by the PythonOperator. There are a few key points here:
Annotate with #provide_session and add argument session=None, so you can query the Airflow DB with session.
Query all upstream task instances for the current task:
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
From the returned task instances, get the states and check if State.FAILED is in there:
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
Perform your own logic:
print("Do logic here...")
And finally, fail the task if fail_this_task=True:
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
The end result:
As #JustinasMarozas explained in a comment, a solution is to create a dummy task like :
dummy = DummyOperator(
task_id='test',
dag=dag
)
and bind it downstream to special_task :
failing_task.set_downstream(dummy)
Thus, the DAG is marked as failed, and the dummy task is marked as upstream_failed.
Hope there is an out-of-the-box solution, but waiting for that, this solution does the job.
To expand on Bas Harenslak answer, a simpler _finally function which will check the state of all tasks (not only the upstream ones) can be:
def _finally(**kwargs):
for task_instance in kwargs['dag_run'].get_task_instances():
if task_instance.current_state() != State.SUCCESS and \
task_instance.task_id != kwargs['task_instance'].task_id:
raise Exception("Task {} failed. Failing this DAG run".format(task_instance.task_id))

How to add task at run time if task 1 is failed

I want to execute task 2 if task 1 is success if task 1 fails i want to run task 3 and want to assign another flow if required.
Basically i want to run conditional tasks in airflow without ssh operators.
from airflow import DAG
from airflow.operators import PythonOperator,BranchPythonOperator
from airflow.operators import BashOperator
from datetime import datetime, timedelta
from airflow.models import Variable
def t2_error_task(context):
instance = context['task_instance']
if instance.task_id == "performExtract":
print ("Please implement something over this")
task_3 = PythonOperator(
task_id='performJoin1',
python_callable=performJoin1, # maybe main?
dag = dag
)
dag.add_task(task_3)
with DAG(
'manageWorkFlow',
catchup=False,
default_args={
'owner': 'Mannu',
'start_date': datetime(2018, 4, 13),
'schedule_interval':None,
'depends_on_past': False,
},
) as dag:
task_1 = PythonOperator(
task_id='performExtract',
python_callable=performExtract,
on_failure_callback=t2_error_task,
depends_on_past=True
)
task_2 = PythonOperator(
task_id='printSchemas',
depends_on_past=True,
python_callable=printSchemaAll, # maybe main?
)
task_2.set_upstream(task_1)
Adding tasks dynamically based on execution-time statuses is not something Airflow supports. In order to get the desired behaviour, you should add task_3 to your dag but change its trigger_rule to all_failed. In this case, the task will get marked as skipped when task_1 succeeds, but it will get executed when it fails.

Triggering A SubDag

EDITED
I have edited this question by considering the inputs from #tobi6
I copied the subdag operator from Airflow source code
Source code: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/subdag_operator.py
I modified a few things in the execute method. The changes were made to trigger the SubDag and wait until the SubDag completes execution. The trigger is working great but the tasks are not being executed (DAG is in the running/Green state while the tasks are in the null/White state).
Please refer below for the changes I made:
from airflow.exceptions import AirflowException
from airflow.models import BaseOperator, Pool
from airflow.utils.decorators import apply_defaults
from airflow.utils.db import provide_session
from airflow.utils.state import State
from airflow.executors import GetDefaultExecutor
from time import sleep
import logging
from datetime import datetime
class SubDagOperator(BaseOperator):
template_fields = tuple()
ui_color = '#555'
ui_fgcolor = '#fff'
#provide_session
#apply_defaults
def __init__(
self,
subdag,
executor=GetDefaultExecutor(),
*args, **kwargs):
"""
Yo dawg. This runs a sub dag. By convention, a sub dag's dag_id
should be prefixed by its parent and a dot. As in `parent.child`.
:param subdag: the DAG object to run as a subdag of the current DAG.
:type subdag: airflow.DAG
:param dag: the parent DAG
:type subdag: airflow.DAG
"""
import airflow.models
dag = kwargs.get('dag') or airflow.models._CONTEXT_MANAGER_DAG
if not dag:
raise AirflowException('Please pass in the `dag` param or call '
'within a DAG context manager')
session = kwargs.pop('session')
super(SubDagOperator, self).__init__(*args, **kwargs)
# validate subdag name
if dag.dag_id + '.' + kwargs['task_id'] != subdag.dag_id:
raise AirflowException(
"The subdag's dag_id should have the form "
"'{{parent_dag_id}}.{{this_task_id}}'. Expected "
"'{d}.{t}'; received '{rcvd}'.".format(
d=dag.dag_id, t=kwargs['task_id'], rcvd=subdag.dag_id))
# validate that subdag operator and subdag tasks don't have a
# pool conflict
if self.pool:
conflicts = [t for t in subdag.tasks if t.pool == self.pool]
if conflicts:
# only query for pool conflicts if one may exist
pool = (
session
.query(Pool)
.filter(Pool.slots == 1)
.filter(Pool.pool == self.pool)
.first()
)
if pool and any(t.pool == self.pool for t in subdag.tasks):
raise AirflowException(
'SubDagOperator {sd} and subdag task{plural} {t} both '
'use pool {p}, but the pool only has 1 slot. The '
'subdag tasks will never run.'.format(
sd=self.task_id,
plural=len(conflicts) > 1,
t=', '.join(t.task_id for t in conflicts),
p=self.pool
)
)
self.subdag = subdag
self.executor = executor
def execute(self, context):
dag_run = self.subdag.create_dagrun(
conf=context['dag_run'].conf,
state=State.RUNNING,
execution_date=context['execution_date'],
run_id='trig__' + str(datetime.utcnow()),
external_trigger=True
)
while True:
if dag_run.get_state() == State.FAILED or dag_run.get_state() == State.SUCCESS:
break
else:
sleep(10)
continue
Below is the code that shows how I'm using the same
from airflow import DAG
from operators.sd_operator import SubDagOperator # My SubDag Operator
from airflow.operators.python_operator import PythonOperator
import logging
from datetime import datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 7, 17),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
}
def print_dag_details(**kwargs):
logging.info(str(kwargs['dag_run'].conf))
with DAG('example_dag', schedule_interval=None, catchup=False, default_args=default_args) as dag:
task_1 = SubDagOperator(
subdag=sub_dag_func('example_dag', 'sub_dag_1'),
task_id='sub_dag_1'
)
task_2 = SubDagOperator(
subdag=sub_dag_func('example_dag', 'sub_dag_2'),
task_id='sub_dag_2',
)
print_kwargs = PythonOperator(
task_id='print_kwargs',
python_callable=print_dag_details,
provide_context=True
)
print_kwargs >> task_1 >> task_2
Any information you provide would be helpful. Thanks in advance.
It is a bit hard to understand your question without context.
"I copied the subdag operator and modified a few things in the execute method."
From where was this copied?
"The trigger is working great ..."
How does this look like?
There are a few things I saw in the code:
It might be helpful to add assigned fields to the function call of sub_dag_func, e.g. sub_dag_func(subdag='parent_dag'...).
In the binary shift definition, used to set upstream / downstream there are tasks defined I cannot find in the DAG (df_job_1, df_job_2). This might be connected to SubDAGs (haven't looked into them yet).
The name of the sub dag seems inconsistent with the comment in the code saying By convention, a sub dag's dag_id should be prefixed by its parent and a dot but it is sub_dag_1, sub_dag_2

Resources