Using dag_run variables in airflow Dag - airflow

I am trying to use airflow variables to determine whether to execute a task or not. I have tried this and it's not working:
if '{{ params.year }}' == '{{ params.message }}':
run_this = DummyOperator (
task_id = 'dummy_dag'
)
I was hoping to get some help making it work. Also is there a better way of doing something like this in airflow?

I think a good way to solve this, is with BranchPythonOperator to branch dynamically based on the provided DAG parameters. Consider this example:
Use params to provide the parameters to the DAG (could be also done from the UI), in this example: {"enabled": True}
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
from airflow.operators.python import get_current_context, BranchPythonOperator
#dag(
default_args=default_args,
schedule_interval=None,
start_date=days_ago(1),
catchup=False,
tags=["example"],
params={"enabled": True},
)
def branch_from_dag_params():
def _print_enabled():
context = get_current_context()
enabled = context["params"].get("enabled", False)
print(f"Task id: {context['ti'].task_id}")
print(f"Enabled is: {enabled}")
#task
def task_a():
_print_enabled()
#task
def task_b():
_print_enabled()
Define a callable to the BranchPythonOperator in which you will perform your conditionals and return the next task to be executed. You can access the execution context variables from **kwargs. Also keep in mind that this operator should return a single task_id or a list of task_ids to follow downstream. Those resultant tasks should always be directly downstream from it.
def _get_task_run(ti, **kwargs):
custom_param = kwargs["params"].get("enabled", False)
if custom_param:
return "task_a"
else:
return "task_b"
branch_task = BranchPythonOperator(
task_id="branch_task",
python_callable=_get_task_run,
)
task_a_exec = task_a()
task_b_exec = task_b()
branch_task >> [task_a_exec, task_b_exec]
The result is that task_a gets executed and task_b is skipped :
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=branch_from_dag_params
AIRFLOW_CTX_TASK_ID=task_a
Task id: task_a
Enabled is: True
Let me know if that worked for you.
Docs

Related

Unable to pull xcom value

I have a very simple DAG which includes a very simple task (PythonOperator) which gets some trivial json data from SWAPI API and returns an int. In this case the value of the int is 202.
I'm fairly sure that this int is being correctly pushed as an XCOM value because when I run the dag, select that task and look at the logs and the XCOM in the UI I see:
Furthermore when I add the line:
ti.xcom_push(key = 'height', value = height)
Into the python function which is getting the value from the API, I'm then able to see from the XCOM view for that task that a key of 'height' is being added and the value is indeed 202.
The problem is that for love nor money I cant pull that value out again and use it in another task. For example the task which needs to use that value is a PythonOperator whose function looks like:
def check_height(ti):
height = ti.xcom_pull(key = 'height', task_ids=['get_data_darth_vader'])
print(f"Height is: {height}")
I've also tried it with no key and setting the key to 'return_value', but nothing work the value is None:
[2021-09-30 21:00:35,044] {logging_mixin.py:109} INFO - Height is: [None]
[2021-09-30 21:00:35,047] {python.py:151} INFO - Done. Returned value was: None
I must be doing something wrong, but I cannot see what. I've watched a number of tutorials and read several blog posts on the subject, and can't see where what I am doing differs from the working examples.
Help!
UPDATE: Here is the whole dag
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime
import json
import requests
def get_darth_vader_height(ti):
"""
Get Darth Vader info from SWAPI
"""
response=requests.get('https://swapi.dev/api/people/4')
data=json.loads(response.text)
height=data['height']
print(f"DEBUG: {height}")
ti.xcom_push(key="height", value=height)
return height
def check_height(ti):
height=ti.xcom_pull(task_ids='task_one', key="height")
print(f"Height is: {height}")
print(str(height))
with DAG(
'my_dag',
start_date = datetime(2021,1,1),
schedule_interval="#daily",
catchup=False,
) as dag:
get_darth_vader_height = PythonOperator(
task_id='task_one',
python_callable=get_darth_vader_height
)
check_darth_vader_height = PythonOperator(
task_id='task_two',
python_callable=check_height
)
is_tall = BashOperator(
task_id='task_three',
bash_command="echo 'is tall!'"
)
is_short = BashOperator(
task_id='task_four',
bash_command="echo 'is short!'"
)
UPDATE WITH WORKING VERSION:
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime
import json
import requests
def get_darth_vader_height(ti):
"""
Get Darth Vader info from SWAPI
"""
response=requests.get('https://swapi.dev/api/people/4')
data=json.loads(response.text)
height=data['height']
print(f"DEBUG: {height}")
ti.xcom_push(key="height", value=height)
return height
def check_height(ti):
height=ti.xcom_pull(task_ids='task_one', key="height")
print(f"Height is: {height}")
if int(height) > 200:
print('height is greater than 200')
return 'is_tall'
print('height is less than 200')
return 'is_short'
with DAG(
'my_dag',
start_date = datetime(2021,1,1),
schedule_interval="#daily",
catchup=False,
) as dag:
get_darth_vader_height = PythonOperator(
task_id='task_one',
python_callable=get_darth_vader_height
)
check_darth_vader_height = BranchPythonOperator(
task_id='task_two',
python_callable=check_height
)
is_tall = BashOperator(
task_id='task_three',
bash_command="echo 'is tall!'"
)
is_short = BashOperator(
task_id='task_four',
bash_command="echo 'is short!'"
)
get_darth_vader_height >> check_darth_vader_height
Adding the chain between the tasks fixed this issue.

Airflow | How DAG got started

Does anyone know how to get the way a DAG got started (whether it was on a scheduler or manually)? I'm using Airflow 2.1.
I have a DAG that runs on an hourly basis, but there are times that I run it manually to test something. I want to capture how the DAG got started and pass that value to a column in a table where I'm saving some data. This will allow me to filter based on scheduled or manual starts and filter test information.
Thanks!
From an execution context, such as a python_callable provided to a PythonOperator you can access to the DagRun object related to the current execution:
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
Logs output:
[2021-09-08 18:53:52,188] {taskinstance.py:1300} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_dagRun_info
AIRFLOW_CTX_TASK_ID=python_task
AIRFLOW_CTX_EXECUTION_DATE=2021-09-07T00:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=backfill__2021-09-07T00:00:00+00:00
Run type: backfill
Externally triggered ?: False
dag_run.run_type would be: "manual", "scheduled" or "backfill". (not sure if there are others)
external_trigger docs:
external_trigger (bool) -- whether this dag run is externally triggered
Also you could use jinja to access default vairables in templated fields, there is a variable representing the dag_run object:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)
Full DAG:
from airflow import DAG
from airflow.models.dagrun import DagRun
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
default_args = {
"owner": "airflow",
}
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
dag = DAG(
dag_id="example_dagRun_info",
default_args=default_args,
start_date=days_ago(1),
schedule_interval="#once",
tags=["example_dags", "params"],
catchup=False,
)
with dag:
python_task = PythonOperator(
task_id="python_task",
python_callable=_print_dag_run,
)
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)

DAG marked as "success" if one task fails, because of trigger rule ALL_DONE

I have the following DAG with 3 tasks:
start --> special_task --> end
The task in the middle can succeed or fail, but end must always be executed (imagine this is a task for cleanly closing resources). For that, I used the trigger rule ALL_DONE:
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
Using that, end is properly executed if special_task fails. However, since end is the last task and succeeds, the DAG is always marked as SUCCESS.
How can I configure my DAG so that if one of the tasks failed, the whole DAG is marked as FAILED?
Example to reproduce
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils import trigger_rule
dag = DAG(
dag_id='my_dag',
start_date=datetime.datetime.today(),
schedule_interval=None
)
start = BashOperator(
task_id='start',
bash_command='echo start',
dag=dag
)
special_task = BashOperator(
task_id='special_task',
bash_command='exit 1', # force failure
dag=dag
)
end = BashOperator(
task_id='end',
bash_command='echo end',
dag=dag
)
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
start.set_downstream(special_task)
special_task.set_downstream(end)
This post seems to be related, but the answer does not suit my needs, since the downstream task end must be executed (hence the mandatory trigger_rule).
I thought it was an interesting question and spent some time figuring out how to achieve it without an extra dummy task. It became a bit of a superfluous task, but here's the end result:
This is the full DAG:
import airflow
from airflow import AirflowException
from airflow.models import DAG, TaskInstance, BaseOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils.db import provide_session
from airflow.utils.state import State
from airflow.utils.trigger_rule import TriggerRule
default_args = {"owner": "airflow", "start_date": airflow.utils.dates.days_ago(3)}
dag = DAG(
dag_id="finally_task_set_end_state",
default_args=default_args,
schedule_interval="0 0 * * *",
description="Answer for question https://stackoverflow.com/questions/51728441",
)
start = BashOperator(task_id="start", bash_command="echo start", dag=dag)
failing_task = BashOperator(task_id="failing_task", bash_command="exit 1", dag=dag)
#provide_session
def _finally(task, execution_date, dag, session=None, **_):
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
print("Do logic here...")
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
finally_ = PythonOperator(
task_id="finally",
python_callable=_finally,
trigger_rule=TriggerRule.ALL_DONE,
provide_context=True,
dag=dag,
)
succesful_task = DummyOperator(task_id="succesful_task", dag=dag)
start >> [failing_task, succesful_task] >> finally_
Look at the _finally function, which is called by the PythonOperator. There are a few key points here:
Annotate with #provide_session and add argument session=None, so you can query the Airflow DB with session.
Query all upstream task instances for the current task:
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
From the returned task instances, get the states and check if State.FAILED is in there:
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
Perform your own logic:
print("Do logic here...")
And finally, fail the task if fail_this_task=True:
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
The end result:
As #JustinasMarozas explained in a comment, a solution is to create a dummy task like :
dummy = DummyOperator(
task_id='test',
dag=dag
)
and bind it downstream to special_task :
failing_task.set_downstream(dummy)
Thus, the DAG is marked as failed, and the dummy task is marked as upstream_failed.
Hope there is an out-of-the-box solution, but waiting for that, this solution does the job.
To expand on Bas Harenslak answer, a simpler _finally function which will check the state of all tasks (not only the upstream ones) can be:
def _finally(**kwargs):
for task_instance in kwargs['dag_run'].get_task_instances():
if task_instance.current_state() != State.SUCCESS and \
task_instance.task_id != kwargs['task_instance'].task_id:
raise Exception("Task {} failed. Failing this DAG run".format(task_instance.task_id))

airflow trigger_rule using ONE_FAILED cause dag failure

what i wanted to achieve is to create a task where will send notification if any-one of the task under the dag is failed. I am applying trigger rule to the task where:
batch11 = BashOperator(
task_id='Error_Buzz',
trigger_rule=TriggerRule.ONE_FAILED,
bash_command='python /home/admin/pythonwork/home/codes/notifications/dagLevel_Notification.py') ,
dag=dag,
catchup = False
)
batch>>batch11
batch1>>batch11
The problem for now is when there no other task failed, the batch11 task will not execute due to trigger_rule, which is what i wanted, but it will result the dag failure since the default trigger_rule for dag is ALL_SUCCESS. Is there a way to end the loop hole to make the dag runs successfully ?
screenshot of outcome :
We do something similar in our Airflow Deployment. The idea is to notify slack when a task in a dag fails. You can set a dag level configuration on_failure_callback as documented https://airflow.apache.org/code.html#airflow.models.BaseOperator
on_failure_callback (callable) – a function to be called when a task
instance of this task fails. a context dictionary is passed as a
single parameter to this function. Context contains references to
related objects to the task instance and is documented under the
macros section of the API.
Here is an example of how I use it. if any of the task fails or succeeds airflow calls notify function and I can get notification wherever I want.
import sys
import os
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from airflow.utils.dates import days_ago
from util.airflow_utils import AirflowUtils
schedule = timedelta(minutes=5)
args = {
'owner': 'user',
'start_date': days_ago(1),
'depends_on_past': False,
'on_failure_callback': AirflowUtils.notify_job_failure,
'on_success_callback': AirflowUtils.notify_job_success
}
dag = DAG(
dag_id='demo_dag',
schedule_interval=schedule, default_args=args)
def task1():
return 'Whatever you return gets printed in the logs!'
def task2():
return 'cont'
task1 = PythonOperator(task_id='task1',
python_callable=task1,
dag=dag)
task2 = PythonOperator(task_id='task2',
python_callable=task1,
dag=dag)
task1 >> task2

Triggering A SubDag

EDITED
I have edited this question by considering the inputs from #tobi6
I copied the subdag operator from Airflow source code
Source code: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/subdag_operator.py
I modified a few things in the execute method. The changes were made to trigger the SubDag and wait until the SubDag completes execution. The trigger is working great but the tasks are not being executed (DAG is in the running/Green state while the tasks are in the null/White state).
Please refer below for the changes I made:
from airflow.exceptions import AirflowException
from airflow.models import BaseOperator, Pool
from airflow.utils.decorators import apply_defaults
from airflow.utils.db import provide_session
from airflow.utils.state import State
from airflow.executors import GetDefaultExecutor
from time import sleep
import logging
from datetime import datetime
class SubDagOperator(BaseOperator):
template_fields = tuple()
ui_color = '#555'
ui_fgcolor = '#fff'
#provide_session
#apply_defaults
def __init__(
self,
subdag,
executor=GetDefaultExecutor(),
*args, **kwargs):
"""
Yo dawg. This runs a sub dag. By convention, a sub dag's dag_id
should be prefixed by its parent and a dot. As in `parent.child`.
:param subdag: the DAG object to run as a subdag of the current DAG.
:type subdag: airflow.DAG
:param dag: the parent DAG
:type subdag: airflow.DAG
"""
import airflow.models
dag = kwargs.get('dag') or airflow.models._CONTEXT_MANAGER_DAG
if not dag:
raise AirflowException('Please pass in the `dag` param or call '
'within a DAG context manager')
session = kwargs.pop('session')
super(SubDagOperator, self).__init__(*args, **kwargs)
# validate subdag name
if dag.dag_id + '.' + kwargs['task_id'] != subdag.dag_id:
raise AirflowException(
"The subdag's dag_id should have the form "
"'{{parent_dag_id}}.{{this_task_id}}'. Expected "
"'{d}.{t}'; received '{rcvd}'.".format(
d=dag.dag_id, t=kwargs['task_id'], rcvd=subdag.dag_id))
# validate that subdag operator and subdag tasks don't have a
# pool conflict
if self.pool:
conflicts = [t for t in subdag.tasks if t.pool == self.pool]
if conflicts:
# only query for pool conflicts if one may exist
pool = (
session
.query(Pool)
.filter(Pool.slots == 1)
.filter(Pool.pool == self.pool)
.first()
)
if pool and any(t.pool == self.pool for t in subdag.tasks):
raise AirflowException(
'SubDagOperator {sd} and subdag task{plural} {t} both '
'use pool {p}, but the pool only has 1 slot. The '
'subdag tasks will never run.'.format(
sd=self.task_id,
plural=len(conflicts) > 1,
t=', '.join(t.task_id for t in conflicts),
p=self.pool
)
)
self.subdag = subdag
self.executor = executor
def execute(self, context):
dag_run = self.subdag.create_dagrun(
conf=context['dag_run'].conf,
state=State.RUNNING,
execution_date=context['execution_date'],
run_id='trig__' + str(datetime.utcnow()),
external_trigger=True
)
while True:
if dag_run.get_state() == State.FAILED or dag_run.get_state() == State.SUCCESS:
break
else:
sleep(10)
continue
Below is the code that shows how I'm using the same
from airflow import DAG
from operators.sd_operator import SubDagOperator # My SubDag Operator
from airflow.operators.python_operator import PythonOperator
import logging
from datetime import datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 7, 17),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
}
def print_dag_details(**kwargs):
logging.info(str(kwargs['dag_run'].conf))
with DAG('example_dag', schedule_interval=None, catchup=False, default_args=default_args) as dag:
task_1 = SubDagOperator(
subdag=sub_dag_func('example_dag', 'sub_dag_1'),
task_id='sub_dag_1'
)
task_2 = SubDagOperator(
subdag=sub_dag_func('example_dag', 'sub_dag_2'),
task_id='sub_dag_2',
)
print_kwargs = PythonOperator(
task_id='print_kwargs',
python_callable=print_dag_details,
provide_context=True
)
print_kwargs >> task_1 >> task_2
Any information you provide would be helpful. Thanks in advance.
It is a bit hard to understand your question without context.
"I copied the subdag operator and modified a few things in the execute method."
From where was this copied?
"The trigger is working great ..."
How does this look like?
There are a few things I saw in the code:
It might be helpful to add assigned fields to the function call of sub_dag_func, e.g. sub_dag_func(subdag='parent_dag'...).
In the binary shift definition, used to set upstream / downstream there are tasks defined I cannot find in the DAG (df_job_1, df_job_2). This might be connected to SubDAGs (haven't looked into them yet).
The name of the sub dag seems inconsistent with the comment in the code saying By convention, a sub dag's dag_id should be prefixed by its parent and a dot but it is sub_dag_1, sub_dag_2

Resources