Get dag_run context in Airflow TaskFlow task - airflow

My dag is started with configuration JSON:
{"foo" : "bar"}
I have a Python operator which uses this value:
my_task = PythonOperator(
task_id="my_task",
op_kwargs={"foo": "{{ dag_run.conf['foo'] }}"},
python_callable=lambda foo: print(foo))
I’d like to replace it with a TaskFlow task…
#task
def my_task:
# how to get foo??
How can I get a reference to context, dag_run, or otherwise get to the configuration JSON from here?

There are several ways to do this using the TaskFlow API:
import datetime
from airflow.decorators import dag, task
from airflow.operators.python import get_current_context
#dag(start_date=datetime.datetime(2023, 1, 1), schedule=None)
def so_75303816():
#task
def example_1(**context):
foo = context["dag_run"].conf["foo"]
print(foo)
#task
def example_2(dag_run=None):
foo = dag_run.conf["foo"]
print(foo)
#task
def example_3():
context = get_current_context()
foo = context["dag_run"].conf["foo"]
print(foo)
#task
def example_4(params=None):
foo = params["foo"]
print(foo)
example_1()
example_2()
example_3()
example_4()
so_75303816()
Depending on your needs/preference, you can use one of the following examples:
example_1: You get all task instance context variables and have to extract "foo".
example_2: You explicitly state via arguments you want only dag_run from the task instance context variables. Note that you have to default arguments to None.
example_3: You can also fetch the task instance context variables from inside a task using airflow.operators.python.get_current_context().
example_4: DAG run context is also available via a variable named "params".
For more information, see https://airflow.apache.org/docs/apache-airflow/stable/tutorial/taskflow.html#accessing-context-variables-in-decorated-tasks and https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#variables.

Related

Airflow - ValueError: cannot map over XCom with custom key while trying to use expand function in Dynamic Task Mapping

I retrieve a dictionary from a task and one of the dictionary values is a list.
I am trying to perform Dynamic Task Mapping with expand() on that list.
However, I get this error:
Broken DAG: [/opt/airflow/dags/AA_taskflowApi.py] Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/mappedoperator.py", line 123, in ensure_xcomarg_return_value
ensure_xcomarg_return_value(v)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/mappedoperator.py", line 118, in ensure_xcomarg_return_value
raise ValueError(f"cannot map over XCom with custom key {key!r} from {operator}")
ValueError: cannot map over XCom with custom key 'nicks' from <Task(_PythonDecoratedOperator): get_name>
Here is my simple DAG:
#dag(dag_id='AA_TaskflowApi',
default_args=default_args,
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
schedule=None)
def hello_world_etl():
#task(multiple_outputs=True)
def get_name():
nicknames=["Nickname1","Nickname2"]
return {
'name': 'Jerry',
'nicks': nicknames
}
#task
def print_nicks(nick):
print(f"Processing nickname {nick}")
#task
def print_all_nicks(nicks):
print(f"Processing nicknames {nicks}")
name_dict = get_name()
nicks=name_dict['nicks']
#This Does NOT work - Causes DAG Import error
print_nicks.expand(nick=nicks)
# BUT This Works
# print_all_nicks(nicks)
greet_dag = hello_world_etl()
How can I use the list nicks to do dynamic task mapping successfully?
The variable name_dict is not available during compiling time, it's only available after executing the task during runtime. If you have a fixed format in the method get_name coming from an API for ex, you should transform the output and prepare the list in a new task
import pendulum
from airflow.decorators import dag, task
default_args = {}
#dag(dag_id='AA_TaskflowApi',
default_args=default_args,
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
schedule=None)
def hello_world_etl():
#task(multiple_outputs=True)
def get_name():
nicknames = ["Nickname1", "Nickname2"]
return {
'name': 'Jerry',
'nicks': nicknames
}
#task
def print_nicks(nick):
print(f"Processing nickname {nick}")
#task
def get_nicks(name_dict):
return name_dict['nicks']
name_dict = get_name()
nicks = get_nicks(name_dict)
print_nicks.expand(nick=nicks)
greet_dag = hello_world_etl()

Airflow - Sequential runs for Dynamic task mapping

I have a use case where I want to run dynamic tasks.
The expectation is
Task1 (output = list of dicts)-> Task2(a) - > Task3(a)
|
----> Task 2(b) -> Task3(b)
Task 2 and Task 3 needs to be run for every object in list and needs to be sequential.
You can connect multiple dynamically mapped tasks. For example:
import datetime
from airflow import DAG
from airflow.decorators import task
with DAG(dag_id="so_74848271", schedule_interval=None, start_date=datetime.datetime(2022, 1, 1)):
#task
def start():
return [{"donald": "duck"}, {"bugs": "bunny"}, {"mickey": "mouse"}]
#task
def create_name(cartoon):
first_name = list(cartoon.keys())[0]
last_name = list(cartoon.values())[0]
return f"{first_name} {last_name}"
#task
def print_name(full_name):
print(f"Hello {full_name}")
print_name.expand(full_name=create_name.expand(cartoon=start()))
The task create_name will generate one task for each dict in the list returned by start. And the print_name task will generate one task for each result of create_name.
The graph view of this DAG looks as follows:

Access dag_run.conf in a custom pythonoperator on Airflow

I extended the existing PythonOperator on Airflow as follow:
class myPythonOperator(PythonOperator):
def __init__(self,**kwargs) -> None:
self.name = kwargs.get("name", "name is not provided")
def execute(self, context,**kwargs):
print(self.name)
super(myPythonOperator, self).execute(context)
And my task was defined as:
def task1(**kwargs):
name = kwargs.get("name", "name is not provided")
print(name)
And with the following DAG:
myTask = myPythonOperator(
task_id='myTask',
python_callable = task1,
op_kwargs={"name": "{{ dag_run.conf['name'] }}"},
provide_context=True
)
When triggering the DAG, I provided a configuration JSON from Airflow web UI, which is {"name":"foo"}
But the problem is that the name specified in JSON can only be access from task1, in ececute() it will always print name is not provided
Does anyone know the trick to access this dag_run.conf from the __init__() function of the operator?
Any help will be appreciated. Thanks
The way to access dag.run_config from a inherited class is by using template_field in Airflow, which can be found here

Using dag_run variables in airflow Dag

I am trying to use airflow variables to determine whether to execute a task or not. I have tried this and it's not working:
if '{{ params.year }}' == '{{ params.message }}':
run_this = DummyOperator (
task_id = 'dummy_dag'
)
I was hoping to get some help making it work. Also is there a better way of doing something like this in airflow?
I think a good way to solve this, is with BranchPythonOperator to branch dynamically based on the provided DAG parameters. Consider this example:
Use params to provide the parameters to the DAG (could be also done from the UI), in this example: {"enabled": True}
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
from airflow.operators.python import get_current_context, BranchPythonOperator
#dag(
default_args=default_args,
schedule_interval=None,
start_date=days_ago(1),
catchup=False,
tags=["example"],
params={"enabled": True},
)
def branch_from_dag_params():
def _print_enabled():
context = get_current_context()
enabled = context["params"].get("enabled", False)
print(f"Task id: {context['ti'].task_id}")
print(f"Enabled is: {enabled}")
#task
def task_a():
_print_enabled()
#task
def task_b():
_print_enabled()
Define a callable to the BranchPythonOperator in which you will perform your conditionals and return the next task to be executed. You can access the execution context variables from **kwargs. Also keep in mind that this operator should return a single task_id or a list of task_ids to follow downstream. Those resultant tasks should always be directly downstream from it.
def _get_task_run(ti, **kwargs):
custom_param = kwargs["params"].get("enabled", False)
if custom_param:
return "task_a"
else:
return "task_b"
branch_task = BranchPythonOperator(
task_id="branch_task",
python_callable=_get_task_run,
)
task_a_exec = task_a()
task_b_exec = task_b()
branch_task >> [task_a_exec, task_b_exec]
The result is that task_a gets executed and task_b is skipped :
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=branch_from_dag_params
AIRFLOW_CTX_TASK_ID=task_a
Task id: task_a
Enabled is: True
Let me know if that worked for you.
Docs

Assign airflow task to several DAGs

I am trying to reuse an existing airflow task by assigning it to different dags.
def create_new_task_for_dag(task: BaseOperator,
dag: models.DAG) -> BaseOperator:
"""Create a deep copy of given task and associate it with given dag
"""
new_task = copy.deepcopy(task)
new_task.dag = dag
return new_task
print_datetime_task = python_operator.PythonOperator(
task_id='print_datetime', python_callable=_print_datetime)
# define a new dag ...
# add to the new dag
create_new_task_for_dag(print_datetime_task, new_dag)
Then it gives the error Task is missing the start_date parameter.
If I define the dag when creating the operator, print_datetime_task = PythonOperator(task_id='print_datetime', python_callable=_print_datetime, dag=new_dag), then it is OK.
I have searched around, and this seems to be the root cause: https://github.com/apache/airflow/pull/5598, but PR has been marked as stale.
I wonder if there is any other approach to reuse an existing airflow task assign to a different dag.
I am using apache-airflow[docker,kubernetes]==1.10.10
While I don't know the solution to your problem with current design (code-layout), it can be made to work by tweaking the design slightly (note that the following code-snippets have NOT been tested)
Instead of copying a task from a DAG,
def create_new_task_for_dag(task: BaseOperator,
dag: models.DAG) -> BaseOperator:
"""Create a deep copy of given task and associate it with given dag
"""
new_task = copy.deepcopy(task)
new_task.dag = dag
return new_task
you can move the instantiation of task (as well as it's assignment to the DAG) to a separate utility function.
from datetime import datetime
from typing import Dict, Any
from airflow.models.dag import DAG
from airflow.operators.python_operator import PythonOperator
def add_new_print_datetime_task(my_dag: DAG,
kwargs: Dict[str, Any]) -> PythonOperator:
"""
Creates and adds a new 'print_datetime' (PythonOperator) task in 'my_dag'
and returns it's reference
:param my_dag: reference to DAG object in which to add the task
:type my_dag: DAG
:param kwargs: dictionary of args for PythonOperator / BaseOperator
'task_id' is mandatory
:type kwargs: Dict[str, Any]
:return: PythonOperator
"""
def my_callable() -> None:
print(datetime.now())
return PythonOperator(dag=my_dag, python_callable=my_callable, **kwargs)
Thereafter you can call that function everytime you want to instantiate that same task (and assign to any DAG)
with DAG(dag_id="my_dag_id", start_date=datetime(year=2020, month=8, day=22, hour=16, minute=30)) as my_dag:
print_datetime_task_kwargs: Dict[str, Any] = {
"task_id": "my_task_id",
"depends_on_past": True
}
print_datetime_task: PythonOperator = add_new_print_datetime_task(my_dag=my_dag, kwargs=print_datetime_task_kwargs)
# ... other tasks and their wiring
References / good reads
Astronomer.io: Dynamically Generating DAGs in Airflow
Apache Airflow | With Statement and DAG

Resources