Airflow 2.0: Encapsulating DAG in class using Taskflow API - airflow

I have pipelines where the mechanics are always the same, a sequence of two tasks.
So I try to abstract the construction of it through a parent abstract class (using TaskFlow API):
from abc import ABC, abstractmethod
from airflow.decorators import dag, task
from datetime import datetime
def AbstractDag(ABC):
#abstractmethod
def task_1(self):
"""task 1"""
#abstractmethod
def task_2(self, data):
"""task 2"""
def dag_wrapper(self):
#dag(schedule_interval=None, start_date=datetime(2022, 1, 1))
def dag():
#task(task_id='task_1')
def task_1():
return self.task_1()
#task(task_id='task_2')
def task_2(data):
return self.task_2(data)
task_2(task_1())
return dag
But when I try to inherit this class, I can't see my dag in the interface:
class MyCustomDag(AbstractDag):
def task_1(self):
return 2
#abstractmethod
def task_2(self, data):
print(data)
custom_dag = MyCustomDag()
dag_object = custom_dag.dag_wrapper()
Do you have any idea how to do this? or better ideas to abstract this?
Thanks a lot!
Nicolas

I was able to get your example DAG to render in the UI with just a couple small tweaks:
The MyCustomDag.task_2 method doesn't need to be decorated as an abstractmethod.
Using dag() as the wrapped DAG object function name has its issues since it's also a decorator name.
In the AbstractDag.dag_wrapper method you do need to call the #dag-decorated function.
Here is the code I used:
from abc import ABC, abstractmethod
from airflow.decorators import dag, task
from datetime import datetime
class AbstractDag(ABC):
#abstractmethod
def task_1(self):
"""task 1"""
#abstractmethod
def task_2(self, data):
"""task 2"""
def dag_wrapper(self):
#dag(schedule_interval=None, start_date=datetime(2022, 1, 1))
def _dag():
#task(task_id='task_1')
def task_1():
return self.task_1()
#task(task_id='task_2')
def task_2(data):
return self.task_2(data)
task_2(task_1())
return _dag()
class MyCustomDag(AbstractDag):
def task_1(self):
return 2
def task_2(self, data):
print(data)
custom_dag = MyCustomDag()
dag_object = custom_dag.dag_wrapper()

It's worth noting the following from the Airflow docs :
When searching for DAGs inside the DAG_FOLDER, Airflow only considers Python files that contain the strings airflow and dag (case-insensitively) as an optimization.
To consider all Python files instead, disable the
DAG_DISCOVERY_SAFE_MODE configuration flag.
If you're inheriting from AbstractDag in a different file, make sure airflow and dag are in that file. You can simply add a comment with those words.

Related

Get dag_run context in Airflow TaskFlow task

My dag is started with configuration JSON:
{"foo" : "bar"}
I have a Python operator which uses this value:
my_task = PythonOperator(
task_id="my_task",
op_kwargs={"foo": "{{ dag_run.conf['foo'] }}"},
python_callable=lambda foo: print(foo))
I’d like to replace it with a TaskFlow task…
#task
def my_task:
# how to get foo??
How can I get a reference to context, dag_run, or otherwise get to the configuration JSON from here?
There are several ways to do this using the TaskFlow API:
import datetime
from airflow.decorators import dag, task
from airflow.operators.python import get_current_context
#dag(start_date=datetime.datetime(2023, 1, 1), schedule=None)
def so_75303816():
#task
def example_1(**context):
foo = context["dag_run"].conf["foo"]
print(foo)
#task
def example_2(dag_run=None):
foo = dag_run.conf["foo"]
print(foo)
#task
def example_3():
context = get_current_context()
foo = context["dag_run"].conf["foo"]
print(foo)
#task
def example_4(params=None):
foo = params["foo"]
print(foo)
example_1()
example_2()
example_3()
example_4()
so_75303816()
Depending on your needs/preference, you can use one of the following examples:
example_1: You get all task instance context variables and have to extract "foo".
example_2: You explicitly state via arguments you want only dag_run from the task instance context variables. Note that you have to default arguments to None.
example_3: You can also fetch the task instance context variables from inside a task using airflow.operators.python.get_current_context().
example_4: DAG run context is also available via a variable named "params".
For more information, see https://airflow.apache.org/docs/apache-airflow/stable/tutorial/taskflow.html#accessing-context-variables-in-decorated-tasks and https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#variables.

Airflow - Sequential runs for Dynamic task mapping

I have a use case where I want to run dynamic tasks.
The expectation is
Task1 (output = list of dicts)-> Task2(a) - > Task3(a)
|
----> Task 2(b) -> Task3(b)
Task 2 and Task 3 needs to be run for every object in list and needs to be sequential.
You can connect multiple dynamically mapped tasks. For example:
import datetime
from airflow import DAG
from airflow.decorators import task
with DAG(dag_id="so_74848271", schedule_interval=None, start_date=datetime.datetime(2022, 1, 1)):
#task
def start():
return [{"donald": "duck"}, {"bugs": "bunny"}, {"mickey": "mouse"}]
#task
def create_name(cartoon):
first_name = list(cartoon.keys())[0]
last_name = list(cartoon.values())[0]
return f"{first_name} {last_name}"
#task
def print_name(full_name):
print(f"Hello {full_name}")
print_name.expand(full_name=create_name.expand(cartoon=start()))
The task create_name will generate one task for each dict in the list returned by start. And the print_name task will generate one task for each result of create_name.
The graph view of this DAG looks as follows:

Naming Airflow dags other then the python callable when using taskflow api

I trying to create multiple dags using the taskflow API that have a variable passed into them which can be used by tasks within the dag
For example I am trying to have this code
from airflow.decorators import dag, task
from datetime import datetime
#dag(schedule_interval=None, start_date=datetime(2021, 1, 1))
def dag_template(input_var):
#task
def printer_task(x):
print(x)
output_input_var = printer_task(input_var)
dag_1 = dag_template("string1")
dag_2 = dag_template(6)
Which ideally would create two dags with the IDs of dag_1 and dag_2. One dag would print the string "string1" the other 6. This almost works with the code creating 1 dag with an ID of dag_template printing 6.
The documentation has that the dag will be called the python callable, is it possible to override this.
I don't feel its a very elegant solution, but it does do what I'm after.
from airflow.decorators import dag, task
from datetime import datetime
config = [("dag_1", "string1"), ("dag_2", 6)]
for dag_name, dag_input in config:
#dag(dag_id = dag_name ,schedule_interval=None, start_date=datetime(2021, 1, 1))
def dag_template(input_var):
#task
def printer_task(x):
print(x)
output_input_var = printer_task(input_var)
globals()[dag_name] = dag_template(dag_input)

Airflow dag and task decorator in 2.0: how to pass config params to task?

I'm struggling to understand how to read DAG config parameters inside a task using Airflow 2.0 dag and task decorators.
Consider this simple DAG definition file:
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
#dag()
def lovely_dag():
#task(start_date=days_ago(1))
def task1():
return 1
something = task1()
my_dag = lovely_dag()
I can trigger the dag using the UI or the console and pass to it some (key,value) config, for example:
airflow dags trigger --conf '{"hello":"there"}' lovely_dag
How can I access {"hello":"there"} inside the task1 function?
My use case is I want to pass 2 parameters to dag and want task1 to see them.
You can access the context as follows:
from airflow.operators.python import task, get_current_context
#task
def my_task():
context = get_current_context()
dag_run = context["dag_run"]
dagrun_conf = dag_run.conf
where dagrun_conf will be the variable containing the DAG config parameters
Source: http://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#accessing-current-context

Assign airflow task to several DAGs

I am trying to reuse an existing airflow task by assigning it to different dags.
def create_new_task_for_dag(task: BaseOperator,
dag: models.DAG) -> BaseOperator:
"""Create a deep copy of given task and associate it with given dag
"""
new_task = copy.deepcopy(task)
new_task.dag = dag
return new_task
print_datetime_task = python_operator.PythonOperator(
task_id='print_datetime', python_callable=_print_datetime)
# define a new dag ...
# add to the new dag
create_new_task_for_dag(print_datetime_task, new_dag)
Then it gives the error Task is missing the start_date parameter.
If I define the dag when creating the operator, print_datetime_task = PythonOperator(task_id='print_datetime', python_callable=_print_datetime, dag=new_dag), then it is OK.
I have searched around, and this seems to be the root cause: https://github.com/apache/airflow/pull/5598, but PR has been marked as stale.
I wonder if there is any other approach to reuse an existing airflow task assign to a different dag.
I am using apache-airflow[docker,kubernetes]==1.10.10
While I don't know the solution to your problem with current design (code-layout), it can be made to work by tweaking the design slightly (note that the following code-snippets have NOT been tested)
Instead of copying a task from a DAG,
def create_new_task_for_dag(task: BaseOperator,
dag: models.DAG) -> BaseOperator:
"""Create a deep copy of given task and associate it with given dag
"""
new_task = copy.deepcopy(task)
new_task.dag = dag
return new_task
you can move the instantiation of task (as well as it's assignment to the DAG) to a separate utility function.
from datetime import datetime
from typing import Dict, Any
from airflow.models.dag import DAG
from airflow.operators.python_operator import PythonOperator
def add_new_print_datetime_task(my_dag: DAG,
kwargs: Dict[str, Any]) -> PythonOperator:
"""
Creates and adds a new 'print_datetime' (PythonOperator) task in 'my_dag'
and returns it's reference
:param my_dag: reference to DAG object in which to add the task
:type my_dag: DAG
:param kwargs: dictionary of args for PythonOperator / BaseOperator
'task_id' is mandatory
:type kwargs: Dict[str, Any]
:return: PythonOperator
"""
def my_callable() -> None:
print(datetime.now())
return PythonOperator(dag=my_dag, python_callable=my_callable, **kwargs)
Thereafter you can call that function everytime you want to instantiate that same task (and assign to any DAG)
with DAG(dag_id="my_dag_id", start_date=datetime(year=2020, month=8, day=22, hour=16, minute=30)) as my_dag:
print_datetime_task_kwargs: Dict[str, Any] = {
"task_id": "my_task_id",
"depends_on_past": True
}
print_datetime_task: PythonOperator = add_new_print_datetime_task(my_dag=my_dag, kwargs=print_datetime_task_kwargs)
# ... other tasks and their wiring
References / good reads
Astronomer.io: Dynamically Generating DAGs in Airflow
Apache Airflow | With Statement and DAG

Resources