I trying to create multiple dags using the taskflow API that have a variable passed into them which can be used by tasks within the dag
For example I am trying to have this code
from airflow.decorators import dag, task
from datetime import datetime
#dag(schedule_interval=None, start_date=datetime(2021, 1, 1))
def dag_template(input_var):
#task
def printer_task(x):
print(x)
output_input_var = printer_task(input_var)
dag_1 = dag_template("string1")
dag_2 = dag_template(6)
Which ideally would create two dags with the IDs of dag_1 and dag_2. One dag would print the string "string1" the other 6. This almost works with the code creating 1 dag with an ID of dag_template printing 6.
The documentation has that the dag will be called the python callable, is it possible to override this.
I don't feel its a very elegant solution, but it does do what I'm after.
from airflow.decorators import dag, task
from datetime import datetime
config = [("dag_1", "string1"), ("dag_2", 6)]
for dag_name, dag_input in config:
#dag(dag_id = dag_name ,schedule_interval=None, start_date=datetime(2021, 1, 1))
def dag_template(input_var):
#task
def printer_task(x):
print(x)
output_input_var = printer_task(input_var)
globals()[dag_name] = dag_template(dag_input)
Related
I have a use case where I want to run dynamic tasks.
The expectation is
Task1 (output = list of dicts)-> Task2(a) - > Task3(a)
|
----> Task 2(b) -> Task3(b)
Task 2 and Task 3 needs to be run for every object in list and needs to be sequential.
You can connect multiple dynamically mapped tasks. For example:
import datetime
from airflow import DAG
from airflow.decorators import task
with DAG(dag_id="so_74848271", schedule_interval=None, start_date=datetime.datetime(2022, 1, 1)):
#task
def start():
return [{"donald": "duck"}, {"bugs": "bunny"}, {"mickey": "mouse"}]
#task
def create_name(cartoon):
first_name = list(cartoon.keys())[0]
last_name = list(cartoon.values())[0]
return f"{first_name} {last_name}"
#task
def print_name(full_name):
print(f"Hello {full_name}")
print_name.expand(full_name=create_name.expand(cartoon=start()))
The task create_name will generate one task for each dict in the list returned by start. And the print_name task will generate one task for each result of create_name.
The graph view of this DAG looks as follows:
I have pipelines where the mechanics are always the same, a sequence of two tasks.
So I try to abstract the construction of it through a parent abstract class (using TaskFlow API):
from abc import ABC, abstractmethod
from airflow.decorators import dag, task
from datetime import datetime
def AbstractDag(ABC):
#abstractmethod
def task_1(self):
"""task 1"""
#abstractmethod
def task_2(self, data):
"""task 2"""
def dag_wrapper(self):
#dag(schedule_interval=None, start_date=datetime(2022, 1, 1))
def dag():
#task(task_id='task_1')
def task_1():
return self.task_1()
#task(task_id='task_2')
def task_2(data):
return self.task_2(data)
task_2(task_1())
return dag
But when I try to inherit this class, I can't see my dag in the interface:
class MyCustomDag(AbstractDag):
def task_1(self):
return 2
#abstractmethod
def task_2(self, data):
print(data)
custom_dag = MyCustomDag()
dag_object = custom_dag.dag_wrapper()
Do you have any idea how to do this? or better ideas to abstract this?
Thanks a lot!
Nicolas
I was able to get your example DAG to render in the UI with just a couple small tweaks:
The MyCustomDag.task_2 method doesn't need to be decorated as an abstractmethod.
Using dag() as the wrapped DAG object function name has its issues since it's also a decorator name.
In the AbstractDag.dag_wrapper method you do need to call the #dag-decorated function.
Here is the code I used:
from abc import ABC, abstractmethod
from airflow.decorators import dag, task
from datetime import datetime
class AbstractDag(ABC):
#abstractmethod
def task_1(self):
"""task 1"""
#abstractmethod
def task_2(self, data):
"""task 2"""
def dag_wrapper(self):
#dag(schedule_interval=None, start_date=datetime(2022, 1, 1))
def _dag():
#task(task_id='task_1')
def task_1():
return self.task_1()
#task(task_id='task_2')
def task_2(data):
return self.task_2(data)
task_2(task_1())
return _dag()
class MyCustomDag(AbstractDag):
def task_1(self):
return 2
def task_2(self, data):
print(data)
custom_dag = MyCustomDag()
dag_object = custom_dag.dag_wrapper()
It's worth noting the following from the Airflow docs :
When searching for DAGs inside the DAG_FOLDER, Airflow only considers Python files that contain the strings airflow and dag (case-insensitively) as an optimization.
To consider all Python files instead, disable the
DAG_DISCOVERY_SAFE_MODE configuration flag.
If you're inheriting from AbstractDag in a different file, make sure airflow and dag are in that file. You can simply add a comment with those words.
I'm struggling to understand how to read DAG config parameters inside a task using Airflow 2.0 dag and task decorators.
Consider this simple DAG definition file:
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
#dag()
def lovely_dag():
#task(start_date=days_ago(1))
def task1():
return 1
something = task1()
my_dag = lovely_dag()
I can trigger the dag using the UI or the console and pass to it some (key,value) config, for example:
airflow dags trigger --conf '{"hello":"there"}' lovely_dag
How can I access {"hello":"there"} inside the task1 function?
My use case is I want to pass 2 parameters to dag and want task1 to see them.
You can access the context as follows:
from airflow.operators.python import task, get_current_context
#task
def my_task():
context = get_current_context()
dag_run = context["dag_run"]
dagrun_conf = dag_run.conf
where dagrun_conf will be the variable containing the DAG config parameters
Source: http://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#accessing-current-context
I am trying to generate airflow dags using a template in a python code, and using globals() as defined here
To define dag object and saving it. Below is my code :
import datetime as dt
import sys
import airflow
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
argumentList = sys.argv
owner = argumentList[1]
dag_name = argumentList[2]
taskID = argumentList[3]
bashCommand = argumentList[4]
default_args = {
'owner': owner,
'start_date': dt.datetime(2019, 6, 1),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
}
def dagCreate():
with DAG(dag_name,
default_args=default_args,
schedule_interval=None,
) as dag:
print_hello = BashOperator(task_id=taskID, bash_command=bashCommand)
return dag
globals()[dag_name] = dagCreate()
I have kept this python code outside dag_folder, and executing it as follows :
python bash-dag-generator.py Airflow test_bash_generate auto_bash_task ls
But I don't see any DAG generated in the airflow webserver UI. I am not sure where I am going wrong.
As per the official documentation:
DAGs are defined in standard Python files that are placed in Airflow’s DAG_FOLDER. Airflow will execute the code in each file to dynamically build the DAG objects. You can have as many DAGs as you want, each describing an arbitrary number of tasks. In general, each one should correspond to a single logical workflow.
So unless your code is actually inside the DAG_FOLDER, it will not be registered as a DAG.
The way I have been able to implement Dynamic DAGs is by using Airflow Variable.
In the below example I have a csv file that has list of Bash command like ls, echo etc. As part of the read_file task I am updating the file location to the Airflow Variable. The part where we read the csv file and loop through the commands is where the dynamic DAGs get created.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
from datetime import datetime, timedelta
import csv
'''
Orchestrate the Dynamic Tasks
'''
def read_file_task():
print('I am reading a File and setting variables ')
Variable.set('dynamic-dag-sample','/home/bashoperator.csv')
with DAG('dynamic-dag-sample',
start_date=datetime(2018, 11, 1)) as dag:
read_file_task = PythonOperator(task_id='read_file_task',
python_callable=read_file_task, provide_context=True,
dag=dag)
dynamic_dag_sample_file_path = Variable.get("dynamic-dag-sample")
if dynamic_dag_sample_file_path != None:
with open(dynamic_dag_sample_file_path) as csv_file:
reader = csv.DictReader(csv_file)
line_count = 0
for row in reader:
bash_task = BashOperator(task_id=row['Taskname'], bash_command=row['Command'])
read_file_task.set_downstream(bash_task)
I mostly see Airflow being used for ETL/Bid data related jobs. I'm trying to use it for business workflows wherein a user action triggers a set of dependent tasks in future. Some of these tasks may need to be cleared (deleted) based on certain other user actions.
I thought the best way to handle this would be via dynamic task ids. I read that Airflow supports dynamic dag ids. So, I created a simple python script that takes DAG id and task id as command line parameters. However, I'm running into problems making it work. It gives dag_id not found error. Has anyone tried this? Here's the code for the script (call it tmp.py) which I execute on command line as python (python tmp.py 820 2016-08-24T22:50:00 ):
from __future__ import print_function
import os
import sys
import shutil
from datetime import date, datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
execution = '2016-08-24T22:20:00'
if len(sys.argv) > 2 :
dagid = sys.argv[1]
taskid = 'Activate' + sys.argv[1]
execution = sys.argv[2]
else:
dagid = 'DAGObjectId'
taskid = 'Activate'
default_args = {'owner' : 'airflow', 'depends_on_past': False, 'start_date':date.today(), 'email': ['fake#fake.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1}
dag = DAG(dag_id = dagid,
default_args=default_args,
schedule_interval='#once',
)
globals()[dagid] = dag
task1 = BashOperator(
task_id = taskid,
bash_command='ls -l',
dag=dag)
fakeTask = BashOperator(
task_id = 'fakeTask',
bash_command='sleep 5',
retries = 3,
dag=dag)
task1.set_upstream(fakeTask)
airflowcmd = "airflow run " + dagid + " " + taskid + " " + execution
print("airflowcmd = " + airflowcmd)
os.system(airflowcmd)
After numerous trials and errors, I was able to figure this out. Hopefully, it will help someone. Here's how it works: You need to have an iterator or an external source (file/database table) to generate dags/task dynamically through a template. You can keep the dag and task names static, just assign them ids dynamically in order to differentiate one dag from the other. You put this python script in the dags folder. When you start the airflow scheduler, it runs through this script on every heartbeat and writes the DAGs to the dag table in the database. If a dag (unique dag id) has already been written, it will simply skip it. The scheduler also look at the schedule of individual DAGs to determine which one is ready for execution. If a DAG is ready for execution, it executes it and updates its status.
Here's a sample code:
from airflow.operators import PythonOperator
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
import sys
import time
dagid = 'DA' + str(int(time.time()))
taskid = 'TA' + str(int(time.time()))
input_file = '/home/directory/airflow/textfile_for_dagids_and_schedule'
def my_sleeping_function(random_base):
'''This is a function that will run within the DAG execution'''
time.sleep(random_base)
def_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(), 'email_on_failure': False,
'retries': 1, 'retry_delay': timedelta(minutes=2)
}
with open(input_file,'r') as f:
for line in f:
args = line.strip().split(',')
if len(args) < 6:
continue
dagid = 'DAA' + args[0]
taskid = 'TAA' + args[0]
yyyy = int(args[1])
mm = int(args[2])
dd = int(args[3])
hh = int(args[4])
mins = int(args[5])
ss = int(args[6])
dag = DAG(
dag_id=dagid, default_args=def_args,
schedule_interval='#once', start_date=datetime(yyyy,mm,dd,hh,mins,ss)
)
myBashTask = BashOperator(
task_id=taskid,
bash_command='python /home/directory/airflow/sendemail.py',
dag=dag)
task2id = taskid + '-X'
task_sleep = PythonOperator(
task_id=task2id,
python_callable=my_sleeping_function,
op_kwargs={'random_base': 10},
dag=dag)
task_sleep.set_upstream(myBashTask)
f.close()
From How can I create DAGs dynamically?:
Airflow looks in you [sic] DAGS_FOLDER for modules that contain DAG objects in their global namespace, and adds the objects it finds in the DagBag. Knowing this all we need is a way to dynamically assign variable in the global namespace, which is easily done in python using the globals() function for the standard library which behaves like a simple dictionary.
for i in range(10):
dag_id = 'foo_{}'.format(i)
globals()[dag_id] = DAG(dag_id)
# or better, call a function that returns a DAG object!
copying my answer from this question. Only for v2.3 and above:
This feature is achieved using Dynamic Task Mapping, only for Airflow versions 2.3 and higher
More documentation and example here:
Official Dynamic Task Mapping documentation
Tutorial from Astronomer
Example:
#task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
#task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())
example 2:
from airflow import XComArg
task = MyOperator(task_id="source")
downstream = MyOperator2.partial(task_id="consumer").expand(input=XComArg(task))
The graph view and tree view are also updated:
Relevant issues here:
https://github.com/apache/airflow/projects/12