Airflow's dynamic task generation feature seems to mainly support generation of parallel tasks. Is dynamic generation of tasks that are executed in series also possible?
Specifically, I would like to do the following: A task (call it task_1) will create a list of variable length (the length will be determined at runtime). Then, for each member of this list, a PythonOperator and a PythonSensor will be generated and connected in series.
For example, the control flow would look something like this:
task_1 >> dynamic_operator_1 >> dynamic_sensor_1 >> ... >> dynamic_operator_n >> dynamic_sensor_n
Related
I am trying to accomplish the following based on Airflow 2.3 (referencing below pseudo code):
I am reading a list of items using an Operator (operator_1)
For every item of the list I want to schedule a task group of two Operators (operator_2_1 and operator_2_2) that are supposed to run one after another
If there is more than one item in the list, then only one task group must be executed at a time (no parallel execution)
with DAG(
...
) as dag:
# Must be executed once
operator_1 = SomeOperator(...) # returns a list of items
# Must be executed "once per item" and only one at a time
task_group(items)
operator_2_1 = SomeOtherOperator(...)
operator_2_2 = SomeOtherOperator(...)
task_group.expand(operator_1)
What I tried so far:
Iterate over the list of items and schedule a task group per item (without using dynamic tasks): Works, but the unwanted parallel execution is a problem
Using dynamic tasks, which seems to work only for tasks (but not for task groups)
I would appreciate any input on this!
Thank you in advance!
Using dynamic tasks:
Mapping a task group is not possible yet, but this is a feature that will be available soon in the next versions.
Control the parallelism of your task groups:
You can create a new pool task_groups_pool with 1 slot, and use it for the tasks of the task groups, in this case you will not have more than one task of all the task groups running at the same time.
And to make sure that the task operator_2_2 will be executed after operator_2_1 of the same group and not a task operator_2_1 in another task group, you can use set the priority of operator_2_2 = priority of operator_2_1 + 1, or by using upstream as a weight_rule for the task groups tasks. (Here you can found more info about priority weight)
Now, I create multiple tasks using a variable like this and it works fine.
with DAG(....) as dag:
body = Variable.get("config_table", deserialize_json=True)
for i in range(len(body.keys())):
simple_task = Operator(
task_id = 'task_' + str(i),
.....
But I need to use XCOM value for some reason instead of using a variable.
Is it possible to dynamically create tasks with XCOM pull value?
I try to set value like this and it's not working
body = "{{ ti.xcom_pull(key='config_table', task_ids='get_config_table') }}"
It's possible to dynamically create tasks from XComs generated from a previous task, there are more extensive discussions on this topic, for example in this question. One of the suggested approaches follows this structure, here is a working example I made:
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Get your data from an API or file or any source. Push it as XCom.
def _process_obtained_data(ti):
list_of_cities = ti.xcom_pull(task_ids='get_data')
Variable.set(key='list_of_cities',
value=list_of_cities['cities'], serialize_json=True)
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
# push to XCom using return
return data
with DAG('dynamic_tasks_example', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
get_data = PythonOperator(
task_id='get_data',
python_callable=_read_file)
Add a second task which will pull from pull from XCom and set a Variable with the data you will use to iterate later on.
preparation_task = PythonOperator(
task_id='preparation_task',
python_callable=_process_obtained_data)
*Of course, if you want you can merge both tasks into one. I prefer not to because usually, I take a subset of the fetched data to create the Variable.
Read from that Variable and later iterate on it. It's critical to define default_var.
end = DummyOperator(
task_id='end',
trigger_rule='none_failed')
# Top-level code within DAG block
iterable_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
Declare dynamic tasks and their dependencies within a loop. Make the task_id uniques. TaskGroup is optional, helps you sorting the UI.
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
if iterable_list:
for index, city in enumerate(iterable_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_print_greeting,
op_kwargs={'city_name': city, 'greeting': 'Hello'}
)
say_goodbye = PythonOperator(
task_id=f'say_goodbye_from_{city}',
python_callable=_print_greeting,
op_kwargs={'city_name': city, 'greeting': 'Goodbye'}
)
# TaskGroup level dependencies
say_hello >> say_goodbye
# DAG level dependencies
get_data >> preparation_task >> dynamic_tasks_group >> end
DAG Graph View:
Imports:
import json
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy import DummyOperator
from airflow.utils.task_group import TaskGroup
Things to keep in mind:
If you have simultaneous dag_runs of this same DAG, all of them will use the same variable, so you may need to make it 'unique' by differentiating their names.
You must set the default value while reading the Variable, otherwise, the first execution may not be processable to the Scheduler.
The Airflow Graph View UI may not refresh the changes immediately. Happens especially in the first run after adding or removing items from the iterable on which the dynamic task generation is created.
If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Good luck!
Edit:
Another important point to take into consideration:
With this approach, the call to Variable.get() method is top-level code, so is read by the scheduler every 30 seconds (default of min_file_process_interval setting). This means that a connection to the metadata DB will happen each time.
Edit:
Added if clause to handle emtpy iterable_list case.
This is not possible, and in general dynamic tasks are not recommended:
The way the Airflow scheduler works is by reading the dag file, loading the tasks into the memory and then checks which dags and which tasks it need to schedule, while xcom are a runtime values that are related to a specific dag run, so the scheduler cannot relay on xcom values.
When using dynamic tasks you're making debug much harder for yourself, as the values you use for creating the dag can change and you'll lose access to logs without even understanding why.
What you can do is use branch operator, to have those tasks always and just skip them based on the xcom value.
For example:
def branch_func(**context)
return f"task_{context['ti'].xcom_pull(key=key)}"
branch = BranchPythonOperator(
task_id="branch",
python_callback=branch_func
)
tasks = [BaseOperator(task_id=f"task_{i}") for i in range(3)]
branch >> tasks
In some cases it's also not good to use this method (for example when I've 100 possible tasks), in those cases I'd recommend writing your own operator or use a single PythonOperator.
I have a manually triggered dag. It takes a parameters like:
{"id_list":"3,5,1"}
In the DAG, I create the operators dynamically based on this list of integers:
for id in id_list:
task = create_task(id)
I need to initialize the id_list based on the parameter values of id_list.
How can I initialize that list since I cannot reference that parameter directly when not in a templated field? This is how I want to see it in the Graph View where the process tasks are based on the id_list params.
I have seen examples of dynamically created tasks but they are not really dynamic in the sense that the list values are hard-coded. The tasks are created dynamically based on the list of hard-code values, if that makes sense.
First, create a fixed number of tasks to execute. This example is using PythonOperator. In the python_callable, if the index is less than the length of the param_list then execute else raise AirflowSkipException
def execute(index, account_ids):
param_list = account_ids.split(',')
if index < len(param_list):
print(f"execute task index {index}")
else:
raise AirflowSkipException
def create_task(task_id, index):
return PythonOperator(task_id=task_id,
python_callable=execute,
op_kwargs={
"index": index,
"account_ids": "{{ dag_run.conf['account_ids'] }}"}
)
record_size_limit = 5
ACCOUNT_LIST = [None] * record_size_limit
for idx in range(record_size_limit):
task = create_task(f"task_{idx}", idx)
task
Trigger DAG and pass this as parameters:
Graph View:
A DAG and its tasks must be resolved prior to being available for use; this includes the webserver, scheduler, everywhere. The webserver is actually a perfect example why: how would you render the process to the user?
The only dynamic components of a process are the parameters that are available during template rendering. In most cases I've seen people use a PythonOperator to loop over the input and perform some action N times to solve the same issue.
I have a list of http endpoints each performing a task on its own. We are trying to write an application which will orchestrate by invoking these endpoints in a certain order. In this solution we also have to process the output of one http endpoint and generate the input for the next http enpoint. Also, the same workflow can get invoked simultaneously depending on the trigger.
What I have done until now,
1. Have defined a new operator deriving from the HttpOperator and introduced capabilities to write the output of the http endpoint to a file.
2. Have written a python operator which can transfer the output depending on the necessary logic.
Since I can have multiple instances of the same workflow in execution, I could not hardcode the output file names. Is there a way to make the http operator which I wrote to write to some unique file names and the same file name should be available for the next task so that it can read and process the output.
Airflow does have a feature for operator cross-communication called XCom
XComs can be “pushed” (sent) or “pulled” (received). When a task pushes an XCom, it makes it generally available to other tasks. Tasks can push XComs at any time by calling the xcom_push() method.
Tasks call xcom_pull() to retrieve XComs, optionally applying filters based on criteria like key, source task_ids, and source dag_id.
To push to XCOM use
ti.xcom_push(key=<variable name>, value=<variable value>)
To pull a XCOM object use
myxcom_val = ti.xcom_pull(key=<variable name>, task_ids='<task to pull from>')
With bash operator , you just set xcom_push = True and the last line in stdout is set as xcom object.
You can view the xcom object , hwile your task is running by simply opening the tast execution from airflow UI and clicking on the xcom tab.
Context: I've defined a airflow DAG which performs an operation, compute_metrics, on some data for an entity based on a parameter called org. Underneath something like myapi.compute_metrics(org) is called. This flow will mostly be run on an ad-hoc basis.
Problem: I'd like to be able to select the org to run the flow against when I manually trigger the DAG from the airflow UI.
The most straightforward solution I can think of is to generate n different DAGs, one for each org. The DAGs would have ids like: compute_metrics_1, compute_metrics_2, etc... and then when I need to trigger compute metrics for a single org, I can pick the DAG for that org. This doesn't scale as I add orgs and as I add more types of computation.
I've done some research and it seems that I can create a flask blueprint for airflow, which to my understanding, extends the UI. In this extended UI I can add input components, like a text box, for picking an org and then pass that as a conf to a DagRun which is manually created by the blueprint. Is that correct? I'm imaging I could write something like:
session = settings.Session()
execution_date = datetime.now()
run_id = 'external_trigger_' + execution_date.isoformat()
trigger = DagRun(
dag_id='general_compute_metrics_needs_org_id',
run_id=run_id,
state=State.RUNNING,
execution_date=execution_date,
external_trigger=True,
conf=org_ui_component.text) # pass the org id from a component in the blueprint
session.add(trigger)
session.commit() # I don't know if this would actually be scheduled by the scheduler
Is my idea sound? Is there a better way to achieve what I want?
I've done some research and it seems that I can create a flask blueprint for airflow, which to my understanding, extends the UI.
The blueprint extends the API. If you want some UI for it, you'll need to serve a template view. The most feature-complete way of achieve this is developing your own Airflow Plugin.
If you want to manually create DagRuns, you can use this trigger as reference. For simplicity, I'd trigger a Dag with the API.
And specifically about your problem, I would have a single DAG compute_metrics that reads the org from an Airflow Variable. They are global and can be set dynamically. You can prefix the variable name with something like the DagRun id to make it unique and thus dag-concurrent safe.