I am trying to accomplish the following based on Airflow 2.3 (referencing below pseudo code):
I am reading a list of items using an Operator (operator_1)
For every item of the list I want to schedule a task group of two Operators (operator_2_1 and operator_2_2) that are supposed to run one after another
If there is more than one item in the list, then only one task group must be executed at a time (no parallel execution)
with DAG(
...
) as dag:
# Must be executed once
operator_1 = SomeOperator(...) # returns a list of items
# Must be executed "once per item" and only one at a time
task_group(items)
operator_2_1 = SomeOtherOperator(...)
operator_2_2 = SomeOtherOperator(...)
task_group.expand(operator_1)
What I tried so far:
Iterate over the list of items and schedule a task group per item (without using dynamic tasks): Works, but the unwanted parallel execution is a problem
Using dynamic tasks, which seems to work only for tasks (but not for task groups)
I would appreciate any input on this!
Thank you in advance!
Using dynamic tasks:
Mapping a task group is not possible yet, but this is a feature that will be available soon in the next versions.
Control the parallelism of your task groups:
You can create a new pool task_groups_pool with 1 slot, and use it for the tasks of the task groups, in this case you will not have more than one task of all the task groups running at the same time.
And to make sure that the task operator_2_2 will be executed after operator_2_1 of the same group and not a task operator_2_1 in another task group, you can use set the priority of operator_2_2 = priority of operator_2_1 + 1, or by using upstream as a weight_rule for the task groups tasks. (Here you can found more info about priority weight)
Related
Airflow's dynamic task generation feature seems to mainly support generation of parallel tasks. Is dynamic generation of tasks that are executed in series also possible?
Specifically, I would like to do the following: A task (call it task_1) will create a list of variable length (the length will be determined at runtime). Then, for each member of this list, a PythonOperator and a PythonSensor will be generated and connected in series.
For example, the control flow would look something like this:
task_1 >> dynamic_operator_1 >> dynamic_sensor_1 >> ... >> dynamic_operator_n >> dynamic_sensor_n
I have a manually triggered dag. It takes a parameters like:
{"id_list":"3,5,1"}
In the DAG, I create the operators dynamically based on this list of integers:
for id in id_list:
task = create_task(id)
I need to initialize the id_list based on the parameter values of id_list.
How can I initialize that list since I cannot reference that parameter directly when not in a templated field? This is how I want to see it in the Graph View where the process tasks are based on the id_list params.
I have seen examples of dynamically created tasks but they are not really dynamic in the sense that the list values are hard-coded. The tasks are created dynamically based on the list of hard-code values, if that makes sense.
First, create a fixed number of tasks to execute. This example is using PythonOperator. In the python_callable, if the index is less than the length of the param_list then execute else raise AirflowSkipException
def execute(index, account_ids):
param_list = account_ids.split(',')
if index < len(param_list):
print(f"execute task index {index}")
else:
raise AirflowSkipException
def create_task(task_id, index):
return PythonOperator(task_id=task_id,
python_callable=execute,
op_kwargs={
"index": index,
"account_ids": "{{ dag_run.conf['account_ids'] }}"}
)
record_size_limit = 5
ACCOUNT_LIST = [None] * record_size_limit
for idx in range(record_size_limit):
task = create_task(f"task_{idx}", idx)
task
Trigger DAG and pass this as parameters:
Graph View:
A DAG and its tasks must be resolved prior to being available for use; this includes the webserver, scheduler, everywhere. The webserver is actually a perfect example why: how would you render the process to the user?
The only dynamic components of a process are the parameters that are available during template rendering. In most cases I've seen people use a PythonOperator to loop over the input and perform some action N times to solve the same issue.
I am running 5 PythonOperator tasks in my airflow DAG and one of them is performing an ETL job which is taking a long time, due to which all my resources are blocked. Is there a way I can set a max execution time per task, after which the task either fails or is marked successful (so that the DAG doesnt fail) with a message?
In every operator we have an execution_timeout variable where you have to pass a datetime.timedelta object.
As per the base operator code comments:
:param execution_timeout: max time allowed for the execution of
this task instance, if it goes beyond it will raise and fail.
:type execution_timeout: datetime.timedelta
Also bear in mind that this will fail a single run of the DAG and will trigger re-runs and will only be declared to be a failed DAG when all re-runs have failed.
So depending on what number of auto retries you have assigned, you could have a potential maximum time of ( number of retries ) x ( timeout ) in case the code keeps taking too long.
Check out this previous answer.
In short, using airflow's built in pools or even specifying a start_date for a task (instead of an entire DAG) seem to be potential solutions.
From this documentation, you'd want to set the execution_timeout task parameter, which would look something like this
from datetime import timedelta
sensor = SFTPSensor(
task_id="sensor",
path="/root/test",
execution_timeout=timedelta(hours=2),
timeout=3600,
retries=2,
mode="reschedule",
)
Context: I've defined a airflow DAG which performs an operation, compute_metrics, on some data for an entity based on a parameter called org. Underneath something like myapi.compute_metrics(org) is called. This flow will mostly be run on an ad-hoc basis.
Problem: I'd like to be able to select the org to run the flow against when I manually trigger the DAG from the airflow UI.
The most straightforward solution I can think of is to generate n different DAGs, one for each org. The DAGs would have ids like: compute_metrics_1, compute_metrics_2, etc... and then when I need to trigger compute metrics for a single org, I can pick the DAG for that org. This doesn't scale as I add orgs and as I add more types of computation.
I've done some research and it seems that I can create a flask blueprint for airflow, which to my understanding, extends the UI. In this extended UI I can add input components, like a text box, for picking an org and then pass that as a conf to a DagRun which is manually created by the blueprint. Is that correct? I'm imaging I could write something like:
session = settings.Session()
execution_date = datetime.now()
run_id = 'external_trigger_' + execution_date.isoformat()
trigger = DagRun(
dag_id='general_compute_metrics_needs_org_id',
run_id=run_id,
state=State.RUNNING,
execution_date=execution_date,
external_trigger=True,
conf=org_ui_component.text) # pass the org id from a component in the blueprint
session.add(trigger)
session.commit() # I don't know if this would actually be scheduled by the scheduler
Is my idea sound? Is there a better way to achieve what I want?
I've done some research and it seems that I can create a flask blueprint for airflow, which to my understanding, extends the UI.
The blueprint extends the API. If you want some UI for it, you'll need to serve a template view. The most feature-complete way of achieve this is developing your own Airflow Plugin.
If you want to manually create DagRuns, you can use this trigger as reference. For simplicity, I'd trigger a Dag with the API.
And specifically about your problem, I would have a single DAG compute_metrics that reads the org from an Airflow Variable. They are global and can be set dynamically. You can prefix the variable name with something like the DagRun id to make it unique and thus dag-concurrent safe.
I'd like to prevent the following task from getting run multiple times when sbt is running:
val myTask = someSettings map {s => if !s.isDone doSomethingAndSetTheFlag}
So what's expected would be when myTask is run for the first time, isDone is false and something gets done in the task, and then the task sets the flag to true. But when the task is run for the second time, since the isDone flag is true, it skips the actual execution block.
The expected behavior is similar to compile -> when source is compiled, the task doesn't compile the code again the next time it's triggered until watchSource task says the code has been changed.
Is it possible? How?
This is done by sbt, a task will be evaluated only once within a single run. If you want to have a value evaluated once, at the project load time, you can change it to be a SettingKey.
This is documented in the sbt documentation (highlighting is mine):
As mentioned in the introduction, a task is evaluated on demand. Each
time sampleTask is invoked, for example, it will print the sum. If the
username changes between runs, stringTask will take different values
in those separate runs. (Within a run, each task is evaluated at
most once.) In contrast, settings are evaluated once on project load
and are fixed until the next reload.