I am trying to implement airflow and per dag run I need to run few tasks concurrently. The problem is I am able to find tasks per dag that can be run concurrently but not per dag run.
t1 >> t2 >> [t3, t4, t5, t6] >> t7
e.g. I have this dag and I run this dag three times in parallel.
Now I want is every dag run should have it's own concurrent task execution limit, not per dag limit.
Any help is appreciated. thanks.
Maybe one of this tree configuration at airflow.cfg would be helpfull to your issue:
Try parallelism, max_active_tasks_per_dag, max_active_runs_per_dag
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html?highlight=parallelism#parallelism
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html?highlight=parallelism#max-active-tasks-per-dag
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html?highlight=parallelism#max-active-runs-per-dag
If not works, try to read this detailed conf post aboutr concurrency.
https://www.astronomer.io/guides/airflow-scaling-workers/
Related
I have the following airflow setup
Executor : KubernetesExecutor
airflow version : 2.1.3
airflow config : parallelism = 256
I have the below scenario
I have a number of dags(eg 10) which are dependent on the success state of another task from another dag. The tasks kept failing with retries enabled for 6 times.
All the dependent dags run hourly and as a result they were added to the queue state by the scheduler. I can see around 800 dags were in queue and nothing was running. So I ended up manually changing their state to Fail.
Below are my questions from this event.
Is there a limit on the number of dags that can run concurrently in airflow set up ?
Is there a limit on how many dags can be enqueued ?
When dags are queued how does the scheduler decides which one to pick ? Is it based on queued time ?
Is is possible for setting up priority among the queued dags ?
How Does airflow 2.1.3 treats task in queue ? Are they counted against max_active_runs parameters ?
Is there a way to control the parallelism for particular tasks in an airflow dag? Eg. say I have a dag definition like...
for dataset in list_of_datasets:
# some simple operation
task_1 = BashOperator(task_id=f'task_1_{dataset.name}', ...)
# load intensive operation
task_2 = BashOperator()
# another simple operation
task_3 = BashOperator()
task_1 >> task_2 >> task_3
Is there a way to have something where task_1 can have, say, 5 of its kind running in a dag instance, while only 2 instances of task_2 may be running in a dag instance (also implying that if there are 2 instances of task_2 already running, then only 3 instances of task_1 can run)? Any other common ways to work around this kind of requirement (I imagine this must come up often for pipelines)?
From discussions on the apache airflow email list...
You can use pools (https://airflow.apache.org/docs/stable/concepts.html#pools) to limit how many tasks can be run in parallel for given pool.
So you can create named "pools" with task count limits (via the webserver admin menu)
and then assign those tasks to those pools in a dag definition file when they are created
for dataset in datasets:
high_load_task = BashOperator(
task_id='high_load_task_%s' % dataset["id"],
bash_command='some_command',
pool='example_pool',
trigger_rule=TriggerRule.ALL_SUCCESS,
dag=dag)
Also...
There is another way to limit parallelism of tasks - applicable for example in cases where you have different kinds of machines with different capabilities (for example with/without GPU). You can have some affinities defined between tasks and actual machines that are executing them - in Celery executor you can define queues in your celery configuration and you can assign your task to one of the queues. Then you can have a number of workers/slots defined for all machines in the queue and as effect you can also limit parallelism of the tasks this way: https://airflow.apache.org/docs/stable/concepts.html#queues
I have 6 tasks (t1,t2,t3,t4,t5 and t6) all running at same time.
These task is running as t1 >> t2 >> t3 >> t4 >> t5 >> t6
At some point t3 is having issues then t4 is not executed or any of the task is having issues then the later task is not getting executed.
Can some please let me know how can I avoid this problem. I do not want other task to be waiting for earlier task to finish. If the earlier task is not finish within 5 minutes then it should skip.
The way you've wired your workflow doesn't suit your requirements.
Tasks that don't really depend on one another shouldn't be wired together. I've seen some people do it just to limit load on external system; but for such use-case, one must employ Pools
If you want a task to timeout after a given interval (so that downstream tasks can continue), you can exploit the execution_timeout argument of BaseOperator
I am just trying to figure if there is a way to limit the duration of a DAG in airflow. For example, set max time a DAG can run for to 30 minutes.
DAGs have dagrun_timeout parameter indeed, but it works only when max_active_runs for the DAG is reached (16 by default). For example if you have 15 active DAGs, the scheduler will just launch 16th. But before launching next one the scheduler will wait until one of previous finishes or exceedes timeout.
But you can use execution_timeout for task instance. This parameter works unconditionally.
disclaimer: I'm not (yet) a user of Airflow, just found about it today and I'm starting to explore if it may fit my uses cases.
I have one data processing workflow that is a sequential (not parallel) execution of multiple tasks. However, some of the tasks need to run on specific machines. Can Airflow manage this? What would be the advised implementation model for this use case?
Thanks.
Yes, you can achieve this in Airflow with queues. You can tie tasks to a specific queue. Then for each worker on a machine, you can set it to only pickup tasks from select queues.
In code, it would look like this:
task_1 = BashOperator(
dag=dag,
task_id='task_a',
...
)
task_2 = PythonOperator(
dag=dag,
task_id='task_b',
queue='special',
...
)
Note that there is this setting in airflow.cfg:
# Default queue that tasks get assigned to and that worker listen on.
default_queue = default
So if you started your workers with this:
Server A> airflow worker
Server B> airflow worker --queues special
Server C> airflow worker --queues default,special
Then task_1 can be picked up by servers A+C and task_2 can be picked up by servers B+C.
In case you're running Airflow in Docker, then you should do the following:
Set the queue name in the DAG file:
with DAG(dag_id='dag_v1',
default_args={
'retries': 1,
'retry_delay': timedelta(seconds=30),
'queue':'server-1',
...
},
schedule_interval=None,
tags=['my_dags']) as dag:
...
Set the default queue in the docker-compose.yml file
AIRFLOW__OPERATORS__DEFAULT_QUEUE: 'server-1'
Restart the Airflow Webserver, Scheduler etc.
Note: You have to do this for each worker but I assume that you have 1 worker per machine - meaning that each machine needs to have a different AIRFLOW__OPERATORS__DEFAULT_QUEUE name and the corresponding DAGs you want to run on that machine need to have the same name for their queue (then you can indeed use the ${HOSTNAME} as the name).