Specify parallelism per task? - airflow

I know in the cfg I can set the parallelism, but is there a way to do it per task, or at least per dag?
dag1=
task_id: 'download_sftp'
parallelism: 4 #I am fine with downloading multiple files at once
task_id: 'process_dimensions'
parallelism: 1 #I want to make sure the dimensions are processed one at a time to prevent conflicts with my 'serial' keys
task_id: 'process_facts'
parallelism: 4 #It is fine to have multiple tables processed at once since there will be no conflicts
dag2 (separate file)=
task_id: 'bcp_query'
parallelism: 6 #I can query separate BCP commands to download data quickly since it is very small amounts of data

You can create a task pool through the web gui and limit the execution parallelism by specifying the specific tasks to use that pool.
Please see: https://airflow.apache.org/concepts.html#pools

The number of active DAG runs can be controlled with the below parameter(present in airflow.cfg configuration file), its applicably globally.
By default, its set to 16, change it to 1 ensures only one instace of dag at a time and rest gets queued.
#The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
How to limit Airflow to run only 1 DAG run at a time? --> Suggests how to control concurrency per dag

Related

Is it a good practice to use airflow metadatabase to control pipelines?

Recently I'm developing an airflow pipeline that will be running for multi tenants. This DAG will be triggered via API, and separated by batches, which is controlled by a metadabase in SQL following some business rules.
Each batch has a batch_id in order to controll the batches, and it is passed to conf DAG via API. The batch id has the timestamp of creation combined with tenant and filetype. Example: tenant1_20221120123323 ... tenant2_20221120123323. These batches can contain two filetypes ( for example purpouses) and for each filetype a DAG is triggered (DAG1 for filetype 1 and DAG2 for filetype 2) and then from the file perspective, it is combined with the filetype in some stages tenant1_20221120123323_filetype1, tenant1_20221120123323_filetype2 ...
For illustrate this, imagine that the first dag the following pipeline process_data_on_spark >> check_new_files_on_statingstorage >> [filetype2_exists, write_new_data_to_warehouse] filetype2_exists >> read_data_from_filetype2 >> merge_filetype2_filetype2 >> write_new_data_to_warehouse . Where the filetype2_exists is a BranchPythonOperator, that verify if DAG_2 was triggered, and if it was, it will merge the resulted data form DAG2 with the DAG1 before execute write_new_data_to_warehouse.
Based on this DAG model, there will be one DAG run for each tenant. So, the DAG can have multiple DAG runs running in parallel if we trigger more than one DAG run (one per tenant). Here is my first question:
Is a good practice work with multiple DAG runs in the same DAG instead of working with Dynamic DAGs ? In this case, I would end withprocess_data_on_spark _tenant1,
process_data_on_spark _tenant2, ...process_data_on_spark _tenantN. It worth mention that the number of tenants can reach hundreads.
Now, considering that the filetype2 can or not be present in the batch, and, considering that I would use the model mentioned above (on single DAG with multiples DAG run runnning in parallel - one for each tenant). The only idead that I have for check if DAG2 was triggered for the current batch (ie., filetype2 was present in the batch) was modify the DAG_run_id to include the batch_id, combined with the filetype:
The default dag_run_id: manual__2022-11-19T00:00:00+00:00
The new dag_run_id: manual__tenant1_20221120123323_filetype2__2022-11-19T00:00:00+00:00
And from then, I would be able to query the airflow metadatabse and check if there was an dag_run_id that contains the current batch_id and the filetype2 running, and, with a sensor, wait for the dag_status be success. Then, I could run the read_data_from_filetype2 task. Otherwise, if there is no dag_run_id with batch_id and filetype2 registed in airflow metadatabase, I can follow the write_new_data_to_warehouse directly.
Here's the other question:
Is a good practice to modify dag_run_id and use it combined with airflow metadatabase to control pipelines?
Considering this scenario, It would be better to create dynamic DAGs, even if there would be result in hundeads DAGs or working with dag_run_id and airflow_metadabase and keep parallel DAG runs in one single DAG?
Or, there would be a better approach for this problem?
Thank You.

Sharing information between DAGs in airflow

I have one dag that tells another dag what tasks to create in a specific order.
Dag 1 -> a file that has a task order
This runs every 5 minutes or so to keep this file fresh.
Dag 2 -> runs the task
this runs daily.
How can I pass this data between the two DAGs using Airflow.
Solutions and problems
The problem with using Airflow Variables is that I cannot set them at runtime.
The problem with using Xcoms is that they can only be run during the task stage and once the tasks are created in Dag 2, they're set and cannot be changed correct?
The problem with pushing the file to s3 is that the airflow instance doesn't have permission to pull from s3 due to security reasons decided by a team that I have no control over.
So what can I do? What are some choices I have?
What is the file format of the output from the 1st DAG? I would recommend the following workflow
Dag 1 -> Update the tasks order and store it in a yaml or json file inside the airflow environment.
Dag 2 -> Read the file to create the required tasks and run them daily.
You need to understand that airflow is constantly reading your dag files to have the latest configuration, so no extra step would be required.
I have had a similar issue in the past and it largely depends on your setup.
If you are running Airflow on Kubernetes this might work.
You create a PV(Persistent Volume) and PVC
You start your application with a KubernetesOperator and mount the PVC to it.
You store the result on the PVC.
You mount the PVC to the other pod.

DAG's task initialization takes time

We have a composer environment which has below configuration details.
Composer Version:  composer-1.10.0-airflow-1.10.6
Machine Type : n1-standard-4
Disk size (GB): 100
Worker Nodes: 6
python version:3
worker_concurrency: 32
parallelism:128
We have a problem in DAG to initialize it's task and it is taking more time. For example DAG has 3 tasks like Task1 -> Task2 -> Task3. Task1 initializes taking time (minimum 5 mins) and once initialized completion time of that task within seconds. Task2 initialized taking again 5 mins and executed within seconds. Like that task's initialization is taking time but completion of that task is quickly done. Have scheduled this DAG every 5 mins, but completing this DAG takes around 10 mins at least. So affecting functionalities and execution of the process.
Here are the functionalities of each three tasks. Task1 objective is to gather the basic information such as storage location from configuration files/variables. Task2 checks the storage whether any new files are coming and based on the file triggers the relevant DAGs. Task3 objective is to send success email.
Also, I noted that worker nodes did not splitted the work among themselves. Always one worker node's CPU utilization is high compared to other worker nodes. Do not know what could be the reason for it. One more interesting is even though the other DAG's are not running at that time this DAG still takes 10 mins to execute.
Appreciated your help in solving this case.
This should be a comment but I don't have the reputation required.
My initial advice is to upgrade your Composer version, 1.10.0 has a few known bugs that are fixed in later versions. Right now the latest version is 1.10.4. This should correct the CPU that stays at 100% (it did in our case). Are there many other DAGs running on your instance?
As I mentioned in the comment the reason behind the high CPU pressure on the particular GKE node might be more evident after the troubleshooting performed on Airflow workflow/GKE sides.
It is happening regularly that on some Aiflow runtime node the computation resources (Memory/CPU) are running out of the node capacity causing Airflow workloads(Pods) being Evicted and further restarted loosing all the process states data, however Celery executor which is responsible for assigning tasks to the Airflow workers can even be not aware about inconvenient state/time-out of the worker and doesn't keep the certain action to re-assign this task to another worker.
According to GCP Composer release notes, the vendor has provided some essential fixes in the latest composer-1.10.* patches, improving Composer runtime performance and reability, as #parakeet said in his answer.
You can also refer to this GCP Composer known issues knowledge base to keep track of the current problems and workarounds that vendor shares to the community.

Airflow: Only allow one instance of task

Is there a way specify that a task can only run once concurrently? So in the tree above where DAG concurrency is 4, Airflow will start task 4 instead of a second instance of task 2?
This DAG is a little special because there is no order between the tasks. These tasks are independent but related in purpose and therefore kept in one DAG so as to new create an excessive number of single task DAGs.
max_active_runs is 2 and dag_concurrency is 4. I would like it start all 4 tasks and only start a task in next if same task in previous run is done.
I may have mis-understood your question, but I believe you are wanting to have all the tasks in a single dagrun finish before the tasks begin in the next dagrun. So a DAG will only execute once the previous execution is complete.
If that is the case, you can make use of the max_active_runs parameter of the dag to limit how many running concurrent instances of a DAG there are allowed to be.
More information here (refer to the last dotpoint): https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled
max_active_runs defines how many running concurrent instances of a DAG there are allowed to be.
Airflow operator documentation describes argument task_concurrency. Just set it to one.
From the official docs for trigger rules:
depends_on_past (boolean) when set to True, keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded.
So the future DAGs will wait for the previous ones to finish successfully before executing.
On airflow.cfg under [core]. You will find
dag_concurrency = 16
//The number of task instances allowed to run concurrently by the scheduler
you're free to change this to what you desire.

how many tasks can be scheduled in a single airflow dag?

I am completely new to airflow, and couldn't find anywhere that how many tasks can be scheduled in a single airflow DAG. And what can be the maximum size of each task.
I want to schedule a task which should be able to handle millions of queries and identify its type and schedule the next task according to the type of query.
Read complete documentation but couldn't find it
There are no limits to how many tasks can be part of a single DAG.
Through the Airflow config, you can set concurrency limitations for execution time such as the maximum number of parallel tasks overall, maximum number of concurrent DAG runs for a given DAG, etc. There are settings at the Airflow level, DAG level, and operator level for more coarse to fine-grained control.
Here are the high-level concurrency settings you can tweak:
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
Reference: default_airflow.cfg
The parallelism settings are described in more detail in this answer.
As far as the maximum "size" of each task, I'm assuming you're referring to resource allocation, such as memory or CPU. This is user configurable depending upon which executor you choose to use:
In a simple setup with LocalExecutor, for instance, it will use any resources available on the host.
In contrast, with the MesosExecutor on the other hand, one can define the max amount of CPU and/or memory that will be allocated to a task instance, and through DockerOperator you also have the option to define the maximum amount of CPU and memory a given task instance will use.
With the CeleryExecutor, you can set worker_concurrency to define the number of task instances each worker will take.
Another way to restrict execution is to use the Pools feature (example), for instance, you can set the max size of a pool of tasks talking to a database to 5 to prevent more than 5 tasks from hitting it at once (and potentially overloading the database/API/whatever resource you want to pool against).
Well using concurrency parameter can let you control how many running task instances a DAG is allowed to have, beyond which point things get queued.
This FAQ from the airflow site has really valuable information about task scheduling.
Lastly, about the size of the tasks, there is no limit from the Airflow side. The only soft requirement posed by Airflow is to create idempotent tasks. So basically as Taylor explained above the task size is limited by the executor - worker that you will select (Kubernetes, Celery, Dask or Local) and the resources that you will have available to your workers.
I think the maximum number of scheduled tasks depends on the airflow DB. I used SQLite in my airflow. I tried to create a lot of tasks and the airflow caused an error.
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
cursor.execute(statement, parameters)
sqlite3.OperationalError: too many SQL variables
Thus, for SQLite, the maximum number of scheduled tasks is 996 (founded experementally).
# DAG for limit testing
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator
default_args = {
'owner': 'airflow_user',
'start_date': days_ago(0),
}
with DAG(
'Task_limit',
default_args = default_args,
description = 'Find task limit',
schedule_interval = None,
) as dag:
for i in range(996):
task = BashOperator(
task_id = "try_port_" + str(i),
bash_command='echo ' + str(i),
dag = dag,
)
# if the range is increased, then an error occurs
MB for another database, this number will be higher.
P.S. After a while, I will replace SQLite with PostgreSQL, so I will find a limit for the new DB.

Resources