I want to list Airflow DAG's which are consuming maximum resource(CPU/RAM). If not DAG then task level
Is it possible to get that usage maybe from historical runs.
Related
Facing a scenrio in Apache - Airflow , where the ask is to stop the DAG execution on the event of an sla miss and proceed with subsequent DAG interval's execution
Is such a functionality configurable in Airflow ?
You can create a lightweight DAG that runs every X (the smallest interval within your Airflow cluster) which checks for SLA misses and pauses the DAG if there's any.
I have multiple dags that run on different cadence: some weekly, some daily etc. I want it to setup such that while dag-a is running, dag-b should wait until it is completed. Also, if dag-b is running dag-a should wait until dag-b completes, etc. Is there a way to do this in airflow out of the box?
What you are looking for is probably the ExternalTaskSensor
Airflow's Cross-DAG Dependencies description is also pretty useful.
If you are using this, there is also the Airflow DAG dependencies plugin, which can be pretty useful for visualizating those dependencies.
You could use the sensor operator to sense the dag runs or a task in a dag run. External task sensor is the best bet. Be careful how you set the timedelta passed. In general, the idea is to specify the when should the sensor be able to find the dag run.
Eg:
If the main dag is scheduled at 4 UTC, and a task sensor is a task in the dag like below
ExternalTaskSensor(
dag=dag,
task_id='dag_sensor_{}'.format(key),
external_dag_id=key,
timedelta=timedelta(days=1),
external_task_id=None,
mode='reschedule',
check_existence=True
)
Then the other dag that should get sensed must be triggering a run at 4.00UTC. That one day difference is set to offset the difference of execution date and current date
Is there a way specify that a task can only run once concurrently? So in the tree above where DAG concurrency is 4, Airflow will start task 4 instead of a second instance of task 2?
This DAG is a little special because there is no order between the tasks. These tasks are independent but related in purpose and therefore kept in one DAG so as to new create an excessive number of single task DAGs.
max_active_runs is 2 and dag_concurrency is 4. I would like it start all 4 tasks and only start a task in next if same task in previous run is done.
I may have mis-understood your question, but I believe you are wanting to have all the tasks in a single dagrun finish before the tasks begin in the next dagrun. So a DAG will only execute once the previous execution is complete.
If that is the case, you can make use of the max_active_runs parameter of the dag to limit how many running concurrent instances of a DAG there are allowed to be.
More information here (refer to the last dotpoint): https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled
max_active_runs defines how many running concurrent instances of a DAG there are allowed to be.
Airflow operator documentation describes argument task_concurrency. Just set it to one.
From the official docs for trigger rules:
depends_on_past (boolean) when set to True, keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded.
So the future DAGs will wait for the previous ones to finish successfully before executing.
On airflow.cfg under [core]. You will find
dag_concurrency = 16
//The number of task instances allowed to run concurrently by the scheduler
you're free to change this to what you desire.
I created 4 SubDags within the main Dag which will run on different schedule_interval. I removed the operation of one SubDag but it still appears on Airflow's Database. Will that entry in the database execute? Is there a way to delete that from Airflow's database?
The record will persist in the database, however if the DAG isn't actually present on the scheduler (and workers depending on your executor), it can't be added to the DagBag and won't be run.
Having a look at this simplified scheduler of what the scheduler does:
def _do_dags(self, dagbag, dags, tis_out):
"""
Iterates over the dags and schedules and processes them
"""
for dag in dags:
self.logger.debug("Scheduling {}".format(dag.dag_id))
dag = dagbag.get_dag(dag.dag_id)
if not dag:
continue
try:
self.schedule_dag(dag)
self.process_dag(dag, tis_out)
self.manage_slas(dag)
except Exception as e:
self.logger.exception(e)
The scheduler will check if the dag is contained in the DagBag before it does any processing on it. Entries for DAGs are kept in the database to maintain the historical record of what dates have been processed should you re-add it in the future. But for all intents and purposes, you can treat a missing DAG as a paused DAG.
I am completely new to airflow, and couldn't find anywhere that how many tasks can be scheduled in a single airflow DAG. And what can be the maximum size of each task.
I want to schedule a task which should be able to handle millions of queries and identify its type and schedule the next task according to the type of query.
Read complete documentation but couldn't find it
There are no limits to how many tasks can be part of a single DAG.
Through the Airflow config, you can set concurrency limitations for execution time such as the maximum number of parallel tasks overall, maximum number of concurrent DAG runs for a given DAG, etc. There are settings at the Airflow level, DAG level, and operator level for more coarse to fine-grained control.
Here are the high-level concurrency settings you can tweak:
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
Reference: default_airflow.cfg
The parallelism settings are described in more detail in this answer.
As far as the maximum "size" of each task, I'm assuming you're referring to resource allocation, such as memory or CPU. This is user configurable depending upon which executor you choose to use:
In a simple setup with LocalExecutor, for instance, it will use any resources available on the host.
In contrast, with the MesosExecutor on the other hand, one can define the max amount of CPU and/or memory that will be allocated to a task instance, and through DockerOperator you also have the option to define the maximum amount of CPU and memory a given task instance will use.
With the CeleryExecutor, you can set worker_concurrency to define the number of task instances each worker will take.
Another way to restrict execution is to use the Pools feature (example), for instance, you can set the max size of a pool of tasks talking to a database to 5 to prevent more than 5 tasks from hitting it at once (and potentially overloading the database/API/whatever resource you want to pool against).
Well using concurrency parameter can let you control how many running task instances a DAG is allowed to have, beyond which point things get queued.
This FAQ from the airflow site has really valuable information about task scheduling.
Lastly, about the size of the tasks, there is no limit from the Airflow side. The only soft requirement posed by Airflow is to create idempotent tasks. So basically as Taylor explained above the task size is limited by the executor - worker that you will select (Kubernetes, Celery, Dask or Local) and the resources that you will have available to your workers.
I think the maximum number of scheduled tasks depends on the airflow DB. I used SQLite in my airflow. I tried to create a lot of tasks and the airflow caused an error.
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
cursor.execute(statement, parameters)
sqlite3.OperationalError: too many SQL variables
Thus, for SQLite, the maximum number of scheduled tasks is 996 (founded experementally).
# DAG for limit testing
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator
default_args = {
'owner': 'airflow_user',
'start_date': days_ago(0),
}
with DAG(
'Task_limit',
default_args = default_args,
description = 'Find task limit',
schedule_interval = None,
) as dag:
for i in range(996):
task = BashOperator(
task_id = "try_port_" + str(i),
bash_command='echo ' + str(i),
dag = dag,
)
# if the range is increased, then an error occurs
MB for another database, this number will be higher.
P.S. After a while, I will replace SQLite with PostgreSQL, so I will find a limit for the new DB.