I am completely new to airflow, and couldn't find anywhere that how many tasks can be scheduled in a single airflow DAG. And what can be the maximum size of each task.
I want to schedule a task which should be able to handle millions of queries and identify its type and schedule the next task according to the type of query.
Read complete documentation but couldn't find it
There are no limits to how many tasks can be part of a single DAG.
Through the Airflow config, you can set concurrency limitations for execution time such as the maximum number of parallel tasks overall, maximum number of concurrent DAG runs for a given DAG, etc. There are settings at the Airflow level, DAG level, and operator level for more coarse to fine-grained control.
Here are the high-level concurrency settings you can tweak:
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
Reference: default_airflow.cfg
The parallelism settings are described in more detail in this answer.
As far as the maximum "size" of each task, I'm assuming you're referring to resource allocation, such as memory or CPU. This is user configurable depending upon which executor you choose to use:
In a simple setup with LocalExecutor, for instance, it will use any resources available on the host.
In contrast, with the MesosExecutor on the other hand, one can define the max amount of CPU and/or memory that will be allocated to a task instance, and through DockerOperator you also have the option to define the maximum amount of CPU and memory a given task instance will use.
With the CeleryExecutor, you can set worker_concurrency to define the number of task instances each worker will take.
Another way to restrict execution is to use the Pools feature (example), for instance, you can set the max size of a pool of tasks talking to a database to 5 to prevent more than 5 tasks from hitting it at once (and potentially overloading the database/API/whatever resource you want to pool against).
Well using concurrency parameter can let you control how many running task instances a DAG is allowed to have, beyond which point things get queued.
This FAQ from the airflow site has really valuable information about task scheduling.
Lastly, about the size of the tasks, there is no limit from the Airflow side. The only soft requirement posed by Airflow is to create idempotent tasks. So basically as Taylor explained above the task size is limited by the executor - worker that you will select (Kubernetes, Celery, Dask or Local) and the resources that you will have available to your workers.
I think the maximum number of scheduled tasks depends on the airflow DB. I used SQLite in my airflow. I tried to create a lot of tasks and the airflow caused an error.
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
cursor.execute(statement, parameters)
sqlite3.OperationalError: too many SQL variables
Thus, for SQLite, the maximum number of scheduled tasks is 996 (founded experementally).
# DAG for limit testing
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator
default_args = {
'owner': 'airflow_user',
'start_date': days_ago(0),
}
with DAG(
'Task_limit',
default_args = default_args,
description = 'Find task limit',
schedule_interval = None,
) as dag:
for i in range(996):
task = BashOperator(
task_id = "try_port_" + str(i),
bash_command='echo ' + str(i),
dag = dag,
)
# if the range is increased, then an error occurs
MB for another database, this number will be higher.
P.S. After a while, I will replace SQLite with PostgreSQL, so I will find a limit for the new DB.
Related
I have a big DAG with around 400 tasks that starts at 8:00 and runs for about 2.5 hours.
There are some smaller DAGs that need to start at 9:00, they are scheduled but are not able to start until the first DAG finishes.
I reduced concurrency=6. The DAG is running only 6 parallel tasks, however this is not solving the issue that the other tasks in other DAGs don't start.
There is no other global configuration to limit the number of running tasks, other smaller dags usually run in parallel.
What can be the issue here?
Ariflow version: 2.1 with Local Executor with Postgres backend running on a 20core server.
Tasks of active DAGs not starting
I don't think it's related to concurrency. This could be related to Airflow using the mini-scheduler.
When a task is finished Task supervisor process perform a "mini scheduler" attempting to schedule more tasks of the same DAG. This means that the DAG will be finished quicker as the downstream tasks are set to Scheduled mode directly however one of it's side effect that it can cause starvation for other DAGs in some circumstances. A case like you present where you have one very big DAG that takes very long time to complete and starts before smaller DAGs may be the exact case where stravation can happen.
Try to set schedule_after_task_execution = False in airflow.cfg and it should solve your issue.
Why don't you use the option to invoke the task after the previous one is finished?
In the first DAG, insert the call to the next one as follows:
trigger_new_dag = TriggerDagRunOperator(
task_id=[task name],
trigger_dag_id=[trigered dag],
dag=dag
)
This operator will start a new DAG after the previous one is executed.
Documentation: https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/trigger_dagrun/index.html
In our Big Data project there are ~3000 tables to load, all of these tables should be processed by a separate DAG in Airflow.
In our solution a single python file generates every type of table loaders so they can be triggered separately via REST API on event-based manner via Cloud Function.
Therefore we generate our DAGs by using:
Airflow variables used for the DAG generator logic
list of table names to generate
table type: insert append, truncate load, scd1, scd2
Airflow variables used by the specific operators of tables loader DAGs, e.g.:
RR_TableN = {} // python dict for operator handling RawToRaw
RC_TableN = {} // python dict for operator handling RawToCuration
user defined macros:
we try not to "static python codes" in between task definitions, because they would be executed during DAG-generation process
user defined macros are evaluated only in DAG-execution time
Unfortunately we are bound to version Airflow v1.x.x
Problem:
We have noticed that the Airflow/Cloud Composer is signigicantly slower between the task executions when multiple DAGs are generated.
When only 10-20 DAGs are generated the time between the Task executions is much faster then we have 100-200 DAGs.
When 1000 DAGs are generated it takes minutes to start a new task after finishing a preceeding task for a given DAG even when a no other DAGs are executed.
We don't understand why the Task execution times are affected that hard by the number of generated DAGs.
Shouldn't be near constant time for Airflow to search in it's metabase for the required parameters for the TaskInstances?
We are not sure if the Cloud Composer is configured/scaled/managed properly by Google.
Questions:
What's the reason behind this slowdown from Airflow's side?
How could we reduce the waiting times between the task executions and speed up the whole process?
Is this a "bad design pattern" what we are implementing (generator and user defined macros processing Airflow variables)?
If so, how could we do similar (table separated DAGs, single codebase etc.) in a more effective way?
This is a very simple example of the generator code what we use:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
def create_dag(dag_id, schedule, dag_number, default_args):
def example(*args):
print('Example DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id, schedule_interval=schedule, default_args=default_args)
with dag:
t1 = PythonOperator(task_id='example', python_callable=example)
return dag
for dag_number in range(1, 5000):
dag_id = 'Example_{}'.format(str(dag_number))
default_args = {'owner': 'airflow', 'start_date': datetime(2021, 1, 1)}
globals()[dag_id] = create_dag(dag_id, '#daily', dag_number, default_args)
Yes. That is a known problem. It's been fixed in Airflow 2.
It's inherent to how processing of DAG files were done in Airflow 1 (mainly about the number of queries generated).
Other than migrating to Airflow 2, there is not much you can do. Fixing that required complete refactoring and half-rewriting of Airflow scheduler logic.
One way to mitigiate it - you could potentially, rather than generate all DAGs from single file, split it to many of those. For example rather than generating DAG objects in the single Python file, you could generate 3000 separate, dynamically generated small DAG files. This will scale much better.
However, the good news is that in Airflow 2 this is Many many times faster and scalable. And Airlfow 1.10 reached EOL and is not supported any more and will not receive any more updates. So rather than changing process I'd heartily recommend to migrate.
The current problem that I am facing is that I have documents in a MongoDB collection which each need to be processed and updated by tasks which need to run in an acyclic dependency graph. If a task upstream fails to process a document, then none of the dependent tasks may process that document, as that document has not been updated with the prerequisite information.
If I were to use Airflow, this leaves me with two solutions:
Trigger a DAG for each document, and pass in the document ID with --conf. The problem with this is that this is not the intended way for Airflow to be used; I would never be running a scheduled process, and based on how documents appear in the collection, I would be making 1440 Dagruns per day.
Run a DAG every period for processing all documents created in the collection for that period. This follows how Airflow is expected to work, but the problem is that if a task fails to process a single document, none of the dependent tasks may process any of the other documents. Also, if a document takes longer than other documents do to be processed by a task, those other documents are waiting on that single document to continue down the DAG.
Is there a better method than Airflow? Or is there a better way to handle this in Airflow than the two methods I currently see?
From the knowledge I gained in my attempt to answer this question, I've come to the conclusion that Airflow is just not the tool for the job.
Airflow is designed for scheduled, idempotent DAGs. A DagRun must also have a unique execution_date; this means running the same DAG at the exact same start time (in the case that we receive two documents at the same time is quite literally impossible. Of course, we can schedule the next DagRun immediately in succession, but this limitation should demonstrate that any attempt to use Airflow in this fashion will always be, to an extent, a hack.
The most viable solution I've found is to instead use Prefect, which was developed with the intention of overcoming some of the limitations of Airflow:
"Prefect assumes that flows can be run at any time, for any reason."
Prefect's equivalent of a DAG is a Flow; one key advantage of a flow that we may take advantage of is the ease of parametriziation. Then, with some threads, we're able to have a Flow run for each element in a stream. Here is an example streaming ETL pipeline:
import time
from prefect import task, Flow, Parameter
from threading import Thread
def stream():
for x in range(10):
yield x
time.sleep(1)
#task
def extract(x):
# If 'x' referenced a document, in this step we could load that document
return x
#task
def transform(x):
return x * 2
#task
def load(y):
print("Received y: {}".format(y))
with Flow("ETL") as flow:
x_param = Parameter('x')
e = extract(x_param)
t = transform(e)
l = load(t)
for x in stream():
thread = Thread(target=flow.run, kwargs={"x": x})
thread.start()
You could change trigger_rule from "all_success" to "all_done"
https://github.com/apache/airflow/blob/62b21d747582d9d2b7cdcc34a326a8a060e2a8dd/airflow/example_dags/example_latest_only_with_trigger.py#L40
And also could create a branch that processes failed documents with trigger_rule set to "one_failed" to move processes those failed documents somehow differently (e.g. move to a "failed" folder and send a notification)
I would be making 1440 Dagruns per day.
With a good Airflow architecture, this is quite possible.
Choking points might be
executor - use Celery Executor instead of Local Executor for example
backend database - monitor and tune as necessary (indexes, proper storage etc)
webserver - well, for thousands of dagruns, tasks etc.. perhaps only use webeserver for dev/qa environments, and not for production where you have higher rate of task/dagruns submissions. You could use cli etc instead.
Another approach is scaling out by running multiple Airflow instances - partition documents let's say to ten buckets, and assign each partition's documents to just one Airflow instance.
I'd process the heavier tasks in parallel and feed successful operations downstream. As far as I know, you can't feed successes asynchronously to downstream tasks, so you would still need to wait for every thread to finish until moving downstream but, this would still be well more acceptable than spawning 1 dag for each record, something in these lines:
Task 1: read mongo filtering by some timestamp (remember idempotence) and feed tasks (i.e. via xcom);
Task 2: do stuff in paralell via PythonOperator, or even better via K8sPod, i.e:
def thread_fun(ret):
while not job_queue.empty():
job = job_queue.get()
try:
ret.append(stuff_done(job))
except:
pass
job_queue.task_done()
return ret
# Create workers and queue
threads = []
ret = [] # a mutable object
job_queue = Queue(maxsize=0)
for thr_nr in appropriate_thread_nr:
worker = threading.Thread(
target=thread_fun,
args=([ret])
)
worker.setDaemon(True)
threads.append(worker)
# Populate queue with jobs
for row in xcom_pull(task_ids=upstream_task):
job_queue.put(row)
# Start threads
for thr in threads:
thr.start()
# Wait to finish their jobs
for thr in threads:
thr.join()
xcom_push(ret)
Task 3: Do more stuff coming from previous task, and so on
We have built a system that queries MongoDB for a list, and generates a python file per item containing one DAG (note: having each dag have its own python file helps Airflow scheduler efficiency, with it's current design) - the generator DAG runs hourly, right before the scheduled hourly run of all the generated DAGs.
Is there a way specify that a task can only run once concurrently? So in the tree above where DAG concurrency is 4, Airflow will start task 4 instead of a second instance of task 2?
This DAG is a little special because there is no order between the tasks. These tasks are independent but related in purpose and therefore kept in one DAG so as to new create an excessive number of single task DAGs.
max_active_runs is 2 and dag_concurrency is 4. I would like it start all 4 tasks and only start a task in next if same task in previous run is done.
I may have mis-understood your question, but I believe you are wanting to have all the tasks in a single dagrun finish before the tasks begin in the next dagrun. So a DAG will only execute once the previous execution is complete.
If that is the case, you can make use of the max_active_runs parameter of the dag to limit how many running concurrent instances of a DAG there are allowed to be.
More information here (refer to the last dotpoint): https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled
max_active_runs defines how many running concurrent instances of a DAG there are allowed to be.
Airflow operator documentation describes argument task_concurrency. Just set it to one.
From the official docs for trigger rules:
depends_on_past (boolean) when set to True, keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded.
So the future DAGs will wait for the previous ones to finish successfully before executing.
On airflow.cfg under [core]. You will find
dag_concurrency = 16
//The number of task instances allowed to run concurrently by the scheduler
you're free to change this to what you desire.
I know in the cfg I can set the parallelism, but is there a way to do it per task, or at least per dag?
dag1=
task_id: 'download_sftp'
parallelism: 4 #I am fine with downloading multiple files at once
task_id: 'process_dimensions'
parallelism: 1 #I want to make sure the dimensions are processed one at a time to prevent conflicts with my 'serial' keys
task_id: 'process_facts'
parallelism: 4 #It is fine to have multiple tables processed at once since there will be no conflicts
dag2 (separate file)=
task_id: 'bcp_query'
parallelism: 6 #I can query separate BCP commands to download data quickly since it is very small amounts of data
You can create a task pool through the web gui and limit the execution parallelism by specifying the specific tasks to use that pool.
Please see: https://airflow.apache.org/concepts.html#pools
The number of active DAG runs can be controlled with the below parameter(present in airflow.cfg configuration file), its applicably globally.
By default, its set to 16, change it to 1 ensures only one instace of dag at a time and rest gets queued.
#The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
How to limit Airflow to run only 1 DAG run at a time? --> Suggests how to control concurrency per dag