I am using composer 2.0.2 (Airlfow 2.2.5) in GCP on a ‘small’ environment which uses the following resources:
Environment resources
The environment has 90 DAGs which are triggered by API calls as our architecture is event based. The other day, I ran a worst case scenario test where the 90 DAGs were triggered nearly at the same time. As a result, many of the DAGs’ tasks were in ‘skipped’ state even though its dependencies were in ‘success’, then the DAGs failed due to timeout.
Based on what I saw on my monitoring dashboard and what I found in documentation/internet, these anomalies happened because the workers (which can scale up to 1.5) hit their CPU limit:
CPU limit reached
In this context, I am looking for ways to reduce CPU usage even if it means having longer execution times. Multiple forums and blogs mention Airflow parameters that could be useful such as AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL but I am still not sure.
Please notice that scaling the environment resources is not an option due to cost restrictions.
Related
I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch, but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Other option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with Step Functions, but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Does Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?
I have worked on both Apache Airflow and AWS Step Functions and here are some insights:
Step Functions provide out of the box maintenance. It has high availability and scalability that is required for your use-case, for Airflow we'll have to do to it with auto-scaling/load balancing on servers or containers (kubernetes).*
Both Airflow and Step Functions have user friendly UI's. While Airflow supports multiple representations of the state machine, Step Functions only display state machine as DAG's.
As of version 2.0, Airflow's Rest API is now stable. AWS Step Functions are also supported by a range of production graded cli and SDK's.
Airflow has server costs while Step Functions have 4000/month free step executions (free tier) and $0.000025/step after that. e.g. if you use 10K steps for AWS Batch that run once daily, you will be priced $0.25 per day ($7.5 per month). The price for Airflow server (t2.large ec2 1 year reserved instance) is $41.98 per month. We will have to use AWS Batch for either case.**
AWS Batch can integrate to both Airflow and Step Functions.
You can clear and rerun a failed task in Apache Airflow, but in Step Functions you will have to create a custom implementation to handle that. You may handle automated retries with back-offs in Step Functions definition as well.
For failed task in Step Functions you will get a visual representation of failed state and the detailed message when you click it. You may also use aws cli or sdk to get the details.
Step Functions use easy to use JSON as state machine definition, while Airflow uses Python script.
Step Functions support async callbacks, i.e. state machine pauses until an external source notifies it to resume. While Airflow has yet to add this feature.
Overall, I see more advantages of using AWS Step Functions. You will have to consider maintenance cost and development cost for both services as per your use case.
UPDATES (AWS Managed Workflows for Apache Airflow Service):
*With AWS Managed Workflows for Apache Airflow service, you can offload deployment, maintenance, autoscaling/load balancing and security of your Airflow Service to AWS. But please consider the version number you're willing to settle for, as AWS managed services are mostly behind the latest version. (e.g. As of March 08, 2021, the latest version of open source airflow is 2.01, while MWAA allows version 1.10.12)
**MWAA costs on environment, instance and storage. More details here.
I have used both Airflow and Step Functions in my personal and work projects.
In general I liked step functions but the fact that you need to schedule the execution with Event Bridge is super annoying. Actually I think here Airflow could just act as a triggered for the step functions.
If Airflow would be cheaper to manage, I would always opt for it because I find managing Json based pipelines a hustle whenever you need to detour from the main use case. This always happen for me somehow.This becomes even a more complex issue when you need to have source control.
This one is a bit more subjective assessment but I find the monitoring capability of Airflow far greater than for step functions.
Also some information about the usage of Airflow vs Step functions
Aws currently has managed airflow which is priced per hour and you don’t need to have dedicated ec2. On the other hand step functions are aws lambdas that have an execution time limit of 15min which makes them not the best candidate for a long running pipelines
I ran the following test command:
airflow test events {task_name_redacted} 2018-12-12
...and got the following output:
Dependencies not met for <TaskInstance: events.{redacted} 2018-12-12T00:00:00+00:00 [None]>, dependency 'Task Instance Slots Available' FAILED: The maximum number of running tasks (16) for this task's DAG 'events' has been reached.
[2019-01-17 19:47:48,978] {models.py:1556} WARNING -
--------------------------------------------------------------------------------
FIXME: Rescheduling due to concurrency limits reached at task runtime. Attempt 1 of 6. State set to NONE.
--------------------------------------------------------------------------------
[2019-01-17 19:47:48,978] {models.py:1559} INFO - Queuing into pool None
My Airflow is configured with a maximum concurrency of 16. Does this mean that I cannot test a task when the DAG is currently running, and has used all of it's task slots?
Also, it was a little unclear from the docs, but does the airflow test actually execute the task, as in if it was a SparkSubmitOperator, it would actually submit the job?
While I am yet to reach that phase of deployment where concurrency will matter, the docs do give a fairly good indication of problem at hand
Since at any point of time just one scheduler is running (and you shouldn't be running multiple anyways), indeed it appears that irrespective of whether the DAG-runs are live-runs or test-runs, this limit will apply on them collectively. So that is certainly a hurdle.
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
But beware that merely increasing this number (assuming you have big-enough boxes for hefty workers / multiple workers), several other configurations will have to be tweaked as well to achieve the kind of parallelism I sense you want.
They are all listed under [core] section
# The amount of parallelism as a setting to the executor. This
defines the max number of task instances that should run
simultaneously on this airflow installation
parallelism = 32
# When not using pools, tasks are run in the "default pool", whose
size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
But we are still not there, because once you spawn so many tasks simultaneously, the backend metadata-db will start choking. While this is likely a minor problem (and might not be affecting unless you have some real huge DAGs / very large no of Variable interactions in your tasks), its still worth noting as a potential roadblock
# The SqlAlchemy pool size is the maximum number of database
connections in the pool. 0 indicates no limit.
sql_alchemy_pool_size = 5
# The SqlAlchemy pool recycle is the number of seconds a connection
can be idle in the pool before it is invalidated. This config does not
apply to sqlite. If the number of DB connections is ever exceeded, a
lower config value will allow the system to recover faster.
sql_alchemy_pool_recycle = 1800
# How many seconds to retry re-establishing a DB connection after
disconnects. Setting this to 0 disables retries.
sql_alchemy_reconnect_timeout = 300
Needless to say, all this is pretty much futile unless you pick the right executor; SequentialExecutor, in particular is only intended for testing
# The executor class that airflow should use. Choices include SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor,
KubernetesExecutor
executor = SequentialExecutor
But then params to BaseOperator like depends_on_past, wait_for_downstream are there to spoil the party as well
Finally I leave you with this link related to Airflow + Spark combination: How to submit Spark jobs to EMR cluster from Airflow?
(Pardon me if the answer confused you more than you already were, but..)
What is the practical limit on the number of DAGS in an airflow system?
We are seeing severe delays after a couple of hundred DAGS are created.
Is anyone running 1000's of DAGS?
It depends on how you run Airflow and how much resources you provide it. You can run Airflow in a distributed mode with Celery and Master and Worker having high memory and vCores, if you have huge number of tasks.
I am trying to diagnose an under-performing airflow pipeline and am wondering what kind of performance I should expect out of the airflow scheduler in terms similar to "tasks scheduled per second".
I have few queued jobs and many of my tasks finish in seconds so I suspect the scheduler is the limiting component and it is my fault for having many quick tasks. Still, I would rather not rewrite my DAGs if it can be avoided.
What can I do to increase the rate at which the scheduler queues tasks?
Pipeline Details
Here is what my current airflow.cfg looks like.
I only have two dags running. One is scheduled every 5 min and the other is rarely triggered by the first. I am currently trying to backfill several years at this frequency, but may need to change my approach:
As for worker nodes: I currently have 4 fairly powerful servers running at less than 10% resource usage in disk, network, cpu, RAM, swap. Toggling 3 of the workers off has no impact on my task throughput and the server left on barely even registers the change in workload.
There are a number of config values in your airflow.cfg that could be related to this.
Under [core]:
parallelism: Total number of task instances that can run at once.
dag_concurrency: Limit of task instances that can run per DAG run, may need to bump if you have many parallel tasks. Can override when defining a DAG.
non_pooled_task_slot_count: Limit of tasks without a pool configured that can run at once.
max_active_runs_per_dag: The maximum number of active DAG runs per DAG. If you're triggering runs manually or there's a backup of DAG runs scheduled with a short interval. Can override when defining a DAG.
Under [scheduler]:
schedule_heartbeat_sec: Defines how often the scheduler runs, try it out with lower values.
min_file_process_interval: Process each file at most once every N seconds. Set to 0 to never limit how often you process a file.
Under [worker]:
celeryd_concurrency: Number of workers celery will run with, so essentially number of task instances a worker can take at once. Matching the number of CPUs is a popular starting point, but can definitely go higher.
Last one is only if you're using the CeleryExecutor, which I'd definitely recommend if you're looking to increase your task throughput.
the Local Executor spawns new processes while scheduling tasks. Is there a limit to the number of processes it creates. I needed to change it. I need to know what is the difference between scheduler's "max_threads" and
"parallelism" in airflow.cfg ?
parallelism: not a very descriptive name. The description says it sets the maximum task instances for the airflow installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow installed on two hosts, so that should be two installations, but based on context 'per installation' here means 'per Airflow state database'. I'd name this max_active_tasks.
dag_concurrency: Despite the name based on the comment this is actually the task concurrency, and it's per worker. I'd name this max_active_tasks_for_worker (per_worker would suggest that it's a global setting for workers, but I think you can have workers with different values set for this).
max_active_runs_per_dag: This one's kinda alright, but since it seems to be just a default value for the matching DAG kwarg, it might be nice to reflect that in the name, something like default_max_active_runs_for_dags
So let's move on to the DAG kwargs:
concurrency: Again, having a general name like this, coupled with the fact that concurrency is used for something different elsewhere makes this pretty confusing. I'd call this max_active_tasks.
max_active_runs: This one sounds alright to me.
source: https://issues.apache.org/jira/browse/AIRFLOW-57
max_threads gives the user some control over cpu usage. It specifies scheduler parallelism.
It's 2019 and more updated docs have come out. In short:
AIRFLOW__CORE__PARALLELISM is the max number of task instances that can run concurrently across ALL of Airflow (all tasks across all dags)
AIRFLOW__CORE__DAG_CONCURRENCY is the max number of task instances allowed to run concurrently FOR A SINGLE SPECIFIC DAG
These docs describe it in more detail:
According to https://www.astronomer.io/guides/airflow-scaling-workers/:
parallelism is the max number of task instances that can run
concurrently on airflow. This means that across all running DAGs, no
more than 32 tasks will run at one time.
And
dag_concurrency is the number of task instances allowed to run
concurrently within a specific dag. In other words, you could have 2
DAGs running 16 tasks each in parallel, but a single DAG with 50 tasks
would also only run 16 tasks - not 32
And, according to https://airflow.apache.org/faq.html#how-to-reduce-airflow-dag-scheduling-latency-in-production:
max_threads: Scheduler will spawn multiple threads in parallel to
schedule dags. This is controlled by max_threads with default value of
2. User should increase this value to a larger value(e.g numbers of cpus where scheduler runs - 1) in production.
But it seems like this last piece shouldn't take up too much time, because it's just the "scheduling" portion. Not the actual running portion. Therefore we didn't see the need to tweak max_threads much, but AIRFLOW__CORE__PARALLELISM and AIRFLOW__CORE__DAG_CONCURRENCY did affect us.
The scheduler's max_threads is the number of processes to parallelize the scheduler over. The max_threads cannot exceed the cpu count. The LocalExecutor's parallelism is the number of concurrent tasks the LocalExecutor should run. Both the scheduler and the LocalExecutor use python's multiprocessing library for parallelism.