Minimum hardware requirements for Apache Airflow cluster - airflow

What are the minimum hardware requirements for setting up an Apache Airflow cluster.
Eg. RAM, CPU, Disk etc for different types of nodes in the cluster.

I have had no issues using very small instances in pseudo-distributed mode (32 parallel workers; Postgres backend):
RAM 4096 MB
CPU 1000 MHz
VCPUs 2 VCPUs
Disk 40 GB
If you want distributed mode, you should be more than fine with that if you keep it homogenous. Airflow shouldn't really do heavy lifting anyways; push the workload out to other things (Spark, EMR, BigQuery, etc).
You will also have to run some kind of messaging queue, like RabbitMQ. I think they take Redis too. However, this doesn't really dramatically impact how you size.

We are running the airflow in AWS with below config
t2.small --> airflow scheduler and webserver
db.t2.small --> postgres for metastore
The parallelism parameter in airflow.cfg is set to 10 and there are around 10 users who access airflow UI
All we do from airflow is ssh to other instances and run the code from there

Related

How to define number of slots in a Airflow pool

I'm using Airflow through Cloud Composer (Image: composer-2.0.29-airflow-2.3.3). I have defined 5 DAGS that run concurrently with 22 tasks run concurrently (max) distributed among the 5 DAGS. These DAGS are in the default-pool with default number of slots set to 128.
My composer instance has:
1 Scheduler: 0.5 vCPUs, 1.875 GB memory, 1 GB storage
Worker: 0.5 vCPUs, 1.875 GB memory, 1 GB storage
Autoscaling worker: from 1 to 3.
I would like to create different pools to separate my 5 systems. How do I define the number of slots in each pool? Suppose a pool has 1 DAG with 10 tasks (with 5/10 concurrent tasks). How many slots should I assign to each task?
DAG example:
task1.x is ingestion of JDBC table; while task2.x is update of the corresponding BigQuery table.
Thank you all!
Airflow pools are designed to avoid overwhelmed on external systems used by a group of tasks. For example, if you have some tasks in different dags which use a machine learning model API, a RDBMS, an API with quotas or any other system with limited scaling, you can use an Airflow pool to limit the number of parallel tasks which interact with this system.
In your case, you have two systems, JDBC database and BigQuery. You need to create just two pools, jdbc_pool and bigquery_pool, and assign all the tasks (form all the dags) which interact with the jdbc table to the first one and assign all the tasks which interact with biquery to the second one. For the slots, you can define them based on the performance of each system, and the computational weight of each task.
If you have a monitoring tool (prometheus, datadog, ...), you can run one of the tasks and watch the resources usage on your db, lets assume that it uses 10% of the resources, in this case you can create a pool with 8 slots to attend 80% of resources usage (you should avoid using 100% of the resources to avoid the problems when there is unexpected load). Then for the pool slots of each task:
if all the tasks are similar, you can use pool_slots=1 for all the tasks: max 8 parallel tasks with 80% of resources usage
if you have some tasks which are more complicated than the task you have tested (they use more than 10% of the db resources), you can use a higher value for pool_slots for these tasks based on the resources usage: assume there is a task which consumes 20% of the resources, you can use pool_slots=2 only for this tasks and keep 1 for the others, in this case you can have 8 parallel simple tasks or 6 parallel simple tasks with this heavy task with 80% of resources usage in the two cases.
For bigquery_pool, you need to check what are the quotas, but I think you can use a high value without any problem where it is a very scalable serverless DWH.
If you just want to limit the number of executed tasks in each worker to avoid OOM problem for ex, you can set the worker concurrency conf.
And if you want to limit the number of executed tasks in the whole Airflow server, you can set the parallelism conf.

How can I configure yarn cluster for parallel execution of Applications?

When I run spark job on yarn cluster, Applications are running in queue. So how can I run in parallel number of Applications?.
I suppose your YARN scheduler option is set to FIFO. Please change it to FAIR or capacity scheduler.Fair Scheduler attempts to allocate resources so that all running applications get the same share of resources.
The Capacity Scheduler allows sharing of a Hadoop cluster along
organizational lines, whereby each organization is allocated a certain
capacity of the overall cluster. Each organization is set up with a
dedicated queue that is configured to use a given fraction of the
cluster capacity. Queues may be further divided in hierarchical
fashion, allowing each organization to share its cluster allowance
between different groups of users within the organization. Within a
queue, applications are scheduled using FIFO scheduling.
If you are using capacity scheduler then
In spark submit mention your queue --queue queueName
Please try to change this capacity scheduler property
yarn.scheduler.capacity.maximum-applications = any number
it will decide how many application will run parallely
By default, Spark will acquire all available resources when it launches a job.
You can limit the amount of resources consumed for each job via the spark-submit command.
Add the option "--conf spark.cores.max=1" to spark-submit. You can change the number of cores to suite your environment. For example if you have 100 total cores, you might limit a single job to 25 cores or 5 cores, etc.
You can also limit the amount of memory consumed: --conf spark.executor.memory=4g
You can change settings via spark-submit or in the file conf/spark-defaults.conf. Here is a link with documentation:
Spark Configuration

How do I setup an Airflow of 2 servers?

Trying to split out Airflow processes onto 2 servers. Server A, which has been already running in standalone mode with everything on it, has the DAGs and I'd like to set it as the worker in the new setup with an additional server.
Server B is the new server which would host the metadata database on MySQL.
Can I have Server A run LocalExecutor, or would I have to use CeleryExecutor? Would airflow scheduler has to run on the server that has the DAGs right? Or does it have to run on every server in a cluster? Confused as to what dependencies there are between the processes
This article does an excellent job demonstrating how to cluster Airflow onto multiple servers.
Multi-Node (Cluster) Airflow Setup
A more formal setup for Apache Airflow is to distribute the daemons across multiple machines as a cluster.
Benefits
Higher Availability
If one of the worker nodes were to go down or be purposely taken offline, the cluster would still be operational and tasks would still be executed.
Distributed Processing
If you have a workflow with several memory intensive tasks, then the tasks will be better distributed to allow for higher utilizaiton of data across the cluster and provide faster execution of the tasks.
Scaling Workers
Horizontally
You can scale the cluster horizontally and distribute the processing by adding more executor nodes to the cluster and allowing those new nodes to take load off the existing nodes. Since workers don’t need to register with any central authority to start processing tasks, the machine can be turned on and off without any downtime to the cluster.
Vertically
You can scale the cluster vertically by increasing the number of celeryd daemons running on each node. This can be done by increasing the value in the ‘celeryd_concurrency’ config in the {AIRFLOW_HOME}/airflow.cfg file.
Example:
celeryd_concurrency = 30
You may need to increase the size of the instances in order to support a larger number of celeryd processes. This will depend on the memory and cpu intensity of the tasks you’re running on the cluster.
Scaling Master Nodes
You can also add more Master Nodes to your cluster to scale out the services that are running on the Master Nodes. This will mainly allow you to scale out the Web Server Daemon incase there are too many HTTP requests coming for one machine to handle or if you want to provide Higher Availability for that service.
One thing to note is that there can only be one Scheduler instance running at a time. If you have multiple Schedulers running, there is a possibility that multiple instances of a single task will be scheduled. This could cause some major problems with your Workflow and cause duplicate data to show up in the final table if you were running some sort of ETL process.
If you would like, the Scheduler daemon may also be setup to run on its own dedicated Master Node.
Apache Airflow Cluster Setup Steps
Pre-Requisites
The following nodes are available with the given host names:
master1 - Will have the role(s): Web Server, Scheduler
master2 - Will have the role(s): Web Server
worker1 - Will have the role(s): Worker
worker2 - Will have the role(s): Worker
A Queuing Service is Running. (RabbitMQ, AWS SQS, etc)
You can install RabbitMQ by following these instructions: Installing RabbitMQ
If you’re using RabbitMQ, it is recommended that it is also setup to be a cluster for High Availability. Setup a Load Balancer to proxy requests to the RabbitMQ instances.
Additional Documentation
Documentation: https://airflow.incubator.apache.org/
Install Documentation: https://airflow.incubator.apache.org/installation.html
GitHub Repo: https://github.com/apache/incubator-airflow
All airflow processes need to have the same contents in their airflow_home folder. This includes configuration and dags. If you only want server B to run your MySQL database, you do not need to worry about any airflow specifics. Simply install the database on server B and change your airflow.cfg's sql_alchemy_conn parameter to point to your database on Server B and run airflow initdb from Server A.
If you also want to run airflow processes on server B, you would have to look into scaling using the CeleryExecutor.

AWS OpsWorks - Slow commands

My OpWorks commands are taking anywhere between 8-15 minutes for a single instance. This is extremely painful for deployments which should really only take 2-3 minutes.
Are these timings usual for a PHP application with no extra deployment recipes?
Check that you are not using a t1.* or t2.* instance. Those instances can get really slow if you have depleted cpu credits (their CPU capacity are throttled).
Setup tasks performed by OpsWorks can deplete CPU capacity in those instances, before they are ready for service.

How does deis scheduler work?

I'm looking at the documentation of deis and I'm not sure how the scheduler works.
Essentially I want to deploy small apps. My idea is to have different size of apps based on memory (64M, 128M, 256M and 512M).
Then I would have a cluster of small machines (1 CPU, ~3GB) and I want to deploy/undeploy any number of apps, where most of them will only have one instance.
So in this case I need a scheduler that looks at the free memory on each node and deploy the app to the node with more resources available (in this case memory based).
For example, if I have 2GB available for apps I could have the following balancing:
Node1: App1 (256M), App2 (256M), App3 (512M) => Total 1.5 GB
Node2: App4 (512M), App5 (128M), App6 (128M), App7 (256M), App8 (512M), App9 (256M) => Total 1.75 GB
Then if I need to deploy an app that will consume 512M, the scheduler should deploy the app in Node1.
So I wanted to understand if deis could be userful for this scenario.
Under the hood, Deis uses fleet as a scheduler. Currently fleet awards a job to whichever machine in the cluster responds first, and has no understanding of machine load. Smarter scheduling is a priority of the fleet project, and as it improves, Deis improves.

Resources