We currently have a bunch of independent jobs running on different servers & being scheduled with crontab. The goal would be to have a single view of all the jobs across the servers and whether they've run successfully etc.
Airflow is one of the tools we are considering using to achieve this. But our servers are configured very differently. Is it possible to set up airflow so that DAG1 (and the airflow scheduler & webserver) runs on server1 and DAG2 runs on server2 without RabbitMQ.
Essentially I'd like to achieve something like the first answer given here (or just at a DAG level): Airflow DAG tasks parallelism on different worker nodes
in the quickest & simplest way possible!
Thanks
You can checkout Running Apache-Airflow with Celery Executor in Docker.
To use celery, you can instantiate a redis node as a pod and proceed with managing tasks across multiple hosts.
The link above will also give you a starter docker-compose yaml to help you get started quickly with Apache Airflow on celery executor.
Is it possible to set up airflow so that DAG1 (and the airflow
scheduler & webserver) runs on server1 and DAG2 runs on server2
without RabbitMQ.
Airflow by default will try to use multiple hosts on Celery Executor and the division will always be on task level and not on DAG level.
This post might help you with spawning specific tasks on a specific worker node.
Related
Every time our team puts another requirements.txt file for our MWAA environment, it requires a restart.
Regardless of the environment being in a PENDING or UPDATING state, I can still access the UI and run/monitor DAGS. I expect something to at least be unavailable or locked during this process from a user perspective.
So, my questions are: in the MWAA way of things, what exactly is being "restarted" during this process and why is applied to the entire so-called MWAA environment?
Airflow DAG processor, Airflow workers and Airflow scheduler are reboot
but not Airflow web server
This can be confirmed checking their respective logs.
Beware, some long-running task can fail during a reboot.
We are setting up airflow for scheduling/orchestration , currently we have Spark python loads, and non-spark loads in different server and push files to gcp available in another server. Is there an option to decide to which worker nodes the airflow task are submitted? Currently we are using ssh connection to run all work loads. Our processing is mostly on-perm
Usage is celery executor model, How to we make sure that a specific task is run on its appropriate node.
task run a non spark server ( no spark binaries available)
task 2 executes PySpark submit - (This has spark binaries)
Task Push the files created from task 2 from another server/nodes ( Only this has the gcp utilities installed to push the files due to security reason ) .
If create a dag, is it possible to mention the task to execute on set of worker nodes ?
Currently we are having wrapper shell script for each task and making 3 ssh runs to complete these process. We would like to avoid such wrapper shell script rather use the inbuild have pythonOperator , SparkSubmitOperator, SparkJdbcOperator and SFTPToGCSOperator and make sure the specific task runs in specific server or worknodes .
In short , can we have 3 worker node groups and make the task to execute on a group of nodes based on the operations?
We can assign a queue to each worker node like
Start the airflow worker with mentioning the queue
airflow worker -q sparkload
airflow worker -q non-sparkload
airflow worker -q gcpload
The start each task with queue mentioned. Similar thread found as well.
How can Airflow be used to run distinct tasks of one workflow in separate machines?
New to Airflow, so apologies if this question doesn't really make sense. Is there a command or place in the webserver UI where one can see a list of all running workers? Also, if a Celery worker node is not explicitly started with airflow worker, are there "default" workers that are initialized with either the webserver or scheduler?
Reguarding this part of your question:
Also, if a Celery worker node is not explicitly started with airflow worker, are there "default" workers that are initialized with either the webserver or scheduler?
Take a look at the docs on executors.
Using celery requires some configuration changes. If you haven't configured airflow to use celery, then even if you start a celery worker, the worker won't pick up any tasks.
Conversely, if you have configured airflow to use celery, and you have not started any celery workers, then your cluster will not execute a single task.
If you are using SequentialExecutor (the default) or LocalExecutor (requires configuration), with both of these executors, tasks are executed by the scheduler, and no celery workers are used (and if you spun some up, then they wouldn't execute any tasks).
Regarding this:
Is there a way to view a list of all airflow workers?
If you have configured airflow to use celery, then you can run flower to see monitoring of celery workers. In airflow >= 2.0.0 flower is launched with airflow celery flower.
We have an Airflow deployment with Celery executors.
Many of our DAGs require a local processing step of some file in a BashOperator or PythonOperator.
However, in our understanding the tasks of a given DAG may not always be scheduled on the same machine.
The options for state sharing between tasks I've gathered so far:
Use Local Executors - this may suffice for one team, depending on the load, but may not scale to the wider company
Use XCom - does this have a size limit? Probably unsuitable for large files
Write custom Operators for every combination of tasks that need local processing in between. This approach reduces modularity of tasks and requires replicating existing operators' code.
Use Celery queues to route DAGs to the same worker (docs) - This option seems attractive at first, but what would be an appropriate way to set it up in order to avoid routing everything to one executor, or crafting a million queues?
Use a shared network storage in all machines that run executors - Seems like an additional infrastructure burden, but is a possibility.
What is the recommended way to do sharing of large intermediate state, such as files, between tasks in Airflow?
To clarify something: No matter how you setup airflow, there will only be one executor running.
The executor runs on the same machine as the scheduler.
Currently (current is airflow 1.9.0 at time of writing) there is no safe way to run multiple schedulers, so there will only ever be one executor running.
Local executor executes the task on the same machine as the scheduler.
Celery Executor just puts tasks in a queue to be worked on the celery workers.
However, the question you are asking does apply to Celery workers. If you use Celery Executor you will probably have multiple celery workers.
Using network shared storage solves multiple problems:
Each worker machine sees the same dags because they have the same dags folder
Results of operators can be stored on a shared file system
The scheduler and webserver can also share the dags folder and run on different machines
I would use network storage, and write the output file name to xcom. Then when you need to input the output from a previous task, you would read the file name from that task's Xcom and process that file.
Change datatype of column key in xcom table of airflow metastore.
Default datatype of key is: blob.
Change it to LONGBLOB. It will help you to store upto 4GB in between intermediate tasks.
Trying to split out Airflow processes onto 2 servers. Server A, which has been already running in standalone mode with everything on it, has the DAGs and I'd like to set it as the worker in the new setup with an additional server.
Server B is the new server which would host the metadata database on MySQL.
Can I have Server A run LocalExecutor, or would I have to use CeleryExecutor? Would airflow scheduler has to run on the server that has the DAGs right? Or does it have to run on every server in a cluster? Confused as to what dependencies there are between the processes
This article does an excellent job demonstrating how to cluster Airflow onto multiple servers.
Multi-Node (Cluster) Airflow Setup
A more formal setup for Apache Airflow is to distribute the daemons across multiple machines as a cluster.
Benefits
Higher Availability
If one of the worker nodes were to go down or be purposely taken offline, the cluster would still be operational and tasks would still be executed.
Distributed Processing
If you have a workflow with several memory intensive tasks, then the tasks will be better distributed to allow for higher utilizaiton of data across the cluster and provide faster execution of the tasks.
Scaling Workers
Horizontally
You can scale the cluster horizontally and distribute the processing by adding more executor nodes to the cluster and allowing those new nodes to take load off the existing nodes. Since workers don’t need to register with any central authority to start processing tasks, the machine can be turned on and off without any downtime to the cluster.
Vertically
You can scale the cluster vertically by increasing the number of celeryd daemons running on each node. This can be done by increasing the value in the ‘celeryd_concurrency’ config in the {AIRFLOW_HOME}/airflow.cfg file.
Example:
celeryd_concurrency = 30
You may need to increase the size of the instances in order to support a larger number of celeryd processes. This will depend on the memory and cpu intensity of the tasks you’re running on the cluster.
Scaling Master Nodes
You can also add more Master Nodes to your cluster to scale out the services that are running on the Master Nodes. This will mainly allow you to scale out the Web Server Daemon incase there are too many HTTP requests coming for one machine to handle or if you want to provide Higher Availability for that service.
One thing to note is that there can only be one Scheduler instance running at a time. If you have multiple Schedulers running, there is a possibility that multiple instances of a single task will be scheduled. This could cause some major problems with your Workflow and cause duplicate data to show up in the final table if you were running some sort of ETL process.
If you would like, the Scheduler daemon may also be setup to run on its own dedicated Master Node.
Apache Airflow Cluster Setup Steps
Pre-Requisites
The following nodes are available with the given host names:
master1 - Will have the role(s): Web Server, Scheduler
master2 - Will have the role(s): Web Server
worker1 - Will have the role(s): Worker
worker2 - Will have the role(s): Worker
A Queuing Service is Running. (RabbitMQ, AWS SQS, etc)
You can install RabbitMQ by following these instructions: Installing RabbitMQ
If you’re using RabbitMQ, it is recommended that it is also setup to be a cluster for High Availability. Setup a Load Balancer to proxy requests to the RabbitMQ instances.
Additional Documentation
Documentation: https://airflow.incubator.apache.org/
Install Documentation: https://airflow.incubator.apache.org/installation.html
GitHub Repo: https://github.com/apache/incubator-airflow
All airflow processes need to have the same contents in their airflow_home folder. This includes configuration and dags. If you only want server B to run your MySQL database, you do not need to worry about any airflow specifics. Simply install the database on server B and change your airflow.cfg's sql_alchemy_conn parameter to point to your database on Server B and run airflow initdb from Server A.
If you also want to run airflow processes on server B, you would have to look into scaling using the CeleryExecutor.