When deploying airflow in cluster mode it is necessary to ensure that the dag files of all worker nodes are consistent. When I modify a dag file in airflow web UI by airflow-code-editor plugin, only one dag file of the worker node where airflow webserver is located will be modified. The dag files of other worker nodes will not be modified synchronously.
How can I resolve this problem?
Related
I'm a newbie at using Airflow. I went through many Airflow tutorials, and I can say that all are about development environments using a docker-compose file or files. I'm facing a problem at work setting up a production environment properly. My goal is to have a cluster composed of 3 EC2 virtual machines. Can anyone share best practices for installing Airflow on that cluster?
I went through many tutorials on the internet.
Airflow has 4 main components:
Webserver: stateless service which expose the UI and the REST API of Airflow
Scheduler: stateless service which processes the dags and runs them and it's the main component
Worker: stateless service which execute the tasks
Metadata: the database of Airflow where the state is stored, and it manages the communication between the 3 other components
And Airflow has 4 main executors:
LocalExecutor: the scheduler runs tasks by itself by spawning a process for each task, and it works in a single host -> not suitable for your need
CeleryExecutor: the most used scheduler, you can create one or multiple scheduler (for HA), and a group of celery workers to run the tasks, you can scale it on different nodes
DaskExecutor: similar to CeleryExecutor but it uses Dask instead of Celery, not much used, and there is no many resources around it
KubernetesExecutor: it runs each task in a K8S pod, and since it's based on Kubernetes, it's very scalable, but it has some drawbacks.
For you use case, I recommend using CeleryExecutor.
If you can use EKS instead of EC2, you can use the helm chart to install and configure the cluster. And if not, you have other options:
run the services directly on the host:
pip install apache-airflow[celery]
# run the webserver
airflow webserver
# run the scheduler
airflow scheduler
# run the worker
airflow celery worker
You can decide how many scheduler, workers and webserver you want to execute, and you can distribute them on the 3 nodes, for ex: node1(1 scheduler, 1 webserver, 1 worker), node2(1 scheduler, 2 workers), node3(1 webserver, 2 workers), and you need a DB, you can use postgres from AWS RDS, or create it on one of the nodes (not recommended).
using docker: same as the first solution, but your run containers instead of running the services directly on the host
using docker swarm: you can connect the 3 nodes to create a swarm cluster, and manage the config from one of the nodes, this gives you some feature which are not provided by the first 2 solutions, and it's similar to K8S. (doc)
For the 3 solutions, you need to create airflow.cfg file contains the configurations and the DB creds, and you should set the exeutor conf to CeleryExecutor.
I was wondering if the Airflow's scheduler and webserver Daemons could be launched on different server instances ?
And if it's possible, why not use serverless architecture for the flask web server ?
There is a lot of resources about multi nodes cluster for workers but I found nothing about splitting scheduler and webserver.
Has anyone already done this ? And what may be the difficulties I will be facing ?
I would say the minimum requirement would be that both instance should have
Read(-write) access to the same AIRFLOW_HOME directory (for accessing DAG scripts and the shared config file)
Access to the same database backend (for accessing shared metadata)
Exactly the same Airflow version (to prevent any potential incompatibilities)
Then just try it out and report back (I am really curious ;) ).
Is it possible to execute an airflow DAG remotely via command line?
I know there is an airflow command line tool but it seems to allow executing from the server's terminal rather than from any external client.
I can think of two options here
Experimental Rest API (preferable): use good-old GET / POST requests to trigger / cancel DAGRuns
Airflow CLI: SSH into the machine running Airflow and trigger the DAG via command-line
I am currently setup airflow scheduler in Linux server A and airflow web server in Linux server B. Both server has no Internet access. I have start the initDB in server A and keep all the dags in server A.
However, when i refresh the webserver UI, it keep having error message:-
This DAG isn't available in the webserver DagBag object
How do i configure the dag folder for web server (server B) to read the dag from scheduler (server A)?
I am using bashoperator. Is that Celery Operator is a must?
Thanks in advance
The scheduler has found your dags_folder, and its processes, and is scheduling them accordingly. The webserver however can "see" these processes solely by their existence in the database but can't find them in its dags_folder path.
You need to ensure that the dags_folder for both servers contain the same files, and that both are kept in sync with one another. This is out of scope for Airflow and it won't handle this on your behalf.
I have a AirFlow service running normally on remote machine, which can be accessed through Browser with URL: http://airflow.xxx.com
Now I want to dynamically upload DAGs from another machine to AirFlow at airflow.xxx.com, and make that DAG auto run.
After I read the airflow document: http://airflow.incubator.apache.org/, I found way to dynamically create DAGs and auto run it, which can be done on the airflow machine airflow.xxx.com.
But I want to do it in another machine, how can I accomplish it, is there a way like webhdfs, which let me directly send command to remote AirFlow?
You should upload your new dag in the Apache Airflow dag directory.
If you didn't set Airflow up in a cluster environment, you should have web-server, scheduler and worker all running on the same machine.
On that machine, if you did not amend airflow.cfg, you should have your dag directory in dags_folder = /usr/local/airflow/dags
If you access the Airflow machine from the other machine through SFTP (or FTP) you can simply put the file in that dir.