I'm a newbie at using Airflow. I went through many Airflow tutorials, and I can say that all are about development environments using a docker-compose file or files. I'm facing a problem at work setting up a production environment properly. My goal is to have a cluster composed of 3 EC2 virtual machines. Can anyone share best practices for installing Airflow on that cluster?
I went through many tutorials on the internet.
Airflow has 4 main components:
Webserver: stateless service which expose the UI and the REST API of Airflow
Scheduler: stateless service which processes the dags and runs them and it's the main component
Worker: stateless service which execute the tasks
Metadata: the database of Airflow where the state is stored, and it manages the communication between the 3 other components
And Airflow has 4 main executors:
LocalExecutor: the scheduler runs tasks by itself by spawning a process for each task, and it works in a single host -> not suitable for your need
CeleryExecutor: the most used scheduler, you can create one or multiple scheduler (for HA), and a group of celery workers to run the tasks, you can scale it on different nodes
DaskExecutor: similar to CeleryExecutor but it uses Dask instead of Celery, not much used, and there is no many resources around it
KubernetesExecutor: it runs each task in a K8S pod, and since it's based on Kubernetes, it's very scalable, but it has some drawbacks.
For you use case, I recommend using CeleryExecutor.
If you can use EKS instead of EC2, you can use the helm chart to install and configure the cluster. And if not, you have other options:
run the services directly on the host:
pip install apache-airflow[celery]
# run the webserver
airflow webserver
# run the scheduler
airflow scheduler
# run the worker
airflow celery worker
You can decide how many scheduler, workers and webserver you want to execute, and you can distribute them on the 3 nodes, for ex: node1(1 scheduler, 1 webserver, 1 worker), node2(1 scheduler, 2 workers), node3(1 webserver, 2 workers), and you need a DB, you can use postgres from AWS RDS, or create it on one of the nodes (not recommended).
using docker: same as the first solution, but your run containers instead of running the services directly on the host
using docker swarm: you can connect the 3 nodes to create a swarm cluster, and manage the config from one of the nodes, this gives you some feature which are not provided by the first 2 solutions, and it's similar to K8S. (doc)
For the 3 solutions, you need to create airflow.cfg file contains the configurations and the DB creds, and you should set the exeutor conf to CeleryExecutor.
Unfortunately I deleted airflow on eks which is deployed via helm chart. Connections and variables are encrypted with fernet key in the databse.when I deployed new airflow via helm again, Everything seems to be running up and fine. Only issue is I see the connections are not able to read by airflow and it is showing the below error in connections tab
cryptography.fernet.invalidtoken airflow
I also checked the fernet key in secrets and as well by doing exec into webserver pod. Both are same.
Can someone help me please.
We have a requirement where we need to send airflow metrics to datadog. I tried to follow the steps mentioned here
https://docs.datadoghq.com/integrations/airflow/?tab=host
Likewise, I included statsD in airflow installation and updated the airflow configuration file (Steps 1 and 2)
After this point, I am not able to figure out how to send my metrics to datadog. Do I follow the Host configurations or containarized configurations? For the Host configurations, we have to update the datadog.yaml file which is not in our repo and for containerized version, they have specified how to do it for Kubernetics but we don't use Kubernetics.
We are using airflow by creating a docker build and running it over on Amazon ECS. We also have a datadog agent running parallely in the same task (not part of our repo). However I am not able to figure out what configurations I need to make in order to send the StatsD metrics to datadog. Please let me know if anyone has any answer.
Hi based on airflow docs I am able to set up cloud/remote logging.
Remote logging is working for dag and task logs but it's not able to back up or remotely store following logs of.
scheduler
dag_processing_manager
I am using docker_hub airflow docker image.
We have a Google Cloud project with several VM instances and also Kubernetes cluster.
I am able to easily access Kubernetes services with kubefwd and I can ping them and also curl them. The problem is that kubefwd works only for Kubernetes, but not for other VM instances.
Is there a way to mount the network locally, so I could ping and curl any instance without it having public IP and with DNS the same as inside the cluster?
I would highly recommend rolling a vpn server like openvpn. You can also run this inside of the Kubernetes Cluster.
I have a make install ready repo for ya to check out at https://github.com/mateothegreat/k8-byexamples-openvpn.
Basically openvpn is running inside of a container (inside of a pod) and you can set the routes that you want the client(s) to be able to see.
I would not rely on kubefwd as it isn't production grade and will give you issues with persistent connections.
Hope this help ya out.. if you still have questions/concerns please reach out.