Set up Airflow production environment - airflow

I'm a newbie at using Airflow. I went through many Airflow tutorials, and I can say that all are about development environments using a docker-compose file or files. I'm facing a problem at work setting up a production environment properly. My goal is to have a cluster composed of 3 EC2 virtual machines. Can anyone share best practices for installing Airflow on that cluster?
I went through many tutorials on the internet.

Airflow has 4 main components:
Webserver: stateless service which expose the UI and the REST API of Airflow
Scheduler: stateless service which processes the dags and runs them and it's the main component
Worker: stateless service which execute the tasks
Metadata: the database of Airflow where the state is stored, and it manages the communication between the 3 other components
And Airflow has 4 main executors:
LocalExecutor: the scheduler runs tasks by itself by spawning a process for each task, and it works in a single host -> not suitable for your need
CeleryExecutor: the most used scheduler, you can create one or multiple scheduler (for HA), and a group of celery workers to run the tasks, you can scale it on different nodes
DaskExecutor: similar to CeleryExecutor but it uses Dask instead of Celery, not much used, and there is no many resources around it
KubernetesExecutor: it runs each task in a K8S pod, and since it's based on Kubernetes, it's very scalable, but it has some drawbacks.
For you use case, I recommend using CeleryExecutor.
If you can use EKS instead of EC2, you can use the helm chart to install and configure the cluster. And if not, you have other options:
run the services directly on the host:
pip install apache-airflow[celery]
# run the webserver
airflow webserver
# run the scheduler
airflow scheduler
# run the worker
airflow celery worker
You can decide how many scheduler, workers and webserver you want to execute, and you can distribute them on the 3 nodes, for ex: node1(1 scheduler, 1 webserver, 1 worker), node2(1 scheduler, 2 workers), node3(1 webserver, 2 workers), and you need a DB, you can use postgres from AWS RDS, or create it on one of the nodes (not recommended).
using docker: same as the first solution, but your run containers instead of running the services directly on the host
using docker swarm: you can connect the 3 nodes to create a swarm cluster, and manage the config from one of the nodes, this gives you some feature which are not provided by the first 2 solutions, and it's similar to K8S. (doc)
For the 3 solutions, you need to create airflow.cfg file contains the configurations and the DB creds, and you should set the exeutor conf to CeleryExecutor.

Related

Migrating to ECS Fargate from EKS

I'm currently in the process of migrating 3 applications from Elastic Kubernetes Service (EKS) to ECS Fargate. Each application is built with Node JS .The current setup seems to be only 1 load balancer in front of one application and the other two applications are accessed through that one load balancer. This is currently how all three applications is accessed:
first_app.example.com
first_app.example.com/second_app
first_app.example.com/third_app
The front end of each application is being powered by an nginx proxy in EKS. I'm not entirely sure if I need nginx to be in ECS Fargate because the application load balancer I'm planning on to use will have an SSL cert integrated with it for redirects to HTTPS from HTTP. I'm a little unclear how to approach moving these applications to Fargate. Additionally, the third app has 3 additional functions:
Apollo GraphQL (abstraction layer between the front end & back end)
CSV
File Manager
This functionality also needs to be implemented on the Fargate side.
Currently I have setup one ECS Fargate cluster, one ECS Service, and one task definition. The task definition currently has the following 7 ECR images:
app_one_front_end
app_two_front_end
app_three_front_end
app_three_csv_job
app_three_file_manager_job
app_three_graphql
nginx ??
All of these images are stored in ECR. However I don't believe I need nginx in this Fargate cluster.
I'm a little unsure how to approach the architecture for this set of applications. It seems I can only have one task definition running on a service, that's why all containers were implemented into one task definition. The service can then be associated with an application load balancer where I set path based routing to access each application.
Any advice on how to approach this migration would be appreciated.
Thanks!
Each Kubernetes Replica Set should be converted to an ECS Service. Each Kubernetes Pod would be converted to an ECS Task.
Kubernetes Replica Set == ECS Service
Kubernetes Pod == ECS Task
If you had multiple Replica Sets in Kubernetes, in order to scale your pods independently, then in order to have the same scalability in ECS you would configure them as separate services with independent scaling configurations.
You are correct in that you probably don't need the Nginx container in ECS.
It seems I can only have one task definition running on a service, that's why all containers were implemented into one task definition.
Services can communicate with each other. You would enable ECS Service Discovery to facilitate that. However it is fine to have them all in the same Task/Service if they don't need to be scaled out independently.
Also, multiple services can be associated with a single Application Load Balancer by creating different listener rules in the load balancer that map to different Target Groups, if that is something you need. You might need to have multiple Target Groups even if you only have a single ECS Service, because you will need to map different load balancer listeners to different containers in your task. That basically allows the Application Load Balancer to perform the job that Nginx was doing in Kubernetes.

Airflow doesn't "see" that the underlying Kubernetes pod completed

We are using a hosted Airflow 1.10.2 in Google Composer 1.7.5 to launch jobs via the KubernetesPodOperator (tasks that will be run in a Kubernetes pod inside a worker cluster)
There has been several occasions in which the Kubernetes pod itself successfully completes, but Airflow doesn't "see" that the pod has completed (it doesn't get the memo), so Airflow thinks the pod is still running and doesn't move onto the next task.
We are planning to move to Composer 2 with Airflow 2.1.4, which I'm fairly confident it manages pods and communication with Kubernetes better, but...
... is there a "quick" tweak we can do? Even a link on how to start investigating would be helpful.
Thank you in advance.

Run Airflow Webserver and Scheduler on different servers

I was wondering if the Airflow's scheduler and webserver Daemons could be launched on different server instances ?
And if it's possible, why not use serverless architecture for the flask web server ?
There is a lot of resources about multi nodes cluster for workers but I found nothing about splitting scheduler and webserver.
Has anyone already done this ? And what may be the difficulties I will be facing ?
I would say the minimum requirement would be that both instance should have
Read(-write) access to the same AIRFLOW_HOME directory (for accessing DAG scripts and the shared config file)
Access to the same database backend (for accessing shared metadata)
Exactly the same Airflow version (to prevent any potential incompatibilities)
Then just try it out and report back (I am really curious ;) ).

Create Kubernetes Pod Network Map

I am looking to map out various network connections between pods in a namespace to understand which pod is talking to which other pods. Is there a way to query the etcd to get this information?
There are many tools to visualize k8s topology.
In order of Github stars:
Cockpit:
Cockpit Project — Cockpit Project Cockpit makes GNU/Linux discoverable. See your server in a web browser and perform system tasks with a mouse. It’s easy to start containers, administer storage, configure networks, and inspect logs.
Weave Scope (Github: weaveworks/scope) is a troubleshooting and monitoring tool for Docker and Kubernetes clusters. It can automatically generate applications and infrastructure topologies which can help you to identify application performance bottlenecks easily. You can deploy Weave Scope as a standalone application on your local server/laptop, or you can choose the Weave Scope Software as a Service (SaaS) solution on Weave Cloud. With Weave Scope, you can easily group, filter or search containers using names, labels, and/or resource consumption. :
spekt8/spekt8: Visualize your Kubernetes cluster in real time :
SPEKT8 is a new visualization tool for your Kubernetes clusters. It automatically builds logical topologies of your application and infrastructure, which enable your SRE and Ops team to intuitively understand, monitor, and control your containerized, microservices based application. Simply deploy our containerized application directly into your Kubernetes cluster.
KubeView (Github: benc-uk/kubeview: Kubernetes cluster visualiser and graphical explorer )
KubeView displays what is happening inside a Kubernetes cluster, it maps out the API objects and how they are interconnected. Data is fetched real-time from the Kubernetes API. The status of some objects (Pods, ReplicaSets, Deployments) is colour coded red/green to represent their status and health.
Kubernetes Topology Graph:
Provides a simple force directed topology graph for kubernetes items.
You can try to use Weave Scope to make a graphical map of your Kubernetes cluster.
It will generates a map of your process, containers and hosts in real time. You can also get logs from containers and run some diagnostic commands via WEB-UI.
To install on Kubernetes you can run:
kubectl apply -f "https://cloud.weave.works/k8s/scope.yaml?k8s-version=$(kubectl version | base64 | tr -d '\n')"
After launch you don't need to configure anything, Scope will listen you pods and network and make a map of you network.

Airflow and Docker Containers

I am running airflow in containers AWS ECS, 1 scheduler, 2 web servers, and multiple celery workers.
From what I have seen the only thing that is affected when running them in containers is that the web servers are unable to access the workers' port on 8793 to retrieve logs from the workers.
Is that that only thing that is affected when running these in containers?
Yes because you can only map one port from docker container to host instance. I use a similar setup and logs are the only main issue I have faced. There are different ways to solve this though. You use other logging services on the container which push the logs to Cloudwatch or FLuentdb etc..

Resources