How to setup Airflow > 2.0 high availability cluster on centos 7 or above

How to setup Airflow > 2.0 high availability cluster on centos 7 or above - airflow

I want to setup HA for airflow(2.3.1) on centos7. Messaging queue - Rabbitmq and metadata db - postgres. Anybody knows how to setup it.

Your question is very large, because the high availability has multiple level and definition:
Airflow availability: multiple scheduler, multiple workers, auto scaling to avoid pressure, high storage volume, ...
The databases: a HA cluster for Rabbitmq and a HA cluster for postgres
Even if you have the first two levels, how many node you want to use? you cannot put everything in the same node, you need to run one service replica per node
Suppose you did that, and now you have 3 different nodes running in the same data center, what if there is a fire in the data center? So you need to use multiple nodes in different regions
After doing all of above, is there a risk for network problem? of course there is
If you just want to run airflow in HA mode, you have multiple option to do that on any OS:
docker compose: usually we use it for developing, but you can use it for production too, you can create multiple scheduler instances, with multiple workers, it can help you to improve the availability of your service
docker swarm: similar to docker compose with additional features (scaling, multi nodes, ...), you will not find much resources to install it, but you can use the compose files and just do some changes
kubernetes: the best solution, K8S can help you to ensure the availability of your services, easy install with helm
or just running the different services on your host: not recommended, because of manual tasks, and applying the HA is complicated

Related

How do i setup multiple mariadb in one single VM using Galera Cluster?

How do i setup multiple mariadb server in one single VM using Galera Cluster?
If configuration links are available please share me?
I have searched in galera website its says add the nodes into cluster and not the adding the multiple mariadb server into cluster

Analysis of having 3 Galera nodes in a single server,
All 3 in a single VM
One in each of 3 VMs
No VMs
Notes:
Galera provides crash protection -- if a node goes down due to hardware failure, the other nodes continue serving the database needs. Not so with all of them sharing the same server and disk(s).
By having multiple instances of MySQL (whether as Galera nodes or not), you can make better use of CPUs. But, since MySQL rarely needs all of the available CPU, I see no advantage in this configuration.
Each instance uses some RAM for static things -- 3 instances leads to 3 copies of such. Other things (eg, caches) scale with RAM size.
No advantage in networking.
(There may be other reasons why there is virtually no difference between a single instance and multiple instances.)

How can you scale Wordpress in Kubernetes cluster - using multiple pod replicas, - accessing a single PVC (persistent file storage)

I'm learning Kubernetes, and trying to setup a cluster that could handle a single Wordpress site with high traffic. From reading multiple examples online from both Google Cloud and Kubernetes.io - they all set the "accessMode" - "readWriteOnce" when creating the PVCs.
Does this mean if I scaled the Wordpress Deployment to use multiple replicas, they all use the same single PVC to store persistent data - read/write data. (Just like they use the single DB instance?)
The google example here only uses a single-replica, single-db instance - https://cloud.google.com/kubernetes-engine/docs/tutorials/persistent-disk
My question is how do you handle persistent storage on a multiple-replica instance?

ReadWriteOnce means all replicas will use the same volume and therefore they will all run on one node. This can be suboptimal.
You can set up ReadWriteMany volume (NFS, GlusterFS, CephFS and others) storage class that will allow multiple nodes to mount one volume.
Alternatively you can run your application as StatefulSet with volumeClaimTemplate which ensures that each replica will mount its own ReadWriteOnce volume.

If on AWS (and therefore blocked by the limitation of EBS volumes only mountable on a single instance at a time), another option here is setting up the Pod Affinity to schedule on the same node. Not ideal from an HA standpoint, but it is an option.
If after setting that up you start running into any wonky issues (e.g. unable to login to the admin, redirect loops, disappearing media), I wrote a guide on some of the more common issues people run into when running Wordpress on Kubernetes, might be worth having a look!

How do I setup an Airflow of 2 servers?

Trying to split out Airflow processes onto 2 servers. Server A, which has been already running in standalone mode with everything on it, has the DAGs and I'd like to set it as the worker in the new setup with an additional server.
Server B is the new server which would host the metadata database on MySQL.
Can I have Server A run LocalExecutor, or would I have to use CeleryExecutor? Would airflow scheduler has to run on the server that has the DAGs right? Or does it have to run on every server in a cluster? Confused as to what dependencies there are between the processes

This article does an excellent job demonstrating how to cluster Airflow onto multiple servers.
Multi-Node (Cluster) Airflow Setup
A more formal setup for Apache Airflow is to distribute the daemons across multiple machines as a cluster.
Benefits
Higher Availability
If one of the worker nodes were to go down or be purposely taken offline, the cluster would still be operational and tasks would still be executed.
Distributed Processing
If you have a workflow with several memory intensive tasks, then the tasks will be better distributed to allow for higher utilizaiton of data across the cluster and provide faster execution of the tasks.
Scaling Workers
Horizontally
You can scale the cluster horizontally and distribute the processing by adding more executor nodes to the cluster and allowing those new nodes to take load off the existing nodes. Since workers don’t need to register with any central authority to start processing tasks, the machine can be turned on and off without any downtime to the cluster.
Vertically
You can scale the cluster vertically by increasing the number of celeryd daemons running on each node. This can be done by increasing the value in the ‘celeryd_concurrency’ config in the {AIRFLOW_HOME}/airflow.cfg file.
Example:
celeryd_concurrency = 30
You may need to increase the size of the instances in order to support a larger number of celeryd processes. This will depend on the memory and cpu intensity of the tasks you’re running on the cluster.
Scaling Master Nodes
You can also add more Master Nodes to your cluster to scale out the services that are running on the Master Nodes. This will mainly allow you to scale out the Web Server Daemon incase there are too many HTTP requests coming for one machine to handle or if you want to provide Higher Availability for that service.
One thing to note is that there can only be one Scheduler instance running at a time. If you have multiple Schedulers running, there is a possibility that multiple instances of a single task will be scheduled. This could cause some major problems with your Workflow and cause duplicate data to show up in the final table if you were running some sort of ETL process.
If you would like, the Scheduler daemon may also be setup to run on its own dedicated Master Node.
Apache Airflow Cluster Setup Steps
Pre-Requisites
The following nodes are available with the given host names:
master1 - Will have the role(s): Web Server, Scheduler
master2 - Will have the role(s): Web Server
worker1 - Will have the role(s): Worker
worker2 - Will have the role(s): Worker
A Queuing Service is Running. (RabbitMQ, AWS SQS, etc)
You can install RabbitMQ by following these instructions: Installing RabbitMQ
If you’re using RabbitMQ, it is recommended that it is also setup to be a cluster for High Availability. Setup a Load Balancer to proxy requests to the RabbitMQ instances.
Additional Documentation
Documentation: https://airflow.incubator.apache.org/
Install Documentation: https://airflow.incubator.apache.org/installation.html
GitHub Repo: https://github.com/apache/incubator-airflow

All airflow processes need to have the same contents in their airflow_home folder. This includes configuration and dags. If you only want server B to run your MySQL database, you do not need to worry about any airflow specifics. Simply install the database on server B and change your airflow.cfg's sql_alchemy_conn parameter to point to your database on Server B and run airflow initdb from Server A.
If you also want to run airflow processes on server B, you would have to look into scaling using the CeleryExecutor.

Kubernetes statefulsets in a GCE multiple zone deployment

I'm working on a project to run a Kubernetes cluster on GCE. My goal is to run a cluster containing a WordPress site in multiple zones. I've been reading a lot of documentation, but I can't seem to find anything that is direct and to the point on persistent volumes and statefulsets in a multiple zone scenario. Is this not a supported configuration? I can get the cluster up and the statefulsets deployed, but I'm not getting the state replicated throughout the cluster. Any suggestions?
Thanks,
Darryl

Reading the docs, I see that the recommended configuration would be to create a MySQL cluster with replication: https://kubernetes.io/docs/tasks/run-application/run-replicated-stateful-application/. This way, you would have the data properly replicated between the instances of your cluster (if you are in a multi-zone deployment you may have to create an external endpoint).
Regarding the Wordpress data, my advice would be to go for an immutable deployment: https://engineering.bitnami.com/articles/why-your-next-web-service-should-be-immutable.html . This way, if you need to add a plugin or perform upgrades, you would create a new container image and re-deploy it. Regarding the media library assets and immutability, I think the best option would be to use an external storage service like S3 https://wordpress.org/plugins/amazon-s3-and-cloudfront/
So, to answer the original question: I think that statefulset synchronization is not available in K8s (at the moment). Maybe using a volume provider that allows ReadWriteMany access mode could fit your needs (https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes), though I am quite unsure about the stability of it.

Kubernetes cluster best practice

I am working on a new project with Kubernetes and I need three environments: DEV,QA and PROD.
What is most recommended, create Multiple Clusters or create one big cluster separating environments by namespace.

Are you just going to have a single prod cluster or multiple prod clusters? One thing to consider is that updating the cluster management software (to a new k8s release) can impact your application. If you only plan to have a single prod cluster, I'd recommend running qa and dev separately so that you can upgrade those clusters first to shake out any issues. If you are going to have multiple prod clusters, then you can upgrade them one at a time to ensure application availability and sharing the clusters between environments makes a lot more sense.

Namespaces will not bring you isolation, at the moment it's just a different subdomain in dns. It's better to have namespace per application.
I highly recommend you to have two clusters for prod (in case of updating k8s) and one-two for dev/qa.

Take a look at this blog post: Checklist: pros and cons of using multiple Kubernetes clusters, and how to distribute workloads between them.
I'd like to highlight some of the pros/cons:
Reasons to have multiple clusters
Separation of production/development/test: especially for testing a new version of Kubernetes, of a service mesh, of other cluster software
Compliance: according to some regulations some applications must run in separate clusters/separate VPNs
Better isolation for security
Cloud/on-prem: to split the load between on-premise services
Reasons to have a single cluster
Reduce setup, maintenance and administration overhead
Improve utilization
Cost reduction
Considering a not too expensive environment, with average maintenance, and yet still ensuring security isolation for production applications, I would recommend:
1 cluster for DEV/QA (separated by namespaces, maybe even isolated, using Network Policies, like in Calico)
1 cluster for PROD

Definitely concur that you want multiple clusters:
anything critical to k8s that may fail during an upgrade or because you screw up somewhere will affect the whole cluster.
for example, I had an issue with DNS which wrecked havoc in my cluster; all namespaces were affected.
Upgrades are usually not a big deal but one day you might hit a roadblock; if kubelet fails for too long your pods will get killed.
So it's best to upgrade your test/dev environments and iron things out there before upgrading in prod.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex