I have a MariaDB Galera Cluster with 3 nodes and multi-master setup.
my current configuration for State Snapshot Transfer method is rsync.
How to change this method to xtrabackup-v2 without downtime and data loss?
Can I shutdown and setup nodes one by one?
Related
I'm a newbie at using Airflow. I went through many Airflow tutorials, and I can say that all are about development environments using a docker-compose file or files. I'm facing a problem at work setting up a production environment properly. My goal is to have a cluster composed of 3 EC2 virtual machines. Can anyone share best practices for installing Airflow on that cluster?
I went through many tutorials on the internet.
Airflow has 4 main components:
Webserver: stateless service which expose the UI and the REST API of Airflow
Scheduler: stateless service which processes the dags and runs them and it's the main component
Worker: stateless service which execute the tasks
Metadata: the database of Airflow where the state is stored, and it manages the communication between the 3 other components
And Airflow has 4 main executors:
LocalExecutor: the scheduler runs tasks by itself by spawning a process for each task, and it works in a single host -> not suitable for your need
CeleryExecutor: the most used scheduler, you can create one or multiple scheduler (for HA), and a group of celery workers to run the tasks, you can scale it on different nodes
DaskExecutor: similar to CeleryExecutor but it uses Dask instead of Celery, not much used, and there is no many resources around it
KubernetesExecutor: it runs each task in a K8S pod, and since it's based on Kubernetes, it's very scalable, but it has some drawbacks.
For you use case, I recommend using CeleryExecutor.
If you can use EKS instead of EC2, you can use the helm chart to install and configure the cluster. And if not, you have other options:
run the services directly on the host:
pip install apache-airflow[celery]
# run the webserver
airflow webserver
# run the scheduler
airflow scheduler
# run the worker
airflow celery worker
You can decide how many scheduler, workers and webserver you want to execute, and you can distribute them on the 3 nodes, for ex: node1(1 scheduler, 1 webserver, 1 worker), node2(1 scheduler, 2 workers), node3(1 webserver, 2 workers), and you need a DB, you can use postgres from AWS RDS, or create it on one of the nodes (not recommended).
using docker: same as the first solution, but your run containers instead of running the services directly on the host
using docker swarm: you can connect the 3 nodes to create a swarm cluster, and manage the config from one of the nodes, this gives you some feature which are not provided by the first 2 solutions, and it's similar to K8S. (doc)
For the 3 solutions, you need to create airflow.cfg file contains the configurations and the DB creds, and you should set the exeutor conf to CeleryExecutor.
I have a Kubernetes cluster i.e "Cluster1" with auto-scaling to a maximum of 2 nodes & a minimum of 1 node. I am trying to understand the digital ocean behavior for auto-downgrading the nodes in the following scenario.
The "Cluster1" cluster has an Nginx as the ingress-controller, which was added as part of the "1-click setup" during cluster provisioning.
This cluster has auto-scaling configured as 1 min node & 2 max nodes. Let's call them Node1 & Node2.
This cluster is behind a digital ocean load balancer i.e LB1 which talks to ingress controller i.e pod running Nginx.
Let's say there is a single replica (replica:1) deployment controller of "image1" which requires 80% of the CPU.
Initially, the image1 is deployed & since there is resource availability, the image1 starts running on Node1.
Consider the image1 is updated to image2, upstream. The deployment controller will see there's node unavailability & will provision Node2, & will create another pod running image2 on Node2, the pod running image1 will start to terminate once the image2 is up and running.
LB1 updates the routing to Node1, Node2.
Now after the pod (on Node1) for image1 is terminated because the replica:1 is set in the deployment controller, the Node1 is not running anything from the user perspective.
Ideally, there should be an automatic de-provisioning of the node i.e Node1.
I tried to manually remove the Node1 from the cluster using the DO dashboard.
LB1 updates & shows single node availability, but shows the status as down.
Upon investigating I found that "nginx-controller" was running only on Node1. When the Node1 is terminated, the "nginx-controller" takes a while to provision a new pod on then available Node2. There is however downtime all this while.
My question is how-to best use auto-scaling for downgrading. I have a few solutions I thought.
Is it possible to run "nginx-controller" on all nodes?
or
If I drain the node from the "kubectl", i.e kubectl drain , and then delete manually delete the node from the dashboard, there shouldn't any downtime?. or just doing kubectl drain, will make DigitalOcean auto downgrade.
if Nginx ingress controller is set as DaemonSet, it is possible
Please go through kubernetes official document about this topic.
https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
I am currently working on single RabbitMQ server which support 20000 to 25672 TCP connection. So is it possible to expand the
TCP connection limit using RabbitMQ server cluster, If yes then How can to configure it and what are the benefits?
With the cluster, you can handle a lot of connections since the connections can be shared across more machines.
To configure the cluster you can follow the official documentation:
https://www.rabbitmq.com/clustering.html
rabbit2$ rabbitmqctl stop_app
Stopping node rabbit#rabbit2 ...done.
rabbit2$ rabbitmqctl join_cluster rabbit#rabbit1
Clustering node rabbit#rabbit2 with [rabbit#rabbit1] ...done.
rabbit2$ rabbitmqctl start_app
Starting node rabbit#rabbit2 ...done.
I'm trying to figure out a proper way to implement active/passive failover between replicas of service with Docker swarm mode.
The service will hold a valuable in-memory state that cannot be lost, that's why I need multiple replicas of it. The replicas will internally implement Raft so that only the replica which is active ("leader") at a given moment will accept requests from clients.
(If you're unfamiliar with Raft: simply put, it is a distributed consensus algorithm, which helps implement active/passive fault-tolerant cluster of replicas. According to Raft, the active replica - the leader - replicates changes in its data to passive replicas - the followers. The only leader accepts requests from clients. If the leader fails, a new leader is elected among the followers).
As far as I understand, Docker will guarantee that a specified number of replicas are up and running, but it will balance incoming requests among all of the replicas, in the active/active manner.
How can I tell Docker to route requests only to the active replica, but still guarantee that all replicas are up?
One option is routing all requests through an additional NGINX container, and updating its rules each time a new leader is elected. But that will be an additional hop, which I'd like to avoid.
I'm also trying to avoid external/overlapping tools such as consul or kubernetes, in order to keep the solution as simple as possible. (HAProxy is not an option because I need a Linux/Windows portable solution). So currently I'm trying to understand if this can be done with Docker swarm mode alone.
Another approach I came across is returning a failing health check from passive replicas. It does the trick with kubernetes according to this answer, but I'm not sure it will work with Docker. How does the swarm manager interpret failing health checks from task containers?
I'd appreciate any thoughts.
Active Passive replica can be achieved by having below deployment mode:
mode: global
With this port of the corresponding service is open, i.e., service is accessible via any the nodes in the swarm, but container will be running only on particular node.
Ref: https://docs.docker.com/compose/compose-file/#mode
Example:
VAULT-HA with Consul Backend docker stack file:
https://raw.githubusercontent.com/gtanand1994/VaultHA/master/docker-compose.yml
Here, Vault and Nginx containers will be seen only in one node in the swarm, but Consul containers (having mode: replicated) will be present on all the nodes of swarm.
But as I said before, VAULT, and NGINX services are available via 'any_node_ip:corresponding_port_number'
I have to shutdown a mariaDB machine which is a part of galera cluster, which have only read access for secondaries. How can i shutdown mariaDb gracefully so that all the connections read successfully completes?
Do you have a load-balancer or other proxy in front of the nodes? If so, turn that off, then wait for connections to terminate.
Do you have sufficient error checking in your clients? If so, simply "pull the plug". The clients will catch the error, and recover.