New node in galera cluster has SST loop

New node in galera cluster has SST loop - mariadb

We got a MariaDB 10.1 node on a Debian 8 server that we try to synchronize with 3-node galera cluster of MariaDB 10.0 on CentOS 7 using the Xtrabackup SST method.
Starting this node and setting a donor works, also works the SST: all data is transferred correctly. SST method and donor parameters are specified in /etc/mysql/my.cnf and we launch it as follow:
# service mysql start
(Note that we also have mariadb.service service present on this system)
After SST completion, mysql shuts down.
When we start it again, we can see in the logs innodb column type errors (column of type XXX that should be of type XXX), and it has again an UUID of 0000-0000-0000-0000 whereas after the SST it did set the laster UUID of the cluster.
It has forgotten the UUID and starts SST over and over.
We also tried a mysql_upgrade after the SST before starting mysql in the cluster, but it doesn't help.
One time out of 20 it started IST, which is expected, but it crashed because the port number couldn't be binded. But then we never got it as far as this attempt to IST.
Ports are open (3306,4444,4567,4568,9200) routes are set, we did also try rsync method without success, and after many tries sometimes multiple xtrabackup processes got hung and that is what caused the port number error. Afterwards we did a reboot after each time to be sure we got a clean system.
Also to mention, we did thouroughly read this datacenter migration guide: http://severalnines.com/blog/migrating-mysql-galera-cluster-new-data-center-without-downtime
So we want to set up a 6-nodes cluster out of a 3-nodes cluster that is only attached through 2 nodes.
Has anyone experienced such issues?

Related

MariaDB has stopped responding - [ERROR] mysqld got signal 6

MariaDB service was stopped responding all of a sudden. It was running for more than 5 months continuously without any issues. When we check the MariaDB service status at the time of the incident, it showed as active (running) ( service mariadb status ). But we could not log into the MariaDB server, each logging attempt was just hanged without any response. All our web applications were also failed to communicate with the MariaDB service. Also, we checked the max_used_connections, and it was below the maximum value.
When we going through the logs, we saw the below error (this had been triggered at the time of the incident).
210623 2:00:19 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.2.34-MariaDB-log
key_buffer_size=67108864
read_buffer_size=1048576
max_used_connections=139
max_threads=752
thread_count=72
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1621655 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f4c008501e8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f4c458a7d30 thread_stack 0x49000
2021-06-23 2:04:20 139966788486912 [Warning] InnoDB: A long semaphore wait:
--Thread 139966780094208 has waited at btr0sea.cc line 1145 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x55e1838d5ab0 created in file btr0sea.cc line 191
a writer (thread id 139966610978560) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file btr0sea.cc line 1145
Last time write locked in file btr0sea.cc line 1218
We could not even stop the MariaDB service using general stopping commands ( service MariaDB stop). But we were able to forcefully kill the MariaDB process and then we could get the MariaDB service back online.
What could be the reason for this failure. If you have already faced similar issues please share your experience, what actions you got to prevent such failures (in the future). Your feedback is much much appreciated.
Our Environment Details are as follows
Operating system: Red Hat Enterprise Linux 7
Mariadb version: 10.2.34-MariaDB-log MariaDB Server

I also face this issue on an aws instance (c5a.4xlarge) hosting my database.
Server version: 10.5.11-MariaDB-1:10.5.11+maria~focal
It happened already 3 times occasionnaly. Like you, no possibility to stop the service but reboot the machine to get it working again.
Logs at restart suggest some tables crashed and should be repaired.

Mariadb Galera cluster does not come up after killing the mysql process

I have a Mariadb Galera cluster with 2 nodes and it is up and running.
Before moving to production, I want to make sure that if a node crashes abruptly, It should come up on its own.
I tried using systemd "restart", but after killing the mysql process the mariadb service does not come up, so, is there any tool or method, that I can use to automate bringing up the nodes after crashes?

Galera clusters needs to have quorum (3 nodes).
In order to avoid a split-brain condition, the minimum recommended number of nodes in a cluster is 3. Blocking state transfer is yet another reason to require a minimum of 3 nodes in order to enjoy service availability in case one of the members fails and needs to be restarted. While two of the members will be engaged in state transfer, the remaining member(s) will be able to keep on serving client requests.
You can read more here.

Starting all nodes in galera at once

I have a galera cluster of three nodes, If I shut down the the three virtual machines and started them all at once, systemd will automatically start mariadb on each of the virtual machines.
Some times it happens that all of the mariadb instances start at once, and this result of a broken cluster.
Which I have to reinitiate using galera_new_cluster
The question is, why does starting all the mariadb instances at once break the cluster ?
Thank you

Whenever you start a node, it either starts as the first node in the cluster (initiates a new cluster), or it attempts to connect to an existing nodes using wsrep_cluster_address. The behavior depends on the node options.
So, every time when you shut down or lose all nodes and start them again, there is nothing to connect to, and you need to start a new cluster. galera_new_cluster does that by starting a node with --wsrep-new-cluster option which overrides the current value of wsrep_cluster_address.
If sometimes it works for you automatically, it most likely means that one of your nodes is permanently configured as the "first node", either via wsrep_cluster_address=gcomm://, or via wsrep-new-cluster. It is a wrong setup in itself. If you lose or shut down only this node and have to restart it, it won't join the remaining nodes in the cluster, it will create a new one.
When you start all nodes at once, you create a race condition. If your "first node" comes up first and initializes quickly enough, it will create a new cluster, and other nodes will join it. If another node comes up first, it won't be able to join anything, thus you get a "broken cluster".
You can find more information on restarting the whole cluster here:
http://galeracluster.com/documentation-webpages/restartingcluster.html

Recommendate way of bootstrapping cluster is to start a advanced node first then second and third so for you need to check lsn no of all nodes or check grastate.date file where you can check all nodes value .
so follow these steps your cluster node will not be crash

Connet two apps to MariaDB Multi Master database

Suppose that we have two application servers(app1 and app2) and also we setup multi master MariaDB clustering with two nodes(node1 and node2) without any HAProxy.Can we connect app1 to node1 and app2 to node2 and also both of app1 and app2 write to node1 and node2?
Does it cause any conflict?

Galera solves most of the problems that occur with Master-Master:
If one of Master-Master dies, now what? Galera recovers from any of its 3 nodes failing.
If you INSERT the same UNIQUE key value in more than one Master, M-M hangs; Galera complains to the last client to COMMIT.
If a node dies and recovers, the data is automatically repaired.
You can add a node without manually doing the dump, etc.
etc.
However, there are a few things that need to be done differently to when using Galera: Tips

DR setup for MariaDB Galera Clusters

I have two MariaDB Galera Cluster with 3 nodes.
Cluster 1 : MDB-01,MDB-02,MDB-03
Cluster 2 : MDBDR-01,MDBDR-02,MDBDR-03
These two clusters are in two different data centers which are in two geographical regions.
Cluster 1 is PRODUCTION cluster and Cluster 2 is DR cluster
Asynchronous replication using GTID has been setup between MDB-01 to MDBDR-01
as per given configuration in the link :
http://www.severalnines.com/blog/deploy-asynchronous-replication-slave-mariadb-galera-cluster-gtid-clustercontrol
(Link is asynchronous replication between MariaDB Galera Cluster to Stand alone MariaDB instance.
However I have setup same configuration for asynchronous replication between MariaDB Galera Cluster to MariaDB Galera Cluster)
I am able to switch from current slave MDBDR-01 => MDB-01 to MDBDR-01 => MDB-02 with below command :
CHANGE MASTER TO master_host='MDB-02'
However I am getting challenge how to point MDBDR-02 => MDB-01 in case of MDBDR-01 is down.
Could you please provide inputs to achieve pointing MDBDR-02 => MDB-01 or MDBDR-03 => MDB-01.

One thing you need to understand, that article briefly mentions, is that each MariaDB's GTID implementation can cause problems in this situation. Since each node maintains its own list of GTIDs and galera transactions do not have their own id, it is possible that the same GTID does not point to the same place on each server (see this article).
Due to that problem, I wouldn't attempt what you're doing without MariaDB 10.1. MariaDB 10.1.8 was just released and is the first GA release of the 10.1 line. 10.1 changes the GTID implementation so galera transactions use their own server_id (set via a config variable). You can then filter replication on the slaves to only replicate the galera id.
To switch to a different slave server, you will need to get that last GTID executed on the old slave. The gtid_slave_pos is stored in mysql.gtid_slave_pos, but mysql.* tables are not replicated. I'm not completely sure and I don't have a way of testing if the original GTID of a transaction is passed to the other slave galera nodes (i.e. if the master cluster's galera server_id is 1 and the slave cluster's galera server_id is 2 and MDBDR-01 gets a slave event with GTID 1-1-123, will MDBDR-02 log it as 1-1-123 or 1-2-456). I'm guessing that it doesn't since the new GTID implementation should change the server_id, but you may be able to verify this. Since you probably can't get the last executed master GTID from a different slave galera node, you will probably need to get the GTID from the old slave which may not be possible unless you gracefully shut down the old slave. You may need to find the GTID from the last executed transaction in the binlog on the new slave and try to match that to a transaction in the master's binlog. Also, if you're not using sync_binlog = 1, the binlog is not reliable and might be a bit behind.
Since each galera slave node probably doesn't know about the executed GTIDs and can't skip previous GTID events, you may also have to play with SQL_SLAVE_SKIP_COUNTER to get to the correct position if the GTID you found is behind.
When you get the GTID (or a guess of it) you will then set up replication on the new slave the same way you set it up on the original slave. The following commands should do it:
SET GLOBAL gtid_slave_pos = "{Last Executed GTID}";
CHANGE MASTER TO master_host="{Master Address}", master_port={Master Port}, master_user="{Replication User}", master_password={Replication Password}, master_use_gtid=slave_pos;
START SLAVE;
You should also disable replication on the old slave before restarting it so the missed events don't get replicated twice.
Until the executed slave GTID is replicated through galera, which might never happen, failover like this will be a messy process.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex