Galera Cluster: mysterious desync every night at 5:30 - mariadb

We have a galera cluster with four servers at two locations with the following setting:
server1 (location 1) with weight 2
server2 (location 1) with weight 2
server3 (location 2) with weight 2
server4 (location 2) with weight 1
Versions:
-----------
galera-25.3.25
10.2.22-MariaDB
wsrep_patch_version: wsrep_25.24
Everything is running fine except that every night at 5:30 (sometimes 5:31 or 5:32) all servers lose the connection between them. They regain it quickly - but I would like to understand what is happening and how to prevent it.
I already checked: there is no cronjob running at this time which could cause this and no other system shows any error.
The mysql error log shows warnings like:
WSREP: (75927761, 'ssl://0.0.0.0:4567') connection to peer e8e6dd8b with addr ssl://XXX.XXX.XXX.XXX:4567 timed out, no messages seen in PT3S
...
WSREP: discarding established (time wait) b69c4124 (ssl://XXX.XXX.XXX.XXX:4567)
...
WSREP: Quorum: No node with complete state
...
If you need more information, please let me know!

Related

Mariadb Galera 10.5.13-16 Node Crash

I have a cluster with 2 galera nodes and 1 arbitrator.
My node 1 crashed I don't understand why..
Here is the log of the node 1.
It seems that it is a problem with the pthread library.
Also every requests are proxied by 2 HAProxy.
2023-01-03 12:08:55 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
2023-01-03 12:08:55 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
2023-01-03 12:08:56 0 [Warning] WSREP: Handshake failed: http request
terminate called after throwing an instance of 'boost::wrapexcept<std::system_error>'
what(): remote_endpoint: Transport endpoint is not connected
230103 12:08:56 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.5.13-MariaDB-1:10.5.13+maria~focal
key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=101
max_threads=102
thread_count=106
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 760333 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
mariadbd(my_print_stacktrace+0x32)[0x55b1b67f7e42]
Printing to addr2line failed
mariadbd(handle_fatal_signal+0x485)[0x55b1b62479a5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ff88ea983c0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ff88e59e18b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7ff88e57d859]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff88e939911]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff88e94538c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff88e9453f7]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff88e9456a9]
/usr/lib/galera/libgalera_smm.so(+0x448ad)[0x7ff884b5e8ad]
/usr/lib/galera/libgalera_smm.so(+0x1fc315)[0x7ff884d16315]
/usr/lib/galera/libgalera_smm.so(+0x1ff7eb)[0x7ff884d197eb]
/usr/lib/galera/libgalera_smm.so(+0x1ffc28)[0x7ff884d19c28]
/usr/lib/galera/libgalera_smm.so(+0x2065b6)[0x7ff884d205b6]
/usr/lib/galera/libgalera_smm.so(+0x1f81f3)[0x7ff884d121f3]
/usr/lib/galera/libgalera_smm.so(+0x1e6f04)[0x7ff884d00f04]
/usr/lib/galera/libgalera_smm.so(+0x103438)[0x7ff884c1d438]
/usr/lib/galera/libgalera_smm.so(+0xe8eea)[0x7ff884c02eea]
/usr/lib/galera/libgalera_smm.so(+0xe9a8d)[0x7ff884c03a8d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7ff88ea8c609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff88e67a293]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Fatal signal 11 while backtracing
PS: if you want more data ask me :)
OK it seems that 2 simultaneous scans of OpenVAS crashes the node.
I tried with version 10.5.13 and 10.5.16 -> crash.
Solution: Upgrade to 10.5.17 at least.

Data unpack would read past end of buffer in file util/show_help.c at line 501

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------

OpenContrail Collector and Zookepper - Daemons stuck in initialization

I have a problem with OpenContrail. I do not understand why, but when I run:
---code_being---
root# watch contrail-status
---code_end---
I get a lot of daemons showing that they are stuck in initializing state. I try to stop and restart these daemons, but to know avail. I also have looked into the ZooKeeper status, but do not know how to get it back up. Any thoughts?
Here is the Contrail-Status Output:
Every 2.0s: contrail-status Mon Jul 30 20:19:24 2018
== Contrail Control ==
supervisor-control: active
contrail-control initializing (Number of connections:4, Expected:5 Missing: IFMap
:IFMapServer)
contrail-control-nodemgr initializing (Collector connection down)
contrail-dns active
contrail-named active
== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen initializing (Collector, Zookeeper:Zookeeper[Connection time-out
] connection down)
contrail-analytics-api initializing (Collector, UvePartitions:UVE-Aggregation[Partition
s:0] connection down)
contrail-analytics-nodemgr initializing (Collector connection down)
contrail-collector initializing
contrail-query-engine initializing (Collector connection down)
contrail-snmp-collector initializing (Collector, Zookeeper:Zookeeper[Connection time-out
] connection down)
contrail-topology initializing (Collector, Zookeeper:Zookeeper[Connection time-out
] connection down)
== Contrail Config ==
supervisor-config: active
unix:///var/run/supervisord_config.sockno
== Contrail Database ==
contrail-database: inactive (disabled on boot)
== Contrail Supervisor Database ==
supervisor-database: active
contrail-database active
contrail-database-nodemgr active
kafka initializing
For Cassandra and ZooKeeper service I get a Blank return:
root#ntw02:~# contrail-cassandra-status
root#ntw02:~# service contrail-control status
contrail-control RUNNING pid 21499, uptime 5:05:39
root#ntw02:~# service contrail-collector status
contrail-collector RUNNING pid 1132, uptime 17 days, 5:45:34
Interesting, I restarted the contrail-collector, but the uptime does not match 07/28/2018
I am going to try and kill the process and provide the output and restart.
So after a long awaited time-frame I finally found the answer to this question. Apparently contrail has a script that will clean up entries specifically related to the database.
Enter:
db_manage.py
You should be able to execute it with
python /usr/lib/python2.7/dist-packages/vnc_cfg_api_server/db_manage.py

Understanding Docker container resource usage

I have server running Ubuntu 16.04 with Docker 17.03.0-ce running an Nginx container. That server also has ConfigServer Security & Firewall installed. Shortly after starting the Nginx container I start receiving emails about "Excessive resource usage" with the following details:
Time: Fri Mar 24 00:06:02 2017 -0400
Account: systemd-timesync
Resource: Process Time
Exceeded: 1820 > 1800 (seconds)
Executable: /usr/sbin/nginx
Command Line: nginx: worker process
PID: 2302 (Parent PID:2077)
Killed: No
I fully understand that I can add exe:/usr/sbin/nginx to csf.pignore to stop these email alerts but I would like to understand a few things first.
Why is the "systemd-timesync" account being reported? That does not seem to have anything to do with Docker.
Why does the host machine seem to be reporting the excessive resource usage (the extended process time) when that is something running in the container?
Why are other docker containers not running Nginx not resulting in excessive resource usage emails?
I'm sure there are other questions but basically, why is this being reported the way it is being reported?
I can at least answer the first two questions:
Unlike real VMs, Docker containers are simply a collection of processes run under the host system kernel. They just have a different view on certain system resources, including their own file hierarchy, their own PID namespace and their own /etc/passwd file. As a result, they will still show up if you ps aux on the host machine.
The nginx container's /etc/passwd includes a user 'nginx' with UID 104 that runs the nginx worker process. However, in the host's /etc/passwd, UID 104 might belong to a completely different user, such as systemd-timesync.
As a result, if you run ps aux | grep nginx in the container, you might see
nginx 7 0.0 0.0 32152 2816 ? S 11:20 0:00 nginx: worker process
while on the host, you see
systemd-timesync 22004 0.0 0.0 32152 2816 ? S 13:20 0:00 nginx: worker process
even though both are the are the same process (also note the different PID namespaces; in containers, PIDs are counted from 1 again).
As a result, container processes will still be subject to ConfigServer's resource monitoring, but they might show up with random, or even non-existent user accounts.
As to why nginx triggers the emails and other containers don't, I can only assume that nginx is the only one of your containers that crosses ConfigServer's resource thresholds.

Error: couldn't read version from server: Get http://localhost:8080/api: dial tcp 127.0.0.1:8080: connection refused

I installed Kubernetes on linux using the steps here.
Everything worked fine until I exited the terminal and opened a new terminal session.
I got a permission denied error and after restarting my machine I get the following error
> kubectl get pod<br/>
error: couldn't read version from server: Get http://localhost:8080/api: dial tcp 127.0.0.1:8080: connection refused
I am just getting started with Kubernetes any help would be appreciated.
seems like a TCP problem. try to isolate the problem by checking if TCP/8080 is open by issue
telnet 127.0.0.1 8080
if you got a 'connection refused' - you should probably look at the firewall/security setting of your machine.
I tried replicating the problem by doing this:
Installed Kubernetes on a fresh ubuntu 15.04 machine from the instructions given in your link above.
stopped (with docker) all the containers as given in the instructions.
logged out and logged in again
Started etcd and then kubernetes master and then the service proxy to get them up again.
Then immediately I ran get nodes to get the same error as yours.
[anovil#ubuntu-anovil ~]$ kubectl get nodes
error: couldn't read version from server: Get http://localhost:8080/api: dial tcp 127.0.0.1:8080: connection refused
[anovil#ubuntu-anovil ~]$
Then I ran docker ps to check if they are all running and seems not the case.
[anovil#ubuntu-anovil ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
449b4751f0e4 gcr.io/google_containers/pause:0.8.0 "/pause" 3 seconds ago Up 2 seconds k8s_POD.e4cc795_k8s-master-127.0.0.1_default_f3ccbffbd75e3c5d2fb4ba69c8856c4a_b169f4ad
8c37ad726b71 gcr.io/google_containers/hyperkube:v1.0.1 "/hyperkube proxy --m" 55 seconds ago Up 55 seconds naughty_jennings
de9cf798bc2b gcr.io/google_containers/hyperkube:v1.0.1 "/hyperkube kubelet -" About a minute ago Up About a minute desperate_pike
6d969a37428e gcr.io/google_containers/etcd:2.0.12 "/usr/local/bin/etcd " About a minute ago Up About a minute jovial_jang
[anovil#ubuntu-anovil ~]$
As you see, the controllers, apiserver and scheduler were missing.
If this would have been your problem, then I just waited for a while, say 1 minute and they were all up again.
So, it just took some time to resume, after which
[anovil#ubuntu-anovil ~]$ kubectl get nodes
NAME LABELS STATUS
127.0.0.1 kubernetes.io/hostname=127.0.0.1 Ready
[anovil#ubuntu-anovil ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0b8b7aae8143 gcr.io/google_containers/hyperkube:v1.0.1 "/hyperkube scheduler" 8 seconds ago Up 8 seconds k8s_scheduler.2744e742_k8s-master-127.0.0.1_default_f3ccbffbd75e3c5d2fb4ba69c8856c4a_6928bc83
0e25d641079b gcr.io/google_containers/hyperkube:v1.0.1 "/hyperkube apiserver" 8 seconds ago Up 8 seconds k8s_apiserver.cfb70250_k8s-master-127.0.0.1_default_f3ccbffbd75e3c5d2fb4ba69c8856c4a_1f35ee04
d5170a4bcd58 gcr.io/google_containers/hyperkube:v1.0.1 "/hyperkube controlle" 8 seconds ago Up 8 seconds k8s_controller-manager.1598ee5c_k8s-master-127.0.0.1_default_f3ccbffbd75e3c5d2fb4ba69c8856c4a_e9c8eaa4
449b4751f0e4 gcr.io/google_containers/pause:0.8.0 "/pause" 18 seconds ago Up 18 seconds k8s_POD.e4cc795_k8s-master-127.0.0.1_default_f3ccbffbd75e3c5d2fb4ba69c8856c4a_b169f4ad
8c37ad726b71 gcr.io/google_containers/hyperkube:v1.0.1 "/hyperkube proxy --m" About a minute ago Up About a minute naughty_jennings
de9cf798bc2b gcr.io/google_containers/hyperkube:v1.0.1 "/hyperkube kubelet -" About a minute ago Up About a minute desperate_pike
6d969a37428e gcr.io/google_containers/etcd:2.0.12 "/usr/local/bin/etcd " About a minute ago Up About a minute jovial_jang
[anovil#ubuntu-anovil ~]$
The first thing you should do after starting the etcd, master and proxy is to check with docker ps and see if they all are up.
Also, if you still have problems then can you try posting your docker version, your host details (OS, version etc.)?
Thanks, (I do not have enough reputations to comment on this request)

Resources