MariaDB has stopped responding - [ERROR] mysqld got signal 6 - mariadb

MariaDB service was stopped responding all of a sudden. It was running for more than 5 months continuously without any issues. When we check the MariaDB service status at the time of the incident, it showed as active (running) ( service mariadb status ). But we could not log into the MariaDB server, each logging attempt was just hanged without any response. All our web applications were also failed to communicate with the MariaDB service. Also, we checked the max_used_connections, and it was below the maximum value.
When we going through the logs, we saw the below error (this had been triggered at the time of the incident).
210623 2:00:19 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.2.34-MariaDB-log
key_buffer_size=67108864
read_buffer_size=1048576
max_used_connections=139
max_threads=752
thread_count=72
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1621655 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f4c008501e8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f4c458a7d30 thread_stack 0x49000
2021-06-23 2:04:20 139966788486912 [Warning] InnoDB: A long semaphore wait:
--Thread 139966780094208 has waited at btr0sea.cc line 1145 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x55e1838d5ab0 created in file btr0sea.cc line 191
a writer (thread id 139966610978560) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file btr0sea.cc line 1145
Last time write locked in file btr0sea.cc line 1218
We could not even stop the MariaDB service using general stopping commands ( service MariaDB stop). But we were able to forcefully kill the MariaDB process and then we could get the MariaDB service back online.
What could be the reason for this failure. If you have already faced similar issues please share your experience, what actions you got to prevent such failures (in the future). Your feedback is much much appreciated.
Our Environment Details are as follows
Operating system: Red Hat Enterprise Linux 7
Mariadb version: 10.2.34-MariaDB-log MariaDB Server

I also face this issue on an aws instance (c5a.4xlarge) hosting my database.
Server version: 10.5.11-MariaDB-1:10.5.11+maria~focal
It happened already 3 times occasionnaly. Like you, no possibility to stop the service but reboot the machine to get it working again.
Logs at restart suggest some tables crashed and should be repaired.

Related

Airflow 2 - error MySQL server has gone away

I am running Airflow with backend mariaDB and periodically when a DAG task is being scheduled, I noticed the following error in airflow worker
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away').
I am not sure if the issue occurres due to misconfiguration of airflow, or it has to do that the backend is mariaDB, which as I saw it is not a recommended database.
Also, in mariaDB logs, I see the following warning repeating almost every minute
[Warning] Aborted connection 305627 to db: 'airflow' user: 'airflow' host: 'hostname' (Got an error reading communication packets)
I've seen some similar issues mentioned, but whatever I tried so far it didn't help.
The question is, Should I change database to MySQL? Or some configuration has to be done in mariaDB's end?
Airflow v2.0.1
MariaDB 10.5.5
SQLAlchemy 1.3.23
Hard to say - you need to look for the reason why your DB connection get aborted. MariaDB for quick testing with single scheduler might work, but there is some reason why your connection to the DB gets disconnected.
There are few things you can do:
airflow has db check command line command and you can run it to test if the DB configuration is working - maybe the errors that you will see will be obious when you try
airflow also has another useful command db shell - it will allow you to connect to the DB and run sql queries for example. This migh tell you if your connection is "stable". You can connect and run some queries and see if your connection is not interrupted in the meantime
You can see at more logs and your network connectivity to see if you have problems
finally check if you have enough resources to run airflow + DB. Ofthen things like that happen when you do not have enough memory for example. Airflow + DB requires at least 4GB RAM minimum from the experience (depends on the DB configuration) and if you are on Mac or Windows and using Docker, Docker VM by default has less memory than that available and you need to increase it.
look at other resources - disk space, memory, number of connections, etc. all can be your problem.

mariadb 10.3.13 table_open_cache problems

Just upgraded mysql 5.6 to mariadb 10.3.13 - now when the server hits open_tables = 2000, my php queries stop working - if I do a flush tables it starts working correctly again. This never happened when I was using mysql - now I can't go a day without having to login and do a flush tables to get things working again
Use WHM / Cpanel to administer my VPS and on the last WHM release it started warning me that the version of MySql (really can't remember what version it was - it was what was loaded when I got my VPS) that I was running was soon coming to an end and I would need to upgrade to SQL 5.7 or MariaDB xxx. Had been wanting to move to MariaDB for awhile anyway, so that is what I did - WHM recommended the 10.3.13 version.
After some more watching and looking it appears that what makes my open_tables hits the 2000 max was the automatic CPANEL backup routines - which also backup all of my databases at one get go. Doesn't crash anything just causes problems with my PHP application connections - I don't thing the connections get rejected - they just don't return any data .... Turned all of the automatice WHM/CPANEL backups off and things have settled down a little.
table_definition_cache 400
table_open_cache 2000
I still do a mysqldump via cron to do my database backups - only two live databases and they still make the tables_open grow to that 2000 max - just not as fast.
I now run a script that runs every hour to show me some of the variables and here is what I am seeing
after doing a flush tables command both open_tables and open_table_definitions start increasing until open_table_definitions hits 400 it stops increasing while open_tables keeps increasing thru the day.
then when the mysqldumps happen in the early morning hours tables_open hits 2000 (the max setting) and my php queries are not executed
I do not get a PHP error.
I ran the following command so that I could see what was happening on the db side.
SET GLOBAL general_log = 'ON'
Looking at the log, when everything is running OK I see my application connecting, preparing the statement, executing the statement and then disconnecting ....
I did the same thing when it started acting up (i.e. my php application starts to not get a result again)
Looking at the log I see my application connecting, then preparing the statement and then instead of seeing it execute the statement, it prepares the same statement 2 more times and then disconnects ...
I logged into mysql and did a flush tables command and everything goes back to normal - application connects, prepares a statement, executes it, disconnects ...
But this never happened before I moved to MariaDB - I never messed with the MySQL server stuff at all - the only time MySQL was restarted was when I did a CENTOS 6 system update and had to reboot the server - would go months without doing a thing on the server ....
Looks like the system was the culprit - I changed the open_table_cache to 400 and my php application is no longer having any issues preparing statements, even after the nightly backups of the databases. Looking at older mysql documentation, mysql 5.6.7 had a table_open_cache setting of 400, so when I upgraded to mariadb 10.3.13 that default setting was changed to 2000 which is when I started having problems.
Not quite sure what the following is telling me, but might be of interest ....
su - mysql
-bash-4.1$ ulimit -Hn
100
-bash-4.1$ ulimit -Sn
100
-bash-4.1$ exit
logout
[~]# ulimit -Hn
4096
[~]# ulimit -Sn
4096

MariaDB 10.1.33 keeps crashing

I have a standard master/multiple slave set up on CentOS 7 running on EC2.
I actually have three identical slaves (all spawned off the same AMI), but only one is crashing about once a day. I've posted the dump from error.log below, as well as the query log referencing the connection ID in the error dump.
I've tried looking at the MariaDB docs, but all it points to is resolve_stack_dump but no real help on trying to figure it out from there.
The slave that is crashing is running many batch-like queries, but according to the dump logs, the last connection id is never one of the connections running queries.
For this slave, I have the system turn off slave updates (SQL_THREAD), run queries for 15 minutes, stop the queries, start the slave until caught up, stop the slave updates, and restart the queries. Repeat. This code has been working pretty much non-stop/crash-free for years now when I had a colo set up before I moved to AWS.
My other two cloned slaves only run the replication queries as a hot-spare of the master (which I've never needed to use). Those servers never crash.
thanks.
Error.log crash dump:
180618 13:12:46 [ERROR] mysqld got signal 11 ; This could be because
you hit a bug. It is also possible that this binary or one of the
libraries it was linked against is corrupt, improperly built, or
misconfigured. This error can also be caused by malfunctioning
hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, something is
definitely wrong and this may fail.
Server version: 10.1.33-MariaDB
key_buffer_size=268431360
read_buffer_size=268431360
max_used_connections=30
max_threads=42
thread_count=11
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads =
22282919 K bytes of memory Hope that's ok; if not, decrease some
variables in the equation.
Thread pointer: 0x7f4209f1c008 Attempting backtrace. You can use the
following information to find out where mysqld died. If you see no
messages after this, something went terribly wrong... stack_bottom =
0x7f4348db90b0 thread_stack 0x48400
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55c19a7be10e]
/usr/sbin/mysqld(handle_fatal_signal+0x305)[0x55c19a2e1295]
sigaction.c:0(__restore_rt)[0x7f4348a835e0]
sql/sql_class.h:3406(sql_set_variables(THD*, List,
bool))[0x55c19a0d2ecd]
sql/sql_list.h:179(base_list::empty())[0x55c19a14bcb8]
sql/sql_parse.cc:2007(dispatch_command(enum_server_command, THD,
char*, unsigned int))[0x55c19a15e85a]
sql/sql_parse.cc:1122(do_command(THD*))[0x55c19a160f37]
sql/sql_connect.cc:1330(do_handle_one_connection(THD*))[0x55c19a22d6da]
sql/sql_connect.cc:1244(handle_one_connection)[0x55c19a22d880]
pthread_create.c:0(start_thread)[0x7f4348a7be25]
/lib64/libc.so.6(clone+0x6d)[0x7f4346e1f34d]
Trying to get some variables. Some pointers may be invalid and cause
the dump to abort. Query (0x0): Connection ID (thread ID): 15894
Status: NOT_KILLED
Optimizer switch:
index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=off
Query Log for connection ID:
180618 13:11:01
15894 Connect ****#piper**** as anonymous on
15894 Query show status
15894 Prepare show full processlist /* m6clone1 /
15894 Execute show full processlist / m6clone1 */
15894 Close stmt
15894 Query show slave status
15894 Query show variables
15894 Quit

website down with mariadb "too many connections" error

I am running a single high-visited website on a high-end Centos 7 VPS (16 vCore / 128 GB of RAM) running Plesk Onyx on
Centos 7 / MariaDB 10.1 / PHP-FPM 5.6 setup.
Everything is usually smooth and fast, but it happened twice in a year that the website went down with the message "Too Many Connections" from MariaDB.
Being in a hurry to restore website I launched a " service mariadb restart " without actually launching a SHOW PROCESSLIST.
I checked mariadb logs and web server logs afterwards and I haven't find anything useful to troubleshoot the issue.
Note that when it happened first time, I raised the max_connections value to 300 in my.cnf and constantly monitored the "max_used_connections" variabile seeing that value never went over 50 so I guessed it happened because of some DDOS attack or malicious attempt.
Questions :
Any advice on how to troubleshoot this ?
How can I be alerted if the max_used_connections value is approaching the max_connections value ? Any tool ?
I am using external pingdom service to check website uptime but it didn't detect this kind of problem (the web response is 200 OK) and also a netdata instance on the server (https://netdata.io/) that didn't help...
Troubleshoot it by turning on the slowlog, preferably with a low value for long_query_time (such as "1"). Probably some naughty query will show up there.
Yes, do SHOW FULL PROCESSLIST next time. (Note "FULL".) Instead of restarting mysqld, look for the offending query. It will have one of the highest values in Time and it probably won't be in Sleep mode. It may be something potentially long like ALTER or a dump. Killing that one process will probably uncork the problem, and the problem will vanish in, perhaps, seconds.
Deleting a file that is "open" by a process (such as mysqld) will not help -- disk space is not recycled until all processes have closed the file. Killing the process closes any open files. Some logs are can be handled with FLUSH LOGS; -- this should be harmless, though it may not help.
If your tables are MyISAM, switching to InnoDB will avoid many cases of table locks (if that is what you are experiencing).
What is the value of innodb_buffer_pool_size? For that sized RAM, about 80G is reasonable.
There might be some clues in the GLOBAL STATUS; see http://mysql.rjweb.org/doc.php/mysql_analysis#tuning for analyzing it. (Caution: It will be useless immediately after a reboot.)

Biztalk Cluster Servers

we used to have 1 biztalk 2006R2 32bit server. We recently upgraded it to Enterprise. But because our traffic size we didn't have enough power and memory with only one. So we also recently installed a second biztalk server, a 2006R2 64-bit, and we put them in a shared cluster. Since then a problem arose, actually two but I'm guessing they probably are connected. One of our (19) host instances keeps getting in the "stopped" status. This host instance is mainly connected with TCP ports. We have a script which checks if host instances are in the stopped state and starts them again, but this obviously has very little use since it keeps resetting to the stopped state. There also is an error in our event viewer, namely:
Faulting application btsntsvc.exe, version 3.6.1404.0, stamp 4674b0a4, faulting module kernel32.dll, version 5.2.3790.4480, stamp 49c51f0a, debug? 0, fault address 0x0000bef7.
Anyone has any idea?
Thanks
Having automated scripts to restart the host instance is not a good idea IMO, you need to get to the bottom of the problem. It looks like a known issue and a hot fix is availble. Worth lookint at this KB http://support.microsoft.com/kb/978059

Resources