I have a standard master/multiple slave set up on CentOS 7 running on EC2.
I actually have three identical slaves (all spawned off the same AMI), but only one is crashing about once a day. I've posted the dump from error.log below, as well as the query log referencing the connection ID in the error dump.
I've tried looking at the MariaDB docs, but all it points to is resolve_stack_dump but no real help on trying to figure it out from there.
The slave that is crashing is running many batch-like queries, but according to the dump logs, the last connection id is never one of the connections running queries.
For this slave, I have the system turn off slave updates (SQL_THREAD), run queries for 15 minutes, stop the queries, start the slave until caught up, stop the slave updates, and restart the queries. Repeat. This code has been working pretty much non-stop/crash-free for years now when I had a colo set up before I moved to AWS.
My other two cloned slaves only run the replication queries as a hot-spare of the master (which I've never needed to use). Those servers never crash.
thanks.
Error.log crash dump:
180618 13:12:46 [ERROR] mysqld got signal 11 ; This could be because
you hit a bug. It is also possible that this binary or one of the
libraries it was linked against is corrupt, improperly built, or
misconfigured. This error can also be caused by malfunctioning
hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, something is
definitely wrong and this may fail.
Server version: 10.1.33-MariaDB
key_buffer_size=268431360
read_buffer_size=268431360
max_used_connections=30
max_threads=42
thread_count=11
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads =
22282919 K bytes of memory Hope that's ok; if not, decrease some
variables in the equation.
Thread pointer: 0x7f4209f1c008 Attempting backtrace. You can use the
following information to find out where mysqld died. If you see no
messages after this, something went terribly wrong... stack_bottom =
0x7f4348db90b0 thread_stack 0x48400
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55c19a7be10e]
/usr/sbin/mysqld(handle_fatal_signal+0x305)[0x55c19a2e1295]
sigaction.c:0(__restore_rt)[0x7f4348a835e0]
sql/sql_class.h:3406(sql_set_variables(THD*, List,
bool))[0x55c19a0d2ecd]
sql/sql_list.h:179(base_list::empty())[0x55c19a14bcb8]
sql/sql_parse.cc:2007(dispatch_command(enum_server_command, THD,
char*, unsigned int))[0x55c19a15e85a]
sql/sql_parse.cc:1122(do_command(THD*))[0x55c19a160f37]
sql/sql_connect.cc:1330(do_handle_one_connection(THD*))[0x55c19a22d6da]
sql/sql_connect.cc:1244(handle_one_connection)[0x55c19a22d880]
pthread_create.c:0(start_thread)[0x7f4348a7be25]
/lib64/libc.so.6(clone+0x6d)[0x7f4346e1f34d]
Trying to get some variables. Some pointers may be invalid and cause
the dump to abort. Query (0x0): Connection ID (thread ID): 15894
Status: NOT_KILLED
Optimizer switch:
index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=off
Query Log for connection ID:
180618 13:11:01
15894 Connect ****#piper**** as anonymous on
15894 Query show status
15894 Prepare show full processlist /* m6clone1 /
15894 Execute show full processlist / m6clone1 */
15894 Close stmt
15894 Query show slave status
15894 Query show variables
15894 Quit
Related
I am running Airflow with backend mariaDB and periodically when a DAG task is being scheduled, I noticed the following error in airflow worker
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away').
I am not sure if the issue occurres due to misconfiguration of airflow, or it has to do that the backend is mariaDB, which as I saw it is not a recommended database.
Also, in mariaDB logs, I see the following warning repeating almost every minute
[Warning] Aborted connection 305627 to db: 'airflow' user: 'airflow' host: 'hostname' (Got an error reading communication packets)
I've seen some similar issues mentioned, but whatever I tried so far it didn't help.
The question is, Should I change database to MySQL? Or some configuration has to be done in mariaDB's end?
Airflow v2.0.1
MariaDB 10.5.5
SQLAlchemy 1.3.23
Hard to say - you need to look for the reason why your DB connection get aborted. MariaDB for quick testing with single scheduler might work, but there is some reason why your connection to the DB gets disconnected.
There are few things you can do:
airflow has db check command line command and you can run it to test if the DB configuration is working - maybe the errors that you will see will be obious when you try
airflow also has another useful command db shell - it will allow you to connect to the DB and run sql queries for example. This migh tell you if your connection is "stable". You can connect and run some queries and see if your connection is not interrupted in the meantime
You can see at more logs and your network connectivity to see if you have problems
finally check if you have enough resources to run airflow + DB. Ofthen things like that happen when you do not have enough memory for example. Airflow + DB requires at least 4GB RAM minimum from the experience (depends on the DB configuration) and if you are on Mac or Windows and using Docker, Docker VM by default has less memory than that available and you need to increase it.
look at other resources - disk space, memory, number of connections, etc. all can be your problem.
MariaDB service was stopped responding all of a sudden. It was running for more than 5 months continuously without any issues. When we check the MariaDB service status at the time of the incident, it showed as active (running) ( service mariadb status ). But we could not log into the MariaDB server, each logging attempt was just hanged without any response. All our web applications were also failed to communicate with the MariaDB service. Also, we checked the max_used_connections, and it was below the maximum value.
When we going through the logs, we saw the below error (this had been triggered at the time of the incident).
210623 2:00:19 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.2.34-MariaDB-log
key_buffer_size=67108864
read_buffer_size=1048576
max_used_connections=139
max_threads=752
thread_count=72
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1621655 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f4c008501e8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f4c458a7d30 thread_stack 0x49000
2021-06-23 2:04:20 139966788486912 [Warning] InnoDB: A long semaphore wait:
--Thread 139966780094208 has waited at btr0sea.cc line 1145 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x55e1838d5ab0 created in file btr0sea.cc line 191
a writer (thread id 139966610978560) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file btr0sea.cc line 1145
Last time write locked in file btr0sea.cc line 1218
We could not even stop the MariaDB service using general stopping commands ( service MariaDB stop). But we were able to forcefully kill the MariaDB process and then we could get the MariaDB service back online.
What could be the reason for this failure. If you have already faced similar issues please share your experience, what actions you got to prevent such failures (in the future). Your feedback is much much appreciated.
Our Environment Details are as follows
Operating system: Red Hat Enterprise Linux 7
Mariadb version: 10.2.34-MariaDB-log MariaDB Server
I also face this issue on an aws instance (c5a.4xlarge) hosting my database.
Server version: 10.5.11-MariaDB-1:10.5.11+maria~focal
It happened already 3 times occasionnaly. Like you, no possibility to stop the service but reboot the machine to get it working again.
Logs at restart suggest some tables crashed and should be repaired.
I am running a single high-visited website on a high-end Centos 7 VPS (16 vCore / 128 GB of RAM) running Plesk Onyx on
Centos 7 / MariaDB 10.1 / PHP-FPM 5.6 setup.
Everything is usually smooth and fast, but it happened twice in a year that the website went down with the message "Too Many Connections" from MariaDB.
Being in a hurry to restore website I launched a " service mariadb restart " without actually launching a SHOW PROCESSLIST.
I checked mariadb logs and web server logs afterwards and I haven't find anything useful to troubleshoot the issue.
Note that when it happened first time, I raised the max_connections value to 300 in my.cnf and constantly monitored the "max_used_connections" variabile seeing that value never went over 50 so I guessed it happened because of some DDOS attack or malicious attempt.
Questions :
Any advice on how to troubleshoot this ?
How can I be alerted if the max_used_connections value is approaching the max_connections value ? Any tool ?
I am using external pingdom service to check website uptime but it didn't detect this kind of problem (the web response is 200 OK) and also a netdata instance on the server (https://netdata.io/) that didn't help...
Troubleshoot it by turning on the slowlog, preferably with a low value for long_query_time (such as "1"). Probably some naughty query will show up there.
Yes, do SHOW FULL PROCESSLIST next time. (Note "FULL".) Instead of restarting mysqld, look for the offending query. It will have one of the highest values in Time and it probably won't be in Sleep mode. It may be something potentially long like ALTER or a dump. Killing that one process will probably uncork the problem, and the problem will vanish in, perhaps, seconds.
Deleting a file that is "open" by a process (such as mysqld) will not help -- disk space is not recycled until all processes have closed the file. Killing the process closes any open files. Some logs are can be handled with FLUSH LOGS; -- this should be harmless, though it may not help.
If your tables are MyISAM, switching to InnoDB will avoid many cases of table locks (if that is what you are experiencing).
What is the value of innodb_buffer_pool_size? For that sized RAM, about 80G is reasonable.
There might be some clues in the GLOBAL STATUS; see http://mysql.rjweb.org/doc.php/mysql_analysis#tuning for analyzing it. (Caution: It will be useless immediately after a reboot.)
We have a process (written in c++ /managed), which receives network data via tcpip.
After running the process for a while while tracking network load, it seems that network get into freeze state and the process does not getting data, there are other processes in the system that using networking (same nic) which operates normally.
the process gets out of this frozen situation by itself after several minutes.
Any idea what is happening?
Any counter i can track to see if my process reach some limitations ?
It is going to be very difficult to answer specifically,
-- without knowing what exactly is your process/application about,
-- whether it is a network chat application, or a file server/client, or ......
-- without other details about your process how it is implemented, what libraries it uses, if relevant to problem.
Also you haven't mentioned what OS and environment you are running this process under,
there is very little anyone can help . It could be anything, a busy wait loopl in your code, locking problems if its a multi-threaded code,....
Nonetheless , here are some options to check:
If its linux try below commands to debug and monitor the behaviour of the process and see what could be problem-
top
Check top to see ow much resources(CPU, memory) your process is using and if there is anything abnormally high values in CPU usage for it.
pstack
This should stack frames of the process executing at time of the problem.
netstat
Run this with necessary options (tcp/udp) to check what is the stae of the network sockets opened by your process
gcore -s -c
This forces your process to core when the mentioned problem happens, and then analyze that core file using gdb
gdb
and then use command where at gdb prompt to get full back trace of the process (which functions it was executing last and previous function calls.
I open the database file and obtain a database connection using open() method of sqlite3 and the connection will not be closed until program exits. If there occurs an unexpected error such as computer's suddenly power-off or OS crash, will the mode of the database file be damaged, or its handle lost? More specifically, can it remain writable if I reboot my computer? BTW, I don't care about the data loss when errors occurs.
Thank you very much!
SQLite is specifically designed to protect against this. From the official SQLite is Transactional page:
All changes within a single
transaction in SQLite either occur
completely or not at all, even if the
act of writing the change out to the
disk is interrupted by
a program crash,
an operating system crash, or
a power failure.
The claim of the previous paragraph is
extensively checked in the SQLite
regression test suite using a special
test harness that simulates the
effects on a database file of
operating system crashes and power
failures.
You might also be interested in the SQLite article Atomic Commit in SQLite if you need to know the specific details on how they protect against crashes such as the above.
Regarding writing after a crash: (from File Locking and Concurrency)
A hot journal is created when a process is in the middle of a database update and a program or operating system crash or power failure prevents the update from completing. Hot journals are an exception condition. Hot journals exist to recover from crashes and power failures. If everything is working correctly (that is, if there are no crashes or power failures) you will never get a hot journal.
The worst that can happen will be that you need to delete the hot journal that is left over after a crash.
As Sqlite is ACID-compliant, a power-off shouldn't be an issue.
http://en.wikipedia.org/wiki/ACID
anything could potentially happen on sudden power off. However I'd suggest UPS to mitigate any risk.