Mariadb: MySQL server has gone away - mariadb

In my application, I have an issue where I receive the "MySQL server has gone away" error during a quite long-running transaction. I know this has already been asked a lot, but I tried my test to go through all possible causes.
The one thing that baffles me a lot, is this error message in the log of the MariaDB server:
[Warning] Aborted connection 6 to db: 'default' user: 'root' host: '10.0.0.18' (Got timeout reading communication packets)
This would explain why the client reports a broken connection, but this error occurs 10-15 minutes before the client reports the "MySQL server has gone away" error. In the meantime, the client is happily running insert statements without an issue. But as soon the client runs a select statement, the statement fails practically immediately.
I have already checked for these possible causes:
The server was running all the time
wait_timeout is set to 8 hours, which is way longer than the time the transaction needs to fail
max_allowed_packet is set to 512M which should be more than enough since the query is a very short select statement
The server does not run out of memory
I'm pretty sure the issue must be related to the "Got timeout reading communication packets" error from the MariaDB log. But I cannot wrap my head around why the client still can write data. And why this timeout occurs in the first place, since the wait_timeout is super high.
Some system information:
I'm running on MariaDB 10.5.1
The client uses python 3.6 and mysqlclient, which uses libmysql, is used for the database connection
I hope maybe some of you have an idea what I should look for, because this is really driving me nuts.

Related

How to recover with a retry from gremlin NoHostAvailableException

I am using Gremlin Java driver to connect to a local gremlin server.
Simple code flow
Creating client
Cluster cluster = Cluster.build().addContactPoint(<endp>).port(<port>).enableSsl(false).create()
Client client = cluster.connect();
Submit Script
client.submit("g.V().count()");
If when i submit the first time the Gremlin server is down, on subsequent retries after bringing back gremlin server, connection still fails to create.
Exception First attempt when Gremlin Server is down:
org.apache.tinkerpop.gremlin.driver.exception.NoHostAvailableException: All hosts are considered unavailable due to previous exceptions. Check the error log to find the actual reason
Exception After Gremlin server is brought back up:
tinkerpop.gremlin.driver.exception.NoHostAvailableException: All hosts are considered unavailable due to previous exceptions
One thing to note is i do not create client on retry just do
Submit Script
client.submit("g.V().count()");
It is quite possible that Gremlin server may go down anytime, how to recover in such circumstances. Fundamentally is
NoHostAvailableExceptio
recoverable?
Normally, the Client should attempt to reconnect to a host that is previously marked unavailable. By default, it should be retrying the host at 1 second intervals as governed by this configuration: connectionPool.reconnectInterval. In your case, however I think you've uncovered a bug where the reconnect attempts are not started because the Client was never able to reach the host in the first place. As of 3.4.11, you can only remedy this by recreating the Client as you noted in your comments. I've created an issue to track this problem here: TINKERPOP-2569

Timer_ConnectionIdle IIS

I get Timer_ConnectionIdle message from error logs of httperror folder in system32/logfiles.
And sometimes the web page return service unavailable or connection refused.
What is the problem?
How can I solve that?
Two Different issues that you are talking about.
Timer connection idle is not something you need to be worried about. It is HTTP.SYS's way of telling you that the client with which it established a connection did not disconnect because there is always a chance that the client would want to establish the connection again. I think it usually waits for 2 minutes before terminating the connection and that is when you get this message in the HTTPERR logs.
Now coming to Service Unavailable and Connection Timeout errors, this is something that you need to take note of. Check for event logs during the time of issue and see if you find anything there.
If you are unable to find anything in the event logs, my next question would be to identify what is done in order to overcome the issue ? Do you recycle the application pool to get the application up and running ? Do you reset IIS ? If you do any of the above, then please capture a full user dump of w3wp process using debug diag during the time of issue (before performing an iisreset or application pool recycle). Analyzing the dump will tell you exactly whats going wrong.
Feel free to follow up with any questions you have.

How to reconnect and continue normaly on "Error: MySQL server has gone away" in nodesql and MariaDB

I get Error: MySQL server has gone away from time to time using MariaDB and mariasql. The application stops working logically. Now I wonder if there is a way to catch this error, reconnect and continue normal operation.
After some research, I found that mariasql reconnects automatically, but make sure you listen to the error event, or your application will stop due to uncaught exception error.

ODP.NET Connection Pooling Issues - Fault Tollerance After Database Goes Down

I have an WebAPI service using ODP.NET to make connections to several oracle databases. Normally the web service would be hit several times a second and will never have long periods on inactivity. In our test site however, we did not use it for 2-3 days. This morning, we hit the service and got "connection request timeout" exceptions from ODP.NET, suggesting that the connection pool was out of available connections. We are closing the connections after use. The service was working fine before the period, but today the very first query got the timeout exception. Our app pool in IIS is configured to never reset.
My question then is, what can cause the connection pool to fill with bad connections after a period of inactivity, where these connections are not cleaned up in the usual 3 minute cycle? It only happened to 2 out of the 3 of our databases, and Validate Connection=true is set for all of them.
EDIT
So after talking to the DBA, there is some different between a connection/session being killed manually or by timeout and the database server severing the TCP connections. In this case, the TCP connection was severed as part of a regular backup (why is not important for this). I guess this happens when the whole database server goes offline at once. The basis of the question still applies I think though: why is ODP.NET unable to cleanup severed connections overtime? There is a performance counter that refers to "Stasis" connections, could those connections be stuck in that state? I would think that it should be able to see that a connection is no longer active (Validate Connection=True), kill it and not return it to the pool.
Granted, this problem can be solved by just resetting the app pool everything the database goes down. I would still like to configure ODP.NET connection pooling to be more fault tolerant.
I have run into this same issue, and the only solution I have found is to use the Connection Lifetime connection string parameter in conjunction with Validate Connection.
In my particular case, the connection timeout was set at the server and the connections in the pool would timeout, but not be sniped out of the pool, resulting in errors.
Setting both the Connection Lifetime and the Validate Connection parameters has resolved the issue.
Make sure the Connection Lifetime value that you choose is less than the server connection inactivity timeout.
The recommended solution is to use ODP.NET Fast Connection Failover (FCF). FCF will automatically remove invalid connections from the pool such that you don't need to use Validate Connection, Connection Lifetime, nor clear the pool.
To use FCF, set "HA events=true", use connection pooling, and have your DBA set up Fast Application Notification (FAN) on the server side. FAN is what alerts the ODP.NET pool when a DB service or node goes down or rebooted. Upon receiving the message, ODP.NET knows which connections to remove from the pool and removes them, leaving all other valid connections untouched.
Something else is going on here. Min Pool Size and some of the other settings help when the connection is severed from things like DBA configured idle timeouts and firewall tcp idle timeouts, 'connection request timeout' occurs when created a new connection.
This could be simple network problem. There could be something interfering with dns resolution of the servers. Another case is not having fully qualified entries in tnsnames. I've been bit by the latter a couple of times.
The other issue is the one you've already recognized - full pool.
Double check that you don't have a connection leak somewhere. A missing .Close is one thing but if you're not using a 'using' statement, a try/finally is required as an unhandled exception could be thrown prior to the .Close.
I would use perfmon to monitor some of the connection statistics to start - NumberOfPooledConnections, NumberOfActiveConnections, etc:

network error (Tcp error)

I am inside a network where I need proxy settings to access the internet.
I have a weird problem.
The internet is working fine.
But it is one particular instance when i get this error:
Network Error (tcp_error)
A communication error occurred: "Operation timed out"
The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time.
For assistance, contact your network support team.
This happens when I use hadoop in local mode.
I can access the UI interface. I can see the jobs running. but when I try to see the logs of each task.. i am not able to access those logs.
UI--> job-->map--> task--> all <-- this is where the error is..
Any clues?
THanks
Not sure about exactly what your tcp action is, or about Hadoop or your proxy setup, but if you can reliably repeat the error, and the timeout error happens at approximately the same time each time you test, and that time is on the order of minutes, my guess would be that you've got a true processing delay (perhaps caused by blocking somewhere) at the server, but not necessarily.

Resources