Below are my current values dealing with tcp and open files on my linux system:
$cat /proc/sys/fs/file-max # outputs 1,624,164.
$cat /proc/sys/net/ipv4/tcp_max_syn_backlog #outputs 1,048,576
$cat /proc/sys/net/core/somaxconn # output 65535
$ulimit -a # open files = 1,024,000, max user processes = 10,240
Q2:
I also have have the timeout in redis set to 0, tcp-keepalive set to 60, and tcp-backlog set to 65535. I am using predis, and i have timeout there set to 0, and the read_write_timeout set to -1. However, we periodically keep getting the error below.
2015-10-28 11:24:14 406309 cron-web Error while reading line from the server. [tcp://10.0.0.1:6379]
2015-10-28 19:15:13 0 web-billing-3 Error while reading line from the server. [tcp://10.0.0.1:6379]
2015-10-28 19:56:58 0 web-billing-3 Operation timed out [tcp://10.0.0.1:6379]
2015-10-29 10:02:25 437257 web-billing-1 Error while reading line from the server. [tcp://10.0.0.1:6379]
2015-10-29 12:03:54 439897 cron-web Error while reading line from the server. [tcp://10.0.0.1:6379]
2015-10-29 15:06:23 443772 web-billing-3 Error while reading line from the server. [tcp://10.0.0.1:6379]
I have tried changing the timeout inroads to 300 and still does not work. The linux system params are also set as demonstrated in question 1. All this has not helped the situation. Any suggestions please?
It seems this scenario has happened before w/PRedis and it could be the way it uses connections leading to IP connection tracking on the server getting overloaded. Or your scripts are firing up "too many" connections. Either way you can figure out fairly easily if this is the case.
On the Redis server, run dmsg| grep conntrack. If you see messages like ip_conntrack: table full, dropping packet then this is the problem. You can follow the steps in this article to fix it via altering your /proc/sys/net/ipv4/ip_conntrack_max settings to match/exceed the peaks you are seeing in /proc/sys/net/ipv4/netfilter/ip_conntrack_count.
Ultimately though this probably belongs on Superuser as it is likely to be a system-level issue.
Edit:
For determining concurrent connections you need to look at info clients and look at the current connections count. You'll need to track that over time to determine what the concurrency profile is in order to see if it could be the issue. Clearly the finer grained resolution you run (ie. the more often you check-and-store) the better your chances of catching a concurrency spike will be.
I really suspect the problem is with Predis as it has known connection management issues. if you can, try using phpredis to see if it continues to occur.
Related
In my application, I have an issue where I receive the "MySQL server has gone away" error during a quite long-running transaction. I know this has already been asked a lot, but I tried my test to go through all possible causes.
The one thing that baffles me a lot, is this error message in the log of the MariaDB server:
[Warning] Aborted connection 6 to db: 'default' user: 'root' host: '10.0.0.18' (Got timeout reading communication packets)
This would explain why the client reports a broken connection, but this error occurs 10-15 minutes before the client reports the "MySQL server has gone away" error. In the meantime, the client is happily running insert statements without an issue. But as soon the client runs a select statement, the statement fails practically immediately.
I have already checked for these possible causes:
The server was running all the time
wait_timeout is set to 8 hours, which is way longer than the time the transaction needs to fail
max_allowed_packet is set to 512M which should be more than enough since the query is a very short select statement
The server does not run out of memory
I'm pretty sure the issue must be related to the "Got timeout reading communication packets" error from the MariaDB log. But I cannot wrap my head around why the client still can write data. And why this timeout occurs in the first place, since the wait_timeout is super high.
Some system information:
I'm running on MariaDB 10.5.1
The client uses python 3.6 and mysqlclient, which uses libmysql, is used for the database connection
I hope maybe some of you have an idea what I should look for, because this is really driving me nuts.
Overview
I have a Go echo http server running with version 1.13.
$ go version
go version go1.13.7 linux/amd64
I'm monitoring a number of different statistics about the server, including the number of goroutines. I periodically see brief spikes of thousands of goroutines, when high load shouldn't cause it to exceed maybe a few hundred. These spikes do not correlate to an increase in http requests as logged by the labstack echo middleware.
To better debug this situation, I added a periodic check in the program which sends me a pprof report on the goroutines if the number spikes.
The added goroutines surprised me, as when the server is in "normal" operating mode, I see 0 goroutines of the listed functions.
goroutine profile: total 1946
601 # 0x4435f0 0x4542e1 0x8f09dc 0x472c61
# 0x8f09db net/http.(*persistConn).readLoop+0xf0b /usr/local/go/src/net/http/transport.go:2027
601 # 0x4435f0 0x4542e1 0x8f2943 0x472c61
# 0x8f2942 net/http.(*persistConn).writeLoop+0x1c2 /usr/local/go/src/net/http/transport.go:2205
601 # 0x4435f0 0x4542e1 0x8f705a 0x472c61
# 0x8f7059 net/http.setRequestCancel.func3+0x129 /usr/local/go/src/net/http/client.go:321
What I'm struggling with, however, is where these are coming from, what they indicate, and at what point in an http request would I expect them.
To my untrained eye, it looks as if something is briefly attempting to open a connection the immediately tries to close it.
But it would be good to have confirmation of this. In what part of an http request do readLoop, writeLoop, and setRequestCancel goroutines get started? What do these goroutines indicate?
Notes
A few things I've looked at:
I tried adding middleware to capture requests frequencies from IP addresses as they came in, and report on those when the spikes happen. To total request number remains low, in the 30-40 range even as this spike is happening. No IP address is anomalous.
I've considered executing something like lsof to find open connections but that seems like a tenuous approach at best, and relies on my understanding of what these goroutines mean.
I've tried to cross-correlate the timing of seeing this with other things on the network, but without understanding what could cause this, I can't make much sense of where the potential culprit may lie.
If the number of goroutines exceeds 8192, the program crashes with the error: race: limit on 8192 simultaneously alive goroutines is exceeded, dying. A search for this error gets me to this github issue, which feels relevant because I am, in fact, using gorilla websockets in the program. However, the binary was compiled with -race and no race condition is spit out along with my error, which is entirely different from the aforementioned question.
I am working with IBM MQ. I managed to get a basic Handshake / Put Message(s) / Get Message(s) / Disconnect .net solution going on, a couple of days ago, but it only works on a local level, and I now need to update the solution so it works remotely as well.
After reading and experimenting for a while, I decided to follow IBM Knowledge Center's Point to Point scenario step by step. However, I can't start the Sender Channel as instructed by the guide's last step; the Sender Channel's status ping-pongs between Binding and Retrying, and the logs come up with the following error codes; AMQ9002, AMQ9202 and AMQ9999, meaning, as far as I can tell, there is some kind of trouble finding and/or connecting with the host, as explained by the error logs.
I have looked through a lot of questions regarding these errors in particular, but while I have followed most of the proposed solutions (I made sure the Receiver's listener is running, I tried turning off Firewalls, I tried with different ports, I have performed tests Telnet, I have stopped/restarted/resolved the Sender channel a few times, and I have tried setting this up from both, the command line and MQ Explorer), I have yet to get a successful communication going on between two different PCs.
I am aware the error could be either temporary, or the result of problems within the Network itself, but I have been trying to establish a successful connection for almost three days now, and before I pass this unto my bosses I would like to make sure I have exhausted every other possibility.
How can I complete IBM's Point To Point set up guide, or is there anything that could point me towards a different / better approach to get two PCs talking with each other via IBM MQ v9?
Although hastily translated from Japanese, you can find the detailed error logs below.
2017/09/19 17:34:09 - Process (234212.1) User (MUSR_MQADMIN) Program
(runmqchl.exe)
Host (DESKTOP - UP 4 D 363) Installation (Installation 1)
VRMF (9.0.3.0) QMgr (QM 1)
Time (2017-09-19T08: 34: 09.201 Z)
AMQ9002: Channel 'TO.QM2' is starting.
Description: Channel 'TO.QM2' is starting.
ACTION: None.
2017/09/19 17:34:30 - Process (234212.1) User (MUSR_MQADMIN) Program
(runmqchl.exe)
Host (DESKTOP - UP4D363) Installation (Installation 1)
VRMF (9.0.3.0) QMgr (QM 1)
Time (2017-09-19T08: 34: 30.824Z)
AMQ 9202: The remote host 'DESKTOP-1AV4LM3 (The correct ip address) (1415)' can not be used.Please try again later.
Description: Using TCP / IP to host 'DESKTOP-1AV4LM3 (The correct ip
address) of channel TO.QM2 (1415) 'trying to allocate a conversation,
but it did not succeed. However, It is temporary and there is also the
possibility that TCP / IP conversation can be allocated normally
later.
If the remote host can not be determined, '????' is displayed. .
ACTION: Please try the connection later. If the failure persists,
record the error value Please contact the stem administrator. The
return code from TCP / IP is 10060 (X'274C ').The cause of this
failure may be that the host can not reach the destination host.
Alternatively, There is a possibility that the host 'DESKTOP-1AV4LM3
(The correct ip address) (1415)' listener isn't running. If that is
the case, start the listener and try again.
2017/09/19 17:34:30 - Process (234212.1) User (MUSR_MQADMIN) Program (runmqchl.exe)
Host (DESKTOP - UP 4 D 363) Installation (Installation 1)
VRMF (9.0.3.0) QMgr (QM 1)
Time (2017-09-19T08: 34: 30.825Z)
AMQ9999: Channel 'TO.QM2' for host 'DESKTOP-1AV4LM3 (1415)' terminated abnormally
Description: The host 'DESKTOP-1AV4LM3 (1415)' cannot be determined.
ACTION: Check the error log for the preceding error message for
this channel program Please determine the cause of failure....
".
The 'interesting' bit of the error messages above is that the sender is attempting to start a channel to port 1415 on the destination and is getting a 10060 return code (WSAETIMEDOUT). This is different from an immediate rejection because the other end doesnt have a socket open, for example.
You will also note its timing out after about 21 seconds if your times are to be believed. The only time I've seen this kind of things is DNS resolution - There was an APAR for example showing that reverse DNS can cause delays in channel startup, and this could be for a successful or unsuccessful startup
http://www-01.ibm.com/support/docview.wss?uid=swg1IC96408
A new attribute was added to MQ to disable reverse DNS lookups if its the cause - See https://www.ibm.com/support/knowledgecenter/en/SSFKSJ_8.0.0/com.ibm.mq.pro.doc/q113120_.htm#q113120___chlauth
If this is the case, on the receiving end (or both!) try runmqsc , 'ALTER QMGR REVDNS(DISABLED)'. You might have to restart the qmgr for it to be effective (I'm not sure, sorry)
I'd also echo the comment added to your question by JoshMc, to check the receiving end logs for messages (both global errors but more likely the qmgr specific AMQERR01.LOG files) when this occurs - I have a feeling that the timeout is only part of your problem.
I'm using syslog to log data to a file - the data is pretty intensive, the order of thousands of rows every few seconds. What I observe is that trace amounts of logs are being missed - less than 0.1 % most of the times - but they're still missing. I have no explanation for why this occurs.
It doesn't seem to correlate directly to the amount of data being written because increasing the amount of data being written did not increase the rate of missed logs.
I'm wondering of ways to debug this - how could we understand or confirm if it is indeed syslog which is dropping data and if so why?
If you look at the source code for syslogd, you will see that the syslogd program only uses datagram sockets (type SOCK_DGRAM). These are by definition connectionsless but also not completely reliable in the sense that stream sockets are.
This is by design. Using stream sockets would mean that the syslog() call would have to wait for a confirmation that the message that it sent was received properly. So if syslogd was busy, every application that calls syslog() would block.
Syslogd was simply not designed with the volume of data that you are subjecting it to in mind. You could try enlarging the value of the sysctl variable kern.ipc.maxsockbuf, giving the logging socket a larger buffer.
If you want to make sure you capture everything, write to a file instead.
Any thoughts on why I might be getting tons of "hangs" when trying to download a file via HTTP, based on the following?
Server is IIS 6
File being downloaded is a binary file, rather than a web page
Several clients hang, including TrueUpdate and FlexNet web updating packages, as well as custom .NET app that just does basic HttpWebRequest/HttpWebResponse logic and downloads using a response stream
IIS log file signature when success is 200 0 0 (sc-status sc-substatus sc-win32-status)
For failure, error signature is 200 0 64
sc-win32-status of 64 is "the specified network name is no longer available"
I can point firefox at the URL and download successfully every time (perhaps some retry logic is happening under the hood)
At this point, it seems like either there's something funky with my server that it's throwing these errors, or that this is just normal network behavior and I need to use (or write) a client that is more resilient to the failures.
Any thoughts?
Perhaps your issue was a low level networking issue with the ISP as you speculated in your reply comment. I am experiencing a similar problem with IIS and some mysterious 200 0 64 lines appearing in the log file, which is how I found this post. For the record, this is my understanding of sc-win32-status=64; I hope someone can correct me if I'm wrong.
sc-win32-status 64 means βThe specified network name is no longer available.β
After IIS has sent the final response to the client, it waits for an ACK message from the client.
Sometimes clients will reset the connection instead of sending the final ACK back to server. This is not a graceful connection close, so IIS logs the β64β code to indicate an interruption.
Many clients will reset the connection when they are done with it, to free up the socket instead of leaving it in TIME_WAIT/CLOSE_WAIT.
Proxies may have a tendancy to do this more often than individual clients.
I've spent two weeks investigating this issue. For me I had the scenario in which intermittent random requests were being prematurely terminated. This was resulting in IIS logs with status code 200, but with a win32-status of 64.
Our infrastructure includes two Windows IIS servers behind two NetScaler load balancers in HA mode.
In my particular case, the problem was that the NetScaler had a feature called "Intergrated Caching" turned on (http://support.citrix.com/proddocs/topic/ns-optimization-10-5-map/ns-IC-gen-wrapper-10-con.html).
After disabling this feature, the request interruptions ceased. And the site operated normally. I'm not sure how or why this was causing a problem, but there it is.
If you use a proxy or a load balancer, do some investigation of what features they have turned on. For me the cause was something between the client and the server interrupting the requests.
I hope that this explanation will at least save someone else's time.
Check the headers from the server, especially content-type, and content-length, it's possible that your clients don't recognize the format of the binary file and hang while waiting for bytes that never come, or maybe they close the underlying TCP connection, which may cause IIS to log the win32 status 64.
Spent three days on this.
It was the timeout that was set to 4 seconds (curl php request).
Solution was to increase the timeout setting:
//curl_setopt($ch, CURLOPT_TIMEOUT, 4); // times out after 4s
curl_setopt($ch, CURLOPT_TIMEOUT, 60); // times out after 60s
You will have to use wireshare or network monitor to gather more data on this problem. Me think.
I suggest you put Fiddler in between your server and your download client. This should reveal the differences between Firefox and other cients.
Description of all sc-win32-status codes for reference
https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-
ERROR_NETNAME_DELETED
64 (0x40)
The specified network name is no longer available.