What opens persistConn when running a Go server? - http

Overview
I have a Go echo http server running with version 1.13.
$ go version
go version go1.13.7 linux/amd64
I'm monitoring a number of different statistics about the server, including the number of goroutines. I periodically see brief spikes of thousands of goroutines, when high load shouldn't cause it to exceed maybe a few hundred. These spikes do not correlate to an increase in http requests as logged by the labstack echo middleware.
To better debug this situation, I added a periodic check in the program which sends me a pprof report on the goroutines if the number spikes.
The added goroutines surprised me, as when the server is in "normal" operating mode, I see 0 goroutines of the listed functions.
goroutine profile: total 1946
601 # 0x4435f0 0x4542e1 0x8f09dc 0x472c61
# 0x8f09db net/http.(*persistConn).readLoop+0xf0b /usr/local/go/src/net/http/transport.go:2027
601 # 0x4435f0 0x4542e1 0x8f2943 0x472c61
# 0x8f2942 net/http.(*persistConn).writeLoop+0x1c2 /usr/local/go/src/net/http/transport.go:2205
601 # 0x4435f0 0x4542e1 0x8f705a 0x472c61
# 0x8f7059 net/http.setRequestCancel.func3+0x129 /usr/local/go/src/net/http/client.go:321
What I'm struggling with, however, is where these are coming from, what they indicate, and at what point in an http request would I expect them.
To my untrained eye, it looks as if something is briefly attempting to open a connection the immediately tries to close it.
But it would be good to have confirmation of this. In what part of an http request do readLoop, writeLoop, and setRequestCancel goroutines get started? What do these goroutines indicate?
Notes
A few things I've looked at:
I tried adding middleware to capture requests frequencies from IP addresses as they came in, and report on those when the spikes happen. To total request number remains low, in the 30-40 range even as this spike is happening. No IP address is anomalous.
I've considered executing something like lsof to find open connections but that seems like a tenuous approach at best, and relies on my understanding of what these goroutines mean.
I've tried to cross-correlate the timing of seeing this with other things on the network, but without understanding what could cause this, I can't make much sense of where the potential culprit may lie.
If the number of goroutines exceeds 8192, the program crashes with the error: race: limit on 8192 simultaneously alive goroutines is exceeded, dying. A search for this error gets me to this github issue, which feels relevant because I am, in fact, using gorilla websockets in the program. However, the binary was compiled with -race and no race condition is spit out along with my error, which is entirely different from the aforementioned question.

Related

How should a game server receive udp packets with a defined tick rate?

I currently have a game server with a customizable tick rate, but for this example let's suggest that the server is only going to tick once per second or 1hz. I'm wondering what's the best way to handle incoming packets if the client send rate is faster than the server's as my current setup doesn't seem to work.
I have my udp blocking receive with a timeout inside my tick function, and it works, however if the client tick rate is higher than the server, all of the packets are not received; only the one that is being read at the current time. So essentially the server is missing packets being sent by clients. The image below demonstrates my issue.
So my question is, how is this done correctly? Is there a separate thread where packets are read constantly, queued up and then the queue is processed when the server ticks or is there a better way?
Image was taken from a video https://www.youtube.com/watch?v=KA43TocEAWs&t=7s but demonstrates exactly what I'm explaining
There's a bunch going on with the scenario you describe, but here's my guidance.
If you are running a server at 1hz, having a blocking socket prevent your main logic loop from running is not a great idea. There is a chance you won't be receiving messages at the rate you expect (due to packet loss, network lag spike or client app closing)
A) you certainly could create another thread, continue to make blocking recv/recvfrom calls, and then enqueue them onto a thread safe data structure
B) you could also just use a non-blocking socket, and keep reading packets until it returns -1. The OS will buffer (usually configurable) a certain number of incoming messages, until it starts dropping them if you aren't reading.
Either way is fine, however for individual game clients, I prefer the second simple approaches when knowing I'm on a thread that is servicing the socket at a reasonable rate (5hz seems pretty low, but may be appropriate for your game). If there's a chance you are stalling the servicing thread (level loading, etc), then go with the first approach, so you don't detect it as a disconnection if you miss sending/receiving a periodic keepalive message.
On the server side, if I'm planning on a large number of clients/data, I go to great lengths to efficiently read from sockets - using IO Completion Ports on Windows, or epoll() on Linux.
Your server could have to have a thread to tick every 5 seconds just like the client to receive all the packets. Anything not received during that tick would be dropped as the server was not listening for it. You can then pass the data over from the thread after 5 ticks to the server as one chunk. The more reliable option though is to set the server to 5hz just like the client and thread every packet that comes in from the client so that it does not lock up the main thread.
For example, if the client update rate is 20, and the server tick rate is 64, the client might as well be playing on a 20 tick server.

syslog drops logs silently

I'm using syslog to log data to a file - the data is pretty intensive, the order of thousands of rows every few seconds. What I observe is that trace amounts of logs are being missed - less than 0.1 % most of the times - but they're still missing. I have no explanation for why this occurs.
It doesn't seem to correlate directly to the amount of data being written because increasing the amount of data being written did not increase the rate of missed logs.
I'm wondering of ways to debug this - how could we understand or confirm if it is indeed syslog which is dropping data and if so why?
If you look at the source code for syslogd, you will see that the syslogd program only uses datagram sockets (type SOCK_DGRAM). These are by definition connectionsless but also not completely reliable in the sense that stream sockets are.
This is by design. Using stream sockets would mean that the syslog() call would have to wait for a confirmation that the message that it sent was received properly. So if syslogd was busy, every application that calls syslog() would block.
Syslogd was simply not designed with the volume of data that you are subjecting it to in mind. You could try enlarging the value of the sysctl variable kern.ipc.maxsockbuf, giving the logging socket a larger buffer.
If you want to make sure you capture everything, write to a file instead.

HTTP: What is better: large timeout or multiple retries?

I'm doing HTTP GET-requests from mobile devices (so network connection usually is not reliable) and wondering what would be a better approach:
Try 1 request with a timeout of 60 sec or
Try 3 requests each with a timeout of 20 secs
Or any other combination of retries/timeouts. I don't know if a HTTP/TCP connection actually can be stalled so a retry would be a good thing. I don't transfer a lot of data (< 1 kB) and are wondering what approach usually yields to a faster response time?
As long as it's an idempotent operation, it should be OK to retry more frequently in theory.
(a GET should never have any side effect at all, to be honest.)
It might still put unnecessary load on the server, and delayed responses to multiple retransmits of the request may saturate the downlink and make the situation worse.
In interactive applications I find an honest "it's taking longer than normal" notification with a user-triggerable "retry" the best: The user has the option to press the "retry" button after exiting the tunnel or building that was causing a short network outage.
Conversely out in the woods with constantly low throughput, they will learn to ignore the notification and wait patiently.

Can someone interpret these apache bench results, is there something that stands out?

Below is a apache bench run for 10K requests with 50 concurrent threads.
I need help understanding the results, does anything stand out in the results that might be pointing to something blocking and restricting more requests per second?
I'm looking at the connection time section, and see 'waiting' and 'processing'. It shows the mean time for waiting is 208, and the mean time to connect is 0 and processing is 208..yet the total is 208. Can someone explain this to me as it doesn't make much sense to me.
Connect time is time it took ab to establish connection with your server. you are probably running it on same server or within LAN, so your connect time is 0.
Processing time is total time server took to process and send complete response.
Wait time is time between sending request and receiving 1st byte of response.
Again, since you are running on same server, and small size of file, your processing time == wait time.
For real benchmark, try ab from multiple points near your target market to get real idea of latency. Right now all the info you have is the wait time.
This question is getting old, but I've run into the same problem so I might as well contribute an answer.
You might benefit from disabling either TCP nagle on the agent side, or ACK delay on the server side. They can interact badly and cause an unwanted delay. Like me, that's probably why your minimum time is exactly 200ms.
I can't confirm, but my understanding is that the problem is cross-platform since it's part of the TCP spec. It might be just for quick connections with a small amount of data sent and received, though I've seen reports of issues for larger transfers too. Maybe somebody who knows TCP better can pitch in.
Reference:
http://en.wikipedia.org/wiki/TCP_delayed_acknowledgment#Problems
http://blogs.technet.com/b/nettracer/archive/2013/01/05/tcp-delayed-ack-combined-with-nagle-algorithm-can-badly-impact-communication-performance.aspx

TCP Connection Life

How long can I expect a client/server TCP connection to last in the wild?
I want it to stay permanently connected, but things happen, so the client will have to reconnect. At what point do I say that there's a problem in the code rather than there's a problem with some external equipment?
I agree with Zan Lynx. There's no guarantee, but you can keep a connection alive almost indefinitely by sending data over it, assuming there are no connectivity or bandwidth issues.
Generally I've gone for the application level keep-alive approach, although this has usually because it's been in the client spec so I've had to do it. But just send some short piece of data every minute or two, to which you expect some sort of acknowledgement.
Whether you count one failure to acknowledge as the connection having failed is up to you. Generally this is what I have done in the past, although there was a case I had wait for three failed responses in a row to drop the connection because the app at the other end of the connection was extremely flaky about responding to "are you there?" requests.
If the connection fails, which at some point it probably will, even with machines on the same network, then just try to reestablish it. If that fails a set number of times then you have a problem. If your connection persistently fails after it's been connected for a while then again, you have a problem. Most likely in both cases it's probably some network issue, rather than your code, or maybe a problem with the TCP/IP stack on your machine (has been known: I encountered issues with this on an old version of QNX--it'd just randomly fall over). Having said that you might have a software problem, and the only way to know for sure is often to attach a debugger, or to get some logging in there. E.g. if you can always connect successfully, but after a time you stop getting ACKs, even after reconnect, then maybe your server is deadlocking, or getting stuck in a loop or something.
What's really useful is to set up a series of long-running tests under a variety of load conditions, from just sending the keep alive are you there?/ack requests and responses, to absolutely battering the server. This will generally give you more confidence about your software components, and can be really useful in shaking out some really weird problems which won't necessarily cause a problem with your connection, although they might result in problems with the transactions taking place. For example, I was once writing a telecoms application server that provided services such as number translation, and we'd just leave it running for days at a time. The thing was that when Saturday came round, for the whole day, it would reject every call request that came in, which amounted to millions of calls, and we had no idea why. It turned out to be because of a single typo in some date conversion code that only caused a problem on Saturdays.
Hope that helps.
I think the most important idea here is theory vs. practice.
The original theory was that the connections had no lifetimes. If you had a connection, it stayed open forever, even if there was no traffic, until an event caused it to close.
The new theory is that most OS releases have turned on the keep-alive timer. This means that connections will last forever, as long as the system on the other end responds to an occasional TCP-level exchange.
In reality, many connections will be terminated after time, with a variety of criteria and situations.
Two really good examples are: The remote client is using DHCP, the lease expires, and the IP address changes.
Another example is firewalls, which seem to be increasingly intelligent, and can identify keep-alive traffic vs. real data, and close connections based on any high level criteria, especially idle time.
How you want to implement reconnect logic depends a lot on your architecture, the working environment, and your performance goals.
It shouldn't really matter, you should design your code to automatically reconnect if that is the desired behavior.
There really is no way to tell. There is nothing inherent to TCP that would cause the connection to just drop after a certain amount of time. Someone on a reliable connection could have years of uptime, while someone on a different connection could have to reconnect every 5 minutes. There is no way to tell or even guess.
You will need some data going over the connection periodically to keep it alive - many OS's or firewalls will drop an inactive connection.
Pick a value. One drop every hour is probably fine. Ten unexpected connection drops in 5 minutes probably indicates a problem.
TCP connections will generally last about two hours without any traffic. Either end can send keep-alive packets, which are, I think, just an ACK on the last received packet. This can usually be set per socket or by default on every TCP connection.
An application level keep-alive is also possible. For a telnet style protocol like FTP, SMTP, POP or IMAP something like sending return, newline and getting back a command prompt.

Resources