What happens to a waiting WebSocket connection on a TCP level when server is busy (blocked) - unix

I am load testing my WebSocket Tornado server, running on Ubuntu Server 14.04.
I am playing with a big client machine loading 60,000 users, 150 a second (that's what my small server can comfortably take). Client is a RedHat machine. When a load test suite finishes, I have to wait a few seconds to be able to rerun.
Within these few seconds, my websocket server is handling closing of the 60,000 connections. I can see it in my graphite dashboard (the server logs every connect and and disconnect information there).
I am also logging relevant outputs of the netstat -s and ss -s commands to my graphite dashboard. When the test suite finishes, I can immediately see tcp established seconds dropping from 60,000 to ~0. Other socket states (closed, timewait, synrecv, orphaned) remain constant, very low. My client's sockets go to timewait for a short period and then this number goes to 0 too. When I immediately rerun the suite, and all the tcp sockets on both ends are free, but the server has not finished processing of the previous closing batch yet, I see no changes on the tcp socket level until the server is finished processing and starts accepting new connections again.
My question is - where is the information about the sockets waiting to be established stored (RedHat and Ubuntu)? No counter/queue length that I am tracking shows this.
Thanks in advance.

Related

.NET ThreadPool tasks queued while pool not exhausted

Question
What can cause tasks to be queued in Thread Pool while there are plenty threads still available in pool?
Explanation
Our actual code is too big to post, but here is best approximation:
long running loop
{
create Task 1
{
HTTP Post request (async)
Wait
}
create Task 2
{
HTTP Post request (async)
Wait
}
Wait for Tasks 1 & 2
}
The issue is that these HTTP requests which usually take 110-120 ms sometimes take up to 800-1100 ms.
Before you ask:
Verified no delays on server side
Verified no delays on network layer (tcpdump + wireshark). If we have such delays, there are pauses between requests, TCP level turn-around fits in 100ms
Important info:
We run it on Linux.
This happens only when we run the service in container on k8s or docker.
If we move it outside container it works just fine.
How do we know it's not ThreadPool starvation?
We have added logging values returned by ThreadPool.GetAvailableThreads and we have values of 32k and 4k for available threads.
How do we know the tasks are queued?
we run dotnet-counters tool and we see queue sizes up to 5 in same second when issue occurs.
Side notes:
we control the network, we are 99.999% sure it not it (because you can never be sure...)
process is not CPU throttled
the process usually have 25 - 30 threads in total at given time
when running on k8s/docker we tried both container and host network - no change.
HttpClient notes:
We are using this HTTP client: https://learn.microsoft.com/en-us/dotnet/api/system.net.http.httpclient?view=net-6.0
Client instances are created before we launch the loop.
These are HTTP, not HTTPS requests
URLs are always the same per task, server is given as IP, like this http://1.2.3.4/targetfortaskX
Generally - using tcpdump and wireshark we observe two TCP streams to be opened and living through whole execution and all requests made are assigned to one of these two streams with keep-alive. So no delays on DNS, TCP SYN or source port exhaustion.

How does a websocket know the server was taken down?

I was playing with websocket a bit (using Sails.js with its built-in socket thing, which is based on Socket.io).
I noticed Chrome receives two frames every 25 seconds. I thought this was some kind of polling to tell the connection was still on.
But then, I cancelled the server and Chrome was notified immediately.
Also I closed the Node process by force with the kill command, and still Chrome was notified, so that means it wasn't Node sending a signal before shutting down the server.
How does this happen?
Normal TCP socket connections do this, so it'd be surprising if websockets didn't.
The server kernel is responsible for cleaning up when the server process dies/exits/is killed. This includes releasing memory, closing files, and shutting down sockets. Cleanly shutting down a TCP socket requires sending a message to tell the peer.
Interestingly, on some old versions of Windows with userspace winsock, this didn't happen if the server process crashed. On all OS with compliant TCP support, it should be guaranteed unless the kernel itself hangs, the machine loses power, or the network breaks.

Golang how to handle gracefull shutdown with keep alives

I have build a proxy server that can balance between multiple nodes.
I also made it that it can reload with zero downtime. Problem is that most of the nodes have keep alive
connections and i have no clue how to handle these. Sometimes the server cant shutdown off 1 or 2 open connections that wont close.
My first opinion is to set a timeout on the shutdown but that does not secures me that every connection is terminated correctly. I think of a download that takes some minutes to complete.
Anyone can give me some good advise what to do in this case?
One option you have is to initially shutdown just the listening sockets, and wait on the active connections before exiting.
Once you free up the listening sockets, your new process is free to start up and accept new connections. The old process can then continue running until all its connections are closed gracefully (this is how HAProxy does reloads), or until some far longer timeout if you choose.

Get rsyslog forwarding messages after remote server restart

I have syslog successfully forwarding logs to an upstream server like so:
$MainMsgQueyeType LinkedList
$MainMsgQueueSize 10000
$MainMsgQueusDiscardMark 8000
$MainMsgQueueDiscardSeverity 1
$MainMsgQueueSaveOnShutdown off
$MainMsgQueueTimeoutEnqueue 0
$ActionQueueType LinkedList # in memory queue
$ActionQueueFileName fwdRule1 # unique name prefix for spool files
$ActionQueueSize 10000 # Only allow 10000 elements in the queue
$ActionQueueDiscardMark 8000 # Only allow 8000 elements in the queue before dropping msgs
$ActionQueueDiscardSeverity 1 # Discard Alert,Critical,Error,Warning,Notice,Info,Debug, NOT Emergency
$ActionQueueSaveOnShutdown off # save messages to disk on shutdown
$ActionQueueTimeoutEnqueue 0
$ActionResumeRetryCount -1 # infinite retries if host is down
$RepeatedMsgReduction off
*.* ##remoteserver.mynetwork.com:5544
On the remoteserver I have something that talks syslog and listens on that port. To test, I have a simple log client that logs 100 messages a second to syslog.
This all works fine, and I have configured the queues above so that in the event that the remoteserver is unavailable, the queues start filling up, and then eventually messages get discarded, thus safeguarding syslog from blocking its logging clients.
When I stop the remote log sink on remoteserver:5544, syslog is still stable (queues filling up / full up), but when I restart the remote log sink a while later, rsyslog detects the server again, reestablishes a TCP connection
HOWEVER - syslog only forwards 1 message to it, despite the queue having many thousands of messages in it, and the logging client continuing to log 100 messages a second
How can I make syslog start forwarding messages again once it has detected the remoteserver is back up? (Without restarting syslog).
Am using rsyslog 4.6.2-2
I am using, and want to use TCP
The problem in case anybody comes across this was that workdirectory was set to:
$WorkDirectory /var/spool/rsyslog
And the above config, does this:
$ActionQueueFileName fwdRule1
Even though its supposed to be an in-memory queue. Because of this, when the queue reached 800 (bizarrely, not 8000), disk-assisted mode was activated, and syslog attempted to write messages to /var/spool/rsyslog. This directory didn't exist . Randomly, (hence a race condition must exist and a bug in rsyslog), after continually trying to open a queue file on the disk in that directory, rsyslog got into a twisted state and gave up and continued queueing messages, until it hit the high 10,000 mark. Restarting the downstream logserver failed to make it recover.
Taking out all references to ActionQueueFileName and making WorkDirectory exist fixed this issue.

Tcp Socket Closed

I always thought that if you didn't implement a heartbeat, there was no way to know if one side of a TCP connection died unexpectedly. If the process was just killed on one side and didn't exit gracefully, there was no way for the socket to send FIN or let the other side know that it was closed.
(See some of the comments here for example http://www.perlmonks.org/?node_id=566568 )
But there is a stock order server that I connect to that has a new "cancel all orders on disconnect feature" that cancels live orders if the client dis-connects. It works even when I kill the process on my end, and there is definitely no heartbeat from my app to it.
So how is it able to detect when I've killed the process? My app is running on Windows Server 2003 and the order server is on Suse Linux Enterprise Server 10. Does Windows detect that the process associated with the socket is no longer alive and send the FIN?
When a process exits - for whatever reason - the OS will close the TCP connections it had open.
There's numerous other ways a TCP connection can go dead undetected
someone yanks out a network cable inbetween.
the computer at the other end gets nuked.
a nat gateway inbetween silently drops the connection
the OS at the other end crashes hard.
the FIN packets gets lost.
Though enabling tcp keepalive, you'll detect it eventually - atleast during a couple of hours.
It could be using a TCP Keep Alive to check for dead peers:
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html
As far as I know, the OS detects the process termination and closes all the file descriptors/sockets/handles the process was using. So, there isn't difference between "killing" application and "gracefully terminating". Of course, the kernel itself must be running (=pc turned on, wire connected...). But it's on the OS the job of sending the FIN and so on...
Also, if a host becomes unreachable /turned off, disconnected...) an intermediate gateway (or the client itself) may detect the event (e.g. loss of carrier, DHCP lease not renewed...) and reply to the packets sent to the died host with a ICMP error (host/network unreachable). This causes the peer's TCP connection to die, but it happens only if the client has some packet to send to the host.

Resources