nginx stalls after 180 KiB from uwsgi - nginx

I'm testing a new "flask" python3 app on a newly created (all packages at latest) Debian WSL2 install using (system) "nginx" and (userspace) "uwsgi" passing data between them via a unix-domain socket.
The response being generated by the app is 6.0MiB in size. Chrome, reading from localhost:8080, receives (according to WireShark) 180KiB (exactly) of content plus a few bytes (84) worth of headers. Then it stalls and never receives anything else. When it times out, the nginx "access" log indicates a transfer of the same number of bytes.
However, if before the timeout I do killall -9 uwsgi, then another 160 KiB (exactly) of the result page get immediately sent to Chrome and this larger number is reported in the nginx log.
If I run using the basic, embedded Flask server directly, I get the full 6M of content; no stalls.
Why is nginx not receiving the full response from uwsgi and/or not passing it to the browser?
Update1: I changed the socket type from unix-domain to tcp. The same problem occurs but now the stall-point is no longer consistent. It's stopped after as little as 180KiB and after as much as 540KiB. Also, it always stalls at fixed places such as 184236, 405420 (exactly 216KiB more), and 552876 (another 144KiB more).
Update2: Stopping uwsgi and instead running nc -vlp 8079 >req.http, I captured a request from nginx. I then restarted uwsgi and replayed the request (nc -v 127.0.0.1 8079 <req.http >out.html). I received the full 6M response. The problem definitely appears to be nginx.

Related

curl error 18, attempting to solve problem using SO answer 1759956

I am trying to follow curl error 18 - transfer closed with outstanding read data remaining.
The top answer is to
...let curl set the length by itself.
I don't know how to do this. I have tried the following:
curl --ignore-content-length http://corpus-db.org/api/author/Dickens,%20Charles/fulltext
However, I still get this error:
curl: (18) transfer closed with outstanding read data remaining
The connection is just getting closed by the server after 30 seconds.
You can try to increase speed of the client but if the server is not delivering enough in the limited time you get the message even with fast connection.
In the case of the example http://corpus-db.org/api/author/Dickens,%20Charles/fulltext I got a larger amount of content with direct output:
curl http://corpus-db.org/api/author/Dickens,%20Charles/fulltext
while the amount was smaller while writing in a file (already ~47MB in 30 seconds):
curl -o Dickens,%20Charles http://corpus-db.org/api/author/Dickens,%20Charles/fulltext
Resuming file transfers can be tried, but on the example server it's not supported:
curl -C - -o Dickens,%20Charles http://corpus-db.org/api/author/Dickens,%20Charles/fulltext
curl: (33) HTTP server doesn't seem to support byte ranges. Cannot resume.
So there might be options to optimize the request, to increase the connection-speed or the cache-size but if you reached the limit and never get more data in the limited time you can't do anything.
The cUrl manual can be found here: https://curl.haxx.se/docs/manual.html
The following links won't help you but perhaps are interesting:
The repository for the data-server can be found here: https://github.com/JonathanReeve/corpus-db
The documentation for the used web-server can be found here: https://hackage.haskell.org/package/warp-3.2.13
It's a speed issue. The server at corpus-db.org will DISCONNECT YOU if you take longer than 35 seconds to download something, regardless of how much you've already downloaded.
To make matters worse, the server does not support Content-Range, so you can't download it in chunks and simply resume download where you left off.
To make matters even worse, not only is Content-Range not supported, but it's SILENTLY IGNORED, which means it seems to work, until you actually inspect what you've downloaded.
If you need to download that page from a slower connection, I recommend renting a cheap VPS, and set it up as a mirror of whatever you need to download, and download from your mirror instead. Your mirror does not need to have the 35-second-limit.
For example, this vps1 costs $1.25/month has a 1Gbps connection, and would be able to download that page. Rent one of those, install nginx on it, wget it in nginx's www folder, and download it from your mirror, and you'll have 300 seconds to download it (nginx default timeout) instead of 35 seconds. If 300 seconds is not enough, you can even change the timeout to whatever you want.
Or you could even get fancy and set up a caching proxy compatible with curl's --proxy, parameter so your command could become
curl --proxy=http://yourserver http://corpus-db.org/api/author/Dickens,%20Charles/fulltext
If someone is interested in an example implementation of this, let me know.
You can't download that page with a 4mbit connection because the server will kick you before the download is complete (after 35 seconds), but if you download it with a 1000mbit connection, you'll be able to download the entire file before the timeout kicks in.
(My home internet connection is 4mbit, and I can't download it from home, but I tried downloading it from a server with a 1000mbit connection, and that works fine.)
1PS: I'm not associated with ramnode in any way, except that I'm a (prior) happy customer of them, and I recommend them to anyone looking for cheap reliable VPSs.

What happens to a waiting WebSocket connection on a TCP level when server is busy (blocked)

I am load testing my WebSocket Tornado server, running on Ubuntu Server 14.04.
I am playing with a big client machine loading 60,000 users, 150 a second (that's what my small server can comfortably take). Client is a RedHat machine. When a load test suite finishes, I have to wait a few seconds to be able to rerun.
Within these few seconds, my websocket server is handling closing of the 60,000 connections. I can see it in my graphite dashboard (the server logs every connect and and disconnect information there).
I am also logging relevant outputs of the netstat -s and ss -s commands to my graphite dashboard. When the test suite finishes, I can immediately see tcp established seconds dropping from 60,000 to ~0. Other socket states (closed, timewait, synrecv, orphaned) remain constant, very low. My client's sockets go to timewait for a short period and then this number goes to 0 too. When I immediately rerun the suite, and all the tcp sockets on both ends are free, but the server has not finished processing of the previous closing batch yet, I see no changes on the tcp socket level until the server is finished processing and starts accepting new connections again.
My question is - where is the information about the sockets waiting to be established stored (RedHat and Ubuntu)? No counter/queue length that I am tracking shows this.
Thanks in advance.

nginx <=> php-fpm: unix socket gives error, tcp connection is slow

I'm running nginx with php-fpm on a high traffic site. I let nginx communicate with php-fpm over tcp/ip, both nginx and the php-fpm pools running on the same server.
When I use tcp/ip to let nginx and php-fpm pools communicate with eachother, the loading of pages takes a few (5-10) seconds before it does anything at all, and when it finally gets going, it takes no time at all for the loading to finish. Since the statuspage of php-fpm shows that the listen backlog is full, I assume it takes some time before the request is handled.
Netstat shows a lot (20k+) connections in the TIME_WAIT status, don't know if this is related but it seemed relevant.
When I try to let nginx and php-fpm communicate over a UNIX socket, the time before the page actually loads is reduced to almost nothing, and the time before the finished page is in my browser is 1000x less. Only problem with the UNIX sockets is that it gives me a LOT of errors in the logs:
*3377 connect() to unix:/dev/shm/.php-fpm1.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 122.173.178.150, server: nottherealserver.fake, request: "GET somerandomphpfile HTTP/1.1", upstream: "fastcgi://unix:/dev/shm/.php-fpm1.sock:", host: "nottherealserver.fake", referrer: "nottherealserver.fake"
My two questions are:
does anybody know why the tcp/ip method has such a large wait before it actually seems to connect to the php-fpm backend?
why do the UNIX sockets cause problems when using this instead of tcp/ip?
What I tried:
set net.ipv4.tcp_tw_recycle and net.ipv4.tcp_tw_reuse to 1 when trying to decrease the number of TIME_WAIT connections (went down from 30k+ to 20k+)
increased the net.core.somaxconn value from the default 128 to 1024 (tried higher too but still the same error when using the UNIX sockets)
increased the max number of open files
What is probably also quite relevant: tried using lighttpd + fastcgi, has the same problem with the long time before a connection finally gets handled. MySQL is not too busy, shouldnt be the cause of the long waiting times. Disk wait time is 0% (SSD disk), so a busy disk doesn't seem to be the culprit either.
Hope that somebody found a fix for this problem, and is willing to share :)
Answering my own question since the problem is solved (not sure if this is the correct way to do it).
My problem was that APC caching didnt work at all. It was installed, configured and enabled, but did not add anything to its cache. After switching from APC to Xcache, there was a huge drop in load and loadtimes. Still don't know why APC did nothing, but at the moment im just happy that the problem is solved :)
Thanks for all the input from you guys!

fsc.exe is very slow because it tries to access crl.microsoft.com

When I run F# compiler - fsc.exe - on our build server it takes ages (~20sec) to run even when there are no input files. After some investigation I found out that it's because the application tries to access crl.microsoft.com (probably to check if some certificates aren't revoked). However, the account under which it runs doesn't have an access to the Internet. And because our routers/firewalls/whatever just drops the SYN packets, fsc.exe tries several times before giving up.
The only solution which comes to mind is to set clr.microsoft.com to 127.0.0.1 in hosts file but it's pretty nasty solution. Moreover, I'll need fsc.exe on our production box, where I can't do such things. Any other ideas?
Thanks
Come across this myself - here are some links... to better descriptions and some alternatives
http://www.eggheadcafe.com/software/aspnet/29381925/code-signing-performance-problems-with-certificate-revocation-chec.aspx
I dug up this form an old MS KB for Exchange when we hit it... Just got the DNS Server to reply as stated (might be the solution for your production box.)
MS Support KB
The CRL check is timing out because it
never receives a response. If a router
were to send a “no route to host” ICMP
packet or similar error instead of
just dropping the packets, the CRL
check would fail right away, and the
service would start. You can add an
entry to crl.microsoft.com in the
hosts file or on the DNS server and
send the packets to a legitimate
location on the network, such as
127.0.0.1, which will reject the connection..."

IIS file download hangs/timeouts - sc-win32-status = 64

Any thoughts on why I might be getting tons of "hangs" when trying to download a file via HTTP, based on the following?
Server is IIS 6
File being downloaded is a binary file, rather than a web page
Several clients hang, including TrueUpdate and FlexNet web updating packages, as well as custom .NET app that just does basic HttpWebRequest/HttpWebResponse logic and downloads using a response stream
IIS log file signature when success is 200 0 0 (sc-status sc-substatus sc-win32-status)
For failure, error signature is 200 0 64
sc-win32-status of 64 is "the specified network name is no longer available"
I can point firefox at the URL and download successfully every time (perhaps some retry logic is happening under the hood)
At this point, it seems like either there's something funky with my server that it's throwing these errors, or that this is just normal network behavior and I need to use (or write) a client that is more resilient to the failures.
Any thoughts?
Perhaps your issue was a low level networking issue with the ISP as you speculated in your reply comment. I am experiencing a similar problem with IIS and some mysterious 200 0 64 lines appearing in the log file, which is how I found this post. For the record, this is my understanding of sc-win32-status=64; I hope someone can correct me if I'm wrong.
sc-win32-status 64 means “The specified network name is no longer available.”
After IIS has sent the final response to the client, it waits for an ACK message from the client.
Sometimes clients will reset the connection instead of sending the final ACK back to server. This is not a graceful connection close, so IIS logs the “64” code to indicate an interruption.
Many clients will reset the connection when they are done with it, to free up the socket instead of leaving it in TIME_WAIT/CLOSE_WAIT.
Proxies may have a tendancy to do this more often than individual clients.
I've spent two weeks investigating this issue. For me I had the scenario in which intermittent random requests were being prematurely terminated. This was resulting in IIS logs with status code 200, but with a win32-status of 64.
Our infrastructure includes two Windows IIS servers behind two NetScaler load balancers in HA mode.
In my particular case, the problem was that the NetScaler had a feature called "Intergrated Caching" turned on (http://support.citrix.com/proddocs/topic/ns-optimization-10-5-map/ns-IC-gen-wrapper-10-con.html).
After disabling this feature, the request interruptions ceased. And the site operated normally. I'm not sure how or why this was causing a problem, but there it is.
If you use a proxy or a load balancer, do some investigation of what features they have turned on. For me the cause was something between the client and the server interrupting the requests.
I hope that this explanation will at least save someone else's time.
Check the headers from the server, especially content-type, and content-length, it's possible that your clients don't recognize the format of the binary file and hang while waiting for bytes that never come, or maybe they close the underlying TCP connection, which may cause IIS to log the win32 status 64.
Spent three days on this.
It was the timeout that was set to 4 seconds (curl php request).
Solution was to increase the timeout setting:
//curl_setopt($ch, CURLOPT_TIMEOUT, 4); // times out after 4s
curl_setopt($ch, CURLOPT_TIMEOUT, 60); // times out after 60s
You will have to use wireshare or network monitor to gather more data on this problem. Me think.
I suggest you put Fiddler in between your server and your download client. This should reveal the differences between Firefox and other cients.
Description of all sc-win32-status codes for reference
https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-
ERROR_NETNAME_DELETED
64 (0x40)
The specified network name is no longer available.

Resources