nginx <=> php-fpm: unix socket gives error, tcp connection is slow - nginx

I'm running nginx with php-fpm on a high traffic site. I let nginx communicate with php-fpm over tcp/ip, both nginx and the php-fpm pools running on the same server.
When I use tcp/ip to let nginx and php-fpm pools communicate with eachother, the loading of pages takes a few (5-10) seconds before it does anything at all, and when it finally gets going, it takes no time at all for the loading to finish. Since the statuspage of php-fpm shows that the listen backlog is full, I assume it takes some time before the request is handled.
Netstat shows a lot (20k+) connections in the TIME_WAIT status, don't know if this is related but it seemed relevant.
When I try to let nginx and php-fpm communicate over a UNIX socket, the time before the page actually loads is reduced to almost nothing, and the time before the finished page is in my browser is 1000x less. Only problem with the UNIX sockets is that it gives me a LOT of errors in the logs:
*3377 connect() to unix:/dev/shm/.php-fpm1.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 122.173.178.150, server: nottherealserver.fake, request: "GET somerandomphpfile HTTP/1.1", upstream: "fastcgi://unix:/dev/shm/.php-fpm1.sock:", host: "nottherealserver.fake", referrer: "nottherealserver.fake"
My two questions are:
does anybody know why the tcp/ip method has such a large wait before it actually seems to connect to the php-fpm backend?
why do the UNIX sockets cause problems when using this instead of tcp/ip?
What I tried:
set net.ipv4.tcp_tw_recycle and net.ipv4.tcp_tw_reuse to 1 when trying to decrease the number of TIME_WAIT connections (went down from 30k+ to 20k+)
increased the net.core.somaxconn value from the default 128 to 1024 (tried higher too but still the same error when using the UNIX sockets)
increased the max number of open files
What is probably also quite relevant: tried using lighttpd + fastcgi, has the same problem with the long time before a connection finally gets handled. MySQL is not too busy, shouldnt be the cause of the long waiting times. Disk wait time is 0% (SSD disk), so a busy disk doesn't seem to be the culprit either.
Hope that somebody found a fix for this problem, and is willing to share :)

Answering my own question since the problem is solved (not sure if this is the correct way to do it).
My problem was that APC caching didnt work at all. It was installed, configured and enabled, but did not add anything to its cache. After switching from APC to Xcache, there was a huge drop in load and loadtimes. Still don't know why APC did nothing, but at the moment im just happy that the problem is solved :)
Thanks for all the input from you guys!

Related

How to find root cause for server shut down frequently

I'm on ubuntu apache. Lately it shuts down frequently and back on after a restart. I kept analyzing apache2 error log to find the cause. Previously it was reporting PHP code error. But after fixing it, it throws different error now.
What can I conclude based on these errors? Which probably would have caused the downtime and how to fix it?
AH: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.
AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
AH00687: Negotiation: discovered file(s) matching request: /opt/bitnami/apps/file_that_doesn't-exist
(70007)The timeout specified has expired: [client 148.251.79.134:60170] AH01075: Error dispatching request to : (polling)
AH00045: child process 5062 still did not exit, sending a SIGTERM
AH00046: child process 5299 still did not exit, sending a SIGKILL
AH01909: localhost:443:0 server certificate does NOT include an ID which matches the server name
I've done enough google search to understand each of this error. Most importantly I woudl like to kow which of these error would have cause the server to go down? And what is the way fixing it?
Bitnami Engineer here,
AH: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.
AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
It seems the server is reaching the limits and that can be the reason of the issues you are running into. It can reach the limits either because the instance is really small and you need to increase its memory/CPU or because you are being attacked.
You can check if you are being attacked by running these commands
cd /opt/bitnami/apache2/logs/
tail -n 10000 access_log | awk '{print $1}'| sort| uniq -c| sort -nr| head -n 10
Are those IPs familiar? Is there any IP that is requesting your site too many times?
You can find more information about how to block it here
You can also increase the MaxRequestWorkers parameter in Apache by editing the /opt/bitnami/apache2/conf/bitnami/httpd.conf file or you can also increase the instance type using the AWS console so the server has more resources from now on.

I set a very large value for the nginx backlog what consequences will lead to

i don't know, if I set a very large value for the nginx backlog , what consequences will lead to.
who can tell me ?
It is too large for php-fpm to set the listen backlog to 65535.
It is realy NOT a good idea to clog the accept queue especially
when the client or nginx has a timeout for this connection.
Assume that the php-fpm qps is 5000. It will take 13s to completely
consume the 65535 backloged connections. The connection maybe already
have been closed cause of timeout of nginx or clients. So when we accept
the 65535th socket, we get a broken pipe.
Even worse, if hundreds of php-fpm processes get a closed connection
they are just wasting time and resouces to run a heavy task and finally
get error when writing to the closed connection(error: Broken Pipe).
The really max accept queue size will be backlog+1(ie, 512 here).
We take 511 which is the same as nginx and redis.

Enormous amount of connections stuck in CLOSE_WAIT state with Varnish

I'm getting some kind of weird problem with varnish, an enormous amount of connections are stuck in CLOSE_WAIT state, just like if varnish wasn't closing connections.
This leads me think that the kernel is waiting for varnish to close the connections, considering this, it could be either a bug in varnish or the kernel from my point of view.
Though, before digging deeper into varnish code, I'd like to have your point of view guys, and know if this kind of symptoms could be caused by any other parameters ?
Obviously, if you ever experienced this problem and have the solution, it would be even more helpful.
FYI:
# netstat -pan|grep varnish|awk '/tcp/ {print $6}'|sort|uniq -c
35902 CLOSE_WAIT
12148 ESTABLISHED
3 LISTEN
You should inspect whether they are in client ⇄ varnish side or varnish ⇄ backend, probably they would be in the backend side, at least that's my case.
According to Connections to backend not closing:
This is actually per design, varnish keeps backend connections around
if they look like they can be reused, and only revisits them when it
tries to reuse them, so they may linger for quite a while before
varnish discovers they have been closed by the backend.
Apart from the socket hanging around, it is harmless.
I would also check if your backends are closing connections unnecessarily, keepalive (if you are able to use it) is of great help. And finally, check the output of varnishstat -1 |grep backend to see if varnish is able to reuse backend connections (backend_reuse) and if it has noticed that they are closed (backend_toolate). The values should be so backend_reuse + backend_toolate ≅ backend_recycle.

Golang how to handle gracefull shutdown with keep alives

I have build a proxy server that can balance between multiple nodes.
I also made it that it can reload with zero downtime. Problem is that most of the nodes have keep alive
connections and i have no clue how to handle these. Sometimes the server cant shutdown off 1 or 2 open connections that wont close.
My first opinion is to set a timeout on the shutdown but that does not secures me that every connection is terminated correctly. I think of a download that takes some minutes to complete.
Anyone can give me some good advise what to do in this case?
One option you have is to initially shutdown just the listening sockets, and wait on the active connections before exiting.
Once you free up the listening sockets, your new process is free to start up and accept new connections. The old process can then continue running until all its connections are closed gracefully (this is how HAProxy does reloads), or until some far longer timeout if you choose.

fsc.exe is very slow because it tries to access crl.microsoft.com

When I run F# compiler - fsc.exe - on our build server it takes ages (~20sec) to run even when there are no input files. After some investigation I found out that it's because the application tries to access crl.microsoft.com (probably to check if some certificates aren't revoked). However, the account under which it runs doesn't have an access to the Internet. And because our routers/firewalls/whatever just drops the SYN packets, fsc.exe tries several times before giving up.
The only solution which comes to mind is to set clr.microsoft.com to 127.0.0.1 in hosts file but it's pretty nasty solution. Moreover, I'll need fsc.exe on our production box, where I can't do such things. Any other ideas?
Thanks
Come across this myself - here are some links... to better descriptions and some alternatives
http://www.eggheadcafe.com/software/aspnet/29381925/code-signing-performance-problems-with-certificate-revocation-chec.aspx
I dug up this form an old MS KB for Exchange when we hit it... Just got the DNS Server to reply as stated (might be the solution for your production box.)
MS Support KB
The CRL check is timing out because it
never receives a response. If a router
were to send a “no route to host” ICMP
packet or similar error instead of
just dropping the packets, the CRL
check would fail right away, and the
service would start. You can add an
entry to crl.microsoft.com in the
hosts file or on the DNS server and
send the packets to a legitimate
location on the network, such as
127.0.0.1, which will reject the connection..."

Resources