Drupal site - Memcache Connection errors

Drupal site - Memcache Connection errors - drupal

We are tying to perf tune our drupal site.
We are using Siege to measure performance (as drupal visitor).
Env:
Nginx + FastCGI+ Memcache
Siege runs fine for a few seconds, and then we run into connection errors:
Example:
HTTP/1.1 200 29.18 secs: 5877 bytes ==> /
HTTP/1.1 200 29.39 secs: 5877 bytes ==> /
warning: socket: -1656235120 select timed out: Connection timed out
warning: socket: -1673020528 select timed out: Connection timed out
Using the same Siege test confiuration, Nginx + FastCGI+ Drupal Cache seems to work fine.
Example:
HTTP/1.1 200 1.41 secs: 5868 bytes ==> /
HTTP/1.1 200 1.40 secs: 5868 bytes ==> /
As you can see, Response time is much higher with MemCache, in addition to the connection errors.
Any idea what could be wrong here... and why Drupal is throwing errors with memcache under load?
Memcache runs on a separate instance. Allocated 2GB memory for MemCache.

I guess that You run out of memcached connections. Please run a check of Your memcached installation with a simple script every second. Then start Siege. I guess Your memcached stops responding after a while.
Test memcache php script:
<?php
$memcache = new Memcache;
$memcache->connect('localhost', 11211) or die ('Unable to connect');
$version = $memcache->getVersion();
echo 'Server version: '.$version;
?>
What I guess is happening is that You have not disable the persistent connections in memcache and they hang around in the php threads. Memcached can serve ~1023 of them at a time and that might not be enough while Sieging.
You might also try ab, apache benchmarking tool with the close look to the -c switch. Play around with it and see how the results change on different values.
Finally, You should run a tcpdump on Your memcached port (usually 11211) on the php machine to find out what is happening to the connections. Does drupal start them? Does the other host respond with a RST or does it time out?
There was a bug in the memcached php documentation api that said that the connections are non-persistent by default. They are persistent by default (well, they were at the time I had the problem with it).
Feel free to comment this answer, I'll read the comments and assist further if necessary.

Related

Wordpress Perfect Varnish VCL Random 503 Error

I am using wordpress. I deploy Varnish Using Docker. This Is My default.vcl. What's Wrong With This Config?? Sometimes, Get Random 503 Error. I Exclude Wordpress Search Page Using RegEx. Also Get Random 503 Error On wordpress Search Page Too!!
varnishlog
https://www.dropbox.com/s/ruczg2i3h/log.txt
I am Using NGINX Backened..
Help Appreciated
Thanks

Your log output contains the following lines:
- Error out of workspace (bo)
- LostHeader Date:
- BerespHeader Server: Varnish
- VCL_call BACKEND_ERROR
- LostHeader Content-Type:
- LostHeader Retry-After:
Apparently you ran out of workspace memory because the size of the response.
The following parameters in your Docker config might cause that:
-p http_resp_hdr_len=65536 \
-p http_resp_size=98304 \
While increasing the size of individual headers and the total response size, the total memory consumption exceeds the workspace_backend value, which defaults to 64k.
Here's the documentation for http_resp_size:
$ varnishadm param.show http_resp_size
http_resp_size
Value is: 32k [bytes] (default)
Minimum is: 0.25k
Maximum number of bytes of HTTP backend response we will deal
with. This is a limit on all bytes up to the double blank line
which ends the HTTP response.
The memory for the response is allocated from the backend
workspace (param: workspace_backend) and this parameter limits
how much of that the response is allowed to take up.
As you can see, it affects workspace_backend. So here's the documentation for that:
$ varnishadm param.show workspace_backend
workspace_backend
Value is: 64k [bytes] (default)
Minimum is: 1k
Bytes of HTTP protocol workspace for backend HTTP req/resp. If
larger than 4k, use a multiple of 4k for VM efficiency.
NB: This parameter may take quite some time to take (full)
effect.
The solution is to increase the workspace_backend through a -p runtime parameter.

Playframework always operation timeout on about 16k-th request from ApacheBench if keep-alive flag not set

An activator template project was created by
activator new rest-benchmark simple-rest-scala
cd rest-benchmark
activator clean stage
target/universal/stage/bin/rest-benchmark -Dplay.crypto.secret=1234
the template can be found here
Then I run ApacheBench to get a rough idea of playframework's throughput:
ab -n 20000 -c 5 http://127.0.0.1:9000/books
which always give similar result - timeout at around 16.3k-th request:
apr_socket_recv: Operation timed out (60)
Total of 16345 requests completed
However, if I run ab with -k KeepAlive:
ab -k -n 20000 -c 5 http://127.0.0.1:9000/books
The benchmark was able to complete.
I have several questions:
Why always timeout when Keep Alive is absent? Shouldn't play be able to handle requests regardless of this header? Or is it my OS keep the connection open, hence no further requests can be processed?
Why around the 16300-th request? Is it related to ulimit?
If missing Keep Alive will cause connection timeout, what can I do in production?
Edit: Switched to play 2.5.4
Edit2: Changed app launch command as marcospereira suggested, observing same result

PHP 5.5, NGINX and Memcached - 502 Error

I'm having a problem with Memcached pools. I will try to add all the context of the error to see if you guys can help me with this.
Context:
PHP 5.5.17 (cgi-fcgi) (built: Sep 24 2014 20:38:04)
php-pecl-memcache-3.0.8-2.fc17.remi.5.5.x86_64
nginx version:
nginx/1.0.15
My problem:
I am creating a connection with memcached and saving a several keys, just in one server first, something like this:
$_memcache = new Memcache;
$_memcache->addServer("127.0.0.1", "11211", true, 50, 3600, 45);
So, let suppose that I add several keys, in that server and I can get those without problem, I actually can, when I see my site and my code is calling to get the keys, it's getting it.
Now the problem, let say with those keys already saved and working without problem, I added another memcached server to the pool, this way:
$_memcache = new Memcache;
$_memcache->addServer("10.0.0.2", "11211", true, 50, 3600, 45);
$_memcache->addServer("10.0.0.3", "11211", true, 50, 3600, 45);
But before I refreshed the site to run my code and get the keys that I have storage in the first server I stopped the memcached in that server number 1 (10.0.0.2), after that I refreshed my site and then I received a 502 error (Bad Gateway)
The error that I am seeing in the log of NGINX is:
[error] 9364#0: *329504 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: XX.XX.XXX.XXX, server: _, request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/run/php-fastcgi/sock:", host: "www.myhost.com"
So, why I am getting that error. The only theory that I have is that for some reason because the connections is persistent is not closing properly when I stop the memcached server, but that only happens when I have a pool of X > 1 servers. If I use just that one and stopped I won't see the error.
There is any way that PHP 5.5 has a bug with the fastcgi socket that I don't know.
NOTE: This problem that I am having wasn't happening on previous PHP version 5.3, after changing the version is this is happening.
NOTE: When I don't use persistent connection seems to work, but this site has a huge traffic and it won't handle the amount of connections open, I tested this in dev environment.
Any help or any suggestions are more than welcome.
Thanks in advance!

PHP-FPM using 40% CPU for a Single Request

(I googled and searched this forum for hours, found some topics, but none of them worked for me)
I'm using Wordpress with: Varnish + Nginx + PHP-FPM + APC + W3 Total Cache + PageSpeed.
As I'm using Varnish, first time I call www.mysite.com it use just 10% of CPU. Calling the second time, it will be cached. The problem is passing request parameter in URL.
For just 1 request (www.mysite.com?1=1) it shows in top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7609 nginx 20 0 438m 41m 28m S 11.6 7.0 0:00.35 php-fpm
7606 nginx 20 0 437m 39m 26m S 10.3 6.7 0:00.31 php-fpm
After the page is fully loaded, these processes above are still active. And after 2 seconds, they are replaced by another 2 php-fpm processes(below), which are active for 3 seconds.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7665 nginx 20 0 444m 47m 28m S 20.9 7.9 0:00.69 php-fpm
7668 nginx 20 0 444m 46m 28m R 20.9 7.9 0:00.63 php-fpm
40% CPU usage just for 1 request not cached!
Strange things:
CPU usage is higher after the page was loaded
When I purged the cache (W3 and Varnish), it take just 10% of CPU to load a not cached page
This high CPU usage just happend passing request parameter or in Wordpress Admin
When I try to do 10 request(pressing F5 key 10x), the server stop serving and in php-fpm log appears:
WARNING: [pool www] server reached max_children setting (10), consider raising it
I raised that value to 20, same problem.
I'm using pm=ondemand (pm.max_children=10 and pm.max_requests=500).
Inittialy I was using pm=dynamic (pm.max_children=10, pm.start_servers=1, pm.min_spare_servers=1, pm.min_spare_servers=2, pm.max_requests=500) and it happened the same problem.
Anyone could help, plz? Any help would be appreciated!
PS:
APC is ON (98% Hits, 2% Misses)
Server is Amazon Micro (613MB RAM)
PHP 5.3.26 (fpm-fcgi)
Linux version 3.4.48-45.46.amzn1.x86_64 Red Hat 4.6.3-2 (I think it's based on CentOS 5)

First reduce the stack of caches. Why using varnish which serves pages from memory when you're using w3 cache already which serves from memory as well?
W3cache is CPU intensive! It does not just cache items but also compresses, minifies and merges files on the fly.
You got a total of 512MB of memory on your machine which is not a lot, also your CPU power is less than a modern smartphone has. Memory access is extremely slow compared to a root server because of the xen virtualization layer - That's why less is more.
Make sure w3cache is properly set up so it actually caches items, then warmup your cache and you should be fine.
Have a look at Googles nginx pagespeed module https://github.com/pagespeed/ngx_pagespeed, it can do the same thing w3cache does, just much more efficient because it happens in the webserver, not in PHP
Nginx can also directly serve from memcached http://www.kingletas.com/2012/08/full-page-cache-with-nginx-and-memcache.html (example article, might need some more investigation)

Problem solved!
For those who are having the same problem:
Check Varnish configuration;
Check your Wordpress's plugin;
1) In my case, TTL was not configured in Varnish, so nothing was being cached.
This config worked for me:
sub vcl_fetch {
if (!(req.url ~ "wp-(login|admin)")) {
unset beresp.http.set-cookie;
set beresp.ttl = 48h;
}
}
2) The high CPU usage AFTER page loads, was caused by a Wordpress plugin called: "Scroll Triggered Box".
It was doing some AJAX after page has loaded. I disabled that plugin and high load stopped.

There are two factors at play here:
You are using micro instance which has a burstable CPU profile. It can burst up to 2 ECU's then be limited to much less than 1 (Some estimates put this at around 0.1 - 0.2 ECU's)
While logged in as an admin, wordpress caching plugins often bypass or reduce caching. W3 should allow you to switch this if you want caching on all the time.

nginx errors readv() and recv() failed

I use nginx along with fastcgi. I see a lot of the following errors in the error logs
readv() failed (104: Connection reset
by peer) while reading upstream and
recv() failed (104: Connection reset
by peer) while reading response header
from upstream
I don't see any problem using the application. Are these errors serious or how to get rid of them.

I was using php-fpm in the background and slow scripts were getting killed after a said timeout because it was configured that way. Thus, scripts taking longer than a specified time would get killed and nginx would report a recv or readv error as the connection is closed from the php-fpm engine/process.

Update:
Since nginx version 1.15.3 you can fix this by setting the keepalive_requests option of your upstream to the same number as your php-fpm's pm.max_requests:
upstream name {
...
keepalive_requests number;
...
}
Original answer:
If you are using nginx to connect to php-fpm, one possible cause can also be having nginx' fastcgi_keep_conn parameter set to on (especially if you have a low pm.max_requests setting in php-fpm):
http|server|location {
...
fastcgi_keep_conn on;
...
}
This may cause the described error every time a child process of php-fpm restarts (due to pm.max_requests being reached) while nginx is still connected to it. To test this, set pm.max_requests to a really low number (like 1) and see if you get even more of the above errors.
The fix is quite simple - just deactivate fastcgi_keep_conn:
fastcgi_keep_conn off;
Or remove the parameter completely (since the default value is off). This does mean your nginx will reconnect to php-fpm on every request, but the performance impact is negligible if you have both nginx and php-fpm on the same machine and connect via unix socket.

Regarding this error:
readv() failed (104: Connection reset by peer) while reading upstream and recv() failed (104: Connection reset by peer) while reading response header from upstream
there was 1 more case where I could still see this.
Quick set up overview:
CentOS 5.5
PHP with PHP-FPM 5.3.8 (compiled from scratch with some 3rd party
modules)
Nginx 1.0.5
After looking at the PHP-FPM error logs as well and enabling catch_workers_output = yes in the php-fpm pool config, I found the root cause in this case was actually the amfext module (PHP module for Flash).
There's a known bug and fix for this module that can be corrected by altering the amf.c file.
After fixing this PHP extension issue, the error above was no longer an issue.

This is a very vague error as it can mean a few things. The key is to look at all possible logs and figure it out.
In my case, which is probably somewhat unique, I had a working nginx + php / fastcgi config. I wanted to compile a new updated version of PHP with PHP-FPM and I did so. The reason was that I was working on a live server that couldn't afford downtime. So I had to upgrade and move to PHP-FPM as seamlessly as possible.
Therefore I had 2 instances of PHP.
1 directly talking with fastcgi (PHP 5.3.4) - using TCP / 127.0.0.1:9000 (PHP 5.3.4)
1 configured with PHP-FPM - using Unix socket - unix:/dir/to/socket-fpm
(PHP 5.3.8)
Once I started up PHP-FPM (PHP 5.3.8) on an nginx vhost using a socket connection instead of TCP I started getting this upstream error on any fastcgi page taking longer than x minutes whether they were using FPM or not. Typically it was pages doing large SELECTS in mysql that took ~2 min to load. Bad I know, but this is because of back end DB design.
What I did to fix it was add this in my vhost configuration:
fastcgi_read_timeout 5m;
Now this can be added in the nginx global fastcgi settings as well. It depends on your set up. http://wiki.nginx.org/HttpFcgiModule

Answer # 2.
Interestingly enough fastcgi_read_timeout 5m; fixed one vhost for me.
However I was still getting the error in another vhost, just by running phpinfo();
What fixed this for me was by copying over a default production php.ini file and adding the config I needed into it.
What I had was an old copy of my php.ini from the previous PHP install.
Once I put the default php.ini from 'shared' and just added in the extensions and config I needed, this solved my problem and no longer did I have nginx errors readv() and recv() failed.
I hope 1 of these 2 fixes helps someone.

Also it can be a very simple problem - there is an infinity cicle somewhere in your code, or an infinity trying to connect an external host on your page.

Some times this problem happen because of huge of requests. By default the pm.max_requests in php5-fpm maybe is 100 or below.
To solve it increase its value depend on the your site's requests, For example 500.
And after the you have to restart the service
sudo service php5-fpm restart

Others have mentioned the fastcgi_read_timeout parameter, which is located in the nginx.conf file:
http {
...
fastcgi_read_timeout 600s;
...
}
In addition to that, I also had to change the setting request_terminate_timeout in the file: /etc/php5/fpm/pool.d/www.conf
request_terminate_timeout = 0
Source of information (there are also a few other recommendations for changing php.ini parameters, which may be relevant in some cases): https://ma.ttias.be/nginx-and-php-fpm-upstream-timed-out-failed-110-connection-timed-out-or-reset-by-peer-while-reading/

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex