Wordpress Perfect Varnish VCL Random 503 Error - wordpress

I am using wordpress. I deploy Varnish Using Docker. This Is My default.vcl. What's Wrong With This Config?? Sometimes, Get Random 503 Error. I Exclude Wordpress Search Page Using RegEx. Also Get Random 503 Error On wordpress Search Page Too!!
varnishlog
https://www.dropbox.com/s/ruczg2i3h/log.txt
I am Using NGINX Backened..
Help Appreciated
Thanks

Your log output contains the following lines:
- Error out of workspace (bo)
- LostHeader Date:
- BerespHeader Server: Varnish
- VCL_call BACKEND_ERROR
- LostHeader Content-Type:
- LostHeader Retry-After:
Apparently you ran out of workspace memory because the size of the response.
The following parameters in your Docker config might cause that:
-p http_resp_hdr_len=65536 \
-p http_resp_size=98304 \
While increasing the size of individual headers and the total response size, the total memory consumption exceeds the workspace_backend value, which defaults to 64k.
Here's the documentation for http_resp_size:
$ varnishadm param.show http_resp_size
http_resp_size
Value is: 32k [bytes] (default)
Minimum is: 0.25k
Maximum number of bytes of HTTP backend response we will deal
with. This is a limit on all bytes up to the double blank line
which ends the HTTP response.
The memory for the response is allocated from the backend
workspace (param: workspace_backend) and this parameter limits
how much of that the response is allowed to take up.
As you can see, it affects workspace_backend. So here's the documentation for that:
$ varnishadm param.show workspace_backend
workspace_backend
Value is: 64k [bytes] (default)
Minimum is: 1k
Bytes of HTTP protocol workspace for backend HTTP req/resp. If
larger than 4k, use a multiple of 4k for VM efficiency.
NB: This parameter may take quite some time to take (full)
effect.
The solution is to increase the workspace_backend through a -p runtime parameter.

Related

How to control vhost_shared_traffic memory K8s nginx ingress?

Background
We run a kubernetes cluster that handles several php/lumen microservices. We started seeing the app php-fpm/nginx reporting 499 status code in it's logs, and it seems to correspond with the client getting a blank response (curl returns curl: (52) Empty reply from server) while the applications log 499.
10.10.x.x - - [09/Mar/2020:18:26:46 +0000] "POST /some/path/ HTTP/1.1" 499 0 "-" "curl/7.65.3"
My understanding is nginx will return the 499 code when the client socket is no longer open/available to return the content to. In this situation that appears to mean something before the nginx/application layer is terminating this connection. Our configuration currently is:
ELB -> k8s nginx ingress -> application
So my thoughts are either ELB or ingress since the application is the one who has no socket left to return to. So i started hitting ingress logs...
Potential core problem?
While looking the the ingress logs i'm seeing quite a few of these:
2020/03/06 17:40:01 [crit] 11006#11006: ngx_slab_alloc() failed: no memory in vhost_traffic_status_zone "vhost_traffic_status"
Potential Solution
I imagine if i gave vhost_traffic_status_zone some more memory at least that error would go away and on to finding the next error.. but I can't seem to find any configmap value or annotation that would allow me to control this. I've checked the docs:
https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/
https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/
Thanks in advance for any insight / suggestions / documentation I might be missing!
here is the standard way to look up how to modify the nginx.conf in the ingress controller. After that, I'll link in some info on suggestions on how much memory you should give the zone.
First start by getting the ingress controller version by checking the image version on the deploy
kubectl -n <namespace> get deployment <deployment-name> | grep 'image:'
From there, you can retrieve the code for your version from the following URL. In the following, I will be using version 0.10.2.
https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.10.2
The nginx.conf template can be found at rootfs/etc/nginx/template/nginx.tmpl in the code or /etc/nginx/template/nginx.tmpl on a pod. This can be grepped for the line of interest. I the example case, we find the following line in the nginx.tmpl
vhost_traffic_status_zone shared:vhost_traffic_status:{{ $cfg.VtsStatusZoneSize }};
This gives us the config variable to look up in the code. Our next grep for VtsStatusZoneSize leads us to the lines in internal/ingress/controller/config/config.go
// Description: Sets parameters for a shared memory zone that will keep states for various keys. The cache is shared between all worker processe
// https://github.com/vozlt/nginx-module-vts#vhost_traffic_status_zone
// Default value is 10m
VtsStatusZoneSize string `json:"vts-status-zone-size,omitempty"
This gives us the key "vts-status-zone-size" to be added to the configmap "ingress-nginx-ingress-controller". The current value can be found in the rendered nginx.conf template on a pod at /etc/nginx/nginx.conf.
When it comes to what size you may want to set the zone, there are the docs here that suggest setting it to 2*usedSize:
If the message("ngx_slab_alloc() failed: no memory in vhost_traffic_status_zone") printed in error_log, increase to more than (usedSize * 2).
https://github.com/vozlt/nginx-module-vts#vhost_traffic_status_zone
"usedSize" can be found by hitting the stats page for nginx or through the JSON endpoint. Here is the request to get the JSON version of the stats and if you have jq the path to the value: curl http://localhost:18080/nginx_status/format/json 2> /dev/null | jq .sharedZones.usedSize
Hope this helps.

Why varnish returns content-length: 0?

I have the following setup:
url -> load balancer -> nginx[1-2] -> varnish[1-2] -> nginx[1-2] (+app)
where first nginx uses the second nginx as backup if varnish fails.
When I perform curl -I http... I get content-length: 0 response. However, if I stop both Varnishes (6.0.2) I get some real number instead of 0. My vcl does not manipulate content-length and I see no other setup that would suggest it.
Moreover, if Varnish is ON and I perform multiple curls (10678 to be exact) I would get 14 responses with content-lentgh different than 0.
The two questions are:
Is content-length: 0 expected from varnish?
Is it possible that varnish fails to setup a connection once in a while and traffic gets routed to nginx directly? No errors in logs, though.

Make wget retry original URL after 3XX Redirect

I have a service that redirects users to temporary pre-signed AWS downloads. These are large files, often 5-10gb. To prevent download sharing, we have a relatively short (30 seconds) valid lifespan.
Everything is working except that on slow internet connections, they tend to fail or get interrupted. wget has a feature that automatically retries the download. However, instead of retrying the original URL (eg: http://service.com/download/file.zip), wget retries the redirected pre-signed URL (eg: http://service.s3.amazonaws.com/file.zip?AWSAccessKeyId=XXXX&Signature=XXXX&Expires=1468000000)
Since these are large files, and the pre-signed lifespan is so short, that temporary url is no longer valid and the user gets a 403 Forbidden result.
Originally, when we noticed the problem, we were using 302 Found temporary redirects. A little research seemed to indicate we SHOULD have been using 307 Temporary Redirect. However, that didn't resolve the problem with wget. For grins and giggles, we tried 303 See Other, but that didn't work either.
Does anyone have any idea how get wget to retry the original URL instead of the redirected URL?
below is a wget example log:
--2016-07-06 10:29:51-- https://service.com/download/file.zip
Connecting to service.com (service.com)|10.0.0.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location:
https://service.s3.amazonaws.com/file.zip?AWSAccessKeyId=XXXX&Signature=XXXX&Expires=1468000000
[following]
--2016-07-06 10:29:52-- https://service.s3.amazonaws.com/file.zip?AWSAccessKeyId=XXXX&Signature=XXXX&Expires=1468000000
Resolving service.s3.amazonaws.com (service.s3.amazonaws.com)...
54.231.12.129
Connecting to service.s3.amazonaws.com
(service.s3.amazonaws.com)|54.231.12.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2070666907 (1.9G) [application/zip]
Saving to: ‘file.zip’
file.zip 53%[=========> ] 1.03G --.-KB/s in 18m 7s
2016-07-06 10:47:59 (995 KB/s) - Read error at byte
1107205784/2070666907 (The specified session has been invalidated for
some reason.). Retrying.
--2016-07-06 10:48:00-- (try: 2) https://service.s3.amazonaws.com/file.zip?AWSAccessKeyId=XXXX&Signature=XXXX&Expires=1468000000
Connecting to service.s3.amazonaws.com
(service.s3.amazonaws.com)|54.231.12.129|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2016-07-06 10:48:01 ERROR 403: Forbidden.
I had a similar issue, and a similar answer as #panzerito, but broke it up into a script i called loopdone
#!/bin/bash
until `$1`; do sleep 1; echo restarting; done
then I can just do loopdone "wget -c http://my.url/" (incl quotes) to force it to run again and again (and resume, unless server does not support it) until exit code is 0. (meaning no error)
Bash-code:
initial_error_EXIT_STATUS; until [ "$?" -eq "0" ]; do wget https://example.com/download/file.zip -c; done

nginx limiting the total cache size

I am using nginx to cache requests to my uwsgi backend using
uwsgi_cache_path /var/cache/nginx/uwsgi keys_zone=cache:15M max_size=5G;
My back-end is setting a very long expires header (1 year+). However, as my system runs, I see the cache topping out at 15M. It gets up to that level, then prunes down to 10M.
This causes a lot of unnecessary calls to my back end. When I change the keys_zone size it seems to control the size of the entire cache. It seems to ignore the max_size and instead substitute the keys_zone size. (*)
Can anyone explain this behavior? Is there a known bug in this version? Am I missing the point? I don't want to allocate 5G to the cache manager..
# nginx -V
nginx version: nginx/1.2.0
built by gcc 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
TLS SNI support enabled
configure arguments: --conf-path=/etc/nginx/nginx.conf --pid-path=/var/run/nginx.pid --user=www-data --group=www-data --with-http_ssl_module --with-http_stub_status_module
(*) Update: I guess this was my overactive imagination trying to find a pattern in the chaos.
Expires header (and some other headers) is honoured by nginx to determine if a response is cacheable, but it's not used to determine how long to cache it.
By default, your inactive cache will be deleted after 10 min. Could you increase that number to see if it makes a difference?
proxy_cache_path path [levels=levels] keys_zone=name:size
[inactive=time] [max_size=size] [loader_files=number]
[loader_sleep=time] [loader_threshold=time];
Cached data that are not accessed during the time specified by the
inactive parameter get removed from the cache regardless of their
freshness. By default, inactive is set to 10 minutes.
Reference: http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path

Drupal site - Memcache Connection errors

We are tying to perf tune our drupal site.
We are using Siege to measure performance (as drupal visitor).
Env:
Nginx + FastCGI+ Memcache
Siege runs fine for a few seconds, and then we run into connection errors:
Example:
HTTP/1.1 200 29.18 secs: 5877 bytes ==> /
HTTP/1.1 200 29.39 secs: 5877 bytes ==> /
warning: socket: -1656235120 select timed out: Connection timed out
warning: socket: -1673020528 select timed out: Connection timed out
Using the same Siege test confiuration, Nginx + FastCGI+ Drupal Cache seems to work fine.
Example:
HTTP/1.1 200 1.41 secs: 5868 bytes ==> /
HTTP/1.1 200 1.40 secs: 5868 bytes ==> /
As you can see, Response time is much higher with MemCache, in addition to the connection errors.
Any idea what could be wrong here... and why Drupal is throwing errors with memcache under load?
Memcache runs on a separate instance. Allocated 2GB memory for MemCache.
I guess that You run out of memcached connections. Please run a check of Your memcached installation with a simple script every second. Then start Siege. I guess Your memcached stops responding after a while.
Test memcache php script:
<?php
$memcache = new Memcache;
$memcache->connect('localhost', 11211) or die ('Unable to connect');
$version = $memcache->getVersion();
echo 'Server version: '.$version;
?>
What I guess is happening is that You have not disable the persistent connections in memcache and they hang around in the php threads. Memcached can serve ~1023 of them at a time and that might not be enough while Sieging.
You might also try ab, apache benchmarking tool with the close look to the -c switch. Play around with it and see how the results change on different values.
Finally, You should run a tcpdump on Your memcached port (usually 11211) on the php machine to find out what is happening to the connections. Does drupal start them? Does the other host respond with a RST or does it time out?
There was a bug in the memcached php documentation api that said that the connections are non-persistent by default. They are persistent by default (well, they were at the time I had the problem with it).
Feel free to comment this answer, I'll read the comments and assist further if necessary.

Resources