How can I stop nginx failling over when openresty throws runtime error deploying cert - nginx

We are using openresty and the lua-resty-auto-ssl package to generate certificates from Lets Encrypt but lately the server keeps falling over. Im guessing its triggered when a certificate trys to auto renew as generating a certificate for first time works fine ... the error we are seeing is
2019/05/12 08:25:24 [error] 2623#2623: *1024227 lua entry thread aborted: runtime error: ...sty/luajit/share/lua/5.1/resty/auto-ssl/servers/hook.lua:40: assertion failed!
stack traceback:
coroutine 0:
[C]: in function 'assert'
...sty/luajit/share/lua/5.1/resty/auto-ssl/servers/hook.lua:40: in function 'server'
.../local/openresty/luajit/share/lua/5.1/resty/auto-ssl.lua:99: in function 'hook_server'
content_by_lua(nginx.conf:194):2: in function <content_by_lua(nginx.conf:194):1>, client: 127.0.0.1, server: , request: "POST /deploy-cert HTTP/1.1", host: "127.0.0.1:8999"
From what I can see in the error it is failing to assert something when trying to deploy the cert which could be any of 4 things
assert(params["domain"])
assert(params["fullchain"])
assert(params["privkey"])
assert(params["expiry"])
Im a bit stuck to what I can do, its no good having the server dropping out on use. Thats the last error thats reported before the server goes offline so im guessing thats the cause? but not 100% sure.
Is there anywhere I can look to find out more information what causes the crash. Im new to nginx/openresty so fumbling my round a bit. Has anyone come across a similar issue?

Wrap it all in a function and call it with pcall or xpcall and add some logic to deal with the error.

Related

How to control vhost_shared_traffic memory K8s nginx ingress?

Background
We run a kubernetes cluster that handles several php/lumen microservices. We started seeing the app php-fpm/nginx reporting 499 status code in it's logs, and it seems to correspond with the client getting a blank response (curl returns curl: (52) Empty reply from server) while the applications log 499.
10.10.x.x - - [09/Mar/2020:18:26:46 +0000] "POST /some/path/ HTTP/1.1" 499 0 "-" "curl/7.65.3"
My understanding is nginx will return the 499 code when the client socket is no longer open/available to return the content to. In this situation that appears to mean something before the nginx/application layer is terminating this connection. Our configuration currently is:
ELB -> k8s nginx ingress -> application
So my thoughts are either ELB or ingress since the application is the one who has no socket left to return to. So i started hitting ingress logs...
Potential core problem?
While looking the the ingress logs i'm seeing quite a few of these:
2020/03/06 17:40:01 [crit] 11006#11006: ngx_slab_alloc() failed: no memory in vhost_traffic_status_zone "vhost_traffic_status"
Potential Solution
I imagine if i gave vhost_traffic_status_zone some more memory at least that error would go away and on to finding the next error.. but I can't seem to find any configmap value or annotation that would allow me to control this. I've checked the docs:
https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/
https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/
Thanks in advance for any insight / suggestions / documentation I might be missing!
here is the standard way to look up how to modify the nginx.conf in the ingress controller. After that, I'll link in some info on suggestions on how much memory you should give the zone.
First start by getting the ingress controller version by checking the image version on the deploy
kubectl -n <namespace> get deployment <deployment-name> | grep 'image:'
From there, you can retrieve the code for your version from the following URL. In the following, I will be using version 0.10.2.
https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.10.2
The nginx.conf template can be found at rootfs/etc/nginx/template/nginx.tmpl in the code or /etc/nginx/template/nginx.tmpl on a pod. This can be grepped for the line of interest. I the example case, we find the following line in the nginx.tmpl
vhost_traffic_status_zone shared:vhost_traffic_status:{{ $cfg.VtsStatusZoneSize }};
This gives us the config variable to look up in the code. Our next grep for VtsStatusZoneSize leads us to the lines in internal/ingress/controller/config/config.go
// Description: Sets parameters for a shared memory zone that will keep states for various keys. The cache is shared between all worker processe
// https://github.com/vozlt/nginx-module-vts#vhost_traffic_status_zone
// Default value is 10m
VtsStatusZoneSize string `json:"vts-status-zone-size,omitempty"
This gives us the key "vts-status-zone-size" to be added to the configmap "ingress-nginx-ingress-controller". The current value can be found in the rendered nginx.conf template on a pod at /etc/nginx/nginx.conf.
When it comes to what size you may want to set the zone, there are the docs here that suggest setting it to 2*usedSize:
If the message("ngx_slab_alloc() failed: no memory in vhost_traffic_status_zone") printed in error_log, increase to more than (usedSize * 2).
https://github.com/vozlt/nginx-module-vts#vhost_traffic_status_zone
"usedSize" can be found by hitting the stats page for nginx or through the JSON endpoint. Here is the request to get the JSON version of the stats and if you have jq the path to the value: curl http://localhost:18080/nginx_status/format/json 2> /dev/null | jq .sharedZones.usedSize
Hope this helps.

Lua entry thread aborted: runtime error bad argument #2 to 'tonumber' (number expected, got string)

Im using lapis framework and torch, sometimes the website showing internal server error when trying to load page, and the error from lapis is:
[error] 3726#81246: *19 lua entry thread aborted: runtime error: ...s/MyUser/torch/install/share/lua/5.1/xlua/init.lua:227: bad argument #2 to 'tonumber' (number expected, got string)
stack traceback:
coroutine 0:
[C]: in function 'require'
.../MyUser/torch/install/share/lua/5.1/lapis/init.lua:15: in function 'serve'
content_by_lua(nginx.conf.compiled:22):2: in function <content_by_lua(nginx.conf.compiled:22):1>, client: 127.0.0.1, server: , request: "GET /recommend HTTP/1.1", host: "localhost:9999", referrer: "http://localhost:9999/home"
After looking arround, the problem is gone when i remark this part:
local lapis = require("lapis")
local config = require("lapis.config").get()
local inspect = require("inspect")
local json = require('cjson')
local kb_recommend = require("recommender.knowledge")
-- local cb_recommend = require("recommender.content") <---- remark this
local mysql = require "luasql.mysql"
local env = mysql.mysql()
local conn = env:connect("restoran", "root", "")
Thats link to file recommender/content.lua:
package.path = package.path .. ";../?.lua"
local predictor = require("Content Based.predictor")
return predictor
This file returning the class where i wrote my torch code. I suspect the problem is coming from the 'require' part, but dont know why? I've been looking on google for solution, and nothing found. This my current version:
Lua 5.1.5
Torch7
Lapis 1.6.0
nginx/1.13.9
openresty/1.13.6.1
anyone can help? im very newbie in Lua environment.
tonumber is a Lua function for an example like this: tonumber("10") or tonumber(10)
I'm sorry I couldn't be of more help.
If you could show us the line where the error is located, I can most likely determine a fix for you.

NGINX (Operation not permitted) while reading upstream

I have NGINX working as a cache engine and can confirm that pages are being cached as well as being served from the cache. But the error logs are getting filled with this error:
2018/01/19 15:47:19 [crit] 107040#107040: *26 chmod()
"/etc/nginx/cache/nginx3/c0/1d/61/ddd044c02503927401358a6d72611dc0.0000000007"
failed (1: Operation not permitted) while reading upstream, client:
xx.xx.xx.xx, server: *.---.com, request: "GET /support/applications/
HTTP/1.1", upstream: "http://xx.xx.xx.xx:80/support/applications/",
host: "---.com"
I'm not really sure what the source of this error could be since NGINX is working. Are these errors that can be safely ignored?
It looks like you are using nginx proxy caching, but nginx does not have the ability to manipulate files in it's cache directory. You will need to get the ownership/permissions correct on the cache directory.
Not explained in the original question is that the mounted storage is an Azure file share. So in the FSTAB I had to include the gid= and uid= for the desired owner. This then removed the need for chown and chmod also became unnecessary. This removed the chmod() error but introduced another.
Then I was getting errors on rename() without permission to perform this. At this point I scrapped what I was doing, moved to a different type of Azure storage (specifically a Disk attached to the VM) and all these problems went away.
So I'm offering this as an answer but realistically, the problem was not solved.
We noticed the same problem. Following the guide from Microsoft # https://learn.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv#create-a-storage-class seems to have fixed it.
In our case the nginx process was using a different user for the worker threads, so we needed to find that user's uid and gid and use that in the StorageClass definition.

PHP 5.5, NGINX and Memcached - 502 Error

I'm having a problem with Memcached pools. I will try to add all the context of the error to see if you guys can help me with this.
Context:
PHP 5.5.17 (cgi-fcgi) (built: Sep 24 2014 20:38:04)
php-pecl-memcache-3.0.8-2.fc17.remi.5.5.x86_64
nginx version:
nginx/1.0.15
My problem:
I am creating a connection with memcached and saving a several keys, just in one server first, something like this:
$_memcache = new Memcache;
$_memcache->addServer("127.0.0.1", "11211", true, 50, 3600, 45);
So, let suppose that I add several keys, in that server and I can get those without problem, I actually can, when I see my site and my code is calling to get the keys, it's getting it.
Now the problem, let say with those keys already saved and working without problem, I added another memcached server to the pool, this way:
$_memcache = new Memcache;
$_memcache->addServer("10.0.0.2", "11211", true, 50, 3600, 45);
$_memcache->addServer("10.0.0.3", "11211", true, 50, 3600, 45);
But before I refreshed the site to run my code and get the keys that I have storage in the first server I stopped the memcached in that server number 1 (10.0.0.2), after that I refreshed my site and then I received a 502 error (Bad Gateway)
The error that I am seeing in the log of NGINX is:
[error] 9364#0: *329504 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: XX.XX.XXX.XXX, server: _, request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/run/php-fastcgi/sock:", host: "www.myhost.com"
So, why I am getting that error. The only theory that I have is that for some reason because the connections is persistent is not closing properly when I stop the memcached server, but that only happens when I have a pool of X > 1 servers. If I use just that one and stopped I won't see the error.
There is any way that PHP 5.5 has a bug with the fastcgi socket that I don't know.
NOTE: This problem that I am having wasn't happening on previous PHP version 5.3, after changing the version is this is happening.
NOTE: When I don't use persistent connection seems to work, but this site has a huge traffic and it won't handle the amount of connections open, I tested this in dev environment.
Any help or any suggestions are more than welcome.
Thanks in advance!

Unable to reach Sentry log server: EOF occurred in violation of protocol

I'm having trouble with setting up Sentry server in HTTPS mode. Every now and then, reasonably often while seemingly random, this error message gets written by Raven (Sentry client) into log files:
Unable to reach Sentry log server: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol> (url: https://$(valid_server)/)
Web UI works fine. Vast majority of the messages from Raven are received fine and Sentry processes them into usable output. However, due to these errors, something gets lost from time to time.
I have tried to figure this one out, but dead ends seem to follow another. Basically it seems a lot like this:
Python Requests requests.exceptions.SSLError: [Errno 8] _ssl.c:504: EOF occurred in violation of protocol
But when testing my Sentry server with similar s_client query using TLS 1.2, it leads to a valid session unlike with the example there.
It's also not about this, since SNI isn't used:
python-requests 2.0.0 - [Errno 8] _ssl.c:504: EOF occurred in violation of protocol
I'm not able to reproduce the error coherently. Raven's tests are passed and nothing is acutely wrong, until an error pops up in the log.
My set up is: Raven 4.2.1 in Python 2.7.5, Nginx 1.6.0 as reverse proxy handling HTTPS, and finally Sentry 6.4.4 with default Gunicorn 0.17.4. Nginx configs are pretty much similar to official documentation (http://sentry.readthedocs.org/en/latest/quickstart/nginx.html) with a few alterations due to HTTPS.
I ran into the same issue and got it fixed by installing the following dependencies:
On Ubuntu:
sudo aptitude install libffi-dev
And then via pip:
pip install pyopenssl ndg-httpsclient pyasn1
The problem seems to be that Python 2.X doesn't support SNI (which is needed for TLS) out of the box as explained here.

Resources