nginx - connection timed out while reading upstream - nginx

I have a flask server with and endpoint that processes some uploaded .csv files and returns a .zip (in a JSON reponse, as a base64 string)
This process can take up to 90 seconds
I've been setting it up for production using gunicorn and nginx and I'm testing the endpoint with smaller .csv s. They get processed fine and in a couple seconds I get the "got blob" log. But nginx doesn't return it to the client and finally it times out. I set up a longer fail-timeout of 10 minutes and the client WILL wait 10 minutes, then time out
the proxy read timeout offered as solution here is set to 3600s
Also the proxy connect timeout is set to 75s according to this
also the timeout for the gunicorn workers according to this
The error log says: "upstream timed out connection timed out while reading upstream"
I also see examples of nginx receiving an OPTIONS request and immediately after the POST request (some CORS weirdness from the client) where nginx passes the OPTIONS request but fails to pass the POST request to gunicorn despite nginx having received it
Question:
What am I doing wrong here?
Many thanks
http {
upstream flask {
server 127.0.0.1:5050 fail_timeout=600;
}
# error log
# 2022/08/18 14:49:11 [error] 1028#1028: *39 upstream timed out (110: Connection timed out) while reading upstream, ...
# ...
server {
# ...
location /api/ {
proxy_pass http://flask/;
proxy_read_timeout 3600;
proxy_connect_timeout 75s;
# ...
}
# ...
}
}
# wsgi.py
from main import app
if __name__ == '__main__':
app.run()
# flask endpoint
#app.route("/process-csv", methods=['POST'])
def process_csv():
def wrapped_run_func():
return blob, export_filename
# ...
try:
blob, export_filename = wrapped_run_func()
b64_file = base64.b64encode(blob.getvalue()).decode()
ret = jsonify(file=b64_file, filename=export_filename)
# return Response(response=ret, status=200, mimetype="application/json")
print("got blob")
return ret
except Exception as e:
app.logger.exception(f"0: Error processing file: {export_filename}")
return Response("Internal server error", status=500)
ps. getting this error from stackoverflow
"Your post appears to contain code that is not properly formatted as code. Please indent all code by 4 spaces using the code toolbar button or the CTRL+K keyboard shortcut. For more editing help, click the [?] toolbar icon."
for having perfectly well formatted code with language syntax, I'm sorry that I had to post it ugly

Sadly I got no response
See last lines for the "solution" finally implemented
CAUSE OF ERROR: I believe the problem is that I'm hosting the Nginx server on wsl1
I tried updating to wsl2 and see if that fixed it but I need to enable some kind of "nested virtualization", as the wsl1 is running already on a VM.
Through conf changes I got it to the point where no error is logged, gunicorn return the file then it just stays in the ether. Nginx never gets/sends the response
"SOLUTION":
I ended up changing the code for the client, the server and the nginx.conf file:
the server saves the resulting file and only returns the file name
the client inserts the filename into an href that then displays a link
on click a request is sent to nginx which in turn just sends the file from a static folder, leaving gunicorn alone
I guess this is the optimal way to do it anyway, though it still bugs me I couldn't (for sure) find the reason of the error

Related

GCP deployment with nginx - uwsgi - flask fails

I have a very simple flask app that is deployed on GKE and exposed via google external load balancer. And getting random 502 responses from the backend-service (added a custom headers on backend-service and nginx to make sure the source and I can see the backend-service's header but not nginx's)
The setup is;
LB -> backend-service -> neg -> pod (nginx -> uwsgi) where pod is the application built using flask and deployed via uwsgi and nginx.
The scenario is to handle image uploads in simple-secured way. Sender sends me a token with upload request.
My flask app
receive request and check the sent token via another service using "requests".
If token valid, proceed to handle the image and return 200
If token is not valid, stop and send back a 401 response.
First, I got suspicious about the 200 and 401's. And reverted all responses to 200. Following some of the expected responses, server starts to respond 502 and keep sending it. "Some of the messages at the very beginning succeeded".
nginx error logs contains below lines
2023/02/08 18:22:29 [error] 10#10: *145 readv() failed (104: Connection reset by peer) while reading upstream, client: 35.191.17.139, server: _, request: "POST /api/v1/imageUpload/image HTTP/1.1", upstream: "uwsgi://127.0.0.1:21270", host: "example-host.com"
my uwsgi.ini file is as below;
[uwsgi]
socket = 127.0.0.1:21270
master
processes = 8
threads = 1
buffer-size = 32768
stats = 127.0.0.1:21290
log-maxsize = 104857600
logdate
log-reopen
log-x-forwarded-for
uid = image_processor
gid = image_processor
need-app
chdir = /server/
wsgi-file = image_processor_application.py
callable = app
py-auto-reload = 1
pidfile = /tmp/uwsgi-imgproc-py.pid
my nginx.conf is as below
location ~ ^/api/ {
client_max_body_size 15M;
include uwsgi_params;
uwsgi_pass 127.0.0.1:21270;
}
Lastly, my app has a healthcheck method with simple JSON response. It does no extra stuff and simply returns. This never fails as explained above.
Edit : my nginx access logs in the pod shows the response as 401 while the client receives 502.
for those who gonna face with the same issue, the problem was post data reading (or not reading).
nginx was expecting to get post data read by the proxied, in our case uwsgi, app. But according to my logic I was not reading it in some cases and returning back the response.
Setting uwsgi post-buffering solved the issue.
post-buffering = %(16 * 1024 * 1024)
Which led me to this solution;
https://stackoverflow.com/a/26765936/631965
Nginx uwsgi (104: Connection reset by peer) while reading response header from upstream

NGINX : Using Upstream server name

I am running nginx/1.19.6 on Ubuntu.
I am struggling to get the upstream module to work without returning a 404.
My *.conf files are located in /etc/nginx/conf.d/
FILE factory_upstream.conf:
upstream factoryserver {
server factory.service.corp.com:443;
}
FILE factory_service.conf:
server
{
listen 80;
root /data/www;
proxy_cache factorycache;
proxy_cache_min_uses 1;
proxy_cache_methods GET HEAD POST;
proxy_cache_valid 200 72h;
#proxy_cache_valid any 5m;
location /factory/ {
access_log /var/log/nginx/access-factory.log main;
proxy_set_header x-apikey abcdefgh12345678;
### Works when expressed as a literal.# proxy_pass https://factory.service.corp.com/;
### 404 when using the upstream name.
proxy_pass https://factoryserver/;
}
}
I have error logging set to debug, but after reloading the configuration and attempting a call, there are no new records in the error log.
nginx -t # Scans OK
nginx -s reload # no errors
cat /var/log/nginx/error.log
...
2021/03/16 11:29:52 [notice] 26157#26157: signal process started
2021/03/16 11:38:20 [notice] 27593#27593: signal process started
The access-factory.log does record the request :
127.1.2.3 --;[16/Mar/2021:11:38:50 -0400];";GET /factory/api/manifest/get-full-manifest/PID/904227100052 HTTP/1.1" ";/factory/api/manifest/get-full-manifest/PID/904227100052" ;404; - ;/factory/api/manifest/get-full-manifest/PID/904227100052";-" ";ID="c4cfc014e3faa803c8fe136d6eae347d ";-" ";10.8.9.123:443" ";404" ";-"
To help with debugging, I cached the 404 error, "proxy_cache_valid any 5m;" commented out in the example above:
When I use the upstream name, the cache file contains the followiing:
<##$ non-printable characters $%^>
KEY: https://factoryserver/api/manifest/get-full-manifest/PID/904227100052
HTTP/1.1 404 Not Found
Server: nginx/1.17.8
Date: Tue, 16 Mar 2021 15:38:50 GMT
...
The key contains the name 'factoryserver' I don't know if that matters or not. Does it?
The server version is different than what I see when I enter the command nginx -v, which is: nginx version: nginx/1.19.6
Does the difference in version in the cache file and the command line indicate anything?
When I switch back to the literal server name in the proxy_pass, I get a 200 response with the requested data. The Key in the cache file then contains the literal upstream server name.
<##$ non-printable characters $%^>
KEY: https://factory.service.corp.com/api/manifest/get-full-manifest/PID/904227100052
HTTP/1.1 200
Server: nginx/1.17.8
Date: Tue, 16 Mar 2021 15:59:30 GMT
...
I will have multiple upstream servers, each providing different services. The configuration will be deployed to multiple factories, each with its own upstream servers.
I would like for the deployment team to only have to update the *_upstream.conf files, and keep the *_service.conf files static from deployment site to deployment site.
factory_upstream.conf
product_upstream.conf
shipping_upstream.conf
abc123_upstream.conf
Why do I get a 404 when using a named upstream server?
Based on the nginx version in the cached response not matching what you see on the command line, it seems that maybe the 404 is coming from the upstream server. I.e, your proxying is working, but the upstream server is returning a 404. To troubleshoot further, I would check the nginx logs for the upstream server and if the incoming request is what you expect.
Note that when using proxy_pass, it makes a big difference whether you have a / at the end or not. With a trailing slash, nginx treats this as the URI it should send the upstream request to and it doesn't include the URI matched by the location block (/factory/) while without, it includes the full URI as-is.
proxy_pass https://factoryserver/ results in https://factory.service.corp.com:443/
proxy_pass https://factoryserver results in https://factory.service.corp.com:443/factory/
Docs: https://docs.nginx.com/nginx/admin-guide/web-server/reverse-proxy/
So maybe when switching between using an upstream and specifying the literal server name, you're inadvertently being inconsistent with the trailing slash. It's a really easy detail to miss, especially when you don't know it's important.
Please provide more information about your server configuration where you proxy pass the requests. Only difference I see now is that you specify the port (443) in your upstream server.

Why Nginx sends requests to upstream servers sequentially

I use Nginx as Load Balancer with the following config:
http {
upstream backend {
server 127.0.0.1:8010;
server 127.0.0.1:8011;
}
server {
listen 80;
location / {
proxy_pass http://backend;
}
}
}
So I have 2 local servers which are Flask apps:
#app1.py
from flask import Flask, jsonify, abort, request, make_response
import time
#app.route("/", methods=['GET'])
def root():
time.sleep(5)
return jsonify({"response": "Hello, world!"})
app.run(debug=False, port=8010) # for app2.py the only diff is port=8011
When I do 4 calls simultaneously (in different tabs) localhost:80, then I need to wait for 20 seconds to see "Hello, world!" in all 4 tabs (instead of 10 as I expected, because it should be distributed to 2 servers, for each it should take 10 seconds, but instead it just processes it sequentially one-by-one). Can you explain why? And how could it be fixed?
I've played around with it a bit more and realized that this behavior is only reproducible, when I open several tabs in Chromium. For my other browser (Firefox) everything works as expected. Also, if I do curl requests, everything works as expected as well.

Nginx memcached with fallback to remote service

I can't get Nginx working with memcached module, the requirement is to query remote service, cache data in memcached and never fetch remote endpoint until backend invalidates the cache. I have 2 containers with memcached v1.4.35 and one with Nginx v1.11.10.
The configuration is the following:
upstream http_memcached {
server 172.17.0.6:11211;
server 172.17.0.7:11211;
}
upstream remote {
server api.example.com:443;
keepalive 16;
}
server {
listen 80;
location / {
set $memcached_key "$uri?$args";
memcached_pass http_memcached;
error_page 404 502 504 = #remote;
}
location #remote {
internal;
proxy_pass https://remote;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
I tried to set memcached upstream incorrectly but I get HTTP 499 instead and warnings:
*3 upstream server temporarily disabled while connecting to upstream
It seems with described configuration Nginx can reach memcached successfully but can't write or read from it. I can write and read to memcached with telnet successfully.
Can you help me please?
My guesses on what's going on with your configuration
1. 499 codes
HTTP 499 is nginx' custom code meaning the client terminated connection before receiving the response (http://lxr.nginx.org/source/src/http/ngx_http_request.h#0120)
We can easily reproduce it, just
nc -k -l 172.17.0.6 172.17.0.6:11211
and curl your resource - curl will hang for a while and then press Ctrl+C — you'll have this message in your access logs
2. upstream server temporarily disabled while connecting to upstream
It means nginx didn't manage to reach your memcached and just removed it from the pool of upstreams. Suffice is to shutdown both memcached servers and you'd constantly see it in your error logs (I see it every time with error_log ... info).
As you see these messages your assumption that nginx can freely communicate with memcached servers doesn't seem to be true.
Consider explicitly setting http://nginx.org/en/docs/http/ngx_http_memcached_module.html#memcached_bind
and use the -b option with telnet to make sure you're correctly testing memcached servers for availability via your telnet client
3. nginx can reach memcached successfully but can't write or read from it
Nginx can only read from memcached via its built-in module
(http://nginx.org/en/docs/http/ngx_http_memcached_module.html):
The ngx_http_memcached_module module is used to obtain responses from
a memcached server. The key is set in the $memcached_key variable. A
response should be put in memcached in advance by means external to
nginx.
4. overall architecture
It's not fully clear from your question how the overall schema is supposed to work.
nginx's upstream uses weighted round-robin by default.
That means your memcached servers will be queried once at random.
You can change it by setting memcached_next_upstream not_found so a missing key will be considered an error and all of your servers will be polled. It's probably ok for a farm of 2 servers, but unlikely is it what your want for 20 servers
the same is ordinarily the case for memcached client libraries — they'd pick a server out of a pool according to some hashing scheme => so your key would end up on only 1 server out of the pool
5. what to do
I've managed to set up a similar configuration in 10 minutes on my local box - it works as expected. To mitigate debugging I'd get rid of docker containers to avoid networking overcomplication, run 2 memcached servers on different ports in single-threaded mode with -vv option to see when requests are reaching them (memcached -p 11211 -U o -vv) and then play with tail -f and curl to see what's really happening in your case.
6. working solution
nginx config:
https and http/1.1 is not used here but it doesn't matter
upstream http_memcached {
server 127.0.0.1:11211;
server 127.0.0.1:11212;
}
upstream remote {
server 127.0.0.1:8080;
}
server {
listen 80;
server_name server.lan;
access_log /var/log/nginx/server.access.log;
error_log /var/log/nginx/server.error.log info;
location / {
set $memcached_key "$uri?$args";
memcached_next_upstream not_found;
memcached_pass http_memcached;
error_page 404 = #remote;
}
location #remote {
internal;
access_log /var/log/nginx/server.fallback.access.log;
proxy_pass http://remote;
proxy_set_header Connection "";
}
}
server.py:
this is my dummy server (python):
from random import randint
from flask import Flask
app = Flask(__name__)
#app.route('/')
def hello_world():
return 'Hello: {}\n'.format(randint(1, 100000))
This is how to run it (just need to install flask first)
FLASK_APP=server.py [flask][2] run -p 8080
filling in my first memcached server:
$ telnet 127.0.0.1 11211
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
set /? 0 900 5
cache
STORED
quit
Connection closed by foreign host.
checking:
note that we get a result every time although we stored data
only in the first server
$ curl http://server.lan && echo
cache
$ curl http://server.lan && echo
cache
$ curl http://server.lan && echo
cache
this one is not in the cache so we'll get a response from server.py
$ curl http://server.lan/?q=1 && echo
Hello: 32337
whole picture:
the 2 windows on the right are
memcached -p 11211 -U o -vv
and
memcached -p 11212 -U o -vv

nginx errors readv() and recv() failed

I use nginx along with fastcgi. I see a lot of the following errors in the error logs
readv() failed (104: Connection reset
by peer) while reading upstream and
recv() failed (104: Connection reset
by peer) while reading response header
from upstream
I don't see any problem using the application. Are these errors serious or how to get rid of them.
I was using php-fpm in the background and slow scripts were getting killed after a said timeout because it was configured that way. Thus, scripts taking longer than a specified time would get killed and nginx would report a recv or readv error as the connection is closed from the php-fpm engine/process.
Update:
Since nginx version 1.15.3 you can fix this by setting the keepalive_requests option of your upstream to the same number as your php-fpm's pm.max_requests:
upstream name {
...
keepalive_requests number;
...
}
Original answer:
If you are using nginx to connect to php-fpm, one possible cause can also be having nginx' fastcgi_keep_conn parameter set to on (especially if you have a low pm.max_requests setting in php-fpm):
http|server|location {
...
fastcgi_keep_conn on;
...
}
This may cause the described error every time a child process of php-fpm restarts (due to pm.max_requests being reached) while nginx is still connected to it. To test this, set pm.max_requests to a really low number (like 1) and see if you get even more of the above errors.
The fix is quite simple - just deactivate fastcgi_keep_conn:
fastcgi_keep_conn off;
Or remove the parameter completely (since the default value is off). This does mean your nginx will reconnect to php-fpm on every request, but the performance impact is negligible if you have both nginx and php-fpm on the same machine and connect via unix socket.
Regarding this error:
readv() failed (104: Connection reset by peer) while reading upstream and recv() failed (104: Connection reset by peer) while reading response header from upstream
there was 1 more case where I could still see this.
Quick set up overview:
CentOS 5.5
PHP with PHP-FPM 5.3.8 (compiled from scratch with some 3rd party
modules)
Nginx 1.0.5
After looking at the PHP-FPM error logs as well and enabling catch_workers_output = yes in the php-fpm pool config, I found the root cause in this case was actually the amfext module (PHP module for Flash).
There's a known bug and fix for this module that can be corrected by altering the amf.c file.
After fixing this PHP extension issue, the error above was no longer an issue.
This is a very vague error as it can mean a few things. The key is to look at all possible logs and figure it out.
In my case, which is probably somewhat unique, I had a working nginx + php / fastcgi config. I wanted to compile a new updated version of PHP with PHP-FPM and I did so. The reason was that I was working on a live server that couldn't afford downtime. So I had to upgrade and move to PHP-FPM as seamlessly as possible.
Therefore I had 2 instances of PHP.
1 directly talking with fastcgi (PHP 5.3.4) - using TCP / 127.0.0.1:9000 (PHP 5.3.4)
1 configured with PHP-FPM - using Unix socket - unix:/dir/to/socket-fpm
(PHP 5.3.8)
Once I started up PHP-FPM (PHP 5.3.8) on an nginx vhost using a socket connection instead of TCP I started getting this upstream error on any fastcgi page taking longer than x minutes whether they were using FPM or not. Typically it was pages doing large SELECTS in mysql that took ~2 min to load. Bad I know, but this is because of back end DB design.
What I did to fix it was add this in my vhost configuration:
fastcgi_read_timeout 5m;
Now this can be added in the nginx global fastcgi settings as well. It depends on your set up. http://wiki.nginx.org/HttpFcgiModule
Answer # 2.
Interestingly enough fastcgi_read_timeout 5m; fixed one vhost for me.
However I was still getting the error in another vhost, just by running phpinfo();
What fixed this for me was by copying over a default production php.ini file and adding the config I needed into it.
What I had was an old copy of my php.ini from the previous PHP install.
Once I put the default php.ini from 'shared' and just added in the extensions and config I needed, this solved my problem and no longer did I have nginx errors readv() and recv() failed.
I hope 1 of these 2 fixes helps someone.
Also it can be a very simple problem - there is an infinity cicle somewhere in your code, or an infinity trying to connect an external host on your page.
Some times this problem happen because of huge of requests. By default the pm.max_requests in php5-fpm maybe is 100 or below.
To solve it increase its value depend on the your site's requests, For example 500.
And after the you have to restart the service
sudo service php5-fpm restart
Others have mentioned the fastcgi_read_timeout parameter, which is located in the nginx.conf file:
http {
...
fastcgi_read_timeout 600s;
...
}
In addition to that, I also had to change the setting request_terminate_timeout in the file: /etc/php5/fpm/pool.d/www.conf
request_terminate_timeout = 0
Source of information (there are also a few other recommendations for changing php.ini parameters, which may be relevant in some cases): https://ma.ttias.be/nginx-and-php-fpm-upstream-timed-out-failed-110-connection-timed-out-or-reset-by-peer-while-reading/

Resources