nginx taking too long to respond - nginx

I have an nginx used mainly as a reverse proxy for a couple of upstream services. This nginx has a simple endpoint used for health checks:
location /ping { return 200 '{"ping":"successful"}'; }
The problem I'm having is that this ping takes too long to be responded:
$ cat /proc/loadavg; date ; httpstat localhost/ping?foo=bar
2.93 1.98 1.94 8/433 16725
Thu Jul 15 15:25:08 UTC 2021
Connected to 127.0.0.1:80 from 127.0.0.1:42946
HTTP/1.1 200 OK
Date: Thu, 15 Jul 2021 15:26:24 GMT
X-Request-ID: b8d276b0b3828113cfee3bf2daa01293
DNS Lookup TCP Connection Server Processing Content Transfer
[ 4ms | 0ms | 76032ms | 0ms ]
| | | |
namelookup:4ms | | |
connect:4ms | |
starttransfer:76036ms |
total:76036ms
That ^ is telling me that the average load is low at the time of the request (2.93 the 1m average load for an 8-core server is ok)
Curl/httpstat initiated the request at 15:25:08 and response was obtained 15:26:24.
Connection was stablished fast, request sent, then it took 76s for the server to respond.
If I look at the access log for this ping I see "req_time":"0.000" (this is the $request_time variable).
{"t":"2021-07-15T15:26:24+00:00","id":"b8d276b0b3828113cfee3bf2daa01293","cid":"18581172","pid":"13631","host":"localhost","req":"GET /ping?foo=bar HTTP/1.1","scheme":"","status":"200","req_time":"0.000","body_sent":"21","bytes_sent":"373","content_length":"","request_length":"85","stats":"","upstream":{"status":"","sent":"","received":"","addr":"","conn_time":"","resp_time":""},"client":{"id":"#","agent":"curl/7.58.0","addr":",127.0.0.1:42946"},"limit_status":{"conn":"","req":""}}
This is the access log format in case anybody wonders what are the rest of the values:
log_format main escape=json '{"t":"$time_iso8601","id":"$ring_request_id","cid":"$connection","pid":"$pid","host":"$http_host","req":"$request","scheme":"$http_x_forwarded_proto","status":"$status","req_time":"$request_time","body_sent":"$body_bytes_sent","bytes_sent":"$bytes_sent","content_length":"$content_length","request_length":"$request_length","stats":"$location_tag","upstream":{"status":"$upstream_status","sent":"$upstream_bytes_sent","received":"$upstream_bytes_received","addr":"$upstream_addr","conn_time":"$upstream_connect_time","resp_time":"$upstream_response_time"},"client":{"id":"$http_x_auth_appid$http_x_ringdevicetype#$remote_user$http_x_auth_userid","agent":"$http_user_agent","addr":"$http_x_forwarded_for,$remote_addr:$remote_port"},"limit_status":{"conn":"$limit_conn_status","req":"$limit_req_status"}}';
My question is: where could nginx have spent these 76s if the request just took 0s to be processed and responded?
Something special to mention is that the server is timing out a lof of connections with the upstreams at that moment as well: we see a lot of upstream timed out (110: Connection timed out) while reading response header from upstream and upstream server temporarily disabled while reading response header from upstream.
So, these two are related, what I can't see is why upstream timeouts would lead to a /ping taking 76s to be attended and responded when both cpu and load are low/acceptable.
Any idea?

Related

How can I correctly measure ASP.NET Core request durations?

Given the following most basic of ASP.NET Core applications (note the Thread.Sleep):
public class Program
{
public static void Main(string[] args)
{
CreateHostBuilder(args).Build().Run();
}
public static IHostBuilder CreateHostBuilder(string[] args) =>
Host.CreateDefaultBuilder(args)
.ConfigureWebHostDefaults(webBuilder =>
{
webBuilder.Configure(appBuilder =>
appBuilder.Run(async context =>
{
var stopwatch = Stopwatch.StartNew();
Thread.Sleep(1000);
await context.Response.WriteAsync($"Finished in {stopwatch.ElapsedMilliseconds} milliseconds.");
}));
});
}
And the following appsettings.json
{
"Logging": {
"LogLevel": {
"Default": "None",
"Microsoft.AspNetCore.Hosting.Diagnostics" : "Information",
}
},
"AllowedHosts": "*"
}
If I run an even moderate load test (100 requests, using bombardier in my case) I see latency of around 5 seconds.
~/go/bin/bombardier http://localhost:5000 -l -n 100 -t 60s
Bombarding http://localhost:51568 with 100 request(s) using 125 connection(s)
100 / 100 [=================================================================================================================================] 100.00% 16/s 6s
Done!
Statistics Avg Stdev Max
Reqs/sec 19.46 250.28 4086.58
Latency 5.21s 366.21ms 6.05s
Latency Distribution
50% 5.05s
75% 5.05s
90% 6.04s
95% 6.05s
99% 6.05s
HTTP codes:
1xx - 0, 2xx - 100, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 3.31KB/s
However, all I see in the logs are
info: Microsoft.AspNetCore.Hosting.Diagnostics[2] Request finished in
1003.3658ms 200
Clearly the requests are taking more than 1 second. I believe the unaccounted 4 seconds are when the request is queued on the ThreadPool.
So my question is how can I measure this latency from inside my application?
I ran your application in my environment, and the ASP.NET logs very similar to yours:
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[2]
Request finished in 1022.4689ms 200
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/favicon.ico
info: Microsoft.AspNetCore.Hosting.Diagnostics[2]
Request finished in 1004.1694ms 200
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[2]
Request finished in 1003.4582ms 200
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/favicon.ico
info: Microsoft.AspNetCore.Hosting.Diagnostics[2]
Request finished in 1004.3703ms 200
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[2]
Request finished in 1003.3915ms 200
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/favicon.ico
info: Microsoft.AspNetCore.Hosting.Diagnostics[2]
Request finished in 1004.3106ms 200
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[2]
Request finished in 1003.122ms 200
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[2]
Request finished in 1017.028ms 200
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[2]
Request finished in 1004.2742ms 200
info: Microsoft.AspNetCore.Hosting.Diagnostics[2]
Request finished in 1006.5832ms 200
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[2]
Request finished in 1004.9214ms 200
info: Microsoft.AspNetCore.Hosting.Diagnostics[1]
Request starting HTTP/1.1 GET http://localhost:5000/
info: Microsoft.AspNetCore.Hosting.Diagnostics[2]
Request finished in 1012.4532ms 200
As for bombardier, I got below output:
bombardier-windows-amd64.exe http://localhost:5000 -l -n 100 -t 60s
Bombarding http://localhost:5000 with 100 request(s) using 125 connection(s)
100 / 100 [==========================================================================================] 100.00% 11/s 8s
Done!
Statistics Avg Stdev Max
Reqs/sec 11.29 99.10 1303.09
Latency 5.78s 1.42s 8.78s
Latency Distribution
50% 5.17s
75% 7.79s
90% 7.88s
95% 8.24s
99% 8.34s
HTTP codes:
1xx - 0, 2xx - 100, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 2.27KB/s
And below is Chrome Dev tools network output:
Also I test it with cURL (note that I had to apply echo [%date%, %time%] second time after the curl command manually, better accuracy can be done by doing it in .bat file
but overall the output confirms that the request took ~ 1100 ms:
C:\curl\bin>echo [%date%, %time%] && curl http://localhost:5000/
[Sun 07/12/2020, 13:12:35.61]
Finished in 1001 milliseconds.
C:\curl\bin\>echo [%date%, %time%]
[Sun 07/12/2020, 13:12:37.45]
So based on all above, it seems that bombardier output differ than what other tools reports, hence we might misunderstood the meaning of its output latency! I made little change to the command by letting it handle the 100 requests using 10 connections only instead of default 125 connections, and the output was:
bombardier-windows-amd64.exe -c10 http://localhost:5000 -l -n 100 -t 60s
Bombarding http://localhost:5000 with 100 request(s) using 10 connection(s)
100 / 100 [==========================================================================================] 100.00% 8/s 11s
Done!
Statistics Avg Stdev Max
Reqs/sec 9.07 26.73 211.08
Latency 1.06s 179.57ms 2.04s
Latency Distribution
50% 1.01s
75% 1.02s
90% 1.07s
95% 1.19s
99% 1.86s
HTTP codes:
1xx - 0, 2xx - 100, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 1.79KB/s
Based on all above, I confirm that the single request taking ~ 1 second. as for bulk requests benchmarks, please try Postman otherwise we need to dig deeper to understand whatbombardier Latency means exactly and how its calculated.
Update
I made small console tool that fire HttpClient in Bulk and it confirms the ~ 1 second response time, also I tried two benchmarking tools from awesome-http-benchmark:
baton.exe -u http://localhost:5000 -c 10 -r 100
Configuring to send GET requests to: http://localhost:5000
Generating the requests...
Finished generating the requests
Sending the requests to the server...
Finished sending the requests
Processing the results...
====================== Results ======================
Total requests: 100
Time taken to complete requests: 10.2670832s
Requests per second: 10
===================== Breakdown =====================
Number of connection errors: 0
Number of 1xx responses: 0
Number of 2xx responses: 100
Number of 3xx responses: 0
Number of 4xx responses: 0
Number of 5xx responses: 0
=====================================================
cassowary run -u http://localhost:5000 -c 10 -n 100
Starting Load Test with 100 requests using 10 concurrent users
100% |████████████████████████████████████████| [10s:0s] 10.2299727s
TCP Connect.....................: Avg/mean=1.90ms Median=2.00ms p(95)=2.00ms
Server Processing...............: Avg/mean=1014.94ms Median=1008.00ms p(95)=1093.00ms
Content Transfer................: Avg/mean=0.17ms Median=0.00ms p(95)=1.00ms
Summary:
Total Req.......................: 100
Failed Req......................: 0
DNS Lookup......................: 5.00ms
Req/s...........................: 9.78
Finally I used nginx on port 2020 as rev proxy in front of kestrel to see bombardier output:
bombardier-windows-amd64.exe http://localhost:2020 -l -n 100 -t 60s
Bombarding http://localhost:2020 with 100 request(s) using 125 connection(s)
100 / 100 [==========================================================================================] 100.00% 9/s 10s
Done!
Statistics Avg Stdev Max
Reqs/sec 11.76 128.07 2002.66
Latency 9.08s 761.43ms 10.04s
Latency Distribution
50% 9.06s
75% 9.07s
90% 9.07s
95% 10.02s
99% 10.04s
HTTP codes:
1xx - 0, 2xx - 95, 3xx - 0, 4xx - 0, 5xx - 5
others - 0
Throughput: 2.51KB/s
As you can see, even with nginx it shows 9 seconds latency! that's should conclude an issue with Bombarding latency definition/calcs.
Below is nginx config:
server {
listen 2020;
location / {
proxy_pass http://localhost:5000/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_http_version 1.1;
proxy_redirect off;
}
Bonus
If you want to hack the performance of thread.sleep to be similar as await Task.Delay then change Main to:
public static void Main(string[] args)
{
ThreadPool.SetMinThreads(130, 130);//Don't Use in Production.
CreateHostBuilder(args).Build().Run();
}

nginx response status variable

I have a question. We use nginx+uwsgi stack and see many errors like:
Aug 30 00:00:55 imfmce-va-81-2 uwsgi: Tue Aug 30 00:00:55 2016 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /provisioning/user/f205970b-6a9f-42b5-830f-c2bec9967b32 (ip 10.216.153.254) !!!
I understand that error occurs when client close connection before reading response or by uwsgi_read_timeout, but I don't understand why in access log I cannt see any error, nginx just log 200 OK:
Aug 30 00:00:55 imfmce-va-81-2 provisioning: active [ 55544 ] 10.216.153.254 Sync-Wopi-SyncLocksTask hostpilot 73af65e4-5984-4b2c-baf4-c88cf8385898 - ECDHE-RSA-AES256-GCM-SHA384 GET /provisioning/user/f205970b-6a9f-42b5-830f-c2bec9967b32 - 0,0,1,0 200 - 1 OK - 321 515 844
We use next format log line:
log_format ss_log_format "active\t[ \$pid ]\t\$remote_addr\t\$http_user_agent\t\$upstream_http_x_user_identity\t\$http_x_client_id\t\$http_x_request_id\t\$ssl_cipher\t\$request_method\t\$uri\t\$args\t\$upstream_http_x_durations\t\$status\t\$upstream_status\t\$http_x_error_code\t\$connection_requests\t\$request_completion\t\$content_length\t\$request_length\t\$body_bytes_sent\t\$bytes_sent";
I would like you to understand that,
we don't need to fix this error, we just need have right access logs.

HTTP streaming / chunked responses on Heroku with clojure

I'm making a clojure web app that streams data to clients using chunked HTTP responses. This works great when I run it locally using foreman, but doesn't work properly when I deploy it to Heroku.
A minimal example exhibiting this behaviour can be found on my github here. The frontend (in resources/index.html) performs an AJAX GET request and prints the response chunks as they arrive. The server uses http-kit to send a new chunk to connected clients every second. By design, the HTTP request never completes.
When the same code is deployed to Heroku, the HTTP connection is closed by the server immediately after the first chunk is sent. It seems to be Heroku's routing mesh which is causing this disconnection to occur.
This can also be seen by performing the GET request using curl:
$ curl -v http://arcane-headland-2284.herokuapp.com/stream
* About to connect() to arcane-headland-2284.herokuapp.com port 80 (#0)
* Trying 54.243.166.168...
* Adding handle: conn: 0x6c3be0
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x6c3be0) send_pipe: 1, recv_pipe: 0
* Connected to arcane-headland-2284.herokuapp.com (54.243.166.168) port 80 (#0)
> GET /stream HTTP/1.1
> User-Agent: curl/7.31.0
> Host: arcane-headland-2284.herokuapp.com
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/html; charset=utf-8
< Date: Sat, 17 Aug 2013 16:57:24 GMT
* Server http-kit is not blacklisted
< Server: http-kit
< transfer-encoding: chunked
< Connection: keep-alive
<
* transfer closed with outstanding read data remaining
* Closing connection 0
curl: (18) transfer closed with outstanding read data remaining
The time is currently Sat Aug 17 16:57:24 UTC 2013 <-- this is the first chunk
Can anybody suggest why this is happening? HTTP streaming is supposed to be supported in Heroku's Cedar stack. The fact the code runs correctly using foreman suggests it is something in Heroku's routing mesh causing it to break.
Live demo of the failing project: http://arcane-headland-2284.herokuapp.com/
This was due to a bug in http-kit which will be fixed shortly.
https://devcenter.heroku.com/articles/request-timeout may be relevant: "long-polling" requests like yours have to send data every 55 seconds or be terminated.

NGINX + uWSGI Connection Reset by Peer

I'm trying to host Bottle Application on NGINX using uWSGI.
Here's my nginx.conf
location /myapp/ {
include uwsgi_params;
uwsgi_param X-Real-IP $remote_addr;
uwsgi_param Host $http_host;
uwsgi_param UWSGI_SCRIPT myapp;
uwsgi_pass 127.0.0.1:8080;
}
I'm running uwsgi as this
uwsgi --enable-threads --socket :8080 --plugin python -- wsgi-file ./myApp/myapp.py
I'm using POST Request. For that using dev Http Client. Which goes infinite when I send the request
http://localhost/myapp
uWSGI server receives the request and prints
[pid: 4683|app: 0|req: 1/1] 127.0.0.1 () {50 vars in 806 bytes} [Thu Oct 25 12:29:36 2012] POST /myapp => generated 737 bytes in 11 msecs (HTTP/1.1 404) 2 headers in 87 bytes (1 switches on core 0)
but in nginx error log
2012/10/25 12:20:16 [error] 4364#0: *11 readv() failed (104: Connection reset by peer) while reading upstream, client: 127.0.0.1, server: localhost, request: "POST /myApp/myapp/ HTTP/1.1", upstream: "uwsgi://127.0.0.1:8080", host: "localhost"
What to do?
make sure to consume your post data in your application
for example if you have a Django/python application
def my_view(request):
# ensure to read the post data, even if you don't need it
# without this you get a: failed (104: Connection reset by peer)
data = request.DATA
return HttpResponse("Hello World")
Some details: https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html
You cannot post data from the client without reading it in your application. while this is not a problem in uWSGI, nginx will fail. You can 'fake' the thing using the --post-buffering option of uWSGI to automatically read datas from the socket (if available), but you'd better to "fix" (even if i do not consider that a bug) your app
This problem occurs when the body of a request is not consumed, since uwsgi cannot know whether it will still be needed at some point. So uwsgi will keep holding on to the data either until it is consumed or until nginx resets the connection (because upstream timed out).
The author of uwsgi explains it here:
08:21 < unbit> plaes: does your DELETE request (not-response) have a body ?
08:40 < unbit> and do you read that body in your app ?
08:41 < unbit> from the nginx logs it looks like it has a body and you are not reading it in the app
08:43 < plaes> so DELETE request shouldn't have the body?
08:43 < unbit> no i mean if a request has a body you have to read/consume it
08:44 < unbit> otherwise the socket will be clobbered
So to fix this you need to make sure to always either read the whole request body or not to send a body if it is not necessary (for a DELETE e.g.).
Not use threads!
I have same problem with Global Interpretator Lock in Python under uwsgi.
When i don't use threads- not connection reset.
Example of uwsgi config ( 1Gb Ram on server)
[root#mail uwsgi]# cat myproj_config.yaml
uwsgi:
print: Myproject Configuration Started
socket: /var/tmp/myproject_uwsgi.sock
pythonpath: /sites/myproject/myproj
env: DJANGO_SETTINGS_MODULE=settings
module: wsgi
chdir: /sites/myproject/myproj
daemonize: /sites/myproject/log/uwsgi.log
max-requests: 4000
buffer-size: 32768
harakiri: 30
harakiri-verbose: true
reload-mercy: 8
vacuum: true
master: 1
post-buffering: 8192
processes: 4
no-orphans: 1
touch-reload: /sites/myproject/log/uwsgi
post-buffering: 8192

Siege unknown responses

I'm trying to test my server on highload resistance with siege utility:
siege http://my.server.ru/ -d1 -r10 -c100
Siege outputs a lot of messages like this:
HTTP/1.1 200 0.46 secs: 10298 bytes ==> /
but sometimes there are error messages like this:
Error: socket: unable to connect sock.c:220: Connection timed out
or this:
warning: socket: -598608128 select timed out: Connection timed out
There is siege report after testing:
Transactions: 949 hits
Availability: 94.90 %
...
Successful transactions: 949
Failed transactions: 51
Longest transaction: 9.87
Shortest transaction: 0.37
In nginx logs on my server, only 950 messages with code 200 and response that all right.
"GET / HTTP/1.1" 200 10311 "-" "JoeDog/1.00 [en] (X11; I; Siege 2.68)"
Can anyone tell me what this means
Error: socket: unable to connect sock.c:220: Connection timed out
warning: socket: -598608128 select timed out: Connection timed out
and why in my nginx logs I only see responses with code 200?
It probably means your pipe is full and can't handle more connections. You can't make nginx or nginx backends accept more connections if if your pipe is full. Try testing against localhost. You will then be testing the stack rather than the stack and the pipe. It will resemble real load less, but give you an idea what you can handle with the bigger pipe.

Resources