NGINX Reverse Proxy : Many html status code 400 responses, why?

NGINX Reverse Proxy : Many html status code 400 responses, why? - nginx

We have recently implemented a nginx based reverse proxy.
While, debugging our access logs, we are seeing quite a bit of status code 400 results.
They look something like this:
[07/Sep/2011:05:49:04 -0700] - "400" 0 "-" "-" "-"
We have enabled debug error logging, and they usually correspond to something like this:
2011/09/07 05:09:28 [info] 5937#0: *30904 client closed prematurely connection while reading client request line
We have tried raising a number of the buffers, as mentioned by a few pages we were able to google up.
http://www.ruby-forum.com/topic/173362
or
http://blog.craz8.com/articles/2009/06/17/nginx-400-bad-request-errors-due-to-cookies-and-what-to-do-about-them
To no avail.
Why is this happening?
This is a strandard nginx reverse proxy -> apache backend server.
Worth mentioning, the unique type of content on our site is fairly minimal. We have tested this using many browsers and are not personally receiving any of these 400 results.
Thanks!
Further urls detailing similar entries in their logs:
http://blog.rayfoo.info/2009/10/weird-web-server-access-log-entries

I found this was caused by using Chrome, which apparently opens extra connections occasionally without sending any data.
Here's some more info: http://www.ruby-forum.com/topic/2953545
Now the question is what to do about them - the answer provided there wasn't very satisfying.

Are you handling SSL connections? Can you add $ssl_cipher $ssl_protocol to your access log format?

First, it's fairly possible that your clients send request with really big http headers or urls. Maybe an older version of your application set some (probably big) cookies which are unused now and some clients are still trying send them back.
I'd set the header buffers to a really big value and on the application side log the size of the headers/requests and the complete request if they are bigger than usual. Or completely take out the nginx from the chain and log the header/request with the same conditions. If you can, take out the nginx for only those IPs/subnets where the 400 errors came from. I suppose nginx can log the source IP for these 400 errors.

Related

How to monitor nginx response time with ELK stack?

I'd like to create a monitor that will show near realtime average response time of nginx.
Below image shows CPU usage for example, I'd like to create something similar for avg response time
I know how I can track the response time for individual requests (https://lincolnloop.com/blog/tracking-application-response-time-nginx/)
Although I 'll have to think how to ignore non-page / api requests such as static image request.
This must be pretty basic requirements, but couldn't find google how to do it.

This is actually trickier than you'd expect:
Metricbeat
The nginx module of Metricbeat doesn't contain this information. It's built around stubstatus and is more around the process itself rather than the timing of individual requests.
Filebeat
The nginx module for Filebeat is where you might expect this. It's built around the nginx access log and has the individual requests. Unfortunately the response time isn't part of the access log by default (at least on Ubuntu) — only the number of bytes sent. Here's an example (response code 200, 158 bytes sent):
34.24.14.22 - - [10/Nov/2019:06:54:51 +0000] "GET / HTTP/1.1" 200 159 "-" "Go-http-client/1.1"
Packetbeat
This one has a field called event.duration that sounds promising. But be careful with the HTTP module — this one is really only for HTTP traffic and not HTTPS (because you can't see the encrypted traffic). In most cases you'll want to use HTTPS for your application, so this isn't all that helpful and will mostly show redirects to HTTPS.
The other protocols such as TLS (this is only the time for the initial handshake) or Flow information (this is a group of packets) are not what you are after IMO.
Customization
I'm afraid you'll need some customization and you basically have two options:
Customize the log format of nginx as described in the blog post you linked to. You'll also need to change the pattern in the Elasticsearch ingest pipeline to extract the timing information correctly.
I assume you have an application behind nginx. Then you might want to get even more insights into that than just timing by using (APM / tracing](https://www.elastic.co/products/apm) with the agents for various languages. This way you'll also automatically skip static resources like images and focus on the relevant parts of your application.

Tomcat occasionally returns a response without HTTP headers

I’m investigating a problem where Tomcat (7.0.90 7.0.92) returns a response with no HTTP headers very occasionally.
According to the captured packets by Wireshark, after Tomcat receives a request it just returns only a response body. It returns neither a status line nor HTTP response headers.
It makes a downstream Nginx instance produce the error “upstream sent no valid HTTP/1.0 header while reading response header from upstream”, return 502 error to the client and close the corresponding http connection between Nginx and Tomcat.
What can be a cause of this behavior? Is there any possibility which makes Tomcat behave this way? Or there can be something which strips HTTP headers under some condition? Or Wireshark failed to capture the frames which contain the HTTP headers? Any advice to narrow down where the problem is is also greatly appreciated.
This is a screenshot of Wireshark's "Follow HTTP Stream" which is showing the problematic response:
EDIT:
This is a screen shot of "TCP Stream" of the relevant part (only response). It seems that the chunks in the second response from the last looks fine:
EDIT2:
I forwarded this question to the Tomcat users mailing list and got some suggestions for further investigation from the developers:
http://tomcat.10.x6.nabble.com/Tomcat-occasionally-returns-a-response-without-HTTP-headers-td5080623.html
But I haven’t found any proper solution yet. I’m still looking for insights to tackle this problem..

The issues you experience stem from pipelining multiple requests over a single connection with the upstream, as explained by yesterday's answer here by Eugène Adell.
Whether this is a bug in nginx, tomcat, your application, or the interaction of any combination of the above, would probably be a discussion for another forum, but for now, let's consider what would be the best solution:
Can you post your nginx configuration? Specifically, if you're using keepalive and a non-default value of proxy_http_version within nginx? – cnst 1 hour ago
#cnst I'm using proxy_http_version 1.1 and keepalive 100 – Kohei Nozaki 1 hour ago
As per an earlier answer to an unrelated question here on SO, yet sharing the configuration parameters as above, you might want to reconsider the reasons behind your use of the keepalive functionality between the front-end load-balancer (e.g., nginx) and the backend application server (e.g., tomcat).
As per a keepalive explanation on ServerFault in the context of nginx, the keepalive functionality in the upstream context of nginx wasn't even supported until very-very recently in the nginx development years. Why? It's because there are very few valid scenarios for using keepalive when it's basically faster to establish a new connection than to wait for an existing one to become available:
When the latency between the client and the server is on the order of 50ms+, keepalive makes it possible to reuse the TCP and SSL credentials, resulting in a very significant speedup, because no extra roundtrips are required to get the connection ready for servicing the HTTP requests.
This is why you should never disable keepalive between the client and nginx (controlled through http://nginx.org/r/keepalive_timeout in http, server and location contexts).
But when the latency between the front-end proxy server and the backend application server is on the order of 1ms (0.001s), using keepalive is a recipe for chasing Heisenbugs without reaping any benefits, as the extra 1ms latency to establish a connection might as well be less than the 100ms latency of waiting for an existing connection to become available. (This is a gross oversimplification of connection handling, but it just shows you how extremely insignificant any possible benefits of the keepalive between the front-end load-balancer and the application server would be, provided both of them live in the same region.)
This is why using http://nginx.org/r/keepalive in the upstream context is rarely a good idea, unless you really do need it, and have specifically verified that it produces the results you desire, given the points as above.
(And, just to make it clear, these points are irrespective of what actual software you're using, so, even if you weren't experiencing the problems you experience with your combination of nginx and tomcat, I'd still recommend you not use keepalive between the load-balancer and the application server even if you decide to switch away from either or both of nginx and tomcat.)
My suggestion?
The problem wouldn't be reproducible with the default values of http://nginx.org/r/proxy_http_version and http://nginx.org/r/keepalive.
If your backend is within 5ms of front-end, you most certainly aren't even getting any benefits from modifying these directives in the first place, so, unless chasing Heisenbugs is your path, you might as well keep these specific settings at their most sensible defaults.

We see that you are reusing an established connection to send the POST request and that, as you said, the response comes without the status-line and the headers.
after Tomcat receives a request it just returns only a response body.
Not exactly. It starts with 5d which is probably a chunk-size and this means that the latest "full" response (with status-line and headers) got from this connection contained a "Transfer-Encoding: chunked" header. For any reason, your server still believes the previous response isn't finished by the time it starts sending this new response to your last request.
A missing chunked seems confirmed as the screenshot doesn't show a last-chunk (value = 0) ending the previous request. Note that the last response ends with a last-chunk (the last byte shown is 0).
What causes this ? The previous response isn't technically considered as fully answered. It can be a bug on Tomcat, your webservice library, your own code. Maybe even, you're sending your request too early, before the previous one was completely answered.
Are some bytes missing if you compare the chunk-sizes from what is actually sent to the client ? Are all buffers flushed ? Beware of the line endings (CRLF vs LF only) too.
One last cause that I'm thinking about, if your response contains some kind of user input taken from the request, you can be facing HTTP Splitting.
Possible solutions.
It is worth trying to disable the chunked encoding at your library level, for example with Axis2 check the HTTP Transport.
When reusing a connection, check your client code to make sure that you aren't sending a request before you read all of the previous response (to avoid overlapping).
Further reading
RFC 2616 3.6.1 Chunked Transfer Coding

It turned out that the "sjsxp" library which JAX-WS RI v2.1.3 uses makes Tomcat behave this way. I tried a different version of JAX-WS RI (v2.1.7) which doesn't use the "sjsxp" library anymore and it solved the issue.
A very similar issue posted on Metro mailing list: http://metro.1045641.n5.nabble.com/JAX-WS-RI-2-1-5-returning-malformed-response-tp1063518.html

What is HTTP Status code 000?

Just switched some downloads over to the Akamai CDN network and I'm seeing some strange stuff in the log files they deliver. A number of entries have the status code 000. When I asked them they said that 000 is the status when the client disconnects without transferring the entire file. Since 000 doesn't appear to be a valid HTTP response code (from the RFC), I have to wonder if that's right.

There's a knowledge base article (requires login) which lists their log values:
Log Delivery Services (LDS) LDS will show a 000 for any 200 or 206
responses with a client abort: the object was served correctly from
the origin or edge, but the end-user terminated the
connection/transaction before it completed.
This is indeed a custom status because the standard log format doesn't include a field which can indicate a client abort.

000 is a common code to use when no HTTP code was received due to a network error. According to a knowledge base article for Amazon CloudFront, 000 also means that the client disconnected before completing the request for that service.

It normally means: No valid HTTP response code
(ie: Connection failed, or was aborted before any data happened).
I would guess that their are either network issues or Akamai isn't managing their webservers correctly.

Blackberry 9000 getting HTTP error 406 When using WiFi

So, I have a Blackberry 9000 application doing simple networking using HttpConnection. Everything works fine normally, when I go to urls of the form:
http://url.com
But I've discovered that I need to test this in wifi only situations (that is, without a BES or equivalent in sight). After some digging, I discovered that I need to add:
;interface=wifi
To all of my URLS, of the form:
http://url.com;interface=wifi
However, I'm noticing that this does not actually work, it gives me back a HTTP error 406. Which according to wiki is a:
406 Not Acceptable
The requested resource is only capable of generating content not acceptable according to the Accept headers sent in the request.[2]
Am I doing something completely wrong? Does Blackberry wrap wifi only requests in headers that require particularly formatted websites?

As explained on this page, you also need add "deviceside=true" to the URL.
To specify that the underlying TCP
connection should be opened directly
from the handheld, set this parameter
to "true". Specify "deviceside=false"
when receiving or sending data through
the BlackBerry MDS Connection Service.
So your full URL would be:
http://url.com;interface=wifi;deviceside=true

IIS file download hangs/timeouts - sc-win32-status = 64

Any thoughts on why I might be getting tons of "hangs" when trying to download a file via HTTP, based on the following?
Server is IIS 6
File being downloaded is a binary file, rather than a web page
Several clients hang, including TrueUpdate and FlexNet web updating packages, as well as custom .NET app that just does basic HttpWebRequest/HttpWebResponse logic and downloads using a response stream
IIS log file signature when success is 200 0 0 (sc-status sc-substatus sc-win32-status)
For failure, error signature is 200 0 64
sc-win32-status of 64 is "the specified network name is no longer available"
I can point firefox at the URL and download successfully every time (perhaps some retry logic is happening under the hood)
At this point, it seems like either there's something funky with my server that it's throwing these errors, or that this is just normal network behavior and I need to use (or write) a client that is more resilient to the failures.
Any thoughts?

Perhaps your issue was a low level networking issue with the ISP as you speculated in your reply comment. I am experiencing a similar problem with IIS and some mysterious 200 0 64 lines appearing in the log file, which is how I found this post. For the record, this is my understanding of sc-win32-status=64; I hope someone can correct me if I'm wrong.
sc-win32-status 64 means “The specified network name is no longer available.”
After IIS has sent the final response to the client, it waits for an ACK message from the client.
Sometimes clients will reset the connection instead of sending the final ACK back to server. This is not a graceful connection close, so IIS logs the “64” code to indicate an interruption.
Many clients will reset the connection when they are done with it, to free up the socket instead of leaving it in TIME_WAIT/CLOSE_WAIT.
Proxies may have a tendancy to do this more often than individual clients.

I've spent two weeks investigating this issue. For me I had the scenario in which intermittent random requests were being prematurely terminated. This was resulting in IIS logs with status code 200, but with a win32-status of 64.
Our infrastructure includes two Windows IIS servers behind two NetScaler load balancers in HA mode.
In my particular case, the problem was that the NetScaler had a feature called "Intergrated Caching" turned on (http://support.citrix.com/proddocs/topic/ns-optimization-10-5-map/ns-IC-gen-wrapper-10-con.html).
After disabling this feature, the request interruptions ceased. And the site operated normally. I'm not sure how or why this was causing a problem, but there it is.
If you use a proxy or a load balancer, do some investigation of what features they have turned on. For me the cause was something between the client and the server interrupting the requests.
I hope that this explanation will at least save someone else's time.

Check the headers from the server, especially content-type, and content-length, it's possible that your clients don't recognize the format of the binary file and hang while waiting for bytes that never come, or maybe they close the underlying TCP connection, which may cause IIS to log the win32 status 64.

Spent three days on this.
It was the timeout that was set to 4 seconds (curl php request).
Solution was to increase the timeout setting:
//curl_setopt($ch, CURLOPT_TIMEOUT, 4); // times out after 4s
curl_setopt($ch, CURLOPT_TIMEOUT, 60); // times out after 60s

You will have to use wireshare or network monitor to gather more data on this problem. Me think.

I suggest you put Fiddler in between your server and your download client. This should reveal the differences between Firefox and other cients.

Description of all sc-win32-status codes for reference
https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-
ERROR_NETNAME_DELETED
64 (0x40)
The specified network name is no longer available.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex