How to monitor nginx response time with ELK stack? - nginx

I'd like to create a monitor that will show near realtime average response time of nginx.
Below image shows CPU usage for example, I'd like to create something similar for avg response time
I know how I can track the response time for individual requests (https://lincolnloop.com/blog/tracking-application-response-time-nginx/)
Although I 'll have to think how to ignore non-page / api requests such as static image request.
This must be pretty basic requirements, but couldn't find google how to do it.

This is actually trickier than you'd expect:
Metricbeat
The nginx module of Metricbeat doesn't contain this information. It's built around stubstatus and is more around the process itself rather than the timing of individual requests.
Filebeat
The nginx module for Filebeat is where you might expect this. It's built around the nginx access log and has the individual requests. Unfortunately the response time isn't part of the access log by default (at least on Ubuntu) — only the number of bytes sent. Here's an example (response code 200, 158 bytes sent):
34.24.14.22 - - [10/Nov/2019:06:54:51 +0000] "GET / HTTP/1.1" 200 159 "-" "Go-http-client/1.1"
Packetbeat
This one has a field called event.duration that sounds promising. But be careful with the HTTP module — this one is really only for HTTP traffic and not HTTPS (because you can't see the encrypted traffic). In most cases you'll want to use HTTPS for your application, so this isn't all that helpful and will mostly show redirects to HTTPS.
The other protocols such as TLS (this is only the time for the initial handshake) or Flow information (this is a group of packets) are not what you are after IMO.
Customization
I'm afraid you'll need some customization and you basically have two options:
Customize the log format of nginx as described in the blog post you linked to. You'll also need to change the pattern in the Elasticsearch ingest pipeline to extract the timing information correctly.
I assume you have an application behind nginx. Then you might want to get even more insights into that than just timing by using (APM / tracing](https://www.elastic.co/products/apm) with the agents for various languages. This way you'll also automatically skip static resources like images and focus on the relevant parts of your application.

Related

NGINX as warm cache in front of wowza for HLS live streams - Get per stream data duration and data transferred?

I've setup NGINX as a warm cache server in front of Wowza > HTTP-Origin application to act as an edge server. The config is working great streaming over HTTPS with nDVR and adaptive streaming support. I've combed the internet looking for examples and help on configuring NGINX and/or other solutions to give me live statistics (# of viewers per stream_name) as well parse the logs to give me stream duration per stream_name/session and data_transferred per stream_name/session. The logging in NGINX for HLS streams logs each video chunk. With Wowza, it is a bit easier to get this data by reading the duration or bytes transferred values from the logs when the stream is destroyed... Any help on this subject would be hugely appreciated. Thank you.
Nginx isn't aware of what the chunks are. It's only serving resource to clients over HTTP, and doesn't know or care that they're interrelated. Therefore, you'll have to derive the data you need from the logs.
To associate client requests together as one, you need some way to track state between requests, and then log that state. Cookies are a common way to do this. Alternatively, you could put some sort of session identifier in the request URI, but this hurts your caching ability since each client is effectively requesting a different resource.
Once you have some sort of session ID logged, you can process those logs with tools such as Elastic Stack to piece together the reports you're looking for.
Depending on your goals with this, you might find it better to get your data client-side. There, you have a better idea of what a session actually is, and then you can log client-side items such as buffer levels and latency and what not. The HTTP requests don't really tell you much about the experience the end users are getting. If that's what you want to know, you should use the log from the clients, not from your HTTP servers. Your HTTP server log is much more useful for debugging underlying technical infrastructure issues.

Tomcat occasionally returns a response without HTTP headers

I’m investigating a problem where Tomcat (7.0.90 7.0.92) returns a response with no HTTP headers very occasionally.
According to the captured packets by Wireshark, after Tomcat receives a request it just returns only a response body. It returns neither a status line nor HTTP response headers.
It makes a downstream Nginx instance produce the error “upstream sent no valid HTTP/1.0 header while reading response header from upstream”, return 502 error to the client and close the corresponding http connection between Nginx and Tomcat.
What can be a cause of this behavior? Is there any possibility which makes Tomcat behave this way? Or there can be something which strips HTTP headers under some condition? Or Wireshark failed to capture the frames which contain the HTTP headers? Any advice to narrow down where the problem is is also greatly appreciated.
This is a screenshot of Wireshark's "Follow HTTP Stream" which is showing the problematic response:
EDIT:
This is a screen shot of "TCP Stream" of the relevant part (only response). It seems that the chunks in the second response from the last looks fine:
EDIT2:
I forwarded this question to the Tomcat users mailing list and got some suggestions for further investigation from the developers:
http://tomcat.10.x6.nabble.com/Tomcat-occasionally-returns-a-response-without-HTTP-headers-td5080623.html
But I haven’t found any proper solution yet. I’m still looking for insights to tackle this problem..
The issues you experience stem from pipelining multiple requests over a single connection with the upstream, as explained by yesterday's answer here by Eugène Adell.
Whether this is a bug in nginx, tomcat, your application, or the interaction of any combination of the above, would probably be a discussion for another forum, but for now, let's consider what would be the best solution:
Can you post your nginx configuration? Specifically, if you're using keepalive and a non-default value of proxy_http_version within nginx? – cnst 1 hour ago
#cnst I'm using proxy_http_version 1.1 and keepalive 100 – Kohei Nozaki 1 hour ago
As per an earlier answer to an unrelated question here on SO, yet sharing the configuration parameters as above, you might want to reconsider the reasons behind your use of the keepalive functionality between the front-end load-balancer (e.g., nginx) and the backend application server (e.g., tomcat).
As per a keepalive explanation on ServerFault in the context of nginx, the keepalive functionality in the upstream context of nginx wasn't even supported until very-very recently in the nginx development years. Why? It's because there are very few valid scenarios for using keepalive when it's basically faster to establish a new connection than to wait for an existing one to become available:
When the latency between the client and the server is on the order of 50ms+, keepalive makes it possible to reuse the TCP and SSL credentials, resulting in a very significant speedup, because no extra roundtrips are required to get the connection ready for servicing the HTTP requests.
This is why you should never disable keepalive between the client and nginx (controlled through http://nginx.org/r/keepalive_timeout in http, server and location contexts).
But when the latency between the front-end proxy server and the backend application server is on the order of 1ms (0.001s), using keepalive is a recipe for chasing Heisenbugs without reaping any benefits, as the extra 1ms latency to establish a connection might as well be less than the 100ms latency of waiting for an existing connection to become available. (This is a gross oversimplification of connection handling, but it just shows you how extremely insignificant any possible benefits of the keepalive between the front-end load-balancer and the application server would be, provided both of them live in the same region.)
This is why using http://nginx.org/r/keepalive in the upstream context is rarely a good idea, unless you really do need it, and have specifically verified that it produces the results you desire, given the points as above.
(And, just to make it clear, these points are irrespective of what actual software you're using, so, even if you weren't experiencing the problems you experience with your combination of nginx and tomcat, I'd still recommend you not use keepalive between the load-balancer and the application server even if you decide to switch away from either or both of nginx and tomcat.)
My suggestion?
The problem wouldn't be reproducible with the default values of http://nginx.org/r/proxy_http_version and http://nginx.org/r/keepalive.
If your backend is within 5ms of front-end, you most certainly aren't even getting any benefits from modifying these directives in the first place, so, unless chasing Heisenbugs is your path, you might as well keep these specific settings at their most sensible defaults.
We see that you are reusing an established connection to send the POST request and that, as you said, the response comes without the status-line and the headers.
after Tomcat receives a request it just returns only a response body.
Not exactly. It starts with 5d which is probably a chunk-size and this means that the latest "full" response (with status-line and headers) got from this connection contained a "Transfer-Encoding: chunked" header. For any reason, your server still believes the previous response isn't finished by the time it starts sending this new response to your last request.
A missing chunked seems confirmed as the screenshot doesn't show a last-chunk (value = 0) ending the previous request. Note that the last response ends with a last-chunk (the last byte shown is 0).
What causes this ? The previous response isn't technically considered as fully answered. It can be a bug on Tomcat, your webservice library, your own code. Maybe even, you're sending your request too early, before the previous one was completely answered.
Are some bytes missing if you compare the chunk-sizes from what is actually sent to the client ? Are all buffers flushed ? Beware of the line endings (CRLF vs LF only) too.
One last cause that I'm thinking about, if your response contains some kind of user input taken from the request, you can be facing HTTP Splitting.
Possible solutions.
It is worth trying to disable the chunked encoding at your library level, for example with Axis2 check the HTTP Transport.
When reusing a connection, check your client code to make sure that you aren't sending a request before you read all of the previous response (to avoid overlapping).
Further reading
RFC 2616 3.6.1 Chunked Transfer Coding
It turned out that the "sjsxp" library which JAX-WS RI v2.1.3 uses makes Tomcat behave this way. I tried a different version of JAX-WS RI (v2.1.7) which doesn't use the "sjsxp" library anymore and it solved the issue.
A very similar issue posted on Metro mailing list: http://metro.1045641.n5.nabble.com/JAX-WS-RI-2-1-5-returning-malformed-response-tp1063518.html

Load balancing TCP traffic using Apache Camel with Netty leads to transaction failures

I am new to Apache Camel and Netty and this is my first project. I am trying to use Camel with the Netty component to load balance heavy traffic in a back end load test scenario.This is the setup I have right now:
from("netty:tcp:\\this-ip:9445?defaultCodec=false&sync=true").loadBalance().roundRobin().to("netty:tcp:\\backend1:9445?defaultCodec=false&sync=true,netty:tcp:\\backend2:9445?defaultCodec=false&sync=true)
The issue is unexpected buffer sizes that I am receiving in the response that I see in the client system sending tcp traffic to Camel. When I send multiple requests one after the other I see no issues and the buffer size is as expected. But, when I try running multiple users sending similar requests to Camel on the same port, I intermittently see unexpected buffer sizes, sometimes 0 bytes to sometimes even greater than the expected number of bytes. I tried playing around with multiple options mentioned in the Camel-Netty page like:
Increasing backlog
keepAlive
buffersizes
timeouts
poolSizes
workerCount
synchronous
stream caching (did not work)
disabled useOriginalMessage for performance
System level TCP parameters, etc. among others.
I am yet to resolve the issue. I am not sure if I'm fundamentally missing something. I did take a look at the encoder/decoders and guess if that could be an issue. But, I don't understand why a load balancer needs to encode/decode messages. I have worked with other load balancers which just require endpoint configurations and hence, I am assuming that Camel does not require this. Am I right? Please know that the issue is not with my client/backend as I ran a 2000 user load test from my client to the backend with less than 1% failures but see a large number of failure ( not that there are no successes) with Camel. I have the following questions:
1.Is this a valid use-case for Apache Camel- Netty? Should I be looking at Mina or others?
2.Can I try to route tcp traffic to JMS or other components and then finally to the tcp endpoint?
3.Do I need encoders/decoders or should this configuration work?
4.Should I continue with this approach or try some other load balancer?
Please let me know if you have any other suggestions. TIA.
Edit1:
I also tried the same approach with netty4 and mina components. The route looks similar to the one in netty. The route with netty4 is as follows:
from("netty4:tcp:\\this-ip:9445?defaultCodec=false&sync=true").to("netty4:tcp:\\backend1:9445?defaultCodec=false&sync=true")
I read a few posts which had the same issue but did not find any solution relevant to my issue.
Edit2:
I increased the receive timeout at my client and immediately noticed the mismatch in expected buffer length issue fall to less than 1%. However, I see that the response times for each transaction when using Camel and not using it is huge; almost 10 times higher. Can you help me with reducing the response times for each transaction? The message received back at my client varies from 5000 to 20000 bytes. Here is my latest route:
from("netty:tcp://this-ip:9445?sync=true&allowDefaultCodec=false&workerCount=20&requestTimeout=30000")
.threads(20)
.loadBalance()
.roundRobin()
.to("netty:tcp://backend-1:9445?sync=true&allowDefaultCodec=false","netty:tcp://backend-2:9445?sync=true&allowDefaultCodec=false")
I also used certain performance enhancements like:
context.setAllowUseOriginalMessage(false);
context.disableJMX();
context.setMessageHistory(false);
context.setLazyLoadTypeConverters(true);
Can you point me in the right direction about how I can reduce the individual transaction times?
For netty4 component there is no parameter called defaultCodec. It is called allowDefaultCodec. http://camel.apache.org/netty4.html
Also, try something like this first.
from("netty4:tcp:\\this-ip:9445?textline=true&sync=true").to("netty4:tcp:\\backend1:9445?textline=true&sync=true")
The above means the data being sent is normal text. If you are sending byte or something else you will need to provide decoding/encoding for netty to handle the data.
And a side note. Before running the Camel route, test manually to send test messages via a standard tcp tool like sockettest to verify that everything works. Then implement the same via Camel. You can find sockettest here http://sockettest.sourceforge.net/ .
I finally solved the issue with the same route settings as above. The issue was with the Request and Response Delimiter not configured properly due to which it was either closing the connection too early leading to unexpected buffer sizes or it was waiting too long even after the entire buffer was received leading to high response times.

What is HTTP Status code 000?

Just switched some downloads over to the Akamai CDN network and I'm seeing some strange stuff in the log files they deliver. A number of entries have the status code 000. When I asked them they said that 000 is the status when the client disconnects without transferring the entire file. Since 000 doesn't appear to be a valid HTTP response code (from the RFC), I have to wonder if that's right.
There's a knowledge base article (requires login) which lists their log values:
Log Delivery Services (LDS) LDS will show a 000 for any 200 or 206
responses with a client abort: the object was served correctly from
the origin or edge, but the end-user terminated the
connection/transaction before it completed.
This is indeed a custom status because the standard log format doesn't include a field which can indicate a client abort.
000 is a common code to use when no HTTP code was received due to a network error. According to a knowledge base article for Amazon CloudFront, 000 also means that the client disconnected before completing the request for that service.
It normally means: No valid HTTP response code
(ie: Connection failed, or was aborted before any data happened).
I would guess that their are either network issues or Akamai isn't managing their webservers correctly.

NGINX Reverse Proxy : Many html status code 400 responses, why?

We have recently implemented a nginx based reverse proxy.
While, debugging our access logs, we are seeing quite a bit of status code 400 results.
They look something like this:
[07/Sep/2011:05:49:04 -0700] - "400" 0 "-" "-" "-"
We have enabled debug error logging, and they usually correspond to something like this:
2011/09/07 05:09:28 [info] 5937#0: *30904 client closed prematurely connection while reading client request line
We have tried raising a number of the buffers, as mentioned by a few pages we were able to google up.
http://www.ruby-forum.com/topic/173362
or
http://blog.craz8.com/articles/2009/06/17/nginx-400-bad-request-errors-due-to-cookies-and-what-to-do-about-them
To no avail.
Why is this happening?
This is a strandard nginx reverse proxy -> apache backend server.
Worth mentioning, the unique type of content on our site is fairly minimal. We have tested this using many browsers and are not personally receiving any of these 400 results.
Thanks!
Further urls detailing similar entries in their logs:
http://blog.rayfoo.info/2009/10/weird-web-server-access-log-entries
I found this was caused by using Chrome, which apparently opens extra connections occasionally without sending any data.
Here's some more info: http://www.ruby-forum.com/topic/2953545
Now the question is what to do about them - the answer provided there wasn't very satisfying.
Are you handling SSL connections? Can you add $ssl_cipher $ssl_protocol to your access log format?
First, it's fairly possible that your clients send request with really big http headers or urls. Maybe an older version of your application set some (probably big) cookies which are unused now and some clients are still trying send them back.
I'd set the header buffers to a really big value and on the application side log the size of the headers/requests and the complete request if they are bigger than usual. Or completely take out the nginx from the chain and log the header/request with the same conditions. If you can, take out the nginx for only those IPs/subnets where the 400 errors came from. I suppose nginx can log the source IP for these 400 errors.

Resources