Nginx not failing over when used for load balancing - nginx

We're using Nginx to load balance between two upstream app servers and we'd like to be able to take one or the other down when we deploy to it. We're finding that when we shut one down, Nginx is not failing over to the other. It keeps sending requests and logging errors.
Our upstream directive has the form:
upstream app_servers {
server 10.100.100.100:8080;
server 10.100.100.200:8080;
}
Our understanding from reading the Nginx docs is that we don't need to explictly specify the "max_fails" or "fail_timeout" because they have reasonable defaults. (ie. max_fails of 1).
Any idea what we might be missing here?
Thanks much.

As per the documentation...
max_fails = NUMBER - number of unsuccessful attempts at communicating
with the server within the time period (assigned by parameter
fail_timeout) after which it is considered inoperative. If not set,
the number of attempts is one. A value of 0 turns off this check. What
is considered a failure is defined by proxy_next_upstream or
fastcgi_next_upstream (except http_404 errors which do not count
towards max_fails).
As per the documentation, a failure is define by proxy_next_upstream or fastcgi_next_upstream.
It keeps sending requests and logging errors.
Please check the log what type of errors were logged, if it is not the default (error or timeout), then you may want to define it exclusively in proxy_next_upstream or fastcgi_next_upstream.

Related

NGINX routing based on server 200 response failures

My goal is to configure nginx's stream object(s) in the config to route requests to a backup upstream in the event that one fails on certain health checks (2/3)
The health checks while sort of specific I believe shouldn't be an issue:
-TCP 1212 availability
-TCP 1912 availability
-HTTP GET on 7078 /?
-Response should be 200 and if I can get the body somehow to check that it's as expected, even better!
If these checks fail on one upstream "cluster" so to speak, I would like to route requests to another identical cluster, much like a back up.
The issue I'm solving lies in the fact that the servers are quite literally half a world apart and so load balancing through one server would cause the same latency as if you waited for it to fail. So while a load balancer would have "routing" behavior in the end, the response time would be unacceptable.
Is there a way to do this in NGINX configs or am I spreading it too thin?
The NGINX upstream module will do passive health checks for you, meaning it will react to connection failures, and optionally switch to backup servers as necessary. To some extent, that might be enough for you.
What you're describing here though are active health checks that let you check different ports from the traffic port, assert HTTP status, header values and even body content. Unfortunately, having dangled that in front of you, these are only available as part of the NGINX Commercial Subscription, which I'm guessing isn't what you're looking for.
If you do need that kind of pro-active health checks, you can still do it from outside of NGINX. One approach might be:
put your upstreams in separate confs, and include one of them where you need it
use ncat and/or curl in a every-minute cron job to do the tests that matter to you
if ever those tests fail, switch out the upstream confs, and tell NGINX to do a zero-downtime reload
You can switch confs by fast mv to rename the right one to match the include, you shouldn't have to rewrite anything.

Nginx High Open TCP connections

My web page got error 500 and dropped. Checking my Nginx's metrics in GCP, I detected:
To many Open TCP Conecctions.
Open TCP Conecctions
To many Accepted and Handled Conecctions
Accepted and Handled Conecctions
To many Writting Connections
enter image description here
Normal Request per minute for each different IPs from Access.log (in compare with others days and months)
Nginx's requests from acces.log
=> The drop in the graphs is because I restarted the server.
So, according to these metrics, I donĀ“t see any relation beetween the amount of connections (TCP, Accepted, Handled and Writting) and the requests (acces.log).
Further, Is normal these amount of Open TCP connections? I don't think.
I'll appreciate your opinions and possibles reasons why happened this.
500 is a server side error generally occurs if server is unable to process request. The webserver throws 500 internal server error when it encounterd an unexpected condition which prevents it from fulfilling the client's request.
The probable cause of 500 error could be because of:
Permission error.
Incorrect code in .htaccess file
PHP timeout.
Syntax or coder erro in CGI/Perl scripts
PHP memory limit

Add delay between retries when origin is down (Nginx, 502)

Pretty much the tile. Got a node app behind nginx, and when i restart the app i would like nginx to delay the response, and retry doing the request a couple of times with some delay inbetween. Everything that i found would only instantly retry N times, but that obviously is no useful when the app is down for a restart, which is my use-case. Is there some way? I dont even care how hacky it is / if it is, i just need a solution that is not starting a second instance of the app, and killing the first one when the second one started.
Thanks!
You can add same server as multiple upstream and configure proxy_next_upstream, proxy_next_upstream_timeout and proxy_next_upstream_tries options as well. Reference
upstream node_servers {
server 127.0.0.1:12005;
server 127.0.0.1:12005;
}
...
proxy_next_upstream http_502;
proxy_next_upstream_timeout 60;
proxy_next_upstream_tries 3;
However, I would recommend you to use process managers like pm2 which support graceful reload/restart. They are relevant if you are using more than one CPU for your nodejs server in a clustered mode.

Nginx: Call all upstreams at once

How I could call all upstreams at once and return a result from first that respond and the response will not be a 404?
Example:
Call to load balancer at "serverX.org/some-resource.png" creates two requests to:
srv1.serverX.org/some-resource.png
srv2.serverX.org/some-resource.png
srv2 responds faster and the response is shown to the user.
Is this possible at all? :)
Thanks!
Short answer, NO. You can't do exactly what you described with nginx. Come to think of it a bit, this operation can't be called load balancing since the whole back-end gets the total amount of traffic.
A good question is what do you think that you could accomplish with that? Better performance?
You can be sure that you will have better results with simple load balancing between your servers since the will have to handle the half of the traffic.
In case that you have a more complex architecture i.e. different loads from different paths to your backend servers we could discuss a more sophisticated load balancing method.
So if your purpose is sth else than performance there are some things that you can do:
1) After you sent the request to first server you can send it using the post_action to another one.
location ~ ^/*.png {
proxy_pass http://srv1.serverX.org;
...
post_action #mirror_to_srv2;
...
}
location #mirror_to_srv2 {
proxy_ignore_client_abort on;
...
proxy_pass http://srv2.serverX.org;
}
2) The request is available to you in nginx as a variable so with some lua scripting you can send it where ever you want.
Note that the above methods are not useful to tackle performance issues but to enable you to do things like mirroring live traffic to dev servers for test/debug purposes.
Last this one seems to provide the functionality you want but remember that isn't built for the use that you seem to have in mind.

HTTP 504 timeout after exactly 120 seconds

I have a server application which runs in the Amazon EC2 cloud. From my client (the browser) I make a HTTP request which uploads a file to the server which then processes the file. If there is a lot of processing (large file
), the server always times out with a 504 backend continuation error always exactly after 120 seconds. Though I get this error, the server continues to process the request and completes it (verified by checking the database) but I cannot see the final result on my client because of the timeout.
I am clueless as to why this is happening. Has anyone faced a similar 504 timeout ? Is there some intermediate proxy server not in my control which is timing out ?
I have a similar problem and in my case I believe it is due to the connection between the Elastic Load Balancer (ELB) and the EC2 instance.
For a long-term solution I will go with the 303 Status response + back-end processing suggested by james.garriss above.
For short-term solution it may be possible for Amazon support to increase the ELB timeout (see their response in https://forums.aws.amazon.com/thread.jspa?messageID=491594&#491594). Unfortunately there doesn't seem to be any way to change the timeout yourself through either API or console.
[Update] AWS now does allow you to update the idle timeout either through console, CLI or .ebextensions configuration. See http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/config-idle-timeout.html (thanks #Daniel Patz for the update)
Assuming that the correct status code is being returned, the problem is that an intermediate proxy is timing out. "The server, while acting as a gateway or proxy, did not receive a timely response from the upstream server specified by the URI." (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.5.5) It most likely indicates that the origin server is having some sort of issue (i.e., taking a long time to process your request), so it's not responding quickly.
Perhaps the best solution is to re-craft your server app so that it responds with a "303 See Other" status code; then your client can retrieve the data at a later data point, once the server is done processing and creates the final result.
Edit: Another idea is to re-craft your server app so that it responds with a "413 Request Entity Too Large" status code when the request entity size is too large. This will get rid of the error, though it may make your app less useful if it can only process "small" files."
Other possible solutions:
Increase timeout value of the proxy (if it's under your control)
Make your request to a different server (if there's another, faster server with the same app)
Make your request differently (if possible) such that you are sending less data at a time
it is possible that the browser timeouts during the script execution.

Resources