Trying to configure Nginx for two purposes:
Reverse proxy to redirect requests to local tomcat server (port 443 to 10443 listening by to
mcat)
Mirror requests to backend server for analysing purposes
Since we encountered very low performance using the default configuration and the mirror directive, we decided to try just with reverse proxy to check if there is an impact on the server and indeed seems like nginx is capping the traffic by almost half (we are using Locust and Jmeter as load tools)
Nginx version: 1.19.4
Worked through 10-tips-for-10x-application-performance & Tuning NGINX for Performance
with no avail.
The machine nginx & tomcat runs on should be strong enough (EC2 c5.4XLarge) and we don't see lack in resources but more of network capping. Very high count of TIME_WAIT connections (20k-40k)
From the machine perspective:
Increased net port range (1024 65300)
Lowered tcp_fin_timeout (15ms)
increased max FD to the max
Nginx perspective (adding nginx.conf after):
keepalive_requests 100000;
keepalive_timeout 1000;
worker_processes 10 (16 is cpu count)
worker_connections 3000;
worker_rlimit_nofile 100000;
nginx.conf:
user nginx;
worker_processes 10;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
worker_rlimit_nofile 100000;
events {
worker_connections 3000;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
log_format main_ext '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'"$host" sn="$server_name" '
'rt=$request_time '
'ua="$upstream_addr" us="$upstream_status" '
'ut="$upstream_response_time" ul="$upstream_response_length" '
'cs=$upstream_cache_status' ;
keepalive_requests 100000;
keepalive_timeout 1000;
ssl_session_cache shared:SSL:10m;
sendfile on;
#tcp_nopush on;
#gzip on;
include /etc/nginx/conf.d/*.conf;
upstream local_host {
server 127.0.0.1:10443;
keepalive 128;
}
server {
listen 443 ssl;
ssl_certificate /etc/ssl/nginx/crt.pem;
ssl_certificate_key /etc/ssl/nginx/key.pem;
location / {
# mirror /mirror;
proxy_set_header Host $host;
proxy_pass https://local_host$request_uri;
}
# Mirror configuration
location = /mirror {
internal;
proxy_set_header Host test-backend-dns;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 3s;
proxy_read_timeout 100ms;
proxy_send_timeout 100s;
proxy_pass https://test-backend-ip:443$request_uri;
}
}
}
Also monitor using Amplify agent, seems like connections count meets with the expected requests and connections, but the actual requests count is low.
Amplify monitor output
Seems like a simple task for Nginx, but something is misconfigured.
Thank you for your answers
After many attempts and ways to figure things out, we got to a conclusion the response time from the application was higher with nginx.
Our assumption and how we eventually overcome this issue, was the SSL Termination.
This is an expensive operation, from both resources and time wise.
What we did was to have the nginx (which is more than capable of handling much higher load than what we hit it with, ~4k RPS) be responsible solely on the SSL Termination, and we changed the tomcat app configuration such that it listens to HTTP requests rather than HTTPS.
This reduced dramatically the TIME_WAIT connections that were packing and taking important resources from the server.
Final configurations for nginx, tomcat & the kernel:
linux machine configuration:
- /proc/sys/net/ipv4/ip_local_port_range - set to 1024 65535
(allows more ports hence ---> more connections)
- sysctl net.ipv4.tcp_timestamps=1
(""..reduce performance spikes related to timestamp generation..")
- sysctl net.ipv4.tcp_tw_recycle=0
(This worked for us. Should be tested with/without tcp_tw_reuse)
- sysctl net.ipv4.tcp_tw_reuse=1
(Same as tw_recycle)
- sysctl net.ipv4.tcp_max_tw_buckets=10000
(self explanatory)
Redhat explanation for tcp_timeouts conf
Tomcat configuration:
<Executor name="tomcatThreadPool" namePrefix="catalina-exec-"
maxThreads="4000"
minSpareThreads="10"
/>
<!-- A "Connector" using the shared thread pool - NO SSL -->
<Connector executor="tomcatThreadPool"
port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
acceptCount="5000"
pollerThreadCount="16"
acceptorThreadCount="16"
redirectPort="8443"
/>
Nginx specific performance params configuration:
main directive:
- worker_processes auto;
- worker_rlimit_nofile 100000;
events directive:
- worker_connections 10000; (we think can be lower)
- multi_accept on;
http directive:
- keepalive_requests 10000;
- keepalive_timeout 10s;
- access_log off;
- ssl_session_cache shared:SSL:10m;
- ssl_session_timeout 10m;
Really helps to understand the two points of the equation: Nginx and tomcat.
We used jmx metrics to understand whats going on on tomcat along side prometheus metrics from our app.
And Amplify agent to monitor nginx behavior.
Hope that helps to anyone.
Related
I am using Nginx as a failover cluster, where a service is given as an upstream.
The service is running in both the nodes simultaneously and the Nginx is set up in third node which redirects to other nodes.
When we switch off the service in one node it successfully switches to the backup node, but when the service in the 1st node is turned back on it switches back to the 1st node.
Below is the relevant part of the configuration in Nginx:
http {
include mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"'
'"upstream: $upstream_addr"'
'"status: $upstream_status"';
access_log logs\access.log main;
sendfile on;
#tcp_nopush on;
#keepalive_timeout 0;
keepalive_timeout 65;
#gzip on;
upstream backend {
server ip1:port1 fail_timeout=5s max_fails=1;
server ip2:port2 backup;
}
server {
listen 8000;
server_name ip3;
ssl_certificate ...\certificate.crt;
ssl_certificate_key ...\privateKey.key;
ssl_session_cache shared:SSL:1m;
ssl_session_timeout 5m;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
location / {
proxy_pass https://backend;
}
How can I stop switching the node to the primary node once the service is up again?
Thanks.
This is the desired behaviour. If NGINX is balancing between two nodes and one of them is marked as backup NGINX will just switch to this note in case the primary node is done.
As soon as the primary node is back up and healthy, NGINX will always switch back to the primary node.
Read more about it here. https://nginx.org/en/docs/http/ngx_http_upstream_module.html#upstream
If you are looking for something like Blue / Green deployments there are other concepts to do it.
One that works with NGINX Open Source is available here.
Write a bash code to change primary node to backup node once the primary node is backed up. Do let me know if you need the bash code
We are connecting to a system where in 4 ports are exposed to serve the grpc requests. Used nginx as load balancer to forward the 4 client grpc requests with below configuration:
user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;
events {
worker_connections 768;
# multi_accept on;
}
http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent"';
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
upstream backend{
#least_conn;
server localhost:9000 weight=1 max_conns=1;
server localhost:9001 weight=1 max_conns=1;
server localhost:9002 weight=1 max_conns=1;
server localhost:9003 weight=1 max_conns=1;
}
server {
listen 80 http2;
access_log /tmp/access.log main;
error_log /tmp/error.log error;
proxy_buffering off;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Scheme $scheme;
proxy_set_header Host $http_host;
location / {
#eepalive_timeout 0;
grpc_pass grpc://backend;
grpc_pass_header userid;
grpc_pass_header transid;
}
}
}
It is observed that few times all client 4 requests goes to all the 4 ports but sometimes (say 30%) to only 2 ports/3ports. Seems like default round robin is not happening with the NGINX as expected. We tried all possibilities like max_conns, least_conn, weight but no luck.
Seems like I have encountered the issue as in below links:
https://serverfault.com/questions/895116/nginx-round-robin-nor-exactly-round-robin
https://stackoverflow.com/questions/40859396/how-to-test-load-balancing-in-nginx
When i was going through Quora found that "fair" module in nginx would resolve this.
"The Nginx fair proxy balancer enhances the standard round-robin load
balancer provided with Nginx so that it will track busy back end servers (e.g. Thin, Ebb, Mongrel) and balance the load to non-busy server processes. "
https://www.quora.com/What-is-the-best-way-to-get-Nginx-to-do-smart-load-balancing
I tried using "fair" module with NGINX from source but encountered so many issues. I could not start the NGINX itself. Can anyone help with this issue?
We got the answer !!!! Just changed "worker_processes auto;" to "worker_processes 1;" Now, it is working fine.
All the requests are load balanced properly. Here we felt if you use other than single worker, multiple worker might send the requests to the same port.
I don't know why exactly this is happening but it may have something to do with the browser.
I encountered the same problem when I was using the browser to send the requests. When I sent the requests from the terminal using curl it was working fine.
We are trying to build HA Kubernetese cluster with 3 core nodes each of having full set of vital components: ETCD + APIServer + Scheduller + ControllerManager and external balancer. Since ETCD can make clusters by themselves, we are stack with making HA APIServers. What seemed an obvious task a couple of weeks ago now became a "no way disaster"...
We decided to use nginx as a balancer for 3 independent APIServers. All the rest parts of our cluster that communicate with APIServer (Kublets, Kube-Proxys, Schedulers, ControllerManagers..) are suppose to use balancer to access it. Everything went well before we started the "destructive" tests (as I call it) with some pods runing.
Here is the part of APIServer config that dials with HS:
.. --apiserver-count=3 --endpoint-reconciler-type=lease ..
Here is our nginx.conf:
user nginx;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
worker_processes auto;
events {
multi_accept on;
use epoll;
worker_connections 4096;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
gzip on;
underscores_in_headers on;
include /etc/nginx/conf.d/*.conf;
}
And apiservers.conf:
upstream apiserver_https {
least_conn;
server core1.sbcloud:6443; # max_fails=3 fail_timeout=3s;
server core2.sbcloud:6443; # max_fails=3 fail_timeout=3s;
server core3.sbcloud:6443; # max_fails=3 fail_timeout=3s;
}
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
server {
listen 6443 ssl so_keepalive=1m:10s:3; # http2;
ssl_certificate "/etc/nginx/certs/server.crt";
ssl_certificate_key "/etc/nginx/certs/server.key";
expires -1;
proxy_cache off;
proxy_buffering off;
proxy_http_version 1.1;
proxy_connect_timeout 3s;
proxy_next_upstream error timeout invalid_header http_502; # non_idempotent # http_500 http_503 http_504;
#proxy_next_upstream_tries 3;
#proxy_next_upstream_timeout 3s;
proxy_send_timeout 30m;
proxy_read_timeout 30m;
reset_timedout_connection on;
location / {
proxy_pass https://apiserver_https;
add_header Cache-Control "no-cache";
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $http_host;
proxy_set_header Authorization $http_authorization;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-SSL-CLIENT-CERT $ssl_client_cert;
}
}
What came out after some tests is that Kubernetes seem to use single long living connection instead of tradition open-close sessions. This is probably dew to SSL. So we have to increase proxy_send_timeout and proxy_read_timeout to ridiculous 30m (the default value for APIServer is 1800s). If this settings are under 10m, then all clients (like Scheduler and ControllerManager) will generate tons if INTERNAL_ERROR because of broken streams.
So, for the crash test I simply put one of APIServers down by gently switching it off. Then I restart another one so nginx sees that upstream went down and switch all current connections to the last one. A couple of seconds later restarted APIserver returns back and we have 2 APIServers working. Then, I put network down on the third APIServer by running 'systemctl stop network' on that server so it has no chances to inform Kubernetes or nginx that its going down.
Now, the cluster it totally broken! nginx seem to recognize that upstream went down, but it will not reset already exciting connections to the upstream that is dead. I can still see them with 'ss -tnp'. If I restart Kubernetes services, they will reconnect and continue to work, same if I restart nginx - new sockets will show in ss output.
This happens only if I make APIserver unavailable by putting network down (preventing it from closing existing connections to nginx and informing Kubernetes that it is switching off). If I just stop it - everything work as a charm. But this is not a real case. Server can go down without any warning - just instantly.
What we are doing wrong? Is there is a way to force nginx to drop all connections to the upstream that went down? Anything to try before we move to HAProxy or LVS and ruin a week of kicking nginx in our attempts to make it balance instead of breaking our not so HA cluster.
Problem Statement:
I want to setup a active failover using nginx plus ( i subscribed for 30 day trial).
All servers should go to primary server, if that goes down(404) only then the requests should go to second server. Once the primary is up the requests should go back to the original server. Is it possible?
With the help of other threads i was able to create the following config file. Almost all the error codes i could find, I tried that with proxy_next_upstream, but i am still not able to achieve the intended results.
I brought down the primary server manually to return 404. It briefly return 503 when its going down. But still no luck with redirecting the traffic.
Both the servers are hosted on IBM Bluemix as nodejs apps. I can share more details if needed.
upstream up1 {
server up_server1;
}
upstream up2 {
server up_server2;
}
server {
listen 80;
location / {
proxy_pass http://up1;
proxy_next_upstream non_idempotent invalid_header error timeout http_500 http_502 http_504 http_403 http_404;
}
}
This is governed by another config file which looks like. Just to give more info
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log notice;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
# geoip_city /etc/nginx/geoip/GeoLiteCity.dat;
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
keepalive_timeout 65;
#gzip on;
include /etc/nginx/conf.d/*.conf;
}
# TCP/UDP proxy and load balancing block
#
#stream {
# Example configuration for TCP load balancing
#upstream stream_backend {
# zone tcp_servers 64k;
# server backend1.example.com:12345;
# server backend2.example.com:12345;
#}
#server {
# listen 12345;
# status_zone tcp_server;
# proxy_pass stream_backend;
#}
#}
Your problem is that you do not in any way use the upstream up2. proxy_next_upstream mean next server in your current upstream - which is defined as "up1" in proxy_pass http://up1, not magically pick any other upstream.
So what you want is to delete upstream up2 and only leave:
upstream up1 {
server up_server1;
server up_server2;
}
I was able to resolve this issue with a small tweak, which was missed in almost all the answers posted in stack overflow. When you are defining 2 different upstreams for using with proxy_next_upstream, you need to add the server as a second entry in your original upstream directive. Check out the code below.
upstream up1 {
server up_server1;
server up_server2; # this entry is important, but not sure why!
}
upstream up2 {
server up_server2;
}
I want to ask about the nginx web server, when accessing the web a lot, then the server becomes down and get an error code 502/504, I use varnish 4 in the web server with port 8000, the physical server has the following specifications:
8 CPU Cores
16GB RAM
For nginx configuration is as follows:
user nginx;
worker_processes 8;
error_log /var/log/nginx/error.log;
#error_log /var/log/nginx/error.log notice;
#error_log /var/log/nginx/error.log info;
pid /run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
fastcgi_buffers 256 16k;
fastcgi_buffer_size 256k;
fastcgi_connect_timeout 300;
fastcgi_send_timeout 300;
fastcgi_read_timeout 300;
fastcgi_max_temp_file_size 0;
......
}
While in php-fpm is as follows:
pm = dynamic
pm.max_children = 246
pm.start_servers = 32
pm.min_spare_servers = 32
pm.max_spare_servers = 64
Please help me feel confused, I followed some of the recommendations that I found from several sources, but still failed, thanks
Regards,
Janitra Panji
The HTTP status codes you are asking about are all defined on the related Wikipedia page
502 Bad Gateway
The server was acting as a gateway or proxy and received an invalid response from the upstream server.
503 Service Unavailable
The server is currently unavailable (because it is overloaded or down for maintenance). Generally, this is a temporary state.
504 Gateway Timeout
The server was acting as a gateway or proxy and did not receive a timely response from the upstream server.
The all indicate that the service behind your Nginx reverse proxy is down for some reason or another. You should study the tuning of your backend server. The issue quite possibly there, and not with Nginx.