I work for a rather busy internet site that is often gets very large spikes of traffic. During these spikes hundreds of pages per second are requested and this produces random 502 gateway errors.
Now we run Nginx (1.0.10) and PHP-FPM on a machine with 4x SAS 15k drives (raid10) with a 16 core CPU and 24GB of DDR3 ram. Also we make use of the latest Xcache version. The DB is located on another machine, but this machine's load is very low, and has no issues.
Under normal load everything runs perfect, system load is below 1, and PHP-FPM status report never really shows more than 10 active processes at one time. There is always about 10GB of ram still available. Under normal load the machine handles about 100 pageviews per second.
The problem arises when huge spikes of traffic arrive, and hundreds of page-views per second are requested from the machine. I notice that FPM's status report then shows up to 50 active processes, but that is still way below the 300 max connections that we have configured. During these spikes Nginx status reports up to 5000 active connections instead of the normal average of 1000.
OS Info: CentOS release 5.7 (Final)
CPU: Intel(R) Xeon(R) CPU E5620 # 2.40GH (16 cores)
php-fpm.conf
daemonize = yes
listen = /tmp/fpm.sock
pm = static
pm.max_children = 300
pm.max_requests = 1000
I have not setup rlimit_files, because as far as I know it should use the system default if you don't.
fastcgi_params (only added values to standard file)
fastcgi_connect_timeout 60;
fastcgi_send_timeout 180;
fastcgi_read_timeout 180;
fastcgi_buffer_size 128k;
fastcgi_buffers 4 256k;
fastcgi_busy_buffers_size 256k;
fastcgi_temp_file_write_size 256k;
fastcgi_intercept_errors on;
fastcgi_pass unix:/tmp/fpm.sock;
nginx.conf
worker_processes 8;
worker_connections 16384;
sendfile on;
tcp_nopush on;
keepalive_timeout 4;
Nginx connects to FPM via Unix Socket.
sysctl.conf
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 1
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv4.tcp_max_syn_backlog = 2048
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.tcp_timestamps = 0
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.eth0.rp_filter=1
net.ipv4.conf.lo.rp_filter=1
net.ipv4.ip_conntrack_max = 100000
limits.conf
* soft nofile 65536
* hard nofile 65536
These are the results for the following commands:
ulimit -n
65536
ulimit -Sn
65536
ulimit -Hn
65536
cat /proc/sys/fs/file-max
2390143
Question: If PHP-FPM is not running out of connections, the load is still low, and there is plenty of RAM available, what bottleneck could be causing these random 502 gateway errors during high traffic?
Note: by default this machine's ulimit's were 1024, since I changed it to 65536 I have not fully rebooted the machine, as it's a production machine and it would mean too much downtime.
This should fix it...
You have:
fastcgi_buffers 4 256k;
Change it to:
fastcgi_buffers 256 16k; // 4096k total
Also set fastcgi_max_temp_file_size 0, that will disable buffering to disk if replies start to exceeed your fastcgi buffers.
Unix socket accept 128 connections by default. It is good to put this line into /etc/sysctl.conf
net.core.somaxconn = 4096
If it's not helping in some cases - use normal port bind instead of socket, because socket on 300+ can block new requests forcing nginx to show 502.
#Mr. Boon
I have 8 core 14 GB ram. But the system gives Gateway time-out very often.
Implementing below fix also didn't solved the issue. Still searching for better fixes.
You have: fastcgi_buffers 4 256k;
Change it to:
fastcgi_buffers 256 16k; // 4096k total
Also set fastcgi_max_temp_file_size 0, that will disable buffering to disk if replies start to exceed your fastcgi buffers.
Thanks.
Related
I tried to restart nginx with command, but error occured.
When I run "sudo systemctl restart nginx", this happens.
Job for nginx.service failed because the control process exited with error code. See "systemctl status nginx.service" and "journalctl -xe" for details.
When I run "systemctl status nginx.service", this happens.
Mar 30 08:55:04 ip-172-31-22-186 nginx[2624]: nginx: [emerg] "proxy_buffers" directive invalid value in /etc/nginx/sites-enabled/...:19
Mar 30 08:55:04 ip-172-31-22-186 nginx[2624]: nginx: configuration file /etc/nginx/nginx.conf test failed
in nginx.conf file:
location / {
....
proxy_buffer_size 0M;
proxy_buffers 4 0M;
proxy_busy_buffers_size 0M;
client_max_body_size 0M;
}
is there a problem with the configuration here?
The proxy_buffers can not be configured like this. Based on what they are used for how how they are designed you can NOT set a buffer of 0m. This would set a memory size (page size) of 0M.
proxy_buffers
Sets the number and size of the buffers used for reading a response from the proxied server, for a single connection. By default, the buffer size is equal to one memory page. This is either 4K or 8K, depending on a platform.
The proxy buffer size is equal to a memory page. To find your current memory_page size type:
getconf PAGE_SIZE
This should return 4096(bytes) -> 4K.
So as you can see there is a reason why you can only use 4K or 8K depending on your system architecture.
We have a great blog post about proxying in general.
https://www.nginx.com/blog/performance-tuning-tips-tricks/
By turning proxy_buffering to on you can configure the proxy_buffers with the directives shown in the docs:
http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_buffering
I am testing server performance using JMeter to mimic a large number of users hitting our digital ocean server in a short period of time. When I set JMeter to 200 users and test it against my Laravel based webpage, every works fine. When I increase the number of users to 500, I start to get 524 errors.
The server CPU never goes over 10% and the memory is at 30%, so the server has enough power. The first couple of hundred requests process correctly, but then the 524 errors begin to appear. The failed requests have a higher latency and connection than the successful requests as shown in this screen shot. Any clue where I should start looking for the problem?
My conf file in sites available
location ~.php$ {
try_files $uri =404;
fastcgi_pass unix:/var/run/php-fpm/xxxxxxxxxxxx.com.sock;
fastcgi_index index.php;
}
my nginx.conf settings
fastcgi_buffers 8 128k;
fastcgi_buffer_size 256k;
client_header_timeout 3000;
client_body_timeout 3000;
fastcgi_read_timeout 3000;
client_max_body_size 32m;
my /etc/sysctl.conf file settings, taken from this post after previously getting 504 errors https://www.digitalocean.com/community/questions/getting-nginx-fpm-sock-error
# sysctl settings are defined through files in
# /usr/lib/sysctl.d/, /run/sysctl.d/, and /etc/sysctl.d/.
#
# Vendors settings live in /usr/lib/sysctl.d/.
# To override a whole file, create a new file with the same in
# /etc/sysctl.d/ and put new settings there. To override
# only specific settings, add a file with a lexically later
# name in /etc/sysctl.d/ and put new settings there.
#
# For more information, see sysctl.conf(5) and sysctl.d(5).
### IMPROVE SYSTEM MEMORY MANAGEMENT ###
# Increase size of file handles and inode cache
fs.file-max = 2097152
# Do less swapping
vm.swappiness = 10
vm.dirty_ratio = 60
vm.dirty_background_ratio = 2
### GENERAL NETWORK SECURITY OPTIONS ###
# Number of times SYNACKs for passive TCP connection.
net.ipv4.tcp_synack_retries = 2
# Allowed local port range
net.ipv4.ip_local_port_range = 2000 65535
# Protect Against TCP Time-Wait
net.ipv4.tcp_rfc1337 = 1
# Decrease the time default value for tcp_fin_timeout connection
net.ipv4.tcp_fin_timeout = 15
# Decrease the time default value for connections to keep alive
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
### TUNING NETWORK PERFORMANCE ###
# Default Socket Receive Buffer
net.core.rmem_default = 31457280
# Maximum Socket Receive Buffer
net.core.rmem_max = 12582912
# Default Socket Send Buffer
net.core.wmem_default = 31457280
# Maximum Socket Send Buffer
net.core.wmem_max = 12582912
# Increase number of incoming connections
net.core.somaxconn = 65535
# Increase number of incoming connections backlog
net.core.netdev_max_backlog = 65535
# Increase the maximum amount of option memory buffers
net.core.optmem_max = 25165824
# Increase the maximum total buffer-space allocatable
# This is measured in units of pages (4096 bytes)
net.ipv4.tcp_mem = 65535 131072 262144
net.ipv4.udp_mem = 65535 131072 262144
# Increase the read-buffer space allocatable
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.udp_rmem_min = 16384
# Increase the write-buffer-space allocatable
net.ipv4.tcp_wmem = 8192 65535 16777216
net.ipv4.udp_wmem_min = 16384
# Increase the tcp-time-wait buckets pool size to prevent simple DOS attacks
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
Added this to /etc/security/limits.conf
nginx soft nofile 2097152
nginx hard nofile 2097152
www-data soft nofile 2097152
www-data hard nofile 2097152
My php-fpm.d conf file settings
pm = static
pm.max_children = 40
pm.start_servers = 8
pm.min_spare_servers = 4
pm.max_spare_servers = 8
; number of processes to process before respawning. Lower number if you have memory leaks, but each respawn takes time
pm.max_requests=50
; pm.process_idle_timeout=10
chdir = /
php_admin_value[disable_functions] = exec,passthru,shell_exec,system
php_admin_flag[allow_url_fopen] = on
php_admin_flag[log_errors] = on
php_admin_value[post_max_size] = 8M
php_admin_value[upload_max_filesize] = 8M
HTTP 524 errors are cloudflare specific - they're not being generated by your own nginx installation. Cloudflare gave up waiting for a response from your backend service, probably because of the low numbers of fpm children available to serve the requests.
524 A Timeout Occurred
Cloudflare was able to complete a TCP connection to the origin server, but did not receive a timely HTTP response.
If you're performance testing your own server setup, don't go through the endpoint that points to cloudflare.
The general "backend timed out" response for http servers are 504 Gateway Timeout.
i have a huge Problem on my site.
Please help me to fix it.
i have a site where users can download files from different other sites (f.e one-click-hoster like uploaded.net). We act as like a proxy. The user generate a link and download the file directly. Our Script download nothing on the server. A little bit like a premium link generator but different. AND NOT ILLEGAL.
If the user are downloading a file that is larger than 1GB, the download will be canceled when it reachs 1gb.
In the log files i found repeated the error
"Upstream timed out (110: Connection timed out) while reading response"
I have tried to put the settings higher but that didn't help.
I tried following:
1. nginx.conf:
fastcgi_send_timeout 300s;
fastcgi_read_timeout 300s;
2. nginx host file:
fastcgi_read_timeout 300;
fastcgi_buffers 8 128k;
fstcgi_buffer_size 256k;
3. PHP.ini:
max_execution_time = 60 (but my php script will set it automaticly to 0)
max_input_time = 60
memory_limit = 128M
4. PHP-FPM >> www.conf
pm.max_children = 25
pm.start_servers = 2
pm.min_spare_servers = 2
pm.max_spare_servers = 12
request_terminate_timeout = 300s
But nothing helps. What can i do to fix this problem?
Server/Nginx Infos:
Memory: 32079MB
CPU: model name: Intel(R) Xeon(R) CPU E3-1230 v3 # 3.30GHz (8 Cores)
PHP: PHP 5.5.15-1~dotdeb.1 (cli) (built: Jul 24 2014 16:44:04)
NGINX: nginx/1.2.1
nginx.conf:
worker_processes 8;
worker_connections 2048;
But time settings are doens't matter i think. Because the download stops exactly on 1.604.408 KB everytime. If i download with 20kb/s the download needs more time, but will cancel on exactly 1.604.408 KB.
thank you for any help.
If you need more informations please ask me.
i had similar problem, where download would stop at 1024MB with error
readv() failed (104: Connection reset by peer) while reading upstream
adding this to nginx.conf file helped:
fastcgi_max_temp_file_size 1024m;
I'm trying to figure out the correct tuning for nginx on an AWS server that is wholly backed by EBS. The basic issue is that when downloading a ~100MB static file, I'm seeing consistent download rates of ~60K/s. If I use scp to copy the same file from the AWS server, I'm seeing rates of ~1MB/s. (So, I'm not sure EBS even comes into play here).
Initially, I was running nginx with basically the out-of-the-box configuration (for CentOS 6.x). But in an attempt to speed things up, I've played around with various tuning parameters to no avail -- the speed has remained basically the same.
Here is the relevant fragment from my config as it stands at this moment:
location /download {
root /var/www/yada/update;
disable_symlinks off;
autoindex on;
# Transfer tuning follows
aio on;
directio 4m;
output_buffers 1 128k;
}
Initially, these tuning settings were:
sendfile on;
tcp_nopush on;
tcp_nodelay on;
Note, I'm not trying to optimize for a large amount of traffic. There is likely only a single client ever downloading at any given time. The AWS server is a 'micro' instance with 617MB of memory. Regardless, the fact that scp can download at ~1MB/s leads me to believe that HTTP should be able to match or beat that throughput.
Any help is appreciated.
[Update]
Additional information. Running a 'top' command while a download is running, I get:
top - 07:37:33 up 11 days, 1:56, 1 user, load average: 0.00, 0.01, 0.05
Tasks: 63 total, 1 running, 62 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
and 'iostat' shows:
Linux 3.2.38-5.48.amzn1.x86_64 04/03/2013 _x86_64_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.02 0.00 0.03 0.03 0.02 99.89
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
xvdap1 0.23 2.66 8.59 2544324 8224920
Have you considered turning sendfile on? Sendfile allows nginx to use the kernel directly to send static files, so it should be faster than any other option.
By default scp will much faster then your HTTP connection. I have a suggestion for you. If you are serving a static file, I prefer to use S3 with Cloud front. Which makes it faster. Its very difficult to achieve better performance there is a file transfer.
Given that things work well on the same machine you are getting throttled. First check your usage policy with AWS, perhaps it's in the fine print. Alternatively, try different ISP'S. If they all give you 60kB/s you know it's AWS.
We are using Nginx As a load balancer for multiple riak nodes. The setup worked fine for some time(few hours) before Nginx started giving bad gateway 502 errors. On checking the individual nodes seemed to be working. We found out that The problem was with nginx buffer size hence increased the buffer size to 16k, it worked fine for one more day before we started getting 502 error for everything.
My Nginx configuration is as follows
upstream riak {
server 127.0.0.1:8091 weight=3;
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
server {
listen 8098;
server_name 127.0.0.1:8098;
location / {
proxy_pass http://riak;
proxy_buffer_size 16k;
proxy_buffers 8 16k;
}
}
Any help is appreciated,Thank you.
Check if you are running out of fd's on the nginx box. Check with netstat if you have too many connections in the TIME_WAIT state. If so, you will need to reduce you tcp_fin_timeout value from default 60 seconds to something smaller.