Elastic Beanstalk WebSocket Connection Dropped - nginx

Under Elastic Beanstalk and behind an Application Load Balancer, I have a WebSockets application on Embedded Jetty.
Platform: Java 8 running on 64bit Amazon Linux/2.10.1
The issue is the connection is being dropped at the one minute mark. Even though, I have already set the Application Load Balancer's Idle Timeout to 300 seconds (which is Jetty's default timeout).
Thus, I did some research and I am thinking now that is a timeout imposed by Nginx, so I followed the answer here.
I could not deploy with an .ebextension formatted like that. Elastic Beanstalk would tell me that the file to replaced did not exist. After, I run into this article, so I came up with the following script:
files:
"/etc/nginx/conf.d/01_increase_timeouts.conf":
mode: "000644"
owner: root
group: root
content: |
keepalive_timeout 300;
proxy_connect_timeout 300;
proxy_send_timeout 300;
proxy_read_timeout 300;
send_timeout 300;
container_commands:
nginx_reload:
command: "sudo service nginx reload"
This way, I am able to deploy now. However, WebSockets connections continue to being dropped at the one minute mark.
Can anyone point out what I am doing wrong or what I could try next?
Please, any help would be greatly appreciated.

Related

Videos greater than 2 MB are not processed by the Nginx server to backend & to AWS S3 bucket

We have been developing an enterprise application for the last two years. Based on microservice architecture, we have nine services with their respective databases and an Angular frontend on NGINX that calls/connects microservices. During our development, we implemented these services and their databases on the Hetzner cloud server with 4GB RAM and 2 CPUs over the internal network, and everything has been working seamlessly. We are uploading all images, pdf, and videos on AWS S3, and it has been smooth sailing. Videos of all sizes were uploaded and played without any issues.
We liked Hetzner and decided to go production also with them. We took the first server and installed proxmox over it, and deployed LXC containers and our services. I tested again here, and no problems were found again.
We then decided to take another server, deployed proxmox, and clustered them. This is where the problem started when we hired a network guy who configured a bridged network between the containers of both nodes. Each container pings the other well, and the telnet also connects over an internal network. MTU set on this bridge is 1400.
Primary Problem- We are NOT able to upload videos over 2 MB to S3 anymore from this network
Other problems – These are intermittent issues, noted in logs–
NGNIX –
504 Gateway Time-out ERRORS of likes, on multiple services--> upstream timed out (110: Connection timed out) while reading response header from upstream, client: 223.235.101.169, server: abc.xyz.com, request: "GET /courses/course HTTP/1.1", upstream: "http://10.10.XX.XX:8080//courses/course/toBeApprove", host: " abc.xyz.com, ", referrer: "https:// abc.xyz.com, /"
Tomcat-
com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: 7J2EHKVDWQP3367G; S3 Extended Request ID: xGGCQhESxh/Mo6ddwtGYShLIeCJYbgCRT8oGleQu/IfguEfbZpTQXG/AIzgLnG2F5YuCqk7vVE8=), S3 Extended Request ID: xGGCQhESxh/Mo6ddwtGYShLIeCJYbgCRT8oGleQu/IfguEfbZpTQXG/AIzgLnG2F5YuCqk7vVE8=
(we increased all known timeouts, both in nginx and tomcat)
Mysql- 2022-09-08T04:24:27.235964Z 8 [Warning] [MY-010055]
[Server] IP address '10.10.XX.XX could not be resolved: Name or
service not known
Other key points to note – we allow video up to 100 mb to upload thus known limits set in nginx and tomcat configurations
Nginx, client_max_body_size 100m;
And tomcat <Connector port="8080"
protocol="HTTP/1.1" maxPostSize="102400” maxHttpHeaderSize="102400"
connectionTimeout="20000" redirectPort="8443" />
In these readings and trials running over last 15 days, we stopped, all firewalls, ufw on OS, proxmox firewall, and even the data center firewall while debugging.
This is our nginx.conf
http {
proxy_http_version 1.1;
proxy_set_header Connection "";
##
client_body_buffer_size 16K;
client_header_buffer_size 1k;
client_max_body_size 100m;
client_header_timeout 100s;
client_body_timeout 100s;
large_client_header_buffers 4 16k;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 300;
send_timeout 600;
proxy_connect_timeout 600;
proxy_send_timeout 600;
proxy_read_timeout 600;
gzip on;
gzip_comp_level 2;
gzip_min_length 1000;
gzip_proxied expired no-cache no-store private auth;
gzip_types text/plain application/x-javascript text/xml text/css application/xml;
These are our primary test/debugging trials.
**1. Testing with a small video (of size 273 Kb)**
a. Nginx log- clean, nothing related to operations
b. Tomcat log-
Start- CoursesServiceImpl - addCourse - Used Memory:73
add course 703
image file not null org.springframework.web.multipart.support.StandardMultipartHttpServletRequest$StandardMultipartFile#15476ca3
image save to s3 bucket
image folder name images
buckets3 lmsdev-cloudfront/images
image s3 bucket for call
imageUrl https://lmsdev-cloudfront.s3.amazonaws.com/images/703_4_istockphoto-1097843576-612x612.jpg
video file not null org.springframework.web.multipart.support.StandardMultipartHttpServletRequest$StandardMultipartFile#13419d27
video save to s3 bucket
video folder name videos
input Stream java.io.ByteArrayInputStream#4da82ff
buckets3 lmsdev-cloudfront/videos
video s3 bucket for call
video url https://lmsdev-cloudfront.s3.amazonaws.com/videos/703_4_giphy360p.mp4
Before Finally - CoursesServiceImpl - addCourse - Used Memory:126
After Finally- CoursesServiceImpl - addCourse - Used Memory:49
c. S3 bucket
[S3 bucket][1]
[1]: https://i.stack.imgur.com/T7daW.png
3. Testing with video 2 mb (fractionally less)
a. Progress bar keeps running about 5 minutes, then
b. Nginx logs-
2022/09/10 16:15:34 [error] 3698306#3698306: *24091 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 223.235.101.169, server: login.pathnam.education, request: "POST /courses/courses/course HTTP/1.1", upstream: "http://10.10.10.10:8080//courses/course", host: "login.pathnam.education", referrer: "https://login.pathnam.education/"
c. Tomcat logs-
Start- CoursesServiceImpl - addCourse - Used Memory:79
add course 704
image file not null org.springframework.web.multipart.support.StandardMultipartHttpServletRequest$StandardMultipartFile#352d57e3
image save to s3 bucket
image folder name images
buckets3 lmsdev-cloudfront/images
image s3 bucket for call
imageUrl https://lmsdev-cloudfront.s3.amazonaws.com/images/704_4_m_Maldives_dest_landscape_l_755_1487.webp
video file not null org.springframework.web.multipart.support.StandardMultipartHttpServletRequest$StandardMultipartFile#45bdb178
video save to s3 bucket
video folder name videos
input Stream java.io.ByteArrayInputStream#3a85dab9
And after few minutes
com.amazonaws.SdkClientException: Unable to execute HTTP request: Connection timed out (Write failed)
d. S3 Bucket – No entry
Now tried to upload the same video from our test server, and it was instantly uploaded to S3 bucket.
Reading all posts with similar problems,mostly are related to php.ini configurations and thus not related to us.
I have solved the issue now, MTU set in LXC container was set differently than what was configured in virtual switch. Proxmox does not give to set MTU while creating LXC container (and you expect bridge MTU to be used) and you can miss that.
Go to conf file of container; in my case it is 100
nano /etc/pve/lxc/100.conf
find and edit this line
net0: name=eno1,bridge=vmbr4002,firewall=1,hwaddr=0A:14:98:05:8C:C5,ip=192.168.0.2/24,type=veth
to add mtu value, as per switch in towards the last:
name=eno1,bridge=vmbr4002,firewall=1,hwaddr=0A:14:98:05:8C:C5,ip=192.168.0.2/24,type=veth,mtu=1400 (my value at vswitch)
Reboot the container for a permanent change.
And all worked like a charm for me. Hope it helps someone who also uses Proxmox interface to create the containers and thus missed this to configure via CLI (a suggested enhancement to Proxmox)

dotnet core - Server hangs on Production

We are currently experiencing an issue when we run our dotnet core server setup on Production. We publish it in Bamboo and run it from an AWS linux server, and it sits behind an nginx reverse proxy.
Essentially, every few days our dotnet core server process will go mute. It silently accepts and hangs on web requests, and even silently ignores our (more polite) attempts to stop it. We have verified that it is actually the netcore process that hangs by sending curl requests directly to port 5000 from within the server. We've replicated our production deployment to the best of our ability to our test environment and have not been able to reproduce this failure mode.
We've monitored the server with NewRelic and have inspected it at times when it's gone into failure mode. We've not been able to correlate this behaviour with any significant level of traffic, RAM usage, CPU usage, or open file descriptor usage. Indeed, these measurements all seem to stay at very reasonable levels.
My team and I are a bit stuck as to what might be causing our hung server, or even what we can do next to diagnose it. What might be causing our server process to hang? What further steps can we take to diagnose the issue?
Extra Information
Our nginx conf template:
upstream wfe {
server 127.0.0.1:5000;
server 127.0.0.1:5001;
}
server {
listen 80 default_server;
location / {
proxy_set_header Host $http_host;
proxy_pass http://wfe;
proxy_read_timeout 20s;
# Attempting a fix suggested by:
# https://medium.com/#mshanak/soved-dotnet-core-too-many-open-files-in-system-when-using-postgress-with-entity-framework-c6e30eeff6d1
proxy_buffering off;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection keep-alive;
proxy_cache_bypass $http_upgrade;
fastcgi_buffers 16 16k;
fastcgi_buffer_size 32k;
}
}
Our Program.cs:
using System.Diagnostics.CodeAnalysis;
using System.IO;
using System.Net;
using Microsoft.AspNetCore;
using Microsoft.AspNetCore.Hosting;
using Microsoft.Extensions.Logging;
using Serilog;
namespace MyApplication.Presentation
{
[ExcludeFromCodeCoverage]
public class Program
{
public static void Main(string[] args)
{
IWebHost host = WebHost.CreateDefaultBuilder(args)
#if DEBUG
.UseKestrel(options => options.Listen(IPAddress.Any, 5000))
#endif
.UseStartup<Startup>()
.UseSerilog()
.Build();
host.Run();
}
}
}
During our CD build process, we publish our application for deployment with:
dotnet publish --self-contained -c Release -r linux-x64
We then deploy the folder bin/Release/netcoreapp2.0/linux-x64 to our server, and run publish/<our-executable-name> from within.
EDIT: dotnet --version outputs 2.1.4, both on our CI platform and on the production server.
When the outage starts, nginx logs show that server responses to requests change from 200 to 502, with a single 504 being emitted at the time of the outage.
At the same time, logs from our server process just stop. And there are warnings there, but they're all explicit warnings that we've put into our application code. None of them indicate that any exceptions have been thrown.
After a few days of investigation I've found the reason of that issue. It is being caused by glibc >= 2.27, which lead to GC hang at some conditions, so there is almost nothing to do about it. However you have a few options:
Use Alpine Linux. It doesn't rely on glibc.
Use older distro like Debian 9, Ubuntu 16.04 or any other with glibc < 2.27
Try to patch glibc by yourself at your own risk: https://sourceware.org/bugzilla/show_bug.cgi?id=25847
Or wait for the glibc patch to be reviewed by community and included in your favorite distro.
More information can be found here: https://github.com/dotnet/runtime/issues/47700

NGINX configuration for Rails 5 ActionCable with puma

I am using Jelastic for my development environment (not yet in production).
My application is running with Unicorn but I discovered websockets with ActionCable and integrated it in my application.
Everything is working fine in local, but when deploying to my Jelastic environment (with the default NGINX/Unicorn configuration), I am getting this message in my javascript console and I see nothing in my access log
WebSocket connection to 'ws://dev.myapp.com:8080/' failed: WebSocket is closed before the connection is established.
I used to have on my local environment and I solved it by adding the needed ActionCable.server.config.allowed_request_origins in my config file. So I double-checked my development config for this and it is ok.
That's why I was wondering if there is something specific for NGINX config, else than what is explained on ActionCable git page
bundle exec puma -p 28080 cable/config.ru
For my application, I followed everything from enter link description here but nothing's mentioned about NGINX configuration
I know that websocket with ActionCable is quite new but I hope someone would be able to give me a lead on that
Many thanks
Ok so I finally managed to fix my issue. Here are the different steps which allowed to make this work:
1.nginx : I don't really know if this is needed but as my application is running with Unicorn, I added this into my nginx conf
upstream websocket {
server 127.0.0.1:28080;
}
server {
location /cable/ {
proxy_pass http://websocket/;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Upgrade";
}
}
And then in my config/environments/development.rb file:
config.action_cable.url = "ws://my.app.com/cable/"
2.Allowed request origin: I have then noticed that my connection was refused even if I was using ActionCable.server.config.allowed_request_origins in my config/environments/development.rb file. I am wondering if this is not due to the development default as http://localhost:3000 as stated in the documentation. So I have added this:
ActionCable.server.config.disable_request_forgery_protection = true
I have not yet a production environment so I am not yet able to test how it will be.
3.Redis password: as stated in the documentation, I was using a config/redis/cable.yml but I was having this error:
Error raised inside the event loop: Replies out of sync: #<RuntimeError: ERR operation not permitted>
/var/www/webroot/ROOT/public/shared/bundle/ruby/2.2.0/gems/em-hiredis-0.3.0/lib/em-hiredis/base_client.rb:130:in `block in connect'
So I understood the way I was setting my password for my redis server was not good.
In fact your have to do something like this:
development:
<<: *local
:url: redis://user:password#my.redis.com:6379
:host: my.redis.com
:port: 6379
And now everything is working fine and Actioncable is really impressive.
Maybe some of my issues were trivial but I am sharing them and how I resolved them so everyone can pick something if needed

Nginx Load Balancer issue

We are using Nginx As a load balancer for multiple riak nodes. The setup worked fine for some time(few hours) before Nginx started giving bad gateway 502 errors. On checking the individual nodes seemed to be working. We found out that The problem was with nginx buffer size hence increased the buffer size to 16k, it worked fine for one more day before we started getting 502 error for everything.
My Nginx configuration is as follows
upstream riak {
server 127.0.0.1:8091 weight=3;
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
server {
listen 8098;
server_name 127.0.0.1:8098;
location / {
proxy_pass http://riak;
proxy_buffer_size 16k;
proxy_buffers 8 16k;
}
}
Any help is appreciated,Thank you.
Check if you are running out of fd's on the nginx box. Check with netstat if you have too many connections in the TIME_WAIT state. If so, you will need to reduce you tcp_fin_timeout value from default 60 seconds to something smaller.

nginx errors readv() and recv() failed

I use nginx along with fastcgi. I see a lot of the following errors in the error logs
readv() failed (104: Connection reset
by peer) while reading upstream and
recv() failed (104: Connection reset
by peer) while reading response header
from upstream
I don't see any problem using the application. Are these errors serious or how to get rid of them.
I was using php-fpm in the background and slow scripts were getting killed after a said timeout because it was configured that way. Thus, scripts taking longer than a specified time would get killed and nginx would report a recv or readv error as the connection is closed from the php-fpm engine/process.
Update:
Since nginx version 1.15.3 you can fix this by setting the keepalive_requests option of your upstream to the same number as your php-fpm's pm.max_requests:
upstream name {
...
keepalive_requests number;
...
}
Original answer:
If you are using nginx to connect to php-fpm, one possible cause can also be having nginx' fastcgi_keep_conn parameter set to on (especially if you have a low pm.max_requests setting in php-fpm):
http|server|location {
...
fastcgi_keep_conn on;
...
}
This may cause the described error every time a child process of php-fpm restarts (due to pm.max_requests being reached) while nginx is still connected to it. To test this, set pm.max_requests to a really low number (like 1) and see if you get even more of the above errors.
The fix is quite simple - just deactivate fastcgi_keep_conn:
fastcgi_keep_conn off;
Or remove the parameter completely (since the default value is off). This does mean your nginx will reconnect to php-fpm on every request, but the performance impact is negligible if you have both nginx and php-fpm on the same machine and connect via unix socket.
Regarding this error:
readv() failed (104: Connection reset by peer) while reading upstream and recv() failed (104: Connection reset by peer) while reading response header from upstream
there was 1 more case where I could still see this.
Quick set up overview:
CentOS 5.5
PHP with PHP-FPM 5.3.8 (compiled from scratch with some 3rd party
modules)
Nginx 1.0.5
After looking at the PHP-FPM error logs as well and enabling catch_workers_output = yes in the php-fpm pool config, I found the root cause in this case was actually the amfext module (PHP module for Flash).
There's a known bug and fix for this module that can be corrected by altering the amf.c file.
After fixing this PHP extension issue, the error above was no longer an issue.
This is a very vague error as it can mean a few things. The key is to look at all possible logs and figure it out.
In my case, which is probably somewhat unique, I had a working nginx + php / fastcgi config. I wanted to compile a new updated version of PHP with PHP-FPM and I did so. The reason was that I was working on a live server that couldn't afford downtime. So I had to upgrade and move to PHP-FPM as seamlessly as possible.
Therefore I had 2 instances of PHP.
1 directly talking with fastcgi (PHP 5.3.4) - using TCP / 127.0.0.1:9000 (PHP 5.3.4)
1 configured with PHP-FPM - using Unix socket - unix:/dir/to/socket-fpm
(PHP 5.3.8)
Once I started up PHP-FPM (PHP 5.3.8) on an nginx vhost using a socket connection instead of TCP I started getting this upstream error on any fastcgi page taking longer than x minutes whether they were using FPM or not. Typically it was pages doing large SELECTS in mysql that took ~2 min to load. Bad I know, but this is because of back end DB design.
What I did to fix it was add this in my vhost configuration:
fastcgi_read_timeout 5m;
Now this can be added in the nginx global fastcgi settings as well. It depends on your set up. http://wiki.nginx.org/HttpFcgiModule
Answer # 2.
Interestingly enough fastcgi_read_timeout 5m; fixed one vhost for me.
However I was still getting the error in another vhost, just by running phpinfo();
What fixed this for me was by copying over a default production php.ini file and adding the config I needed into it.
What I had was an old copy of my php.ini from the previous PHP install.
Once I put the default php.ini from 'shared' and just added in the extensions and config I needed, this solved my problem and no longer did I have nginx errors readv() and recv() failed.
I hope 1 of these 2 fixes helps someone.
Also it can be a very simple problem - there is an infinity cicle somewhere in your code, or an infinity trying to connect an external host on your page.
Some times this problem happen because of huge of requests. By default the pm.max_requests in php5-fpm maybe is 100 or below.
To solve it increase its value depend on the your site's requests, For example 500.
And after the you have to restart the service
sudo service php5-fpm restart
Others have mentioned the fastcgi_read_timeout parameter, which is located in the nginx.conf file:
http {
...
fastcgi_read_timeout 600s;
...
}
In addition to that, I also had to change the setting request_terminate_timeout in the file: /etc/php5/fpm/pool.d/www.conf
request_terminate_timeout = 0
Source of information (there are also a few other recommendations for changing php.ini parameters, which may be relevant in some cases): https://ma.ttias.be/nginx-and-php-fpm-upstream-timed-out-failed-110-connection-timed-out-or-reset-by-peer-while-reading/

Resources