Warning in Jetty 9 Http requests - Under Utilization of N/W in local loopback - tcp

I am trying to run Jetty 9 on Ubuntu 12.10 (32 bit). The JVM i am using in JDK 1.7.0_40. I have setup a rest service on my server that uses RestLib. The rest service is a POST method that just receives the data and does no processing with it and responds a success.
I want to see what is the maximum load the Jetty9 server will take with the given resources. I have a Intel i5 processor box with 8 GB memory. I have setup a Jmeter to test this rest in the localhost setting. I know this is not advisable but i would like to know this number (just out of curiosity).
When I run the JMeter to test this POST method with 1 MB of payload data in the body, i am getting a through put of around 20 (for 100 users).
I measured the the bandwidth using iperf to begin with
iperf -c 127.0.0.1 -p 8080
------------------------------------------------------------
Client connecting to 127.0.0.1, TCP port 8080
TCP window size: 167 KByte (default)
[ 3] local 127.0.0.1 port 44130 connected with 127.0.0.1 port 8080
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 196 MBytes 165 Mbits/sec
the number 165 MB seems ridiculously small for me but that's one observation.
I ran the server with StatisticsHandler enabled and was observing the request mean time. Also was observing the system resources using nmon monitoring tool.
The CPU usage was around 20 % (overall), 4GB of free memory and the number of threads in the server (monitored using jconsole) around 200 (i had specified max thread count as 2000 in start.ini file).
Jmeter was configured to bombard repeatedly.
I observed the network usage in local loopback interface in nmon tool and it was around
30 MB. This was inline with the iperf data quoted earlier.
I tried the same experiment with weblogic(using JDK 1.6) and it was using nearly 250 MBps in lo interface. I had explicitly disabled tcp sync cookies in the sysctl config to avoid the limitation due to system thinking the test as DOS attack.
Please help me comprehend this numbers. Am I missing something here in the config. The n/w seems to be a limiting factor here but since it is a loopback interface there is no physical limitation as proved by the Weblogic case.
Please help me understand what am I doing wrong in the Jetty 9 case.
Also I am getting this warning in Jetty9 logs very frequently
WARN:oejh.HttpParser:qtp14540840-309: Parsing Exception: java.lang.IllegalStateException: too much data after closed for HttpChannelOverHttp#1dee6d3{r=1,a=IDLE,uri=-}

This question is effectively being answered on this mailing list thread:
http://dev.eclipse.org/mhonarc/lists/jetty-users/msg03906.html

Related

nginx stalls after 180 KiB from uwsgi

I'm testing a new "flask" python3 app on a newly created (all packages at latest) Debian WSL2 install using (system) "nginx" and (userspace) "uwsgi" passing data between them via a unix-domain socket.
The response being generated by the app is 6.0MiB in size. Chrome, reading from localhost:8080, receives (according to WireShark) 180KiB (exactly) of content plus a few bytes (84) worth of headers. Then it stalls and never receives anything else. When it times out, the nginx "access" log indicates a transfer of the same number of bytes.
However, if before the timeout I do killall -9 uwsgi, then another 160 KiB (exactly) of the result page get immediately sent to Chrome and this larger number is reported in the nginx log.
If I run using the basic, embedded Flask server directly, I get the full 6M of content; no stalls.
Why is nginx not receiving the full response from uwsgi and/or not passing it to the browser?
Update1: I changed the socket type from unix-domain to tcp. The same problem occurs but now the stall-point is no longer consistent. It's stopped after as little as 180KiB and after as much as 540KiB. Also, it always stalls at fixed places such as 184236, 405420 (exactly 216KiB more), and 552876 (another 144KiB more).
Update2: Stopping uwsgi and instead running nc -vlp 8079 >req.http, I captured a request from nginx. I then restarted uwsgi and replayed the request (nc -v 127.0.0.1 8079 <req.http >out.html). I received the full 6M response. The problem definitely appears to be nginx.

VPN connection results in extremely slow connection (OpenVPN)

I am currently in the and connecting to a research institute in India using OpenVPN. The client config file says I am using TCP, however, I tried with UDP too.
My issue is that my connection is seriously degraded to about 1 Mbps when I connect to the VPN (see speedtest results below). Please suggest if there are any ways to improve the same. I have read that many people have had this problem and there is no single solution that can solve it. I tried suggestions from various posts, like https://serverfault.com/questions/686286/very-low-tcp-openvpn-throughput-100mbit-port-low-cpu-utilization, to change the buffer size and txqueuelen. I had also set
sndbuf 0
rcvbuf 0
and with txqueuelen = 4000 but there was no improvement in the connection speed (have also tried other combinations of these variables). MTU is set to 1500.
The server uses CentOS 7 and I am using Ubuntu 18.04.6 LTS. The version on OpenVPN that I am using: OpenVPN 2.5.7 x86_64-pc-linux-gnu.
(I am new to the technicalities of VPN even though I have used it before.)
Speedtest without VPN:
Testing from University of <hidden> ...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Bresco Broadband (Columbus, OH) [263.97 km]: 23.339 ms
Testing download speed..........................................
Download: 542.38 Mbit/s
Testing upload speed............................................
Upload: 611.33 Mbit/s
Speedtest with VPN:
Retrieving speedtest.net configuration...
Testing from <hidden> Communications (<hidden IP address>)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by BBNL (Bangalore) [2.23 km]: 649.717 ms
Testing download speed................................................................................
Download: 0.96 Mbit/s
Testing upload speed................................................................................................
Upload: 2.19 Mbit/s
OpenVPN client config file:
dev tun
proto TCP
persist-tun
persist-key
cipher AES-256-CBC
ncp-ciphers AES-256-GCM:AES-128-GCM
auth SHA1
tls-client
client
resolv-retry infinite
remote <hidden> 443 tcp
verify-x509-name "<hidden>-VPN" name
auth-user-pass
pkcs12 pfSense-TCP4-443-<username hidden>.p12
tls-auth pfSense-TCP4-443-<username hidden>-tls.key 1
remote-cert-tls server
sndbuf 512000
rcvbuf 512000
txqueuelen 1000
I was able to find this link that explains why this may be the case and offers a possible solution. Here is a reddit thread that offers a few more solutions in the comments. However, as I was researching I found that this is a common issue with OpenVPN (many articles/threads discussing that it slows down or it has speed issues)

Why cant I connect more than 8000 clients to MQTT brokers via HAProxy?

I am trying to establish 10k client connections(potentially 100k) with my 2 MQTT brokers using HAProxy as a load balancer.
I have a working simulator(using Java Paho library) that can simulate 10k clients. On the same machine I run 2 MQTT brokers in docker. For LB im using another machine with virtual image of Ubuntu 16.04.
When I connect directly to a MQTT Broker those connections are established without a problem, however when I use HAProxy I only get around 8.8k connections, while the rest throw: Error at client{insert number here}: Connection lost (32109) - java.net.SocketException: Connection reset. When I connect simulator directly to a broker (Same machine) about 20k TCP connections open, however when I use load balancer only 17k do. This leaves me thinking that LB is causing the problem.
It is important to add that whenever I run the simulator I'm unable to use the browser (Cannot connect to the internet). I havent tested if this is browser only, but could that mean that I actually run out of ports or something similar and the real issue here is not in the LB?
Here is my HAProxy configuration:
global
log /dev/log local0
log /dev/log local1 notice
maxconn 500000
ulimit-n 500000
maxpipes 500000
defaults
log global
mode http
timeout connect 3h
timeout client 3h
timeout server 3h
listen mqtt
bind *:8080
mode tcp
option tcplog
option clitcpka
balance leastconn
server broker_1 address:1883 check
server broker_2 address:1884 check
listen stats
bind 0.0.0.0:1936
mode http
stats enable
stats hide-version
stats realm Haproxy\ Statistics
stats uri /
This is what MQTT broker shows for every successful/unsuccessful connection
...
//Successful connection
1613382861: New connection from xxx:32850 on port 1883.
1613382861: New client connected from xxx:60974 as 356 (p2, c1, k1200, u'admin').
...
//Unsuccessful connection
1613382699: New connection from xxx:42861 on port 1883.
1613382699: Client <unknown> closed its connection.
...
And this is what ulimit -a shows on LB machine.
core file size (blocks) (-c) 0
data seg size (kb) (-d) unlimited
scheduling priority (-e) 0
file size (blocks) (-f) unlimited
pending signals (-i) 102355
max locked memory (kb) (-l) 82000
max memory size (kb) (-m) unlimited
open files (-n) 500000
POSIX message queues (bytes) (-q) 819200
real-time priority (-r) 0
stack size (kb) (-s) 8192
cpu time (seconds) (-t) unlimited
max user processes (-u) 500000
virtual memory (kb) (-v) unlimited
file locks (-x) unlimited
Note: The LB process has the same limits.
I followed various tutorials and increased open file limit as well as port limit and TCP header size, etc. The number of connected users increased from 2.8k to about 8.5-9k (Which is still way lower than the 300k author of the tutorial had). ss -s command shows about 17000ish TCP and inet connections.
Any pointers would greatly help!
Thanks!
You can't do a normal LB of MQTT traffic, as you can't "pin" the connection based on the MQTT Topic. If you send in a SUBSCRIBE to Broker1 for Topic "test/blatt/#", but the next client PUBLISHes to Broker2 "test/blatt/foo", then if the two brokers are not bridged, your first subscriber will never get that message.
If your clients are terminating the TCP connection sometime after the CONNECT, or the HAproxy is round-robin'ing the packets between the two brokers, you will get errors like this. You need to somehow persist the connections, and I don't know how you do that with HAproxy. Non-free LB's like A10 Thunder or F5 LTM can persist TCP connections...but you still need the MQTT brokers bridged for it all to work.
Turns out I was running out of resources on my computer.
I moved simulator to another machine and managed to get 15k connections running. Due to resource limits I cant get more than that. Computer thats running the serverside uses 20/32GB of RAM and the computer running simulator used 32/32GB for approx 15k devices. Now I see why running both on the same computer is not an option.

How to find out the errors behind a lot of non-2xx or 3xx responses when load testing nginx reverse proxy with wrk

We are doing some test with NGINX as reverse proxy in front of two NGINX sample web servers. The tool being used in our tests is wrk. The web servers' configuration are very simple. Each of them has a static page (similar to default welcome page) and the NGINX proxy is directing traffic in a round robin fashion. The aim of the test is to measure the impact of different OSes with a NGiNX reverse proxy on the results (We are doing this with CentOS 7, Debian 10 and FreeBSD 12)
In our results, (except FreeBSD) we have a lot of non-2xx or 3xx errors inside:
10 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 74.50ms 221.36ms 1.90s 91.31%
Req/Sec 5.88k 4.56k 16.01k 43.96%
Latency Distribution
50% 4.68ms
75% 7.71ms
90% 196.01ms
99% 1.03s
3509526 requests in 1.00m, 1.11GB read
Socket errors: connect 0, read 0, write 0, timeout 875
Non-2xx or 3xx responses: 3285230
Requests/sec: 58431.20
Transfer/sec: 18.96MB
As you can see, about 90 percent of the responses are in this category.
I've tried several different configurations on NGINX logging to "catch" some of these errors. But all I get is 200 OK in the log. How can I get more information about these responses?
502 means the proxy was not able to connect to the backend. This could be due to resource exhaustion on either the proxy or the backend server. If your CPU is not saturated you are most likely dealing with some artificial kernel limit. I've seen file descriptors, TCP connections, accept queues, firewall tracked connections cause this. dmesg sometimes has useful logs.
Usually adding keepalive connections to the backend helps: https://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive
Try something like this...
response = function(status, headers, body)
if status ~= 200 then
io.write("Status: ".. status .."\n")
io.write("Body:\n")
io.write(body .. "\n")
end
end
After some research, I was able to track this down with tcpdump on the proxy node like below :
After running wrk on the proxy, I ran tcpdump like below :
tcpdump -i ens192 port 80 -nn
And the result - though quite big - had some interesting insights :
10:53:33.317363 IP x.x.x.x.80 > y.y.y.y.28375: Flags [P.], seq 389527:389857, ack 37920, win 509, options [nop,nop,TS val 825684835 ecr 679917942], length 330: HTTP: HTTP/1.1 502 Bad Gateway
The reason I could not see the error in nginx logs is that in reverse proxy mode logging, ngnix will log the results only in debug mode, which, itself, makes the processing so slow that the above error could not surface. Using tcpdump, I could find out what can be the issue inside the packets.

What could cause so many TIME_WAIT connections to be open?

So, I have application A on one server which sends 710 HTTP POST messages per second to application B on another server, which is listening on a single port. The connections are not keep-alive; they are closed.
After a few minutes, application A reports that it can't open new connections to application B.
I am running netstat continuously on both machines, and see that a huge number of TIME_WAIT connections are open on each. Virtually all connections showing are in TIME_WAIT. From reading online, it seems that this is the state it's in for 30 seconds (on our machines 30 seconds according to /proc/sys/net/ipv4/tcp_fin_timeout value) after each side closes the connection.
I have a script running on each machine that's continuously doing:
netstat -na | grep 5774 | wc -l
and:
netstat -na | grep 5774 | grep "TIME_WAIT" | wc -l
The value of each, on each machine, seems to get to around 28,000 before application A reports that it can't open new connections to application B.
I've read that this file: /proc/sys/net/ipv4/ip_local_port_range provides the total number of connections that can be open at once:
$ cat /proc/sys/net/ipv4/ip_local_port_range
32768 61000
61000 - 32768 = 28232, which is right in line with the approximately 28,000 TIME_WAITs I am seeing.
My question is how is it possible to have so many connections in TIME_WAIT.
It seems that at 710 connections per second being closed, I should see approximately 710 * 30 seconds = 21300 of these at a given time. I suppose that just because there are 710 being opened per second doesn't mean that there are 710 being closed per second...
The only other thing I can think of is a slow OS getting around to closing the connections.
TCP's TIME_WAIT indicates that local endpoint (this side) has closed the connection. The connection is being kept around so that any delayed packets can be matched to the connection and handled appropriately. The connections will be removed when they time out within four minutes.
Assuming that all of those connections were valid, then everything is working correctly. You can eliminate the TIME_WAIT state by having the remote end close the connection or you can modify system parameters to increase recycling (though it can be dangerous to do so).
Vincent Bernat has an excellent article on TIME_WAIT and how to deal with it:
The Linux kernel documentation is not very helpful about what net.ipv4.tcp_tw_recycle does:
Enable fast recycling TIME-WAIT sockets. Default value is 0. It should
not be changed without advice/request of technical experts.
Its sibling, net.ipv4.tcp_tw_reuse is a little bit more documented but the language is about the same:
Allow to reuse TIME-WAIT sockets for new connections when it is safe
from protocol viewpoint. Default value is 0. It should not be changed
without advice/request of technical experts.
The mere result of this lack of documentation is that we find numerous tuning guides advising to set both these settings to 1 to reduce the number of entries in the TIME-WAIT state. However, as stated by tcp(7) manual page, the net.ipv4.tcp_tw_recycle option is quite problematic for public-facing servers as it won’t handle connections from two different computers behind the same NAT device, which is a problem hard to detect and waiting to bite you:
Enable fast recycling of TIME-WAIT sockets. Enabling this option is
not recommended since this causes problems when working with NAT
(Network Address Translation).

Resources