package main
import (
"io"
"net/http"
)
func hello(w http.ResponseWriter, r *http.Request) {
io.WriteString(w, "Hello world!\n")
}
func main() {
http.HandleFunc("/", hello)
http.ListenAndServe(":8000", nil)
}
I've got a couple of incredibly basic HTTP servers, and all of them are exhibiting this problem.
$ ab -c 1000 -n 10000 http://127.0.0.1:8000/
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
apr_socket_recv: Connection refused (61)
Total of 5112 requests completed
With a smaller concurrency value, things still fall over. For me, the issue seems to show up around the 5k-6k mark usually:
$ ab -c 10 -n 10000 http://127.0.0.1:8000/
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
apr_socket_recv: Operation timed out (60)
Total of 6277 requests completed
And in fact, you can drop concurrency entirely and the problem still (sometimes) happens:
$ ab -c 1 -n 10000 http://127.0.0.1:8000/
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
apr_socket_recv: Operation timed out (60)
Total of 6278 requests completed
I can't help but wonder if I'm hitting some kind of operating system limit somewhere? How would I tell? And how would I mitigate?
In short, you're running out of ports.
The default ephemeral port range on osx is 49152-65535, which is only 16,383 ports. Since each ab request is http/1.0 (without keepalive in your first examples), each new request takes another port.
As each port is used, it get's put into a queue where it waits for the tcp "Maximum Segment Lifetime", which is configured to be 15 seconds on osx. So if you use more than 16,383 ports in 15 seconds, you're effectively going to get throttled by the OS on further connections. Depending on which process runs out of ports first, you will get connection errors from the server, or hangs from ab.
You can mitigate this by using an http/1.1 capable load generator like wrk, or using the keepalive (-k) option for ab, so that connections are reused based on the tool's concurrency settings.
Now, the server code you're benchmarking does so little, that the load generator is being taxed just as much as the sever itself, with the local os and network stack likely making a good contribution. If you want to benchmark an http server, it's better to do some meaningful work from multiple clients not running on the same machine.
Related
I installed Ubuntu 18.04 on Hyper-V Win Server 2016.
And network performance of the Ubuntu is bad: I'm hosting few sites (Apache + PHP) and sometime response time is > 10 seconds. Sometimes it is fast.
As I troubleshooted, I see this netstat results:
# netstat -s | egrep -i 'loss|retran'
3447700 segments retransmitted
226 times recovered from packet loss due to fast retransmit
Detected reordering 6 times using reno fast retransmit
TCPLostRetransmit: 79831
45 timeouts after reno fast retransmit
6247 timeouts in loss state
2056435 fast retransmits
107095 retransmits in slow start
TCPLossProbes: 220607
TCPLossProbeRecovery: 3753
TCPSynRetrans: 90564
What can be cause of such high "segments retransmitted" number? And how to fix it?
Few notes:
- VMQ is disabled for Ubuntu VM
- The host system Network adapter is Intel I210
- I disabled IPv6 both on host and in VM
Here is WireShark showing, that it takes ~7 seconds to connect (just initial connection) to my site Propovednik.com:
Sep 20: So far, the issue seems to be caused by OVH / SoYouStart bad network:
This command shows 20-30% packets loss:
sudo ping us.soyoustart.com -c 10 -i 0.2 -p 00 -s 1200 -l 5
The problem could be anywhere along the network, including the workstation where you work from. I suggest you check the network as retransmissions and packetloss means that either something is malfunctioning or misconfigured. If this is on a wireless network, you could be out of range of your router.
I am pinging the website you noted from my computer and there is no packetloss.
This is strange. How this actually works. So far I know it is "impossible" to have a network like this.
I'm going to explain in details how my network works.
I have a LAN. 192.168.1.0/24 and router is on 192.168.1.1, This router has a public address.
I can share the IP address because I'm running there a server for tests nothing more. It is ok so far.
Now the magic happens.
When I trace the route to an IP I got this (To google DNS):
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
1 zonhub.home (192.168.1.1) 1.160 ms 1.676 ms 1.340 ms
2 * * *
3 10.137.211.97 (10.137.211.97) 12.915 ms 12.526 ms 12.145 ms
4 10.255.49.90 (10.255.49.90) 10.349 ms 10.255.49.102 (10.255.49.102) 11.483 ms 11.042 ms
5 80.157.128.249 (80.157.128.249) 34.577 ms 80.157.130.41 (80.157.130.41) 32.917 ms 80.157.130.33 (80.157.130.33) 30.602 ms
6 mad-sa3-i.MAD.ES.NET.DTAG.DE (217.5.95.161) 33.396 ms 80.157.128.22 (80.157.128.22) 27.107 ms mad-sa3-i.MAD.ES.NET.DTAG.DE (217.5.95.161) 29.510 ms
7 80.157.128.22 (80.157.128.22) 28.050 ms 72.14.235.20 (72.14.235.20) 32.767 ms 80.157.128.22 (80.157.128.22) 27.932 ms
8 72.14.235.20 (72.14.235.20) 29.780 ms 72.14.235.18 (72.14.235.18) 27.020 ms 26.706 ms
9 216.239.43.233 (216.239.43.233) 49.456 ms 209.85.240.191 (209.85.240.191) 44.034 ms 216.239.43.233 (216.239.43.233) 51.935 ms
10 72.14.236.191 (72.14.236.191) 53.374 ms 209.85.253.20 (209.85.253.20) 50.699 ms 216.239.43.233 (216.239.43.233) 44.918 ms
11 209.85.251.231 (209.85.251.231) 50.151 ms * 216.239.49.45 (216.239.49.45) 47.309 ms
12 google-public-dns-a.google.com (8.8.8.8) 51.536 ms 50.180 ms 45.505 ms
What is that 2nd, 3rd and 4th hop? How can it be on class A private address when 192.168.1.1 is running the NAT service where I have my LAN and my external 3 publics addresses (yes I have 3 and is 3 "class A" IP's on 88,89,93 network).
Another thing is how on 4th hop we have the 2nd octet 255?
Anyone feel free to traceroute my no-ip domain: synackfiles.no-ip.org
Just don't mess with my router (it blocks if you port scan or fail to log in in ssh or http auth so you get banned for this. If you just traceroute it is fine) :P
Now, second magic and weird stuff happens.
I'm going to run nmap. So i GOT THIS:
sudo nmap -sV -A -O 10.137.211.113 -vv -p 1-500 -Pn
Starting Nmap 6.00 ( http://nmap.org ) at 2013-11-14 15:24 WET
NSE: Loaded 93 scripts for scanning.
NSE: Script Pre-scanning.
NSE: Starting runlevel 1 (of 2) scan.
NSE: Starting runlevel 2 (of 2) scan.
Initiating Parallel DNS resolution of 1 host. at 15:24
Completed Parallel DNS resolution of 1 host. at 15:24, 0.04s elapsed
Initiating SYN Stealth Scan at 15:24
Scanning 10.137.211.113 [500 ports]
SYN Stealth Scan Timing: About 30.40% done; ETC: 15:26 (0:01:11 remaining)
SYN Stealth Scan Timing: About 60.30% done; ETC: 15:26 (0:00:40 remaining)
Completed SYN Stealth Scan at 15:26, 101.14s elapsed (500 total ports)
Initiating Service scan at 15:26
Initiating OS detection (try #1) against 10.137.211.113
Initiating Traceroute at 15:26
Completed Traceroute at 15:26, 9.05s elapsed
Initiating Parallel DNS resolution of 1 host. at 15:26
Completed Parallel DNS resolution of 1 host. at 15:26, 0.01s elapsed
NSE: Script scanning 10.137.211.113.
NSE: Starting runlevel 1 (of 2) scan.
Initiating NSE at 15:26
Completed NSE at 15:26, 0.00s elapsed
NSE: Starting runlevel 2 (of 2) scan.
Nmap scan report for 10.137.211.113
Host is up (0.0010s latency).
All 500 scanned ports on 10.137.211.113 are filtered
Device type: general purpose|specialized|media device
Running: Barrelfish, Microsoft Windows 2003|PocketPC/CE|XP, Novell NetWare 3.X, Siemens embedded, Telekom embedded
OS CPE: cpe:/o:barrelfish:barrelfish cpe:/o:microsoft:windows_server_2003::sp1 cpe:/o:microsoft:windows_server_2003::sp2 cpe:/o:microsoft:windows_ce cpe:/o:microsoft:windows_xp:::professional cpe:/o:novell:netware:3.12
Too many fingerprints match this host to give specific OS details
TCP/IP fingerprint:
SCAN(V=6.00%E=4%D=11/14%OT=%CT=%CU=%PV=Y%G=N%TM=5284EBAB%P=armv7l-unknown-linux-gnueabi)
T7(R=Y%DF=N%TG=80%W=0%S=Z%A=S+%F=AR%O=%RD=0%Q=R)
U1(R=N)
IE(R=N)
TRACEROUTE (using proto 1/icmp)
HOP RTT ADDRESS
1 2.32 ms zonhub.home (192.168.1.1)
2 ... 30
NSE: Script Post-scanning.
NSE: Starting runlevel 1 (of 2) scan.
NSE: Starting runlevel 2 (of 2) scan.
Read data files from: /usr/bin/../share/nmap
OS and Service detection performed. Please report any incorrect results at http://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 122.65 seconds
Raw packets sent: 1109 (49.620KB) | Rcvd: 4 (200B)
Well, this is odd. I don't know how my country's WAN is designed and built.
I'm from Portugal and my ISP is "ZON TVCABO". You can search now. :P
This is very very very interesting..
Sincerely,
int3
I cannot tell you how your providers WAN is built - but in order to
save public IPs - one can design an ISPs internal network with private
IPs. The routers that are not needed to be available from public will
have private IPs only - the IPs assigned to you can be routed to your
uplink over routers that are ISP internally only.
the 2nd hop has no tracing allowed, but allows to forward them.
the 4th - 10.255.x.x is a private IP in the 10.0.0.0/8 A range. (you
can use numbers from 0-255)
Working on a project where we need to server a small static xml file ~40k / s.
All incoming requests are sent to the server from HAProxy. However, none of the requests will be persistent.
The issue is that when benchmarking with non-Persistent requests, the nginx instance caps out at 19 114 req/s. When persistent connections are enabled, performance increases by nearly an order of magnitude, to 168 867 req/s. The results are similar with G-wan.
When benchmarking non-persistent requests, CPU usage is minimal.
What can I do to increase performance with non-persistent connections and nginx?
[root#spare01 lighttpd-weighttp-c24b505]# ./weighttp -n 1000000 -c 100 -t 16 "http://192.168.1.40/feed.txt"
finished in 52 sec, 315 millisec and 603 microsec, 19114 req/s, 5413 kbyte/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 290000000 bytes total, 231000000 bytes http, 59000000 bytes data
[root#spare01 lighttpd-weighttp-c24b505]# ./weighttp -n 1000000 -c 100 -t 16 -k "http://192.168.1.40/feed.txt"
finished in 5 sec, 921 millisec and 791 microsec, 168867 req/s, 48640 kbyte/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 294950245 bytes total, 235950245 bytes http, 59000000 bytes data
Your 2 tests are similar (except HTTP Keep-Alives):
./weighttp -n 1000000 -c 100 -t 16 "http://192.168.1.40/feed.txt"
./weighttp -n 1000000 -c 100 -t 16 -k "http://192.168.1.40/feed.txt"
And the one with HTTP Keep-Alives is 10x faster:
finished in 52 sec, 19114 req/s, 5413 kbyte/s
finished in 5 sec, 168867 req/s, 48640 kbyte/s
First, HTTP Keep-Alives (persistant connections) make HTTP requests run faster because:
Without HTTP Keep-Alives, the client must establish a new CONNECTION for EACH request (this is slow because of the TCP handshake).
With HTTP Keep-Alives, the client can send all requests at once (using the SAME CONNECTION). This is faster because there are less things to do.
Second, you say that the static file XML size is "small".
Is "small" nearer to 1 KB or 1 MB? We don't know. But that makes a huge difference in terms of available options to speedup things.
Huge files are usually served through sendfile() because it works in the kernel, freeing the usermode server from the burden of reading from disk and buffering.
Small files can use more flexible options available for application developers in usermode, but here also, file size matters (bytes and kilobytes are different animals).
Third, you are using 16 threads with your test. Are you really enjoying 16 PHYSICAL CPU Cores on BOTH the client and the server machines?
If that's not the case, then you are simply slowing-down the test to the point that you are no longer testing the web servers.
As you see, many factors have an influence on performance. And there are more with OS tuning (the TCP stack options, available file handles, system buffers, etc.).
To get the most of a system, you need to examinate all those parameters, and pick the best for your particular exercise.
I have this Google Dart test program:
#import('dart:io');
main() {
var s = new HttpServer();
s.defaultRequestHandler = (HttpRequest req, HttpResponse res) {
res.statusCode = 200;
res.contentLength = 4;
res.outputStream.writeString("sup!!");
res.outputStream.close();
};
s.listen('127.0.0.1', 80);
print('its up!');
}
It works fine on Chrome and Firefox, I get the sup -messages.
However, as soon as I try Apache Bench against it, it hangs (ab hangs):
Z:\www>ab -n 1 -c 1 "http://127.0.0.1/"
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)...apr_poll: The timeout specified has expired (70007)
You can find ab by installing Apache HTTP server and it will be located under the bin folder.
On a side note: is there some other benchmarking tool similar to ab that I could possibly use (and does not hang)?
It could be a problem with the contentLength. You wrote 4, but the actual content length is 5. For instance, if ab saw the contentLength, it might read 4 characters and wait for the connection to close. However, the connection probably won't close because the server is waiting to write the last character. The client and server are each waiting for something, resulting in deadlock.
We are tying to perf tune our drupal site.
We are using Siege to measure performance (as drupal visitor).
Env:
Nginx + FastCGI+ Memcache
Siege runs fine for a few seconds, and then we run into connection errors:
Example:
HTTP/1.1 200 29.18 secs: 5877 bytes ==> /
HTTP/1.1 200 29.39 secs: 5877 bytes ==> /
warning: socket: -1656235120 select timed out: Connection timed out
warning: socket: -1673020528 select timed out: Connection timed out
Using the same Siege test confiuration, Nginx + FastCGI+ Drupal Cache seems to work fine.
Example:
HTTP/1.1 200 1.41 secs: 5868 bytes ==> /
HTTP/1.1 200 1.40 secs: 5868 bytes ==> /
As you can see, Response time is much higher with MemCache, in addition to the connection errors.
Any idea what could be wrong here... and why Drupal is throwing errors with memcache under load?
Memcache runs on a separate instance. Allocated 2GB memory for MemCache.
I guess that You run out of memcached connections. Please run a check of Your memcached installation with a simple script every second. Then start Siege. I guess Your memcached stops responding after a while.
Test memcache php script:
<?php
$memcache = new Memcache;
$memcache->connect('localhost', 11211) or die ('Unable to connect');
$version = $memcache->getVersion();
echo 'Server version: '.$version;
?>
What I guess is happening is that You have not disable the persistent connections in memcache and they hang around in the php threads. Memcached can serve ~1023 of them at a time and that might not be enough while Sieging.
You might also try ab, apache benchmarking tool with the close look to the -c switch. Play around with it and see how the results change on different values.
Finally, You should run a tcpdump on Your memcached port (usually 11211) on the php machine to find out what is happening to the connections. Does drupal start them? Does the other host respond with a RST or does it time out?
There was a bug in the memcached php documentation api that said that the connections are non-persistent by default. They are persistent by default (well, they were at the time I had the problem with it).
Feel free to comment this answer, I'll read the comments and assist further if necessary.