I'm currently implementing reliable UDP transport inspired by KCP, Dragonite, and QUIC just in self-education purpose. I want to apply several optimizations, one of which is multiplexing.
My idea is: I split data into small chunks (chunk size is correlating with MTU) and send and receive them through multiple datagram sockets asynchronously in parallel (both on client and server) utilizing coroutines.
Will this solution work? Should I expect performance improvement?
Contrary to TCP UDP has no slow start, i.e. it can start sending with full speed (if known) from the beginning. Thus essentially the limits of how fast sending can be done is either the speed in which the local system can send data or the available bandwidth. Assuming that the sending is not CPU bound and the traffic of all of the multiple sockets you envision will take the same way (outgoing network card, routers, incoming network card) and no connection-specific traffic shaping is done in middleboxes, then using multiple sockets should not result in increased speed since it does not change how the various bottlenecks are used.
This changes if the sending is CPU bound. In this case the use of multiple coroutines combined with multiple sockets might make better use of today's multi-processor systems in that it is running on multiple CPU cores at the same time and this way can send more packets until it gets CPU bound again.
This changes also if the traffic is bandwidth-bound but there are alternative path to the target system which provide additional bandwidth. By binding the sockets to a different local IP address (on a different local network card) or by choosing a different target IP address (for the same target system) one might be able to use such alternative path and thus make use of the additional bandwidth.
Similarly multiple sockets might help if there is some traffic shaping which limits the bandwidth per connection in between client and server. In this case multiple sockets can increase the amount of usable bandwidth.
Coming from a background of vSphere vm's with vNIC's defined on creation as I am do the GCE instances internal and public ip network connections use a particular virtualised NIC and if so what speed is it 100Mbit/s 1Gb or 10Gb?
I'm not so much interested in the bandwidth from the public internet in but more what kind of connection is possible between instances given networks can span regions
Is it right to think of a GCE project network as a logical 100Mbit/s 1Gb or 10Gb network spanning the atlantic I plug my instances into or should there be no minimum expectation because too many variables exist like noisy neighbours and inter region bandwidth not to mention physical distance?
The virtual network adapter advertised in GCE conforms to the virtio-net specification (specifically virtio-net 0.9.5 with multiqueue). Within the same zone we offer up to 2Gbps/core of network throughput. The NIC itself does not advertise a specific speed. Performance between zones and between regions is subject to capacity limits and quality-of-service within Google's WAN.
The performance relevant features advertised by our virtual NIC as of December 2015 are support for:
IPv4 TCP Transport Segmentation Offload
IPv4 TCP Large Receive Offload
IPv4 TCP/UDP Tx checksum calculation offload
IPv4 TCP/UDP Rx checksum verification offload
Event based queue signaling/interrupt suppression.
In our testing for best performance it is advantageous to enable of all of these features. Images supplied by Google will take advantage of all the features available in the shipping kernel (that is, some images ship with older kernels for stability and may not be able to take advantage of all of these features).
I can see up to 1Gb/s between instances within the same zone, but AFAIK that is not something which is guaranteed, especially for tansatlantic communication. Things might change in the future, so I'd suggest to follow official product announcements.
There have been a few enhancements in the years since the original question and answers were posted. In particular, the "2Gbps/core" (really, per vCPU) is still there but there is now a minimum cap of 10 Gbps for VMs with two or more vCPUs. The maximum cap is currently 32 Gbps, with 50 Gbps and 100 Gbps caps in the works.
The per-VM egress caps remain "guaranteed not to exceed" not "guaranteed to achieve."
In terms of achieving peak, trans-Atlantic performance, one suggestion would be the same as for any high-latency path. Ensure that your sources and destinations are tuned to allow sufficient TCP window to achieve the throughput you desire. In particular, this formula would be in effect:
Throughput <= WindowSize / RoundTripTime
Of course that too is a "guaranteed not to exceed" rather than a "guaranteed to achieve" thing. As was stated before "Performance between zones and between regions is subject to capacity limits and quality-of-service within Google's WAN."
I'm using IT Guru's Opnet to simulate different networks. I've run the basic HomeLAN scenario and by default it uses an ethernet connection running at a data rate of 20Kbps. Throughout the scenarios this is changed from 20K to 40K, then to 512K and then to a T1 line running at 1.544Mbps. My question is - does increasing the data rate for the line increase the throughput?
I have this graph output from the program to display my results:
Please note it's the image on the forefront which is of interest
In general, the signaling capacity of a data path is only one factor in the net throughput.
For example, TCP is known to be sensitive to latency. For any particular TCP implementation and path latency, there will be a maximum speed beyond which TCP cannot go regardless of the path's signaling capacity.
Also consider the source and destination of the traffic: changing the network capacity won't change the speed if the source is not sending the data any faster or if the destination cannot receive it any faster.
In the case of network emulators, also be aware that buffer sizes can affect throughput. The size of the network buffer must be at least as large as the signal rate multiplied by the latency (the Bandwidth Delay Product). I am not familiar with the particulars of Opnet, but I have seen other emulators where it is possible to set a buffer size too small to support the select rate and latency.
I have written a couple of articles related to these topics which may be helpful:
This one discusses common network bottlenecks: Common Network Performance Problems
This one discusses emulator configuration issues: Network Emulators
Say if I was to get shared, virtual or dedicated hosting, I read somewhere a server/machine can only handle 64,000 TCP connections at one time, is this true? How many could any type of hosting handle regardless of bandwidth? I'm assuming HTTP works over TCP.
Would this mean only 64,000 users could connect to the website, and if I wanted to serve more I'd have to move to a web farm?
In short:
You should be able to achieve in the order of millions of simultaneous active TCP connections and by extension HTTP request(s). This tells you the maximum performance you can expect with the right platform with the right configuration.
Today, I was worried whether IIS with ASP.NET would support in the order of 100 concurrent connections (look at my update, expect ~10k responses per second on older ASP.Net Mono versions). When I saw this question/answers, I couldn't resist answering myself, many answers to the question here are completely incorrect.
Best Case
The answer to this question must only concern itself with the simplest server configuration to decouple from the countless variables and configurations possible downstream.
So consider the following scenario for my answer:
No traffic on the TCP sessions, except for keep-alive packets (otherwise you would obviously need a corresponding amount of network bandwidth and other computer resources)
Software designed to use asynchronous sockets and programming, rather than a hardware thread per request from a pool. (ie. IIS, Node.js, Nginx... webserver [but not Apache] with async designed application software)
Good performance/dollar CPU / Ram. Today, arbitrarily, let's say i7 (4 core) with 8GB of RAM.
A good firewall/router to match.
No virtual limit/governor - ie. Linux somaxconn, IIS web.config...
No dependency on other slower hardware - no reading from harddisk, because it would be the lowest common denominator and bottleneck, not network IO.
Detailed Answer
Synchronous thread-bound designs tend to be the worst performing relative to Asynchronous IO implementations.
WhatsApp can handle a million WITH traffic on a single Unix flavoured OS machine - https://blog.whatsapp.com/index.php/2012/01/1-million-is-so-2011/.
And finally, this one, http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html, goes into a lot of detail, exploring how even 10 million could be achieved. Servers often have hardware TCP offload engines, ASICs designed for this specific role more efficiently than a general purpose CPU.
Good software design choices
Asynchronous IO design will differ across Operating Systems and Programming platforms. Node.js was designed with asynchronous in mind. You should use Promises at least, and when ECMAScript 7 comes along, async/await. C#/.Net already has full asynchronous support like node.js. Whatever the OS and platform, asynchronous should be expected to perform very well. And whatever language you choose, look for the keyword "asynchronous", most modern languages will have some support, even if it's an add-on of some sort.
To WebFarm?
Whatever the limit is for your particular situation, yes a web-farm is one good solution to scaling. There are many architectures for achieving this. One is using a load balancer (hosting providers can offer these, but even these have a limit, along with bandwidth ceiling), but I don't favour this option. For Single Page Applications with long-running connections, I prefer to instead have an open list of servers which the client application will choose from randomly at startup and reuse over the lifetime of the application. This removes the single point of failure (load balancer) and enables scaling through multiple data centres and therefore much more bandwidth.
Busting a myth - 64K ports
To address the question component regarding "64,000", this is a misconception. A server can connect to many more than 65535 clients. See https://networkengineering.stackexchange.com/questions/48283/is-a-tcp-server-limited-to-65535-clients/48284
By the way, Http.sys on Windows permits multiple applications to share the same server port under the HTTP URL schema. They each register a separate domain binding, but there is ultimately a single server application proxying the requests to the correct applications.
Update 2019-05-30
Here is an up to date comparison of the fastest HTTP libraries - https://www.techempower.com/benchmarks/#section=data-r16&hw=ph&test=plaintext
Test date: 2018-06-06
Hardware used: Dell R440 Xeon Gold + 10 GbE
The leader has ~7M plaintext reponses per second (responses not connections)
The second one Fasthttp for golang advertises 1.5M concurrent connections - see https://github.com/valyala/fasthttp
The leading languages are Rust, Go, C++, Java, C, and even C# ranks at 11 (6.9M per second). Scala and Clojure rank further down. Python ranks at 29th at 2.7M per second.
At the bottom of the list, I note laravel and cakephp, rails, aspnet-mono-ngx, symfony, zend. All below 10k per second. Note, most of these frameworks are build for dynamic pages and quite old, there may be newer variants that feature higher up in the list.
Remember this is HTTP plaintext, not for the Websocket specialty: many people coming here will likely be interested in concurrent connections for websocket.
This question is a fairly difficult one. There is no real software limitation on the number of active connections a machine can have, though some OS's are more limited than others. The problem becomes one of resources. For example, let's say a single machine wants to support 64,000 simultaneous connections. If the server uses 1MB of RAM per connection, it would need 64GB of RAM. If each client needs to read a file, the disk or storage array access load becomes much larger than those devices can handle. If a server needs to fork one process per connection then the OS will spend the majority of its time context switching or starving processes for CPU time.
The C10K problem page has a very good discussion of this issue.
To add my two cents to the conversation a process can have simultaneously open a number of sockets connected equal to this number (in Linux type sytems) /proc/sys/net/core/somaxconn
cat /proc/sys/net/core/somaxconn
This number can be modified on the fly (only by root user of course)
echo 1024 > /proc/sys/net/core/somaxconn
But entirely depends on the server process, the hardware of the machine and the network, the real number of sockets that can be connected before crashing the system
It looks like the answer is at least 12 million if you have a beefy server, your server software is optimized for it, you have enough clients. If you test from one client to one server, the number of port numbers on the client will be one of the obvious resource limits (Each TCP connection is defined by the unique combination of IP and port number at the source and destination).
(You need to run multiple clients as otherwise you hit the 64K limit on port numbers first)
When it comes down to it, this is a classic example of the witticism that "the difference between theory and practise is much larger in practise than in theory" - in practise achieving the higher numbers seems to be a cycle of a. propose specific configuration/architecture/code changes, b. test it till you hit a limit, c. Have I finished? If not then d. work out what was the limiting factor, e. go back to step a (rinse and repeat).
Here is an example with 2 million TCP connections onto a beefy box (128GB RAM and 40 cores) running Phoenix http://www.phoenixframework.org/blog/the-road-to-2-million-websocket-connections - they ended up needing 50 or so reasonably significant servers just to provide the client load (their initial smaller clients maxed out to early, eg "maxed our 4core/15gb box # 450k clients").
Here is another reference for go this time at 10 million: http://goroutines.com/10m.
This appears to be java based and 12 million connections: https://mrotaru.wordpress.com/2013/06/20/12-million-concurrent-connections-with-migratorydata-websocket-server/
Note that HTTP doesn't typically keep TCP connections open for any longer than it takes to transmit the page to the client; and it usually takes much more time for the user to read a web page than it takes to download the page... while the user is viewing the page, he adds no load to the server at all.
So the number of people that can be simultaneously viewing your web site is much larger than the number of TCP connections that it can simultaneously serve.
in case of the IPv4 protocol, the server with one IP address that listens on one port only can handle 2^32 IP addresses x 2^16 ports so 2^48 unique sockets. If you speak about a server as a physical machine, and you are able to utilize all 2^16 ports, then there could be maximum of 2^48 x 2^16 = 2^64 unique TCP/IP sockets for one IP address. Please note that some ports are reserved for the OS, so this number will be lower. To sum up:
1 IP and 1 port --> 2^48 sockets
1 IP and all ports --> 2^64 sockets
all unique IPv4 sockets in the universe --> 2^96 sockets
There are two different discussions here: One is how many people can connect to your server. This one has been answered adequately by others, so I won't go into that.
Other is how many ports yours server can listen on? I believe this is where the 64K number came from. Actually, TCP protocol uses a 16-bit identifier for a port, which translates to 65536 (a bit more than 64K). This means that you can have that many different "listeners" on the server per IP Address.
I think that the number of concurrent socket connections one web server can handle largely depends on the amount of resources each connection consumes and the amount of total resource available on the server barring any other web server resource limiting configuration.
To illustrate, if every socket connection consumed 1MB of server resource and the server has 16GB of RAM available (theoretically) this would mean it would only be able to handle (16GB / 1MB) concurrent connections. I think it's as simple as that... REALLY!
So regardless of how the web server handles connections, every connection will ultimately consume some resource.
When writing a custom server, what are the best practices or techniques to determine maximum number of users that can connect to the server at any given time?
I would assume that the capabilities of the computer hardware, network capacity, and server protocol would all be important factors.
Also, do you think it is a good practice to limit the number of network connections to a certain maximum number of users? Or should the server not limit the number of network connections and let performance degrade until the response time is extremely high?
Dan Kegel put together a summary of techniques for handling large amounts of network connections from a single server, here: http://www.kegel.com/c10k.html
In general modern servers can handle very large numbers of concurrent connections. I've worked on systems having over 8,000 concurrently open TCP/IP sockets.
You will need a high quality servicing interface to handle that kind of load, check out libevent or libev.
That is a good question and it definitely is situational. What is your computer? Do you have a 4 socket machine filled with Quad Core Xeons, 128 GB of RAM, and Fiber Channel Connectivity (like the pair of Dell R900s we just bought)? Or are you running on a p3 550 with 256 MB of RAM, and 56K modem? How much load does each connection place on your server? What kind of response is acceptible?
These are the questions you need to answer. I guess the best way to find the answer is through load testing. Create a unit test of the expected (and maybe some unexpected) paths that your code will perform against your server. Find a load testing framework that will allow you to simulate 10, 100, 1000, 10000 users performing those tasks at the same time.
That will tell you how many connections your computer can support.
The great thing about the load/unit test scenario is that you can put in response time expectations in your unit tests and increase the load until you fall outside of your response time. If you have a requirement of supporting X number of Users with Y second response, you will be able to demonstrate it with your load tests.
One of the biggest setbacks in high concurrency connections is actually the routers involved. Home user oriented routers usually have a small NAT table, preventing the router from actually servicing the server the connections.
Be sure to research your router/ network infrastructure setup just as well.
I think you shouldn't limit the number of connections your server will allow - just catch and handle properly any exceptions that might occur when accepting and closing connections and you should be fine. You should leave that kind of lower level programming to the underlying OS layers - that way you can port your server easier etc.
This really depends on your operating system.
Different Unix flavors will support "unlimited" number of file handles / sockets others have high values like 32768.
A typical user limit is 8192 but it can usually be set higher.
I think windows is more limiting but the server version may have higher limits.