Comparing HTTP and FTP for transferring files - http

What are the advantages (or limitations) of one over the other for transferring files over the Internet?
(I am aware of secure forms of both protocols. I'd like to hear comparisons through personal experiences in terms of performance, reliability, file size limitations etc.)

Here's a performance comparison of the two. HTTP is more responsive for request-response of small files, but FTP may be better for large files if tuned properly. FTP used to be generally considered faster. FTP requires a control channel and state be maintained besides the TCP state but HTTP does not. There are 6 packet transfers before data starts transferring in FTP but only 4 in HTTP.
I think a properly tuned TCP layer would have more effect on speed than the difference between application layer protocols. The Sun Blueprint Understanding Tuning TCP has details.
Heres another good comparison of individual characteristics of each protocol.

I just benchmarked a file transfer over both FTP and HTTP :
over two very good server connections
using the same 1GB .zip file
under the same network conditions (tested one after the other)
The result:
using FTP: 6 minutes
using HTTP: 4 minutes
using a concurrent http downloader software (fdm): 1 minute
So, basically under a "real life" situation:
1) HTTP is faster than FTP when downloading one big file.
2) HTTP can use parallel chunk download which makes it 6x times faster than FTP depending on the network conditions.

Many firewalls drop outbound connections which are not to ports 80 or 443 (http & https); some even drop connections to those ports that are not HTTP(S). FTP may or may not be allowed, not to speak of the active/PASV modes.
Also, HTTP/1.1 allows for much better partial requests ("only send from byte 123456 to the end of file"), conditional requests and caching ("only send if content changed/if last-modified-date changed") and content compression (gzip).
HTTP is much easier to use through a proxy.
From my anecdotal evidence, HTTP is easier to make work with dropped/slow/flaky connections; e.g. it is not needed to (re)establish a login session before (re)initiating transfer.
OTOH, HTTP is stateless, so you'd have to do authentication and building a trail of "who did what when" yourself.
The only difference in speed I've noticed is transferring lots of small files: HTTP with pipelining is faster (reduces round-trips, esp. noticeable on high-latency networks).
Note that HTTP/2 offers even more optimizations, whereas the FTP protocol has not seen any updates for decades (and even extensions to FTP have insignificant uptake by users). So, unless you are transferring files through a time machine, HTTP seems to have won.
(Tangentially: there are protocols that are better suited for file transfer, such as rsync or BitTorrent, but those don't have as much mindshare, whereas HTTP is Everywhereâ„¢)

One consideration is that FTP can use non-standard ports, which can make getting though firewalls difficult (especially if you're using SSL). HTTP is typically on a known port, so this is rarely a problem.
If you do decide to use FTP, make sure you read about Active and Passive FTP.
In terms of performance, at the end of the day they're both spewing files directly down TCP connections so should be about the same.

One advantage of FTP is that there is a standard way to list files using dir or ls. Because of this, ftp plays nice with tools such as rsync. Granted, rsync is usually done over ssh, but the option is there.

Both of them uses TCP as a transport protocol, but HTTP uses a persistent connection, which makes the performance of the TCP better.

Related

How to handle 20k concurrent listeners on an Icecast server

I want to know how to handle more than 20k listeners concurrently on an Icecast server. I am using liquidsoap as the audio stream generator (Only one audio stream is distributed through the Icecast server ). The server is configured on AWS. Further, I want to know whether I need to use LB and CDN to handle this much traffic.
Your main concern is bandwidth. Nothing else, bandwidth. You always run out of bandwidth first. Really.
You'll likely want to spread the load across multiple servers and e.g. do simple DNS round-robin. Also because multiple servers means more bandwidth available.
Feeding in a Primary+Relays (master/slave) topology is typical and documented. For more details I'd recommend the Icecast documentation and searching the Icecast mailing list archives.
There are some minor things like making sure that your ulimit for file descriptors is high enough for the Icecast process.
PS: In theory you can squeeze around 20k concurrent connections out of Icecast, but most of the time you won't have enough actual bandwidth to feed those anyway.

What strategies I can use to overcome networking limitations?

I maintain a service that basically pings sites to check whether they're online or not. The service per se is really simple, it relies only on the HTTP status code returned by the requested URL. For instance, I ignore the response body completely.
The service works fine for a small list of domains. However, networking becomes an issue as the number of sites to ping grows. I tried a couple of different languages and libraries. My latest implementation uses NodeJS and node-fetch. But I already had versions of it wrote in Python, PHP, Java, Golang. From that experience, I now know the language is not what determines the request/response speed. There are differences between languages and lib, for sure, but the bottleneck is not there.
Today, I think the only way to make the service scales is with multiple clusters in different networks (e.g. VPC if we're talking AWS). I can't think of a way to deal with networking restrictions in a single or just a few instances.
So, I'm asking this really broad question: what strategies I can use to overcome networking limitations? I'm looking for both dev and ops answers, but mostly focusing on keep the structure as light as possible.
One robust way to ping a website (or any TCP service in general) is to send TCP SYN packet to port 443 (or 80 for insecure HTTP) and measure the time till SYN+ACK response. Tools like hping3 and MTR utilize this method.
This method is one of the best because ICMP may be blocked, take a different path, be prioritized differently on routers in the path, or be responded to by a totally different host. Whereas TCP SYN is the actual scenario the users of the website exercise. The network load is minimal as no data is sent in SYN/SYN+ACK packets, only protocol headers (TCP, IP, and lower level protocol headers).
The answer of #Maxim Egorushkin is great, TCP SYN scanning is the most efficient way I can think of. There are other tools like Masscan, use pcap to send SYN packet in userspace, reduce TCP connection management overhead in kernel. This approach may do the job with a single instance.
If you wanna use HTTP protocol to make sure application layer works fine, use HTTP HEAD request. It responses with a header and status code as GET, but without the body.
Another potential optimization is DNS, you can host a DNS server locally and manage to update domains beforehand, or use a script to update host file before pinging those sites. This can save several milliseconds and bandwith
during pinging sites.
At development level, you could impletement a library just parse status code in HTTP response, so saving some CPU time on parsing headers.
It is helpful to address the actual bottleneck first, it that bandwith limit? memory limit? file descriptor limit? etc.

What is the difference between in FTP and HTTP?

HTTP is used to display the info and also can be used to transfer files from one host to another host.
FTP is used to transfer files from one host to another.
So I come to this point that FTP and HTTP both are almost doing the same work. Then what is the exact benefit of using FTP while I can do this with the HTTP?
Correct me if I am wrong.
Thanks
FTP is a File Transfer Protocol, for transferring files.
FTP is significantly older, it is a protocol designed to enable the transfer of files over a long-running session. There are a wide array of commands and the intent is to allow you to navigate and browse a remote file system and retrieve files (originally over a separate data connection).
FTP still sees a lot of use, but many files are actually transferred over HTTP instead.
HTTP
The HyperText Transfer Protocol was originally designed to transfer hypertext documents and the various assets needed to render them. In practice, this is the way information is transferred on the web -- html, css, images, data are all transferred between web servers and web browsers, as well as between one server and another this way.
HTTP was designed to retrieve a resource from a URL that may or may not match the remote file system (in many web apps, the structure of the URLs has very little to do with the file locations). There is often only a single request in a single http connection and the data uses the same connection as the request.
So I come to this point that FTP and HTTP both are almost doing the same work.
Not really. FTP can be used for file transfer and not really much more. HTTP is way more flexible since it not only transfers byte streams but also meta data (what kind of data is this), supports implicit compression, client specific responses (like based on supported languages), has more flexible ways for authentication, is tuned for less overhead (i.e. can be faster) ...
Then what is the exact benefit of using FTP while I can do this with the HTTP?
There is no real benefit of FTP today. In contrary, in contrast to alternatives like HTTP the design of FTP leads to lots of problems in today's infrastructure where NAT is heavily used (i.e. multiple internal systems behind a single router with public IP address).
FTP remains mostly in places where clients or servers don't support more modern ways for file exchange. A typical example is cheap web hosting where access to the server to update files is often done by FTP since lots of tools have FTP builtin and it is easy to setup on the server too. Alternatives like WebDAV (HTTP based) or SFTP (SSH based) are less used here since they have less support in clients and servers even though they would offer more security and more flexibility and less problems.

How many socket connections can a web server handle?

Say if I was to get shared, virtual or dedicated hosting, I read somewhere a server/machine can only handle 64,000 TCP connections at one time, is this true? How many could any type of hosting handle regardless of bandwidth? I'm assuming HTTP works over TCP.
Would this mean only 64,000 users could connect to the website, and if I wanted to serve more I'd have to move to a web farm?
In short:
You should be able to achieve in the order of millions of simultaneous active TCP connections and by extension HTTP request(s). This tells you the maximum performance you can expect with the right platform with the right configuration.
Today, I was worried whether IIS with ASP.NET would support in the order of 100 concurrent connections (look at my update, expect ~10k responses per second on older ASP.Net Mono versions). When I saw this question/answers, I couldn't resist answering myself, many answers to the question here are completely incorrect.
Best Case
The answer to this question must only concern itself with the simplest server configuration to decouple from the countless variables and configurations possible downstream.
So consider the following scenario for my answer:
No traffic on the TCP sessions, except for keep-alive packets (otherwise you would obviously need a corresponding amount of network bandwidth and other computer resources)
Software designed to use asynchronous sockets and programming, rather than a hardware thread per request from a pool. (ie. IIS, Node.js, Nginx... webserver [but not Apache] with async designed application software)
Good performance/dollar CPU / Ram. Today, arbitrarily, let's say i7 (4 core) with 8GB of RAM.
A good firewall/router to match.
No virtual limit/governor - ie. Linux somaxconn, IIS web.config...
No dependency on other slower hardware - no reading from harddisk, because it would be the lowest common denominator and bottleneck, not network IO.
Detailed Answer
Synchronous thread-bound designs tend to be the worst performing relative to Asynchronous IO implementations.
WhatsApp can handle a million WITH traffic on a single Unix flavoured OS machine - https://blog.whatsapp.com/index.php/2012/01/1-million-is-so-2011/.
And finally, this one, http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html, goes into a lot of detail, exploring how even 10 million could be achieved. Servers often have hardware TCP offload engines, ASICs designed for this specific role more efficiently than a general purpose CPU.
Good software design choices
Asynchronous IO design will differ across Operating Systems and Programming platforms. Node.js was designed with asynchronous in mind. You should use Promises at least, and when ECMAScript 7 comes along, async/await. C#/.Net already has full asynchronous support like node.js. Whatever the OS and platform, asynchronous should be expected to perform very well. And whatever language you choose, look for the keyword "asynchronous", most modern languages will have some support, even if it's an add-on of some sort.
To WebFarm?
Whatever the limit is for your particular situation, yes a web-farm is one good solution to scaling. There are many architectures for achieving this. One is using a load balancer (hosting providers can offer these, but even these have a limit, along with bandwidth ceiling), but I don't favour this option. For Single Page Applications with long-running connections, I prefer to instead have an open list of servers which the client application will choose from randomly at startup and reuse over the lifetime of the application. This removes the single point of failure (load balancer) and enables scaling through multiple data centres and therefore much more bandwidth.
Busting a myth - 64K ports
To address the question component regarding "64,000", this is a misconception. A server can connect to many more than 65535 clients. See https://networkengineering.stackexchange.com/questions/48283/is-a-tcp-server-limited-to-65535-clients/48284
By the way, Http.sys on Windows permits multiple applications to share the same server port under the HTTP URL schema. They each register a separate domain binding, but there is ultimately a single server application proxying the requests to the correct applications.
Update 2019-05-30
Here is an up to date comparison of the fastest HTTP libraries - https://www.techempower.com/benchmarks/#section=data-r16&hw=ph&test=plaintext
Test date: 2018-06-06
Hardware used: Dell R440 Xeon Gold + 10 GbE
The leader has ~7M plaintext reponses per second (responses not connections)
The second one Fasthttp for golang advertises 1.5M concurrent connections - see https://github.com/valyala/fasthttp
The leading languages are Rust, Go, C++, Java, C, and even C# ranks at 11 (6.9M per second). Scala and Clojure rank further down. Python ranks at 29th at 2.7M per second.
At the bottom of the list, I note laravel and cakephp, rails, aspnet-mono-ngx, symfony, zend. All below 10k per second. Note, most of these frameworks are build for dynamic pages and quite old, there may be newer variants that feature higher up in the list.
Remember this is HTTP plaintext, not for the Websocket specialty: many people coming here will likely be interested in concurrent connections for websocket.
This question is a fairly difficult one. There is no real software limitation on the number of active connections a machine can have, though some OS's are more limited than others. The problem becomes one of resources. For example, let's say a single machine wants to support 64,000 simultaneous connections. If the server uses 1MB of RAM per connection, it would need 64GB of RAM. If each client needs to read a file, the disk or storage array access load becomes much larger than those devices can handle. If a server needs to fork one process per connection then the OS will spend the majority of its time context switching or starving processes for CPU time.
The C10K problem page has a very good discussion of this issue.
To add my two cents to the conversation a process can have simultaneously open a number of sockets connected equal to this number (in Linux type sytems) /proc/sys/net/core/somaxconn
cat /proc/sys/net/core/somaxconn
This number can be modified on the fly (only by root user of course)
echo 1024 > /proc/sys/net/core/somaxconn
But entirely depends on the server process, the hardware of the machine and the network, the real number of sockets that can be connected before crashing the system
It looks like the answer is at least 12 million if you have a beefy server, your server software is optimized for it, you have enough clients. If you test from one client to one server, the number of port numbers on the client will be one of the obvious resource limits (Each TCP connection is defined by the unique combination of IP and port number at the source and destination).
(You need to run multiple clients as otherwise you hit the 64K limit on port numbers first)
When it comes down to it, this is a classic example of the witticism that "the difference between theory and practise is much larger in practise than in theory" - in practise achieving the higher numbers seems to be a cycle of a. propose specific configuration/architecture/code changes, b. test it till you hit a limit, c. Have I finished? If not then d. work out what was the limiting factor, e. go back to step a (rinse and repeat).
Here is an example with 2 million TCP connections onto a beefy box (128GB RAM and 40 cores) running Phoenix http://www.phoenixframework.org/blog/the-road-to-2-million-websocket-connections - they ended up needing 50 or so reasonably significant servers just to provide the client load (their initial smaller clients maxed out to early, eg "maxed our 4core/15gb box # 450k clients").
Here is another reference for go this time at 10 million: http://goroutines.com/10m.
This appears to be java based and 12 million connections: https://mrotaru.wordpress.com/2013/06/20/12-million-concurrent-connections-with-migratorydata-websocket-server/
Note that HTTP doesn't typically keep TCP connections open for any longer than it takes to transmit the page to the client; and it usually takes much more time for the user to read a web page than it takes to download the page... while the user is viewing the page, he adds no load to the server at all.
So the number of people that can be simultaneously viewing your web site is much larger than the number of TCP connections that it can simultaneously serve.
in case of the IPv4 protocol, the server with one IP address that listens on one port only can handle 2^32 IP addresses x 2^16 ports so 2^48 unique sockets. If you speak about a server as a physical machine, and you are able to utilize all 2^16 ports, then there could be maximum of 2^48 x 2^16 = 2^64 unique TCP/IP sockets for one IP address. Please note that some ports are reserved for the OS, so this number will be lower. To sum up:
1 IP and 1 port --> 2^48 sockets
1 IP and all ports --> 2^64 sockets
all unique IPv4 sockets in the universe --> 2^96 sockets
There are two different discussions here: One is how many people can connect to your server. This one has been answered adequately by others, so I won't go into that.
Other is how many ports yours server can listen on? I believe this is where the 64K number came from. Actually, TCP protocol uses a 16-bit identifier for a port, which translates to 65536 (a bit more than 64K). This means that you can have that many different "listeners" on the server per IP Address.
I think that the number of concurrent socket connections one web server can handle largely depends on the amount of resources each connection consumes and the amount of total resource available on the server barring any other web server resource limiting configuration.
To illustrate, if every socket connection consumed 1MB of server resource and the server has 16GB of RAM available (theoretically) this would mean it would only be able to handle (16GB / 1MB) concurrent connections. I think it's as simple as that... REALLY!
So regardless of how the web server handles connections, every connection will ultimately consume some resource.

Is SCTP good for peer-to-peer apps?

I am considering using SCTP instead of TCP for a p2p app written in C. Should I do it? Also how does the speed of SCTP compare to the speed of TCP?
EDIT:
I found that SCTP can be tunneled over UDP with the only problem being tunneled SCTP is not interoperable with untunneled SCTP.
Have you considered whether your target systems will all have SCTP pre-installed on them or whether your application will need to include SCTP itself? In my experience I would not expect all systems to have SCTP installed on them, and I would expect them not to if it were Windows.
If you include SCTP in the application itself then that will more than double the number of messages being passed into an out of the Kernel which will impact performance when compared with using the pre installed TCP.
Have you considered what benefits you want from SCTP? You mentioned fault tolerance but for this to work with SCTP it requires the application to have multiple ethernet ports and and IP addresses. Is this likely on your app?
As much as I love SCTP (!) I would seriously consider sticking with TCP unless you are sure SCTP is needed or unless you control the hosts your app is deployed on.
Regards
If it's for a local area network, sure go for it.
Note however that if you plan to use it on the open internet many consumer grade firewalls aren't flexible enough to permit unrecognised IP protocols through them.
How does it help you?
You're P2P, so every peer must have at least one socket open to every other peer.
If you've got a socket open, then you can do everything you need to do over that. If you've taken the approach of one socket per file and you have multiple files being tranferred concurrently between two given peers, then SCTP will save you one socket per file. However, on a normal P2P network of any size, you will almost never have multiple files being transferred concurrently between two peers.
Just have one socket and have your own little protocol; send a packet with a header, the header indicates content type, e.g. a command, or part a file - and if so, which file, and which byte range.
Of course, you get a little overhead for that, whereas if you have one socket for commands and one per file, you're more efficient. Is saving one socket per peer (assuming one download at a time) worth the time/hassle/complexity of using SCTP?

Resources