What strategies I can use to overcome networking limitations? - http

I maintain a service that basically pings sites to check whether they're online or not. The service per se is really simple, it relies only on the HTTP status code returned by the requested URL. For instance, I ignore the response body completely.
The service works fine for a small list of domains. However, networking becomes an issue as the number of sites to ping grows. I tried a couple of different languages and libraries. My latest implementation uses NodeJS and node-fetch. But I already had versions of it wrote in Python, PHP, Java, Golang. From that experience, I now know the language is not what determines the request/response speed. There are differences between languages and lib, for sure, but the bottleneck is not there.
Today, I think the only way to make the service scales is with multiple clusters in different networks (e.g. VPC if we're talking AWS). I can't think of a way to deal with networking restrictions in a single or just a few instances.
So, I'm asking this really broad question: what strategies I can use to overcome networking limitations? I'm looking for both dev and ops answers, but mostly focusing on keep the structure as light as possible.

One robust way to ping a website (or any TCP service in general) is to send TCP SYN packet to port 443 (or 80 for insecure HTTP) and measure the time till SYN+ACK response. Tools like hping3 and MTR utilize this method.
This method is one of the best because ICMP may be blocked, take a different path, be prioritized differently on routers in the path, or be responded to by a totally different host. Whereas TCP SYN is the actual scenario the users of the website exercise. The network load is minimal as no data is sent in SYN/SYN+ACK packets, only protocol headers (TCP, IP, and lower level protocol headers).

The answer of #Maxim Egorushkin is great, TCP SYN scanning is the most efficient way I can think of. There are other tools like Masscan, use pcap to send SYN packet in userspace, reduce TCP connection management overhead in kernel. This approach may do the job with a single instance.
If you wanna use HTTP protocol to make sure application layer works fine, use HTTP HEAD request. It responses with a header and status code as GET, but without the body.
Another potential optimization is DNS, you can host a DNS server locally and manage to update domains beforehand, or use a script to update host file before pinging those sites. This can save several milliseconds and bandwith
during pinging sites.
At development level, you could impletement a library just parse status code in HTTP response, so saving some CPU time on parsing headers.
It is helpful to address the actual bottleneck first, it that bandwith limit? memory limit? file descriptor limit? etc.


HTTP through a domain socket

I'm writing a bit of desktop software which has two components. Component B queries component A. Creating a web service seems like an ideal way to do IPC in principle. The data model fits, there are ready-made client and server libraries, a well known way to encode and decode parameters etc.
But setting up an HTTP server on a network socket doesn't seem right for a local application. For example what port do I choose? I don't really want people to be able to scan and talk to the app from outside etc.
So I was thinking that I might be able to do HTTP over a domain socket. Does that make any sense? Is there any precedence for it? Is there an equivalent protocol that I could use for IPC which has the same properties as HTTP (requests for specified resources (URIs), encoded parameters, response)?
Looking for C libraries (and possibly Go and ObjC for bonus points).
Binding to the loopback interface only ( solves your "external visibility" problem, only processes on the local machine will be able to connect.
It does not solve your port allocation problem though, the port number you choose might be taken by the time your app starts. Then your server can't bind and your client connects to the other process bound to your port.
Old, less hip, but CORBA implementations tend to have the problems you have not thought of yet figured out already.

Connection Speed using HTTP Request

We are making an application involving a server(tomcat, apache, linux) and multiple mobile clients(Android, iPhone, Windows, Nokia J2ME).
Normally the clients and the server will communicate using http.
I would like to know the download and upload speeds of the client from the http request that it made.
Ideally I would not like to upload a file and download a file to come up with these speeds. I am assuming that there might be some thing at the HTTP protocol level that can give me this, or some lower layer of the network.
If only it were that simple.
Even where the bandwidth and latency of a network are very well defined, the actual throughput will be limited by the congestion window and where the end points are in establishing the slow start threshold. These can affect throughput by a factor of 20 or more.
There's nothing in HTTP which will provide metrics for these. Some TCP stacks will expose limited information about throughput (as used by iftop, iptraf).
However if you really want to gather useful metrics on HTTP throughput, then you need to start shoving data across the network - have a look at yahoo boomerang for an implementation.
If the http connection goes to the Apache server first, you can use Apache Bench to do all sorts of load testing. It comes with apache and can be invoked with something like the following.
Suppose we want to see how fast Yahoo can handle 100 requests, with a maximum of 10 requests running concurrently:
ab -n 100 -c 10 http://www.yahoo.com/
HTTP does not deal with connection speeds. Although I could imagine some solution that involves some HTTP (reverse) proxy that estimates speeds on a connection and sets custom headers to pass this info. You would also need to to associate stats of different connections with particular client. I have not seen yet a readily available solution for this.
Also note that
network traffic can be buffered or shaped so download speed may depend on amount of data transferred or previous load of network. So even downloading file would not be accurate.
Amount of data transferred depends on protocol level (payload wrapped in HTTP wrapped in gzip wrapped in TLS wrapped TCP). Which one do you want to measure? Or what do you want to achieve with this measured speed?
I've seen some Real User Monitoring (RUM) tools that can do this passively (they get a feed from a SPAN port or network TAP infront of the servers at the data centre)
There are probably ways of integrating the data they produce into your applications but I'm not sure it would be easy or perhaps given the way latency and bandwidth can 'dynamically' change on a mobile network that accurate.
I guess the real thing to focus on is the design of the app, how much data is travelling across the network, how you can minimise it etc.
Other thing to consider is whether you could offer a solution that allows some of the application to be hosted in the telco's POPs (some telcos route all their towers back to a central pop, others have multiple POPs)

Other common protocols besides HTTP?

I usually pass data between my web servers (in different locations) using HTTP requests (sometimes using SSL if it's sensitive). I was wondering if there were any lighter protocols that I might be able to swap HTTP(S) for that would also support public/private keys like SSH or something.
I used PHP sockets to build a SMTP client before so I wouldn't mind doing that if required.
There are lots and lots and lots of protocols. Lots. Start here for a list.
SFTP is fun for passing data around. It works well. You'll find that it's not much better than HTTP, however, because HTTP is pretty simple.
SMTP would work. http://en.wikipedia.org/wiki/Simple_Mail_Transfer_Protocol
SNMP can be made to work. http://en.wikipedia.org/wiki/Simple_Network_Management_Protocol You have to really push the envelope.
All of these, however, involve TCP/IP sockets, which involve a fair amount of overhead because of the negotiation for a connection and the acknowledgement of packets.
If you want real fun with very low overhead, use UDP.
You might want to use Reliable UDP if you're worried about messages getting dropped.
I'd like to mention XMPP in addition to protocols already listed in other answers.
It's lightweight, and it is used in some "realtime" communication systems (for example, in GTalk).
WebSocket is a good option if you are interested in keeping a connection open to pass multiple messages back and forth. It's useful for issuing updates from the server to clients in real time, for example.
Why don't you simply use FTPS:

How to retain one million simultaneous TCP connections?

I am to design a server that needs to serve millions of clients that are simultaneously connected with the server via TCP.
The data traffic between the server and the clients will be sparse, so bandwidth issues can be ignored.
One important requirement is that whenever the server needs to send data to any client it should use the existing TCP connection instead of opening a new connection toward the client (because the client may be behind a firewall).
Does anybody know how to do this, and what hardware/software is needed (at the least cost)?
What operating systems are you considering for this?
If using a Windows OS and using something later than Vista then you shouldn't have a problem with many thousands of connections on a single machine. I've run tests (here: http://www.lenholgate.com/blog/2005/11/windows-tcpip-server-performance.html) with a low spec Windows Server 2003 machine and easily achieved more than 70,000 active TCP connections. Some of the resource limits that affect the number of connections possible have been lifted considerably on Vista (see here: http://www.lenholgate.com/blog/2005/11/windows-tcpip-server-performance.html) and so you could probably achieve your goal with a small cluster of machines. I don't know what you'd need in front of those to route the connections.
Windows provides a facility called I/O Completion Ports (see: http://msdn.microsoft.com/en-us/magazine/cc302334.aspx) which allow you to service many thousands of concurrent connections with very few threads (I was running tests yesterday with 5000 connections saturating a link to a server with 2 threads to process the I/O...). Thus the basic architecture is very scalable.
If you want to run some tests then I have some freely available tools on my blog that allow you to thrash a simple echo server using many thousands of connections (1) and (2) and some free code which you could use to get you started (3)
The second part of your question, from your comments, is more tricky. If the client's IP address keeps changing and there's nothing between you and them that is providing NAT to give you a consistent IP address then their connections will, no doubt, be terminated and need to be re-established. If the clients detect this connection tear down when their IP address changes then they can reconnect to the server, if they can't then I would suggest that the clients need to poll the server every so often so that they can detect the connection loss and reconnect. There's nothing the server can do here as it can't predict the new IP address and it will discover that the old connection has failed when it tries to send data.
And remember, your problems are only just beginning once you get your system to scale to this level...
This problem is related to the so-called C10K problem. The C10K page lists a large number of good resources for addressing the problems you will encounter when you try to allow thousands of clients to connect to the same server.
I've come across the APE Project
a while back. It seems like a dream come true. They can support up to 100k concurrent clients on a single node. Spread them across 10 or 20 nodes, and you can serve millions. Perfect for RESTful applications. Might want to look deeper for any shared namespace. One drawback is that this is a standalone server, as in supplementary to a web server. This server is of course Open Source, so any cost is hardware/ISP related.
You cannot use UDP. If the client sends a request and you don't reply immediately, a router is going to forget the reverse route in 30 seconds or less, so your server will never be able to reply to the client.
TCP is the only option, and it, too, will give you headaches. Most routers are going to forget the route and/or drop the connection after a few minutes, so your client/server code is going to have to send "keep alives" fairly often.
I recommend setting up a "sniffer", to see how the phone companies are staying in touch with your smartphone for their "push" technology. Copy whatever they're doing, because that stuff works!
As Greg mentioned, the problem you are describing is C10K (or rather "C1M" in your case )
I recently made a simple TCP echo server on linux that scales very well with the number of sessions (only tested up to 200.000 though), by using the epoll queue. On BSD, you have something similar called kqueue.
You can check out the code if you want to. Hope this helps and good luck!
EDIT: As noted in the comments below, my original assertion that there is a 64K limit based on the number of ports is incorrect, however there is a 32K limit on the number of socket handles, so my suggested design is valid.
With a typical TCP/IP server design, you're limited in the number of simultaneous open connections you can have. The server has one listening port, and when a client connects to it the server makes an accept call, and that creates a new socket on a random port for the rest of the connection.
To handle more than 64K simultaneous connections I think you need to use UDP instead. You only need one port for the server to listen on, and you need to manage the connections using a 32-bit client ID in the packet data instead of having a separate port for each client. The 32-bit client ID could be the client's IP address, and the client can listen on a known UDP port for messages coming back from the server. That port would be the only one that needs to be open on the firewall.
With this approach, your only limitation is how quickly you can handle and respond to UDP messages. With millions of clients, even sparse traffic could give you large spikes, and if you don't read the packets fast enough your input queue will fill up and you'll start dropping packets. The C10K page Greg points to will give you strategies for that.

Comparing HTTP and FTP for transferring files

What are the advantages (or limitations) of one over the other for transferring files over the Internet?
(I am aware of secure forms of both protocols. I'd like to hear comparisons through personal experiences in terms of performance, reliability, file size limitations etc.)
Here's a performance comparison of the two. HTTP is more responsive for request-response of small files, but FTP may be better for large files if tuned properly. FTP used to be generally considered faster. FTP requires a control channel and state be maintained besides the TCP state but HTTP does not. There are 6 packet transfers before data starts transferring in FTP but only 4 in HTTP.
I think a properly tuned TCP layer would have more effect on speed than the difference between application layer protocols. The Sun Blueprint Understanding Tuning TCP has details.
Heres another good comparison of individual characteristics of each protocol.
I just benchmarked a file transfer over both FTP and HTTP :
over two very good server connections
using the same 1GB .zip file
under the same network conditions (tested one after the other)
The result:
using FTP: 6 minutes
using HTTP: 4 minutes
using a concurrent http downloader software (fdm): 1 minute
So, basically under a "real life" situation:
1) HTTP is faster than FTP when downloading one big file.
2) HTTP can use parallel chunk download which makes it 6x times faster than FTP depending on the network conditions.
Many firewalls drop outbound connections which are not to ports 80 or 443 (http & https); some even drop connections to those ports that are not HTTP(S). FTP may or may not be allowed, not to speak of the active/PASV modes.
Also, HTTP/1.1 allows for much better partial requests ("only send from byte 123456 to the end of file"), conditional requests and caching ("only send if content changed/if last-modified-date changed") and content compression (gzip).
HTTP is much easier to use through a proxy.
From my anecdotal evidence, HTTP is easier to make work with dropped/slow/flaky connections; e.g. it is not needed to (re)establish a login session before (re)initiating transfer.
OTOH, HTTP is stateless, so you'd have to do authentication and building a trail of "who did what when" yourself.
The only difference in speed I've noticed is transferring lots of small files: HTTP with pipelining is faster (reduces round-trips, esp. noticeable on high-latency networks).
Note that HTTP/2 offers even more optimizations, whereas the FTP protocol has not seen any updates for decades (and even extensions to FTP have insignificant uptake by users). So, unless you are transferring files through a time machine, HTTP seems to have won.
(Tangentially: there are protocols that are better suited for file transfer, such as rsync or BitTorrent, but those don't have as much mindshare, whereas HTTP is Everywhereâ„¢)
One consideration is that FTP can use non-standard ports, which can make getting though firewalls difficult (especially if you're using SSL). HTTP is typically on a known port, so this is rarely a problem.
If you do decide to use FTP, make sure you read about Active and Passive FTP.
In terms of performance, at the end of the day they're both spewing files directly down TCP connections so should be about the same.
One advantage of FTP is that there is a standard way to list files using dir or ls. Because of this, ftp plays nice with tools such as rsync. Granted, rsync is usually done over ssh, but the option is there.
Both of them uses TCP as a transport protocol, but HTTP uses a persistent connection, which makes the performance of the TCP better.
