How is a browser able to resume downloads? - http

Do downloads use HTTP? How can they resume downloads after they have been suspended for several minutes? Can they request a certain part of the file?

Downloads are done over either HTTP or FTP.
For a single, small file, FTP is slightly faster (though you'll barely notice a differece). For downloading large files, HTTP is faster due to automatic compression. For multiple files, HTTP is always faster due to reusing existing connections and pipelining.
Parts of a file can indeed be requested independent of the whole file, and this is actually how downloads work. This is a process known as 'Chunked Encoding'. A browser requests individual parts of a file, downloads them independently, and assembles them in the correct order once all parts have been downloaded:
In chunked transfer encoding, the data stream is divided into a series of non-overlapping "chunks". The chunks are sent out and received independently of one another. No knowledge of the data stream outside the currently-being-processed chunk is necessary for both the sender and the receiver at any given time.
And according to FTP vs HTTP:
During a "chunked encoding" transfer, the sending party sends a stream of [size-of-data][data] blocks over the wire until there is no more data to send and then it sends a zero-size chunk to signal the end of it.
This is combined with a process called 'Byte Serving' to allow for resuming of downloads:
Byte serving begins when an HTTP server advertises its willingness to serve partial requests using the Accept-Ranges response header. A client then requests a specific part of a file from the server using the Range request header. If the range is valid, the server sends it to the client with a 206 Partial Content status code and a Content-Range header listing the range sent.

Do downloads use HTTP?
Yes. Especially since major browsers had deprecated FTP.
How can they resume downloads after they have been suspended for several minutes?
Not all downloads can resume after this long. If the (TCP or SSL/TLS) connection had been closed, another one has to be initiated to resume the download. (If it's HTTP/3 over QUIC, then it's another story.)
Can they request a certain part of the file?
Yes. This can be done with Range Requests. But it require server-side support (especially when the requested resource is provided by a dynamic script).
That other answer mentioning chunked transfer had mistaken it for the underlaying mechanism of TCP. Chunked transfer is not designed for the purpose of resuming partial downloads. It's designed for delimiting message boundary when the Content-Length header is not present, and when the communicating parties wish to reuse the connection. It is also used when the protocol version is HTTP/1.1 and there's a trailer fields section (which is similar to header fields section, but comes after the message body). HTTP/2 and HTTP/3 have their own way to convey trailers.
Even if multiple non-overlapping "chunks" of the resource is requested, it's encapsulated in a multipart/* message.

Related

HTTP Request parsing

I would like to understand the usage of 'Transfer-encoding: Chunked' in case of HTTP requests.
Is it common for requests to be chunked?
My thinking is no since requests need to be completely read before processing, it does not make sense to be sending chunked requests.
It is not that common, but it can be very useful for large request bodies.
My thinking is no since requests need to be completely read before processing, it does not make sense to be sending chunked requests.
(1) No, they don't need to be read completely.
(2) ...and the main reason to compress it to save bytes on the wire anyway.
For an HTTP agent acting as a reverse proxy or a forward proxy, so taking a message from one side and sending it on the other side, using a chunked transmission means you can send the parts of the message you have without storing it locally. You avoid the 'buffering' problems, slowdown and storage.
You also have some optimizations based on each actor preferred size of data blocks, like you could have an actor which likes sending packets of 8000 bytes, because that's the good number for his own kernel settings (tcp windows, internal http server buffer size, etc), while another actor on the message transmission using smaller chunks of 2048 bytes.
Finally, you do not need to compute the size of the message, the message will end on the end-of-stream marker, that's all. Which is also usefull if you are sending something which is compressed on the fly, you may not know the final size until everything is compressed.
Chunked transmission is used a lot. It is the default mode of most HTTP servers if you ask for HTTP/1.1 mode and not HTTP/1.0.

How to tell a proxy a connection is still used using HTTP communication?

I have a client side GUI app for human usage that consumes some SOAP web services and uses cURL as the underlying HTTP communication lib. Depending on the input, processing a request can take some large amount of time, even one hour. Neither the client nor server time out for that reason on their own and that's tested and works. Most of the requests get processed in some minutes anyway, so this is an edge case.
One of my users is forced to use a proxy between my client app and my server and for various reasons has no control over it. That proxy has a time out configured and closes the connection to my client after 4 minutes of no data transfer. So the user can (and did) upload data for e.g. 30 minutes, afterwards the server starts to process the data and after 4 minutes the proxy closes the connection, the server will silently continue to process the request, but the user is left with some error message AND won't get the processing result. My app already uses TCP Keep Alive, so that shouldn't be the problem, but instead the time out seems to be defined for higher level data. It works the same like the option read_timeout for squid, which I used to reproduce the behaviour in our internal setup.
What I would like to do now is start a background thread in my web service which simply outputs some garbage data to my client over all the time the request is processed, which is ignored by the client and tells the proxy that the connection is still active. I can recognize my client using the user agent and can configure if to ouput that data or not server side and such, so other clients consuming the web service wouldn't get a problem.
What I'm asking for is, if there's any HTTP compliant method to output such garbage data before the actual HTTP response? So e.g. would it be enough to simply output \r\n without any additional content over and over again to be HTTP compliant with all requesting libs? Or maybe even binary 0? Or some full fledged HTTP headers stating something like "real answer about to come, please be patient"? From my investigation this pretty much sounds like chunked HTTP encoding, but I'm not sure yet if this is applicable.
I would like to have the following, where all those "Wait" stuff is simply ignored in the end and the real HTTP response at the end contains Content-Length and such.
Wait...\r\n
Wait...\r\n
Wait...\r\n
[...]
HTTP/1.1 200 OK\r\n
Server: Apache/2.4.23 (Win64) mod_jk/1.2.41\r\n
[...]
<?xml version="1.0" encoding="UTF-8"?><soap:Envelope[...]
Is that possible in some standard HTTP way and if so, what's the approach I need to take? Thanks!
HTTP Status 102
Isn't HTTP Status 102 exactly what I need? As I understand the spec, I can simply print that response line over and over again until the final response is available?
HTTP Status 102 was a dead-end, two things might work, depending on the proxy used: A NPH script can be used to regularly print headers directly to the client. The important thing is that NPH scripts normally bypass header buffers from the web server and can therefore be transferred over the wire as needed. They "only" need be correct HTTP headers and depending on the web server and proxy and such it might be a good idea to create incrementing, unique headers. Simply by adding some counter in the header name.
The second thing is chunked transfer-encoding, in which case small chunks of dummy data can be printed to the client in the response body. The good thing is that such small amount of data can be transferred over the wire as needed using server side flush and such, the bad thing is that the client receives this data and by default behaves as if it was part of the expected response body. That might break the application of course, but most HTTP libs provide callbacks for processing received data and if you print some unique one, the client should be able to filter the garbage out.
In my case the web service is spawning some background thread and depending on the entry point of the service requested it either prints headers using NPH or chunks of data. In both cases the data can be the same, so a NPH-header can be used for chunked transfer-encoding as well.
My NPH solution doesn't work with Squid, but the chunked one does. The problem with Squid is that its read_timeout setting is not low level for the connection to receive data at all, but instead some logical HTTP thing. This means that Squid does receive my headers, but it expects a complete HTTP header within the period of time defined using read_timeout. With my NPH approach this isn't the case, simply because by design I only want to send some garbage headers to ignore until the real headers arrive.
Additionally, one has to be careful about NPH in Apache httpd, but in my use case it works. I can see the individual headers in Squid's log and without any garbage after the response body or such. Avoid the Action directive.
Apache2 sends two HTTP headers with a mapped "nph-" CGI

Checksum in HTTP response header - why not?

I'd like to know some kind of file checksum (like SHA-256 hash, or anything else) when I start downloading file from HTTP server. It could be transferred as one of HTTP response headers.
HTTP etag is something similar, but it's used only for invalidating browser cache and, from what I've noticed, every site is calculating it in different way and it doesn't look like any hash I know.
Some software download sites provide various file checksums as separate files to download (for example, latest Ubuntu 16.04 SHA1 hashes: http://releases.ubuntu.com/16.04/SHA1SUMS). Won't it be easier to just include them in HTTP response header and force browser to calculate it when download ends (and do not force user to do it manually).
I guess that whole HTTP-based Internet is working, because we're using TCP protocol, which is reliable and ensures received bytes are exactly same as one send by the server. But if TCP is so "cool", why do we check file hashes manually (see abouve Ubuntu example)? And lot of thing can go wrong during file download (client/server disk corruption, file modification on server side etc.). And if I'm right, everything could be fixed simply by passing file hash at download start.
The checksum provided separately from the file is used for integrity check when doing Non TLS or indirect transfer.
Maybe I know your doubt because I had the same question about the checksums, let's find it out.
There are two tasks to be considered:
File broken during transfer
File be changed by hacker
And three protocol in this question:
HTTP protocol
SSL/TLS protocol
TCP protocol
Now we separate into two situations:
1. File provider and client transfer the file directly, no proxy, no offline(usb disk).
The TCP protocol promise: the data from server is exactly same as the data client received, by checksum and ack.
The TLS protocol promise: the server is authenticated (is truly ubuntu.com) and the data is not changed by any middleman.
So there is no need to add checksum header in HTTP protocol when doing HTTPS.
But when TLS is not enabled, forgery could happen: bad guy in middle gives a bad file to the client.
2. File provider and client transfer the file indirectly, by CDN, by mirror, by offline way(usb disk).
Many sites like ubuntu.com use 3-party CDN to serve static files, which the CDN server is not managed by ubuntu.com.
http://releases.ubuntu.com/somefile.iso redirect to http://59.80.44.45/somefile.iso.
Now the checksum must be provided out-of-band because it is not authenticated we don't trust the connection. So checksum header in HTTP protocol is helpless in this situation.
Digest is the standard header used to convey the checksum of a selected representation of a resource (that is, the payload body).
An example response with digest.
>200 OK
>...
>Digest: sha-256=X48E9qOokqqrvdts8nOJRJN3OWDUoyWxBf7kbu9DBPE=
>
>{"hello": "world"}
Digest may be used both in request and responses.
It's a good practice to validate the data against the digest before processing it.
You can see the related page on mozilla website for an indepth discussion around the payload body in http.
I guess that whole HTTP-based Internet is working, because we're using TCP protocol
No, the integrity on the web is ensured by TLS. Non-TLS communication should not
be trusted. See rfc8446
The hashes on ubuntu.com and similar sites are there for two purposes:
check the integrity of the file (yes, hypothetically the browser could check it for you)
check the correctness of the file, to avoid tampering (e.g. an attacker could intercept your download request and serve you a malicious file. While you may be covered by https browser side, that would not be true for data at rest, e.g. a usb external disk, and you may want to check for its correctness by comparing the hashes)

How does HTTP/2 provide faster browsing speed compared to HTTP/1.1?

I was reading reading an article on launching of HTTP/2. It was said that HTTP/2 is based on SPDY (speedy) protocol and it can provide faster browsing speed compared to HTTP/1.1 by using "header field compression" and "multiplexing". How does these terms exactly work?
Am I supposed to believe that in HTTP/1.1 requests are processed in a 'one after the other' manner?
Multiplexing
With HTTP 1.1 a lot of time is spent just waiting. A browser sends requests and waits for the response to come back and then sends another GET etc. An inefficient use of the bandwidth. At times it would use Pipelining but that too suffers that sometimes requests need to wait for the requests done prior. The head of line blocking problem.
With multiplexing, there's virtually no waiting but the browsers can just ask for hundreds of things at once and they will be delivered in whatever order they can be delivered and without individual streams or objects having to wait for each other. (With prioritization and flow control to help control them properly.)
This will be most notable on high-latency connections. For a visible and clear demo what it can do, see the golang's gophertiles demo at https://http2.golang.org/gophertiles?latency=1000 (requires a HTTP/2 enabled browser)
Header compression
Additionally, HTTP/2 offers header compression that makes a client able to squeeze in more requests earlier in a TCP connection lifetime. In the early slow-start period of a new TCP connection it can be valuable to cram in more requests so that the responses come back earlier. HTTP headers are extremely repetitive in their nature.
Server push
A HTTP/2 server can send data to the client as if the client asked for it, before the client asks for it! If the server thinks the client is likely to want/need that too, and thus a half RTT can be saved.
Starting with HTTP/2, both headers and HTTP response content could be compressed. With HTTP/1.1, headers are never compressed contrary to the content (specified with Content-Encoding header).
Multiplexing is related to server-push. Actually, when the server sends an HTML page, it can use the same connection to push additional resources like css and javascript files. If the HTML page needs to load those additional scripts, no more request will be sent to the server as they were already sent previously.

How can I download a single file from multiple locations via HTTP?

I need to download a big file quickly, but all sources I can find have throttled bandwidth. Each of them seem to support HTTP 1.1 Byte Serving (Range Requests), since I can pause and resume the downloads. How can I download it from multiple sources in parallel?
Assuming this is a programming question (given that this is StackOverflow) I am going to explain how instead of just linking to a download accelerator that takes advantage of this.
What is needed in terms of the server to do this?
A server that supports Range HTTP header.
A server that allows for concurrent connections. It is possible to support Range while not allowing multiple simultaneous connection by using either endpoint or IP based restrictions server side. For this reason, I recommend you set up a simple test server instead of downloading from a file sharing site while testing this.
What is the Range Header?
Data transmission over HTTP is sent in order starting from the beginning of the file if the Range header is not set. The first byte of the file on the server will be the first byte of the HTTP response and the last byte of the file on the server will be the last byte of the HTTP response. The Range header allows you to specify where the bytes should start sending from allowing you to "skip" the beginning of the response.
Actual Answer Example
Our Situation
The response is plain text. The response content is just one word "StackOverflow!!" encoding ASCII, meaning each character is one byte. Therefore, the Content-Length header's value is 15 octets (another term for bytes).
We are going to download this file using 3 requests. For the sake of this example, we are going to say it will be 3 times faster but you should realize that this method will make downloads slower for very small files. This is because HTTP headers must be sent with each request as well as the 3-way handshake. We will also assume that the server supports HEAD requests and that the Content-Length header is sent with the download response. Finally, this request will be preformed using GET for reasons of HEAD requests. However, there are workarounds for POST.
Juicy Details
First, perform an HTTP HEAD request. Take the "Content-Length" header and divide that value by the amount of concurrent parallel connections you wish to make. For this example, the Content-Length is 15 and we wish to make 3 connections so the divided value will be 5.
Now preform the amount of requests you wished to preform parallel. With each request, set the Range header to "Range: bytes=" followe by how many requests have already been made times the divided value found above. Then append "-" followed by the value you just determined plus the divided value.
For this example, each request should have the header set as followed.
Range: bytes=0-5
Range: bytes=5-10
Range: bytes=10-15
The response of each of these requests should be
Stack
Overf
low!!
In essence, we are just conforming to Range specification (section 3.12 of RFC 2616) as well as Byte Range specification (section 14.35 of RFC 2616).
Finally, append the bytes of each request to form the final response data.
Disclaimer: I've never actually tried this but it should work in theory
I can't say if wget is able to put a file together again, if fetched from multiple sources.
The following example shows how to do it with aria2c.
You would build a download description file and then pass that to aria, like so:
aria2c -i uri.txt --split=5 --min-split-size=1M --max-connection-per-server=5
where uri.txt might contain
http://a.com/file1.iso http://mirror-1.com/file1.iso http://mirror-2.com/file1.iso
dir=/downloads
out=file1.iso
This would fetch the same file, from 3 different locations and place it into the downloads folder (dir) with the name file1.iso (out).

Resources