Assume I want to upload a file to a web server. Maybe even a rather big file (e.g. 30 MB). It's done with a typical file upload form (see minimal example below).
Now networks are not perfect. I see those types of errors being possible:
Bitflips can happen
packages can get lost
the order in which packages arrive might not be the order in which they were sent
a package could be received twice
Reading the TCP wiki article, I see
At the lower levels of the protocol stack, due to network congestion, traffic load balancing, or unpredictable network behaviour, IP packets may be lost, duplicated, or delivered out of order. TCP detects these problems, requests re-transmission of lost data, rearranges out-of-order data and even helps minimize network congestion to reduce the occurrence of the other problems. If the data still remains undelivered, the source is notified of this failure. Once the TCP receiver has reassembled the sequence of octets originally transmitted, it passes them to the receiving application. Thus, TCP abstracts the application's communication from the underlying networking details.
Reading that, the only reason I can see why a downloaded file might be broken is (1) something went wrong after it was downloaded or (2) the connection was interrupted.
Do I miss something? Why do sites that offer Linux images often also provide an MD5 hash? Is the integrity of a file upload/download over HTTPS (and thus also over TCP) guaranteed or not?
Minimal File Upload Example
HTML:
<!DOCTYPE html>
<html>
<head><title>Upload a file</title></head>
<body>
<form method="post" enctype="multipart/form-data">
<input name="file" type="file" />
<input type="submit"/>
</form>
</body>
</html>
Python/Flask:
"""
Prerequesites:
$ pip install flask
$ mkdir uploads
"""
import os
from flask import Flask, flash, request, redirect, url_for
from werkzeug.utils import secure_filename
app = Flask(__name__)
app.config["UPLOAD_FOLDER"] = "uploads"
#app.route("/", methods=["GET", "POST"])
def upload_file():
if request.method == "POST":
# check if the post request has the file part
if "file" not in request.files:
flash("No file part")
return redirect(request.url)
file = request.files["file"]
# if user does not select file, browser also
# submit an empty part without filename
if file.filename == "":
flash("No selected file")
return redirect(request.url)
filename = secure_filename(file.filename)
file.save(os.path.join(app.config["UPLOAD_FOLDER"], filename))
return redirect(url_for("upload_file", filename=filename))
else:
return """<!DOCTYPE html>
<html>
<head><title>Upload a file</title></head>
<body>
<form method="post" enctype="multipart/form-data">
<input name="file" type="file" />
<input type="submit"/>
</form>
</body>
</html>"""
return "upload handled"
if __name__ == "__main__":
app.run()
Is the integrity of file uploads/downloads guaranteed by TCP/HTTPS?
In short: No. But it is better with HTTPS than with plain TCP.
TCP only has a very weak error detection, so it will likely detect simple bit flips and discard (and resend) the corrupted packet - but it will not detect more complex errors. HTTPS though has (through the TLS layer) a pretty solid integrity protection and undetected data corruption on transport is essentially impossible.
TCP also has a robust detection and prevention of duplicates and reordering. TLS (in HTTPS) has an even more robust detection of this kind of data corruption.
But it gets murky when the TCP connection simply closes early, for example if a server crashes. TCP has no indication of a message by itself so a connection close is often used as an end of message indicator. This is for example true for FTP data connections but it can also be true for HTTP (and thus HTTPS). While HTTP has usually a length indicator (Content-length header or explicit chunk sizes with Transfer-Encoding: chunked) it defines also end of TCP connection as an end of message. Clients vary in the behavior if the end of connection is reached before the declared end of message: some will treat the data as corrupted, other will assume a broken server (i.e. wrong length declaration) and treat connection close as valid end of message.
In theory TLS (in HTTPS) has a clear end-of-TLS message (TLS shutdown) which might help in detecting an early connection close. In practice though implementation might simply close the underlying socket w/o this explicit TLS shutdown so that one unfortunately cannot fully rely on it.
Why do sites that offer Linux images often also provide an MD5 hash?
There is also another point of failure: the download might have been corrupted before it gets downloaded. Download sites often have several mirrors and the corruption might happen when sending the file to the download mirror, or even when sending the file to the download master. Having some strong checksum in parallel to the download helps to detect such errors, as long as the checksum was created at the origin of the download and thus before the data corruption.
Related
I have react front-end and flask backend web application. In this web app, I upload large CSV files from client to server via HTTP multipart/form-data. To achieve this, I take file information in <form encType='multipart/form-data'> element, with <input type='file'>. Then I use axios.post to make a POST request to the server.
On the flask server side, I access the file using request.files['file'] and save the file using file.save. This works as expected. The file is transferred successfully.
I'm thinking to compute MD5 checksum on both client and server side in order to make sure that both sides have files with same MD5 hash. However, this requires reading the file in chunks from the disk and compute the MD5. (since I'm dealing with large files, it is not possible to load the entire file in memory). So, I think this is little inefficient. I want to know whether this transfer via 'HTTP multipart/form-data' provide reliability guarantee? If so, I can ignore the MD5 verification right?
If reliability is not guaranteed, is there any good approach to make sure that both sides have exact same file copy? Thanks in advance.
HTTP integrity is as reliable as the underlying transport protocol, be it TCP (HTTP/1 and 2) or UDP (HTTP/3). Bits can fall over and still yield a valid checksum. This does happen.
If you want to make absolutely sure that you've received the same file as the uploader intended, you need to add a checksum yourself, using for example SHA or MD5.
Do downloads use HTTP? How can they resume downloads after they have been suspended for several minutes? Can they request a certain part of the file?
Downloads are done over either HTTP or FTP.
For a single, small file, FTP is slightly faster (though you'll barely notice a differece). For downloading large files, HTTP is faster due to automatic compression. For multiple files, HTTP is always faster due to reusing existing connections and pipelining.
Parts of a file can indeed be requested independent of the whole file, and this is actually how downloads work. This is a process known as 'Chunked Encoding'. A browser requests individual parts of a file, downloads them independently, and assembles them in the correct order once all parts have been downloaded:
In chunked transfer encoding, the data stream is divided into a series of non-overlapping "chunks". The chunks are sent out and received independently of one another. No knowledge of the data stream outside the currently-being-processed chunk is necessary for both the sender and the receiver at any given time.
And according to FTP vs HTTP:
During a "chunked encoding" transfer, the sending party sends a stream of [size-of-data][data] blocks over the wire until there is no more data to send and then it sends a zero-size chunk to signal the end of it.
This is combined with a process called 'Byte Serving' to allow for resuming of downloads:
Byte serving begins when an HTTP server advertises its willingness to serve partial requests using the Accept-Ranges response header. A client then requests a specific part of a file from the server using the Range request header. If the range is valid, the server sends it to the client with a 206 Partial Content status code and a Content-Range header listing the range sent.
Do downloads use HTTP?
Yes. Especially since major browsers had deprecated FTP.
How can they resume downloads after they have been suspended for several minutes?
Not all downloads can resume after this long. If the (TCP or SSL/TLS) connection had been closed, another one has to be initiated to resume the download. (If it's HTTP/3 over QUIC, then it's another story.)
Can they request a certain part of the file?
Yes. This can be done with Range Requests. But it require server-side support (especially when the requested resource is provided by a dynamic script).
That other answer mentioning chunked transfer had mistaken it for the underlaying mechanism of TCP. Chunked transfer is not designed for the purpose of resuming partial downloads. It's designed for delimiting message boundary when the Content-Length header is not present, and when the communicating parties wish to reuse the connection. It is also used when the protocol version is HTTP/1.1 and there's a trailer fields section (which is similar to header fields section, but comes after the message body). HTTP/2 and HTTP/3 have their own way to convey trailers.
Even if multiple non-overlapping "chunks" of the resource is requested, it's encapsulated in a multipart/* message.
I'd like to know some kind of file checksum (like SHA-256 hash, or anything else) when I start downloading file from HTTP server. It could be transferred as one of HTTP response headers.
HTTP etag is something similar, but it's used only for invalidating browser cache and, from what I've noticed, every site is calculating it in different way and it doesn't look like any hash I know.
Some software download sites provide various file checksums as separate files to download (for example, latest Ubuntu 16.04 SHA1 hashes: http://releases.ubuntu.com/16.04/SHA1SUMS). Won't it be easier to just include them in HTTP response header and force browser to calculate it when download ends (and do not force user to do it manually).
I guess that whole HTTP-based Internet is working, because we're using TCP protocol, which is reliable and ensures received bytes are exactly same as one send by the server. But if TCP is so "cool", why do we check file hashes manually (see abouve Ubuntu example)? And lot of thing can go wrong during file download (client/server disk corruption, file modification on server side etc.). And if I'm right, everything could be fixed simply by passing file hash at download start.
The checksum provided separately from the file is used for integrity check when doing Non TLS or indirect transfer.
Maybe I know your doubt because I had the same question about the checksums, let's find it out.
There are two tasks to be considered:
File broken during transfer
File be changed by hacker
And three protocol in this question:
HTTP protocol
SSL/TLS protocol
TCP protocol
Now we separate into two situations:
1. File provider and client transfer the file directly, no proxy, no offline(usb disk).
The TCP protocol promise: the data from server is exactly same as the data client received, by checksum and ack.
The TLS protocol promise: the server is authenticated (is truly ubuntu.com) and the data is not changed by any middleman.
So there is no need to add checksum header in HTTP protocol when doing HTTPS.
But when TLS is not enabled, forgery could happen: bad guy in middle gives a bad file to the client.
2. File provider and client transfer the file indirectly, by CDN, by mirror, by offline way(usb disk).
Many sites like ubuntu.com use 3-party CDN to serve static files, which the CDN server is not managed by ubuntu.com.
http://releases.ubuntu.com/somefile.iso redirect to http://59.80.44.45/somefile.iso.
Now the checksum must be provided out-of-band because it is not authenticated we don't trust the connection. So checksum header in HTTP protocol is helpless in this situation.
Digest is the standard header used to convey the checksum of a selected representation of a resource (that is, the payload body).
An example response with digest.
>200 OK
>...
>Digest: sha-256=X48E9qOokqqrvdts8nOJRJN3OWDUoyWxBf7kbu9DBPE=
>
>{"hello": "world"}
Digest may be used both in request and responses.
It's a good practice to validate the data against the digest before processing it.
You can see the related page on mozilla website for an indepth discussion around the payload body in http.
I guess that whole HTTP-based Internet is working, because we're using TCP protocol
No, the integrity on the web is ensured by TLS. Non-TLS communication should not
be trusted. See rfc8446
The hashes on ubuntu.com and similar sites are there for two purposes:
check the integrity of the file (yes, hypothetically the browser could check it for you)
check the correctness of the file, to avoid tampering (e.g. an attacker could intercept your download request and serve you a malicious file. While you may be covered by https browser side, that would not be true for data at rest, e.g. a usb external disk, and you may want to check for its correctness by comparing the hashes)
I have a requirement to make legal documents available to mobile applications (e.g. android, iphone, etc) via HTTP. Corruption can occur over http (references: 1, 2). In my case it is imperative that the downloaded documents have not been corrupt during transmission.
One mechanism for ensuring integrity is to digitally sign the documents. This approach works well if the documents are xml, however the signing public key will need to be available and trusted by the client.
Another mechanism is to create and store a checksum of the document (e.g. MD5). The client can download the document and the checksum, and then use the checksum to verify the document.
Question 1: Are there any other alternative mechanisms for ensuring the integrity?
Question 2: Does http have any built in mechanisms for ensuring downloaded data has not been corrupted during download?
Question 3: What is the statical likelihood of document corruption during download over HTTP (I would prefer this answer to be backed up by statistical data)?
As far as I know, HTTP itself does not have any built-in checksum mechanism and your suggestion would work for ensuring the data is valid. The thing is though, HTTP is generally implemented on the Transmission Control Protocol (TCP). TCP provides reliable communication between hosts.
Specifically, TCP itself implements error detection (using a checksum) and uses special number sequences to ensure the data arrives in the order that it was sent. If the host sending the data receives information that the receiving host did not get the data, it will resend.
If however the HTTP implementation on the device is actually running on top of the User Datagram Protocol (UDP), it isn't reliable however it is unlikely that a device is using UDP for HTTP or at least the unreliable version (as there is a Reliable User Datagram Protocol).
Now, I couldn't find statistics or much information at all regarding corruption of a HTTP request. Depending how mission critical you deem this to be, treat it like it would happen then. There is mention of downloading files that end up being corrupt. While these mostly seem to relate to ZIP files, I wouldn't think it is due to HTTP but rather other things inbetween like the device itself that is downloading and corrupting the information.
Perhaps in your scenario, it is best to add your checksum if it is absolutely critically important that your information arrives in one piece.
I want to deprecate (turn off/not send HTTP responses) for some old HTML & JS code that my clients have installed on their pages. Not all clients can update all of their webpages prior to when we deprecate, but I have the OK to deprecate.
Simple example of what the code can look like:
Customer domain, customer.com, has HTML & JS on their pages:
<script src="http://mycompany.com/?customer=customer.com&..."></script>
We are considering configuring our switches to send a TCP RST response on incoming deprecated requests to http://mycompany.com/..., so my question is, are there any side-effects (stall page loading, for example) with the approach of configuring our switches to respond with a TCP RST on the incoming TCP connection? Obviously, I want the least (ie no) impact on a customer's site.
I have to think that RST is a fairly harsh mechanism to not reply to a single request. This request might be one of a hundred resources required to render one of your client pages, and if you tear down the connection, that connection cannot be re-used to request further resources. (See 19.7.1 in the HTTP1.1 RFC: "Persistent connections are the default for
HTTP/1.1 messages; we introduce a new keyword (Connection: close) for
declaring non-persistence.")
Each new connection will require a new three-way handshake to set up, which might add half a second per failed request to one of the two connections the client is using to retrieve resources from your servers. What is the average latency between your servers and your customers? Multiply that by three to get the time for a new three-way handshake.
If you fail the requests at the HTTP protocol level instead (301? 302? 404? 410?) you can return a failure in the existing HTTP connection and save three-round-trips to generate a new connection (which might also be for a resource that you're no longer interested in serving).
Plus, 410 ought to indicate that the browser shouldn't bother requesting the resource again (but I have no idea which browsers will follow this advice.) An RST-ed resource will probably be re-tried every single time it is requested.