I'd like to know some kind of file checksum (like SHA-256 hash, or anything else) when I start downloading file from HTTP server. It could be transferred as one of HTTP response headers.
HTTP etag is something similar, but it's used only for invalidating browser cache and, from what I've noticed, every site is calculating it in different way and it doesn't look like any hash I know.
Some software download sites provide various file checksums as separate files to download (for example, latest Ubuntu 16.04 SHA1 hashes: http://releases.ubuntu.com/16.04/SHA1SUMS). Won't it be easier to just include them in HTTP response header and force browser to calculate it when download ends (and do not force user to do it manually).
I guess that whole HTTP-based Internet is working, because we're using TCP protocol, which is reliable and ensures received bytes are exactly same as one send by the server. But if TCP is so "cool", why do we check file hashes manually (see abouve Ubuntu example)? And lot of thing can go wrong during file download (client/server disk corruption, file modification on server side etc.). And if I'm right, everything could be fixed simply by passing file hash at download start.
The checksum provided separately from the file is used for integrity check when doing Non TLS or indirect transfer.
Maybe I know your doubt because I had the same question about the checksums, let's find it out.
There are two tasks to be considered:
File broken during transfer
File be changed by hacker
And three protocol in this question:
HTTP protocol
SSL/TLS protocol
TCP protocol
Now we separate into two situations:
1. File provider and client transfer the file directly, no proxy, no offline(usb disk).
The TCP protocol promise: the data from server is exactly same as the data client received, by checksum and ack.
The TLS protocol promise: the server is authenticated (is truly ubuntu.com) and the data is not changed by any middleman.
So there is no need to add checksum header in HTTP protocol when doing HTTPS.
But when TLS is not enabled, forgery could happen: bad guy in middle gives a bad file to the client.
2. File provider and client transfer the file indirectly, by CDN, by mirror, by offline way(usb disk).
Many sites like ubuntu.com use 3-party CDN to serve static files, which the CDN server is not managed by ubuntu.com.
http://releases.ubuntu.com/somefile.iso redirect to http://59.80.44.45/somefile.iso.
Now the checksum must be provided out-of-band because it is not authenticated we don't trust the connection. So checksum header in HTTP protocol is helpless in this situation.
Digest is the standard header used to convey the checksum of a selected representation of a resource (that is, the payload body).
An example response with digest.
>200 OK
>...
>Digest: sha-256=X48E9qOokqqrvdts8nOJRJN3OWDUoyWxBf7kbu9DBPE=
>
>{"hello": "world"}
Digest may be used both in request and responses.
It's a good practice to validate the data against the digest before processing it.
You can see the related page on mozilla website for an indepth discussion around the payload body in http.
I guess that whole HTTP-based Internet is working, because we're using TCP protocol
No, the integrity on the web is ensured by TLS. Non-TLS communication should not
be trusted. See rfc8446
The hashes on ubuntu.com and similar sites are there for two purposes:
check the integrity of the file (yes, hypothetically the browser could check it for you)
check the correctness of the file, to avoid tampering (e.g. an attacker could intercept your download request and serve you a malicious file. While you may be covered by https browser side, that would not be true for data at rest, e.g. a usb external disk, and you may want to check for its correctness by comparing the hashes)
Related
I have react front-end and flask backend web application. In this web app, I upload large CSV files from client to server via HTTP multipart/form-data. To achieve this, I take file information in <form encType='multipart/form-data'> element, with <input type='file'>. Then I use axios.post to make a POST request to the server.
On the flask server side, I access the file using request.files['file'] and save the file using file.save. This works as expected. The file is transferred successfully.
I'm thinking to compute MD5 checksum on both client and server side in order to make sure that both sides have files with same MD5 hash. However, this requires reading the file in chunks from the disk and compute the MD5. (since I'm dealing with large files, it is not possible to load the entire file in memory). So, I think this is little inefficient. I want to know whether this transfer via 'HTTP multipart/form-data' provide reliability guarantee? If so, I can ignore the MD5 verification right?
If reliability is not guaranteed, is there any good approach to make sure that both sides have exact same file copy? Thanks in advance.
HTTP integrity is as reliable as the underlying transport protocol, be it TCP (HTTP/1 and 2) or UDP (HTTP/3). Bits can fall over and still yield a valid checksum. This does happen.
If you want to make absolutely sure that you've received the same file as the uploader intended, you need to add a checksum yourself, using for example SHA or MD5.
Do downloads use HTTP? How can they resume downloads after they have been suspended for several minutes? Can they request a certain part of the file?
Downloads are done over either HTTP or FTP.
For a single, small file, FTP is slightly faster (though you'll barely notice a differece). For downloading large files, HTTP is faster due to automatic compression. For multiple files, HTTP is always faster due to reusing existing connections and pipelining.
Parts of a file can indeed be requested independent of the whole file, and this is actually how downloads work. This is a process known as 'Chunked Encoding'. A browser requests individual parts of a file, downloads them independently, and assembles them in the correct order once all parts have been downloaded:
In chunked transfer encoding, the data stream is divided into a series of non-overlapping "chunks". The chunks are sent out and received independently of one another. No knowledge of the data stream outside the currently-being-processed chunk is necessary for both the sender and the receiver at any given time.
And according to FTP vs HTTP:
During a "chunked encoding" transfer, the sending party sends a stream of [size-of-data][data] blocks over the wire until there is no more data to send and then it sends a zero-size chunk to signal the end of it.
This is combined with a process called 'Byte Serving' to allow for resuming of downloads:
Byte serving begins when an HTTP server advertises its willingness to serve partial requests using the Accept-Ranges response header. A client then requests a specific part of a file from the server using the Range request header. If the range is valid, the server sends it to the client with a 206 Partial Content status code and a Content-Range header listing the range sent.
Do downloads use HTTP?
Yes. Especially since major browsers had deprecated FTP.
How can they resume downloads after they have been suspended for several minutes?
Not all downloads can resume after this long. If the (TCP or SSL/TLS) connection had been closed, another one has to be initiated to resume the download. (If it's HTTP/3 over QUIC, then it's another story.)
Can they request a certain part of the file?
Yes. This can be done with Range Requests. But it require server-side support (especially when the requested resource is provided by a dynamic script).
That other answer mentioning chunked transfer had mistaken it for the underlaying mechanism of TCP. Chunked transfer is not designed for the purpose of resuming partial downloads. It's designed for delimiting message boundary when the Content-Length header is not present, and when the communicating parties wish to reuse the connection. It is also used when the protocol version is HTTP/1.1 and there's a trailer fields section (which is similar to header fields section, but comes after the message body). HTTP/2 and HTTP/3 have their own way to convey trailers.
Even if multiple non-overlapping "chunks" of the resource is requested, it's encapsulated in a multipart/* message.
I have a requirement to make legal documents available to mobile applications (e.g. android, iphone, etc) via HTTP. Corruption can occur over http (references: 1, 2). In my case it is imperative that the downloaded documents have not been corrupt during transmission.
One mechanism for ensuring integrity is to digitally sign the documents. This approach works well if the documents are xml, however the signing public key will need to be available and trusted by the client.
Another mechanism is to create and store a checksum of the document (e.g. MD5). The client can download the document and the checksum, and then use the checksum to verify the document.
Question 1: Are there any other alternative mechanisms for ensuring the integrity?
Question 2: Does http have any built in mechanisms for ensuring downloaded data has not been corrupted during download?
Question 3: What is the statical likelihood of document corruption during download over HTTP (I would prefer this answer to be backed up by statistical data)?
As far as I know, HTTP itself does not have any built-in checksum mechanism and your suggestion would work for ensuring the data is valid. The thing is though, HTTP is generally implemented on the Transmission Control Protocol (TCP). TCP provides reliable communication between hosts.
Specifically, TCP itself implements error detection (using a checksum) and uses special number sequences to ensure the data arrives in the order that it was sent. If the host sending the data receives information that the receiving host did not get the data, it will resend.
If however the HTTP implementation on the device is actually running on top of the User Datagram Protocol (UDP), it isn't reliable however it is unlikely that a device is using UDP for HTTP or at least the unreliable version (as there is a Reliable User Datagram Protocol).
Now, I couldn't find statistics or much information at all regarding corruption of a HTTP request. Depending how mission critical you deem this to be, treat it like it would happen then. There is mention of downloading files that end up being corrupt. While these mostly seem to relate to ZIP files, I wouldn't think it is due to HTTP but rather other things inbetween like the device itself that is downloading and corrupting the information.
Perhaps in your scenario, it is best to add your checksum if it is absolutely critically important that your information arrives in one piece.
I'm attempting to synchornize a set of files over HTTP.
For the moment, I'm using HTTP PUT, and sending files that have been altered. However, this is very inefficient when synchronizing large files where the delta is very small.
I'd like to do something closer to what rsync does to transmit the deltas, but I'm wondering what the best approach to do this would be.
I know I could use an rsync library on both ends, and wrap their communication over HTTP, but this sounds more like an antipattern; tunneling a standalone protocol over HTTP. I'd like to do something that's more in line with how HTTP works, and not wrap binary data (except my files, duh) in an HTTP request/response.
I've also failed to find any relevant/useful functionality already implemented in WebDAV.
I have total control over the client and server implementation, since this is a desktop-ish application (meaning "I don't need to worry about browser compatibility").
The HTTP PATCH recommended in a comment requires the client to keep track of local changes. You may not be able to do that due to the size of the file.
Alternatively you could treat "chunks" of the huge file as resources: depending on the nature of the changes and the content of the file it could be by bytes, chapters, whatever.
The client could query the hash of all chunks, calculate the same for the local version, and PUT only the changed ones.
I have some software which runs as a black box, I have no access to it. This software makes HTTP requests. What I want to do is intercept these requests, forward them on, catch the response, do something with it, before passing the response back to the software.
Can this be done? What's the best method?
Thanks
Edit: Requests are to the public internet from a local intranet via a gateway/router. I have root access to my machine. Another machine could be used as intermediate gateway.
Edit 2: Requests are not encrypted. What I am actually trying to do is save down any images that are requested.
Try yellosoft-alchemy.
If the communication isn't encrypted, use Ethereal (or any other similar program) to sniff the communication on the wire.
edit: since the communication isn't encrypted, you can do that easily with Ethereal. You can save each TCP stream independently from there.
Edit2: Ok, you want to do this automatically. In this case, I would suggest you look at two tools available on Linux called tcpflow and tcpreen.
tcpreen creates a proxy similar to what you want between a local port and a remote one. It's a TCP proxy, not an HTTP proxy so this means you'll have to write some parsing tool to isolate the HTTP streams that contain the images you want (probably based on the MIME type of the response). it's not too complex a task, though, if you understand how HTTP works.
tcpflow is similar to tcpreen except that it's a sniffer instead of a proxy. Use whatever tool you think its more adapted to your environment.