I want to know the exact mechanism behind the transfer of binary files using a browser. If the browser uses purely HTTP that means only text is allowed so the image is encoded using base64 and decoded later in browser? Or does the browser download this using some other mechanism where this encoding/decoding is not needed?
Just in case someone wants to know the answer. While you can send the binary data over HTTP using base64 encoding, it is not the most efficient process, as encoding and decoding is required. So when you request an image file using http, the server gives you the metadata information such as MIME type, content-length etc. Using this information, the HTTP agent (eg. browser) actually downloads the image directly using TCP and not HTTP.
Related
I have a requirement of uploading a large file over HTTP to a remote server.
I am researching on how to send the data using multipart/form-data.
I have gone through How does HTTP file upload work? and understood how it separates the file data using boundaries.
I wanted to know whether all the file data is sent at one go or is streamed with several requests to the remote server.
Because if it is sent at one go, it is not possible to read the whole data at the remote server and write it to a file.
But if it streamed, how does the remote server parses the streamed data, write this streamed data to a file and redo the same thing till all the data is streamed.
Sorry if it a noob question, I am researching about it as well.
Maybe it is outside the scope of multipart/form-data and HTTP is itself taking care of.
Any help is appreciated.
The logistics of the sending is not relevant. What matters is the maximum request size that is set on the server side. How it is set depends on the technology used there: IIS, Apache, nginx? If the post request of the browser exceeds that size (because of a too large file), errors will happen. There is nothing on the browser side u can tweak or change to fix breaking uploads. Unless you are building your own browser:-)
When I click on a binary file (image files, PDF files, videos files etc) in a browser to download, does the server return these files in an HTTP response body? Does HTTP protocol support binary HTTP response body in the first place? Or does the browser uses some other protocol internally to transfer these files?
Any reference (books, links) on how browser works would be appreciated!
Does HTTP protocol support binary HTTP response body in the first place?
Yes. I believe the browser knows it is binary because of the Content-Type header in the response.
In the image below, captured from Wireshark, the data highlighted in blue is binary.
You can see this data is in the body of the response of an HTTP request for an image/x-icon.
I've been looking at how to implement an API over HTTP that allows for online processing of the resource it returns. This resource could be a progressive JPEG, for example. In reading up on progressive JPEGs and how they are rendered in browsers I never see any mention of this requiring chunked transfer encoding to work. If I'm understanding things correctly I don't see how progressive JPEGs could be rendered before they are fully downloaded without the use of chunked transfer encoding. Is this correct?
Edit: To clarify why I think chunked encoding is needed if you don't use chunked encoding to GET a progressive JPEG then the browser or other application that sent the GET request for the JPEG wouldn't be passed the JPEG resource until it was fully received. With chunked encoding, on the other hand, as each chunk of the JPEG came in, the application (browser or otherwise) could render or do whatever with the portion of the JPEG that was received instead of not having anything to process until the full JPEG was downloaded.
the browser or other application that sent the GET request for the JPEG wouldn't be passed the JPEG resource until it was fully received
That's not true. Browsers can access resources they are downloading before the download completes.
In the end it's all received over a socket, and a proper abstraction layer lets the application code "stream" bytes from that socket as they come in in packets.
I'd like to know some kind of file checksum (like SHA-256 hash, or anything else) when I start downloading file from HTTP server. It could be transferred as one of HTTP response headers.
HTTP etag is something similar, but it's used only for invalidating browser cache and, from what I've noticed, every site is calculating it in different way and it doesn't look like any hash I know.
Some software download sites provide various file checksums as separate files to download (for example, latest Ubuntu 16.04 SHA1 hashes: http://releases.ubuntu.com/16.04/SHA1SUMS). Won't it be easier to just include them in HTTP response header and force browser to calculate it when download ends (and do not force user to do it manually).
I guess that whole HTTP-based Internet is working, because we're using TCP protocol, which is reliable and ensures received bytes are exactly same as one send by the server. But if TCP is so "cool", why do we check file hashes manually (see abouve Ubuntu example)? And lot of thing can go wrong during file download (client/server disk corruption, file modification on server side etc.). And if I'm right, everything could be fixed simply by passing file hash at download start.
The checksum provided separately from the file is used for integrity check when doing Non TLS or indirect transfer.
Maybe I know your doubt because I had the same question about the checksums, let's find it out.
There are two tasks to be considered:
File broken during transfer
File be changed by hacker
And three protocol in this question:
HTTP protocol
SSL/TLS protocol
TCP protocol
Now we separate into two situations:
1. File provider and client transfer the file directly, no proxy, no offline(usb disk).
The TCP protocol promise: the data from server is exactly same as the data client received, by checksum and ack.
The TLS protocol promise: the server is authenticated (is truly ubuntu.com) and the data is not changed by any middleman.
So there is no need to add checksum header in HTTP protocol when doing HTTPS.
But when TLS is not enabled, forgery could happen: bad guy in middle gives a bad file to the client.
2. File provider and client transfer the file indirectly, by CDN, by mirror, by offline way(usb disk).
Many sites like ubuntu.com use 3-party CDN to serve static files, which the CDN server is not managed by ubuntu.com.
http://releases.ubuntu.com/somefile.iso redirect to http://59.80.44.45/somefile.iso.
Now the checksum must be provided out-of-band because it is not authenticated we don't trust the connection. So checksum header in HTTP protocol is helpless in this situation.
Digest is the standard header used to convey the checksum of a selected representation of a resource (that is, the payload body).
An example response with digest.
>200 OK
>...
>Digest: sha-256=X48E9qOokqqrvdts8nOJRJN3OWDUoyWxBf7kbu9DBPE=
>
>{"hello": "world"}
Digest may be used both in request and responses.
It's a good practice to validate the data against the digest before processing it.
You can see the related page on mozilla website for an indepth discussion around the payload body in http.
I guess that whole HTTP-based Internet is working, because we're using TCP protocol
No, the integrity on the web is ensured by TLS. Non-TLS communication should not
be trusted. See rfc8446
The hashes on ubuntu.com and similar sites are there for two purposes:
check the integrity of the file (yes, hypothetically the browser could check it for you)
check the correctness of the file, to avoid tampering (e.g. an attacker could intercept your download request and serve you a malicious file. While you may be covered by https browser side, that would not be true for data at rest, e.g. a usb external disk, and you may want to check for its correctness by comparing the hashes)
I often hear people say download with HTTP. What does it really mean technically?
HTTP stands for Hyper Text Transfer Protocol. So to understand it literally, it is meant for text transferring. And I used some sniffer tool to monitor the wire traffic. What get transferred are all ASCII characters. So I guess we have to convert whatever we want to download into characters before transferring it via HTTP. Using HTTP URL encoding? or some binary-to-text encoding schema such as base64? But that requires some decoding on the client side.
I always think it is TCP that can transfer whatever data, so I am guessing HTTP download is a mis-used word. It arise because we view a web page via HTTP and find some downloadable link on that page, and then we click it to download. In fact, browser open a TCP connection to download it. Nothing about HTTP.
Anyone could shed some light?
The complete answer to What does HTTP download exactly mean? is in its RCF 2616 specification, that you can read here: https://www.rfc-editor.org/rfc/rfc2616
Of course that's a long (but very detailed) document.
I won't replicate or summarize its content here.
In the body of your question you are more specific:
So to understand it literally, it is meant for text transferring.
I think the word "TEXT" it misleading you.
And
have to convert whatever we want to download into characters before transferring it via HTTP
is false. You don't necessarily have to.
A file, for example a JPEG image, may be sent over the wire without any kind of encoding. See for example this: When a web server returns a JPEG image (mime type image/jpeg), how is that encoded?
Note that optionally a compression or encoding may be applied (the most common case is GZIP for textual content like html, text, scripts...) but that depends on how the client and the server agree on how the data have to be transferred. That "agreement" is made with the "Accept-Encoding" and "Content-Encoding" directives in respectively the request's and the resonse's headers.
I understand the name is misleading you, but if you read Hyper Text Transfer Protocol as a Transfer Protocol with Hypertext capabilities, then it changes a bit.
When HTTP was developed there were already lots of protocols (for example, the IP protocol, which is how data are widely transmitted between servers on the internet) but there were not protocols that allowed for easy navigation between documents.
HTTP is a protocol that allows for transferring of information AND for hyper text (i.e. links) embedded within text documents. These links don't necessarily have to point to other text documents, so you can basically transmit any information using HTTP (the sender and the receiver agree on the type of document being sent using something called the mime type).
So the name still makes sense, even if you can send things other than text files.
HTTP stands for Hyper Text Transfer Protocol. So to understand it literally, it is meant for text transferring.
Yes, text transferring. Not necessarily plain text, but all text. It doesn't mean that your text has to be readable by a person, just the computer.
And I used some sniffer tool to monitor the wire traffic. What get transferred are all ASCII characters.
Your sniffer tool knows that you're a person, so it won't just present you with 0s and 1s. It converts whatever it gets to ASCII characters to make it readable to you. Alle communication over the wire is binary. The ASCII representation is just there for your sake.
So I guess we have to convert whatever we want to download into characters before transferring it via HTTP
No, not at all. Again, it's text – not necessarily plain text.
I always think it is TCP that can transfer whatever data, [...]
Here you're right. TCP does transfer all data, but in a completely different layer. To understand this, let's look at the OSI model:
When you send anything over the network, your data goes through all the different layers. First, the application layer. Here we have HTTP and several others. Everything you send over HTTP goes through the layers, down through presentation and all the way to the physical layer.
So when you say that TCP transfers the data, then you're right (HTTP could work over other transport protocols such as UDP, but that is rarely seen), but TCP transfers all your data whether you download a file from a webserver, copy a shared folder on your local network between computers or send an email.
HTTP can transfer "binary" data just fine. There is no need to convert anything.
HTTP is the protocol used to transfer your data. In your case any file you are downloading.
You can either do that(opening another type of connection) or you can send your data as raw text. What you'll send is just what you would see when opening the file in a text editor. Your browser just decides to save the file in your Downloads folder(or whereever you want it) because it sees the file type is not supportet(.rar, .zip).
If you look at OSI model, HTTP is a protocol that lives in the application layer. So when you hear that someone uses "HTTP to transfer data" they are referring to application layer protocol. An alternative would be FTP or NFS, for example.
Browser indeed opens TCP connection, when HTTP is used. TCP lives in the transport layer and provides reliable connection on top of IP.
HTTP protocol provides different verbs that can be used to retrieve and send data, GET and POST are the most common ones. Look-up REST.