What does HTTP download exactly mean? - http

I often hear people say download with HTTP. What does it really mean technically?
HTTP stands for Hyper Text Transfer Protocol. So to understand it literally, it is meant for text transferring. And I used some sniffer tool to monitor the wire traffic. What get transferred are all ASCII characters. So I guess we have to convert whatever we want to download into characters before transferring it via HTTP. Using HTTP URL encoding? or some binary-to-text encoding schema such as base64? But that requires some decoding on the client side.
I always think it is TCP that can transfer whatever data, so I am guessing HTTP download is a mis-used word. It arise because we view a web page via HTTP and find some downloadable link on that page, and then we click it to download. In fact, browser open a TCP connection to download it. Nothing about HTTP.
Anyone could shed some light?

The complete answer to What does HTTP download exactly mean? is in its RCF 2616 specification, that you can read here: https://www.rfc-editor.org/rfc/rfc2616
Of course that's a long (but very detailed) document.
I won't replicate or summarize its content here.
In the body of your question you are more specific:
So to understand it literally, it is meant for text transferring.
I think the word "TEXT" it misleading you.
And
have to convert whatever we want to download into characters before transferring it via HTTP
is false. You don't necessarily have to.
A file, for example a JPEG image, may be sent over the wire without any kind of encoding. See for example this: When a web server returns a JPEG image (mime type image/jpeg), how is that encoded?
Note that optionally a compression or encoding may be applied (the most common case is GZIP for textual content like html, text, scripts...) but that depends on how the client and the server agree on how the data have to be transferred. That "agreement" is made with the "Accept-Encoding" and "Content-Encoding" directives in respectively the request's and the resonse's headers.

I understand the name is misleading you, but if you read Hyper Text Transfer Protocol as a Transfer Protocol with Hypertext capabilities, then it changes a bit.
When HTTP was developed there were already lots of protocols (for example, the IP protocol, which is how data are widely transmitted between servers on the internet) but there were not protocols that allowed for easy navigation between documents.
HTTP is a protocol that allows for transferring of information AND for hyper text (i.e. links) embedded within text documents. These links don't necessarily have to point to other text documents, so you can basically transmit any information using HTTP (the sender and the receiver agree on the type of document being sent using something called the mime type).
So the name still makes sense, even if you can send things other than text files.

HTTP stands for Hyper Text Transfer Protocol. So to understand it literally, it is meant for text transferring.
Yes, text transferring. Not necessarily plain text, but all text. It doesn't mean that your text has to be readable by a person, just the computer.
And I used some sniffer tool to monitor the wire traffic. What get transferred are all ASCII characters.
Your sniffer tool knows that you're a person, so it won't just present you with 0s and 1s. It converts whatever it gets to ASCII characters to make it readable to you. Alle communication over the wire is binary. The ASCII representation is just there for your sake.
So I guess we have to convert whatever we want to download into characters before transferring it via HTTP
No, not at all. Again, it's text – not necessarily plain text.
I always think it is TCP that can transfer whatever data, [...]
Here you're right. TCP does transfer all data, but in a completely different layer. To understand this, let's look at the OSI model:
When you send anything over the network, your data goes through all the different layers. First, the application layer. Here we have HTTP and several others. Everything you send over HTTP goes through the layers, down through presentation and all the way to the physical layer.
So when you say that TCP transfers the data, then you're right (HTTP could work over other transport protocols such as UDP, but that is rarely seen), but TCP transfers all your data whether you download a file from a webserver, copy a shared folder on your local network between computers or send an email.

HTTP can transfer "binary" data just fine. There is no need to convert anything.

HTTP is the protocol used to transfer your data. In your case any file you are downloading.

You can either do that(opening another type of connection) or you can send your data as raw text. What you'll send is just what you would see when opening the file in a text editor. Your browser just decides to save the file in your Downloads folder(or whereever you want it) because it sees the file type is not supportet(.rar, .zip).

If you look at OSI model, HTTP is a protocol that lives in the application layer. So when you hear that someone uses "HTTP to transfer data" they are referring to application layer protocol. An alternative would be FTP or NFS, for example.
Browser indeed opens TCP connection, when HTTP is used. TCP lives in the transport layer and provides reliable connection on top of IP.
HTTP protocol provides different verbs that can be used to retrieve and send data, GET and POST are the most common ones. Look-up REST.

Related

Wireshark Traffic Analysis for File type

So I am wondering how can we utilize Wireshark to see if a users has downloaded a txt file over the internet.
I tried this while running wireshark:
https://code.google.com/p/androidnetworktester/downloads/detail?name=1mb.txt
I followed the HTTP stream, and can see the URL and a bunch, but in the PCAP packet body, I can't find the 1mb.txt file anywhere. Just curious, if we are doing forensics works, how can we prove the person really downloaded this using this wireshark information? Is it because it's using SSL that all the text in the PCAP is scattered with random code?
Thanks a bunch
if we are doing forensics works, how can we prove the person really downloaded this using this wireshark information
You can't really prove it from the packet capture unless you are able to decode the content. In most cases this is not possible, but if you have access to the private key of the site (you usually don't because it is private) and if RSA key exchange was used then you can decode the traffic after capture.
What you can get from the packet capture is the target host of the request, but not the exact URL or even the content. But if the length of the packet capture matches about the length of the content (there is some overhead in transport) and if you know that this is the only file at the server of this size than you might have at least an indicator that the user might have downloaded this file. But is probably not enough as a real prove.
For more prove you might then have a look at this history of the browser.

How do XMPP/HTML/etc. *really* work?

This might be a dumb question, however, I have been continually frustrated by what seems to be a big gap in every explanation I've seen of protocols like XMPP or HTML. So basically, when I've read documentation on either, in general, it will describe the structure of the data sent back and forth through the protocol, but it does not explain exactly how this data is transferred. It's one thing to provide an example of, say, a generic HTTP request, but it is something else to explain how this text is actually sent to the server.
I guess posed another way, what resources are there out there for learning best practices for implementing text-based protocols? At their core, are all text-based protocols basically the exact same thing? How, for example, would it differ at the binary level, were I to say send the text content of an HTTP request over IRC vs however it is done natively by HTTP?
If I wanted to develop my own, simple textual protocol, what would be the best way to send the text to a client? Does the content itself even really matter? What I mean is that, obviously, HTTP and XMPP are rather different protocols, but do they differ in terms of how the text is transferred between computer to computer?
HTTP, IRC and XMPP are all sent on top of TCP, which is a protocol that provides a bidirectional stream between two endpoints (IP address + port). Under the hood, the data you send is split into separate packets, sent across the network, and reassembled on the other end, so that the recipient just sees a stream of incoming data - except when something goes wrong; there is a somewhat accessible description here.
What that means is that while the application protocol (HTTP, XMPP etc) is different, the underlying transport mechanism is exactly the same. It would be possible (perhaps even interesting) to implement HTTP on top of IRC: an HTTP/IRC client enters a channel, sends the HTTP request as messages to the channel, line by line, a server is present in the channel, reads the request and sends the response the same way - but transporting HTTP over IRC is fundamentally different from transporting HTTP over TCP. The former means layering an application protocol over another application protocol (and the IRC connection needs to go over TCP anyway), while the latter is an application protocol over a transport protocol, which is the way things usually are done (except for various kinds of proxies).
Hope that makes some sense...

What is the better performing / more compact way to send binary data to a server in WP7

Given the no direct tcp / socket limitation in Windows Phone 7 I was wondering what is the way that has the least performance overhead and/or can send it in the most compact way.
I think I can send the data as a file using HTTP (probably with an HTTPWebRequest) and encode it as Base64, but this would increase the transfer size significantly. I could use WCF but the performance overhead is going to be large as well.
Is there a way to send plain binary data without encoding it, or some faster way to do so?
Network communication on WP7 is currently limited to HTTP only.
With that in mind you're going to have to allow for the HTTP header being included as part of the transmission. You can help keep this small by not adding any additional headers youself (unless you really have to).
In terms of the body of the message then it's up to you to keep things as small as possible.
Formatting your data as JSON will typically be smaller than as XML.
If, however, your data will always be in a specific format you could just include it as raw data. i.e. if you know that the the data will have the first n bits/bytes/characters representing one thing, then next y bits/bytes/characters represent another, etc. you could format your data without any (field) identifiers. It just depends what you need.
If you want to send binary data, then certainly some people have been using raw sockets - see
Connect to attached pc from WP7 by opening a socket to localhost
However, unless you want to write your own socket server, then HTTP is very convenient. As Matt says, you can include binary content in your HTTP requests. To do this, you can use the headers:
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
Content-Length: your length
To actually set these headers, you may need to send this as a multipart message... see questions like Upload files with HTTPWebrequest (multipart/form-data)
There's some excellent sample code on AppHub forums - http://forums.create.msdn.com/forums/p/63646/390044.aspx - shows how to upload a binary photo to Facebook.
Unless your data is very large, then it may be easier to take the 4/3 hit of Base64 encoding :) (and there are other slightly more efficient encoding types too like Ascii85 - http://en.wikipedia.org/wiki/Ascii85)

Why does HTTP use only printable characters? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Why HTTP protocol is designed in plain text way?
To show my complete ignorance of how TCP/IP works: Looking at the ASCII table, what is the rationale of HTTP using only tab, newline and [x20-x7E] for the protocol?
For example, why is "x02" ("Start of Text") not used but a double newline "x0Ax0A"?
Has it to do with any interference with TCP/IP (as in "is not allowed in Application Layer")? Is there perhaps a more trivial reason? Or a more complicated?
Duplicate of this question: Why HTTP protocol is designed in plain text way?.
Text-based protocols are easier to debug - no need for special formatting/decoding routines in your debug print code, just dump the entire request to the console. Likewise, you can make a HTTP request just by typing it into an appropriate generic TCP relay program (eg, netcat).
I can't speak as to why HTTP chose to use that particular restricted character set, but it doesn't have to do with the restrictions of TCP/IP. TCP/IP packets are structured more or less like an envelope; they have complex routing and formatting information that must be stored in a very specific format, but the actual contents of the packet payloads can be whatever you choose. The packet headers contain enough information to allow for forwarding and transmission regardless of the packet content.

How to analyse a HTTP dump?

I have a file that apparently contains some sort of dump of a keep-alive HTTP conversation, i.e. multiple GET requests and responses including headers, containing an HTML page and some images. However, there is some binary junk in between - maybe it's a dump on the TCP or even IP level (I'm not sure how to determine what it is).
Basically, I need to extract the files that were transferred. Are there any free tools I could use for this?
Use Wireshark.
Look into the file format for its dumps and convert your dump to it. Its very simple. Its called the pcap file format. Then you can open it in Wireshark no problem and it should be able to recognize the contents. Wireshark supports many dozens if not many hundred communication formats at various OSI layers (including TCP/IP/HTTP) and is great for this kind of debugging.
Wireshark will analyze on the packet level. If you want to analyze on the protocol level, I recommend Fiddler: http://www.fiddlertool.com/fiddler/
It will show you the headers sent, the responses, and will decrypt HTTPS sessions as well. And a ton more.
The Net tab in the Firebug plugin for Firefox might be of use.

Resources