Detecting if a URL is a file download - http

How can I detect if a given URL is a file to be downloaded?
I came across the content-disposition header, however it seems that this isn't a part of http 1.1 directly.
Is there a more standard way to detect if the response for a GET request made to a given URL is actually a file to/can be downloaded?
That is the response is not html or json or anything similar, but something like an image, mp3, pdf file etc.?

HTTP is a transfer protocol - which is a very different thing to hard drive storage layouts. The concept of "file" simply does not exist in HTTP. No more than your computer hard drive contains actual paper-and-cardboard "files" that one would see in an office filing system.
Whatever you may think the HTTP message or URL are saying the response content does not have to come from any computer file, and does not have to be stored in one by the recipient.
The response to any GET message in HTTP can always be "downloaded" by sending another GET request with that same URL (and maybe other headers in the case of HTTP/1.1 variants). That is built into the definition of what a GET message is and has nothing to do with files.

I ended up using the content-type to decide if it's an html file or some other type of file that is on the other end of a given URL.
I'm using the content-disposition header content to detect the original file name if it exists since the header isn't available everywhere.

Could checking for a file extension be a possibility? Sorry I can't enlarge on that much without knowing more, but I guess you could consider using PHP to implement this if HTML doesn't have enough functionality?

Related

How can I tell if visiting a URL would download a file of a certain mimetype?

I am building an application that tells me if visiting a URL would make a user download a file of a certain mimetype.
My question is: What information (like header fields) can be used to achive this?
I was thinking about sending a HEAD-request and look for Content-Disposition and Content-Type header fields. But an attacker might just lie in this fields and because of mimesniffing my browser would still save the file.
Is there a way to get this information without downloading the file (this would cause unwanted traffic.)
EDIT:
I want to develop an application that gets an URL as input.
The output should be three things:
1: does visiting the URL make browsers save ("download) a file delivered by the webserver?
if 1:
2: what is the mimetype of this file?
3: what is the filename of this file?
Example:The url https://foo.bar/game.exe visited with a browser saves the file game.exe
How could I tell (without causing huge traffic by downloading the file) that the url will: 1: make me download a file 2: application/octet-stream 3: game.exe
I already know how to make a head request. But can I really trust the Content-Disposition and Content-Type header fields? I have observed responses that did not contain a Content-Disposition field and my browser still saved the file. This would cause my application to think the URL is clear while it isn't.
Browsers do not guess the mime type if the type is present in the content-type header (see MDN:Mime Types)
So, you can rely on if that and/ or the content-Disposition header is present that the browser will not guess.
Now, in order to detect what it is you are getting, the best way is to request the head of the file (the first line / few bytes) and decipher the magic value from that. (e.a. the *NIX way to determine what a file is)
this is more reliable and less risky than depending on the file extension...
but if you need a fool proof methode to determine if a file will be downloaded.. there is n't one I know.
This can be done using curl, with the -I option (to fetch headers only), like so:
curl -I https://www.irs.gov/pub/irs-pdf/f1040.pdf

How to know when to resolve referer

I was working on my server and encountered the need to implement the use of request.headers.referer When I did tests and read headers to determine how to write the parsing functions, I couldn't determine a differentiation between requests that invoke from a link coming from outside the server, outside the directory, or calls for local resources from a given HTML response. For instance,
Going from localhost/dir1 to localhost/dir2 using <a href="http://localhost/dir2"> will yield the response headers:
referer:"http://localhost/dir1" url:"/dir2"
while the HTML file sent from localhost/dir2 asking for resources using local URI style.css will yeild:
referer:"http://localhost/dir2" url:"/style.css"
and the same situation involving an image could end up
referer:"http://localhost/dir2" url:"/_images/image.png"
How would I prevent incorrect resolution, between url and referer, from accidentally being parsed as http://localhost/dir1/dir2 or http://localhost/_images/image.png and so on? Is there a way to tell in what way the URI is being referred by the browser, and how can either the browser or server identify when http://localhost/dir2/../dir1 is intended destination?

Send XML file over HTTP POST

I want to send a bunch of XML files from my client (iPad) to my application server(Web)..Is there any way I can pass them to server using HTTP POST? I assume HTTP POST only allows embedding strings not attaching as files..We don't want to use FTP due to securuty reasons. We even thought of web service, but not sure whether attachments are possible..Pleas advise if you know any ways of transferring files from client to server.
The maximum length of a POST variable is massive - so no worries there, you can send XML fine. POST can send any type of data, just make sure you set the Content-Type header correctly or you may get unexpected results.
It is no less / more secure than FTP however.

What is the correct way to determine the type of a file returned by a web server?

I've always believed that the HTTP Content-Type should correctly identify the contents of a returned resources. I've recently noticed a resource from google.com with a filename similar to /extern_chrome/799678fbd1a8a52d.js that contained HTTP headers of:
HTTP/1.1 200 OK
Expires: Mon, 05 Sep 2011 00:00:00 GMT
Last-Modified: Mon, 07 Sep 2009 00:00:00 GMT
Content-Type: text/html; charset=UTF-8
Date: Tue, 07 Sep 2010 04:30:09 GMT
Server: gws
Cache-Control: private, x-gzip-ok=""
X-XSS-Protection: 1; mode=block
Content-Length: 19933
The content is not HTML, but is pure JavaScript. When I load the resource using a local proxy (Burp Suite), the proxy states that the MIME type is "script".
Is there an accepted method for determining what is returned from a web server? The Content-type header seems to usually be correct. Extensions are also an indicator, but not always accurate. Is the only accurate method to examine the contents of the file? Is this what web browsers do to determine how to handle the content?
The browser knows it's JavaScript because it reached it via a <script src="..."> tag.
If you typed the URL to a .js file into your URL's address bar, then even if the server did return the correct Content-Type, your browser wouldn't treat the file as JavaScript to be executed. (Instead, you would probably either see the .js source code in your browser window, or be prompted to save it as a file, depending on your browser.)
Browsers don't do anything with JavaScript unless it's referenced by a <script> tag, plain and simple. No content-sniffing is required.
Is the only accurate method to examine the contents of the file?
Its the method browsers use to determine the file type, but is by no means accurate. The fact that it isn't accurate is a security concern.
The only method available to the server to indicate the file type is via the Content-Type HTTP header. Unfortunately, in the past, not many servers set the correct value for this header. So browsers decided to play smart and tried to figure out the file type using their own proprietary algorithms.
The "guess work" done by browsers is called content-sniffing. The best resource to understand content-sniffing is the browser security handbook. Another great resource is this paper, whose suggestions have now been incorporated into Google Chrome and IE8.
How do I determine the correct file type?
If you are just dealing with a known/small list of servers, simply ask them to set the right content-type header and use it. But if you are dealing with websites in the wild that you have no control of, you will likely have to develop some kind of content-sniffing algorithm.
For text files, such as JavaScript, CSS, and HTML, the browser will attempt to parse the file. If that parsing fails before anything can get parsed, then it is considered completely invalid. Otherwise, as much as possible is kept and used. For JavaScript, it probably needs to syntactically compile everything.
For binary files, such as Flash, PNG, JPEG, WAVE files, they could use a library such as the magic library. The magic library determines the MIME type of a file using the content of the file which is really the only part that's trustworthy.
However, somehow, when you drag and drop a document in your browser, the browser heuristic in this case is to check the file extension. Really weak! So a file to attach to a POST could be a .exe and you would think it is a .png because that's the current file extension...
I have some code to test the MIME type of a file in JavaScript (after a drag and drop or Browse...):
https://sourceforge.net/p/snapcpp/code/ci/master/tree/snapwebsites/plugins/output/output.js
Search for MIME and you'll find the various functions doing the work. An example of usage is visible in the editor:
https://sourceforge.net/p/snapcpp/code/ci/master/tree/snapwebsites/plugins/editor/editor.js
There are extensions to the basic MIME types that can be found in the mimetype plugin.
It's all Object Oriented code so it may be a bit difficult to follow at first, but more or less, many of the calls are asynchronous.
Is there an accepted method for determining what is returned from a web server? The Content-type header seems to usually be correct. Extensions are also an indicator, but not always accurate.
As far as I know Apache uses file extensions. Assuming you trust your website administrator and end users cannot upload content, extensions are quite safe actually.
Is the only accurate method to examine the contents of the file?
Accurate and secure, yes. That being said, a server that makes use of a database system can save such meta data in the database and thus not have to re-check each time it handles the file. Further, once the type is detected, it can attempt a load to double check that the MIME type is all proper. That can even happen in a backend process so you don't waste the client's time (actually my server goes further and checks each file for viruses too, so even files it cannot load get checked in some way.)
Is this what web browsers do to determine how to handle the content?
As mentioned by Joe White, in most cases the browser expects a specific type of data from a file: a link for CSS expects CSS data; a script expects JavaScript, Ruby, ASP; an image or figure tag expects an image; etc.
So the browser can use a loader for that type of data and if the load fails it knows it was not of the right type. So the browser does not really need to detect the type per se. However, you have to trust that the loaders will properly fail when the data stream is invalid. This is why we have updates of the Flash player and way back had an update of the GIF library.
The detection of the type, as the magic library does, will only read a "few" bytes at the start of the file and determine a type based on that. This does not mean that the file is valid and can safely be loaded. The GIF bug meant that the file very much looked like a GIF image (it had the right signature) but at some point the buffers used in the library would overflow possibly creating a way to crash your browser and, hopefully for the hacker, take over your computer...

Better file uploading approach: HTTP post multipart or HTTP put?

Use-case: Upload a simple image file to a server, which clients could later retrieve
Designate a FTP Server for the job.
HTTP Put: It can directly upload files to a server without the need of a server side
component to handle the bytestream.
HTTP Post: Handle the bytestream by the server side component.
I think to safely use PUT on a public website requires even more effort than using POST (and is less commonly done) due to potential security issues. See http://bitworking.org/news/PUT_SaferOrDangerous.
OTOH, I think there are plenty of resources for safely uploading files with POST and checking them in the server side script, and that this is the more common practice.
PUT is only appropriate when you know the URL you are putting to.
You could also do:
4) POST to obtain a URL to which you then PUT the file.
edit: how are you going to get the HTTP server to decide whether it is OK to accept a particular PUT request?
What I usually do (via PHP) is HTTP POST.
And employ PHP's move_uploaded_file() to get it to whatever destination I want.

Resources