How can I tell if visiting a URL would download a file of a certain mimetype? - http

I am building an application that tells me if visiting a URL would make a user download a file of a certain mimetype.
My question is: What information (like header fields) can be used to achive this?
I was thinking about sending a HEAD-request and look for Content-Disposition and Content-Type header fields. But an attacker might just lie in this fields and because of mimesniffing my browser would still save the file.
Is there a way to get this information without downloading the file (this would cause unwanted traffic.)
EDIT:
I want to develop an application that gets an URL as input.
The output should be three things:
1: does visiting the URL make browsers save ("download) a file delivered by the webserver?
if 1:
2: what is the mimetype of this file?
3: what is the filename of this file?
Example:The url https://foo.bar/game.exe visited with a browser saves the file game.exe
How could I tell (without causing huge traffic by downloading the file) that the url will: 1: make me download a file 2: application/octet-stream 3: game.exe
I already know how to make a head request. But can I really trust the Content-Disposition and Content-Type header fields? I have observed responses that did not contain a Content-Disposition field and my browser still saved the file. This would cause my application to think the URL is clear while it isn't.

Browsers do not guess the mime type if the type is present in the content-type header (see MDN:Mime Types)
So, you can rely on if that and/ or the content-Disposition header is present that the browser will not guess.
Now, in order to detect what it is you are getting, the best way is to request the head of the file (the first line / few bytes) and decipher the magic value from that. (e.a. the *NIX way to determine what a file is)
this is more reliable and less risky than depending on the file extension...
but if you need a fool proof methode to determine if a file will be downloaded.. there is n't one I know.

This can be done using curl, with the -I option (to fetch headers only), like so:
curl -I https://www.irs.gov/pub/irs-pdf/f1040.pdf

Related

File upload and store with lighttpd

I am running lighthttpd in Linux on an embedded platform.
Now i want to make it possible to transfer a file to the system, with an upload web page containing a file selector and "Upload" button (with HTML tags and ). The selected file is transferred as a POST HTTP request containing multipart/form-data. The file should then simply be stored as a regular file in the file system.
I'm already having a CGI interface, a bash script which receives the request and which passes it to the backend C++ application. And because it is an embedded platform, i would like to avoid using php, python etc. only for this case.
As far as i see, lighttpd is not able to save the received files directly from multipart-encoded request body to pure files, correct?
To decode the body i found 'munpack' tool from the mpack package, which writes the encoded body to files on disk, but is intended for mime encoded emails. Nevertheless i can call it in the CGI bash script, and it works almost like expected, except that it can't handle the terminating boundary id (the boundary id given in 'Content-Type' appended by two dashes), resulting in the last file still containing the final boundary. Update: This munpack behaviour came from a faulty script, but still it doesn't work, munpack produces wrong files when the body contains CRLF line endings; only LF produces the correct result.
Is there any other direct request-to-file-on-disk approach? Or do i really have to filter out the terminating boundary manually in the script, or write a multipart-message parser in my C++ application?
To make the use case clear: A user should be able to upload a firmware file to my system. So he connects to my system with a web browser, receives an upload page where he can select the file and send it with an "Upload" button. This transferred file should then simply be stored on my system. The CGI script for receiving the request does already exist (as well as a C++ backend where i could handle the request, too), the only problem is converting the multipart/form-data encoded file to a plain file on disk.
Now i want to make it possible to transfer a file to the system, through a POST HTTP request. The file should simply be stored as a regular file in the file system.
That sounds more like it should be an HTTP PUT rather than an HTTP POST.
As far as i see, lighttpd is not able to save the received files directly from multipart-encoded request body to pure files, correct?
Do you mean application/x-www-form-urlencoded with the POST?
Why multipart-encoded? Are there multiple files being uploaded?
lighttpd mod_webdav supports PUT. Otherwise, you need your own program to handle the request body, be it a shell script or a compiled program. You can use libfcgi with your C++, or you can look at the C programs that lighttpd uses for testing, which implement FastCGI and SCGI in < 300 lines of C each.

Securing HTTP referer

I develop software which stores files in directories with random names to prevent unauthorized users to download a file.
The first thing we need about this is to store them in a separate top-level domain (to prevent cookie theft).
The second danger is HTTP referer which may reveal the name of the secret directory.
My experiments with Chrome browser shows that HTTP referer is sent only when I click a link in my (secret) file. So the trouble is limited only to files which may contain links (in Chrome HTML and PDF). Can I rely on this behavior (not sending the referer is the next page is opened not from a current (secret) page link but with some other method such as entering the URL directly) for all browsers?
So the problem was limited only to HTML and PDF files. But it is not a complete security solution.
I suspect that we can fully solve this problem by adding Content-Disposition: attachment when serving all our secret files. Will it prevent the HTTP referer?
Also note that I am going to use HTTPS for a man-in-the-middle not to be able to download our secret files.
You can use the Referrer-Policy header to try to control referer behaviour. Please take note that this requires clients to implement this.
Instead of trying to conceal the file location, may I suggest you implement proper authentication and authorization handling?
I agree that Referrer-Policy is your best first step, but as DaSourcerer notes, it is not universally implemented on browsers you may support.
A fully server-side solution is as follows:
User connects to .../<secret>
Server generates a one-time token and redirects to .../<token>
Server provides document and invalidate token
Now the referer will point to .../<token>, which is no longer valid. This has usability trade-offs, however:
Reloading the page will not work (though you may be able to address this with a cookie or session)
Users cannot share URL from URL bar, since it's technically invalid (in some cases that could be a minor benefit)
You may be able to get the same basic benefits without the usability trade-offs by doing the same thing with an IFRAME rather than redirecting. I'm not certain how IFRAME influences Referer.
This entire solution is basically just Referer masking done proactively. If you can rewrite the links in the document, then you could instead use Referer masking on the way out. (i.e. rewrite all the links so that they point to https://yoursite.com/redirect/....) Since you mention PDF, I'm assuming that this would be challenging (or that you otherwise do not want to rewrite the document).

Detecting if a URL is a file download

How can I detect if a given URL is a file to be downloaded?
I came across the content-disposition header, however it seems that this isn't a part of http 1.1 directly.
Is there a more standard way to detect if the response for a GET request made to a given URL is actually a file to/can be downloaded?
That is the response is not html or json or anything similar, but something like an image, mp3, pdf file etc.?
HTTP is a transfer protocol - which is a very different thing to hard drive storage layouts. The concept of "file" simply does not exist in HTTP. No more than your computer hard drive contains actual paper-and-cardboard "files" that one would see in an office filing system.
Whatever you may think the HTTP message or URL are saying the response content does not have to come from any computer file, and does not have to be stored in one by the recipient.
The response to any GET message in HTTP can always be "downloaded" by sending another GET request with that same URL (and maybe other headers in the case of HTTP/1.1 variants). That is built into the definition of what a GET message is and has nothing to do with files.
I ended up using the content-type to decide if it's an html file or some other type of file that is on the other end of a given URL.
I'm using the content-disposition header content to detect the original file name if it exists since the header isn't available everywhere.
Could checking for a file extension be a possibility? Sorry I can't enlarge on that much without knowing more, but I guess you could consider using PHP to implement this if HTML doesn't have enough functionality?

How can I find the URL that downloads a file?

I am developing a web scraper and I need to download a .pdf file from a page. I can get the file name from the html tag, but can't find the complete url (or request body) that downloads the file.
I have tried to sniff the traffic with the chrome and firefox network traffic tool and with wireshark, with no success. I can see it make a post request to the exact same url as the page itself, and so I can't understand why this happens. My guess is that the filename is being sent inside the POST request body, but I also can't find that information in those tools. If I could see the variable name in the body, I could create a copy of the request and then get the file.
How can I get that information?
Here is the website I am talking about: http://www2.trt8.jus.br/consultaprocesso/formulario/ProcessoConjulgado.aspx?sDsTelaOrigem=ListarProcessos.aspx&iNrInstancia=1&sFlTipo=T&iNrProcessoVaraUnica=126&iNrProcessoUnica=1267&iNrProcessoAnoUnica=2010&iNrRegiaoUnica=8&iNrJusticaUnica=5&iNrDigitoUnica=24&iNrProcesso=1267&iNrProcessoAno=2010&iNrProcesso2a=0&iNrProcessoAno2a=0
EDIT: for those seeking to do something similar, take a look at this website: http://curl.trillworks.com/
It converts a cURL to a python requests code. Very useful
The POST data used for the request is encoded content generated by ASP.NET. It contains various state/session information of the page that the link is on. This makes it difficult to directly scrape for the URL.
You can examine the HAR by exporting it from the Network tab in Chrome DevTools:
The __EVENTVALIDATION data is used to ensure events raised on the client originate from the controls rendered on the page from the server.
You might be able to achieve what you want by requesting the page the link is on first, then extract the required POST data from the response (containing the page state and embedded request for file), and then make a new request with this information. This assumes the server doesn't expire any sessions in the meantime.

What is the correct way to determine the type of a file returned by a web server?

I've always believed that the HTTP Content-Type should correctly identify the contents of a returned resources. I've recently noticed a resource from google.com with a filename similar to /extern_chrome/799678fbd1a8a52d.js that contained HTTP headers of:
HTTP/1.1 200 OK
Expires: Mon, 05 Sep 2011 00:00:00 GMT
Last-Modified: Mon, 07 Sep 2009 00:00:00 GMT
Content-Type: text/html; charset=UTF-8
Date: Tue, 07 Sep 2010 04:30:09 GMT
Server: gws
Cache-Control: private, x-gzip-ok=""
X-XSS-Protection: 1; mode=block
Content-Length: 19933
The content is not HTML, but is pure JavaScript. When I load the resource using a local proxy (Burp Suite), the proxy states that the MIME type is "script".
Is there an accepted method for determining what is returned from a web server? The Content-type header seems to usually be correct. Extensions are also an indicator, but not always accurate. Is the only accurate method to examine the contents of the file? Is this what web browsers do to determine how to handle the content?
The browser knows it's JavaScript because it reached it via a <script src="..."> tag.
If you typed the URL to a .js file into your URL's address bar, then even if the server did return the correct Content-Type, your browser wouldn't treat the file as JavaScript to be executed. (Instead, you would probably either see the .js source code in your browser window, or be prompted to save it as a file, depending on your browser.)
Browsers don't do anything with JavaScript unless it's referenced by a <script> tag, plain and simple. No content-sniffing is required.
Is the only accurate method to examine the contents of the file?
Its the method browsers use to determine the file type, but is by no means accurate. The fact that it isn't accurate is a security concern.
The only method available to the server to indicate the file type is via the Content-Type HTTP header. Unfortunately, in the past, not many servers set the correct value for this header. So browsers decided to play smart and tried to figure out the file type using their own proprietary algorithms.
The "guess work" done by browsers is called content-sniffing. The best resource to understand content-sniffing is the browser security handbook. Another great resource is this paper, whose suggestions have now been incorporated into Google Chrome and IE8.
How do I determine the correct file type?
If you are just dealing with a known/small list of servers, simply ask them to set the right content-type header and use it. But if you are dealing with websites in the wild that you have no control of, you will likely have to develop some kind of content-sniffing algorithm.
For text files, such as JavaScript, CSS, and HTML, the browser will attempt to parse the file. If that parsing fails before anything can get parsed, then it is considered completely invalid. Otherwise, as much as possible is kept and used. For JavaScript, it probably needs to syntactically compile everything.
For binary files, such as Flash, PNG, JPEG, WAVE files, they could use a library such as the magic library. The magic library determines the MIME type of a file using the content of the file which is really the only part that's trustworthy.
However, somehow, when you drag and drop a document in your browser, the browser heuristic in this case is to check the file extension. Really weak! So a file to attach to a POST could be a .exe and you would think it is a .png because that's the current file extension...
I have some code to test the MIME type of a file in JavaScript (after a drag and drop or Browse...):
https://sourceforge.net/p/snapcpp/code/ci/master/tree/snapwebsites/plugins/output/output.js
Search for MIME and you'll find the various functions doing the work. An example of usage is visible in the editor:
https://sourceforge.net/p/snapcpp/code/ci/master/tree/snapwebsites/plugins/editor/editor.js
There are extensions to the basic MIME types that can be found in the mimetype plugin.
It's all Object Oriented code so it may be a bit difficult to follow at first, but more or less, many of the calls are asynchronous.
Is there an accepted method for determining what is returned from a web server? The Content-type header seems to usually be correct. Extensions are also an indicator, but not always accurate.
As far as I know Apache uses file extensions. Assuming you trust your website administrator and end users cannot upload content, extensions are quite safe actually.
Is the only accurate method to examine the contents of the file?
Accurate and secure, yes. That being said, a server that makes use of a database system can save such meta data in the database and thus not have to re-check each time it handles the file. Further, once the type is detected, it can attempt a load to double check that the MIME type is all proper. That can even happen in a backend process so you don't waste the client's time (actually my server goes further and checks each file for viruses too, so even files it cannot load get checked in some way.)
Is this what web browsers do to determine how to handle the content?
As mentioned by Joe White, in most cases the browser expects a specific type of data from a file: a link for CSS expects CSS data; a script expects JavaScript, Ruby, ASP; an image or figure tag expects an image; etc.
So the browser can use a loader for that type of data and if the load fails it knows it was not of the right type. So the browser does not really need to detect the type per se. However, you have to trust that the loaders will properly fail when the data stream is invalid. This is why we have updates of the Flash player and way back had an update of the GIF library.
The detection of the type, as the magic library does, will only read a "few" bytes at the start of the file and determine a type based on that. This does not mean that the file is valid and can safely be loaded. The GIF bug meant that the file very much looked like a GIF image (it had the right signature) but at some point the buffers used in the library would overflow possibly creating a way to crash your browser and, hopefully for the hacker, take over your computer...

Resources