Tell if someone is accesing my HTTP-resources directly? - http

Is there a way to find out if anyone is calling the image located on my website directly on their website?
I have a website and I just want to make sure no one is using my bandwidth.

Sure there are methods, some which can be trusted a little more than others.
Using Referer-Header
There is a HTTP-Header named Referer which most often contain a string representing the URL which a user visited to get access to the current request.
You can see it as a "I came from here"-header.
If it was guaranteed to always exists it would be a piece of cake to prevent people from leeching your bandwitdh, though since this is not the case it's pretty much a gamble to just rely on this value (which might not exists at times).
Using Cookies
Another way of telling whether a user is a true visitor on your website is to use cookies, a user that hasn't got a cookie and tries to get access to a specific resource (such as an image) could get a message saying "sorry, only real visitors of example.com get access to this image".
Too bad that nothing states that a client is forced to implement and handle cookies.
Using links with a set expiration time [RECOMMENDED]
This is probably the safest option, though it's the hardest to implement.
Using links that is only valid for N hours will make it impossible to leech your bandwidth without going into trouble of implementing some sort of crawler which regularly crawls your site and returns the current access token required to get access to a resource (such as an image).
When a user visits the site a token generated N hours is applied to all resources available is appended to their path sent back to the visitor. This token is mandatory and only valid for N hours.
If the user tries to access an image with an invalid/non-existent token you could send back either 404 or 401 as HTTP status code (preferably the later since it's a Forbidden request).
There are however some quirks worth mentioning:
Crawlers from *search-engine*s might not visit the whole site at a given moment inside the N hours, make sure that they can access the whole content of your site. Identify them by using the value of header User-Agent.
Don't be tempted to lower the lifespan of your token to less than any reasonable time, remember that some users are on slow connections and that having a token of 5 seconds might sound cool - but real users can get flagged erroneously.
never put a token on a resource that people should be able to find from external point (search engines for one), such as the page containing the images you wish to protect.
If you do this by accident you will mostly harm the reputation of your site.
Additional thoughts...
Please remember that any method implemented to make it impossible
for leechers to hotlink your resources never should result in true
visitors being flagged for bandwidth leech. You probably want to ease
up on the restriction rather than making it stronger.
I rather have 10 normal visitors and 2 leechers than no leechers but
only 5 normal users (because I accidentally flagged 5 of the real
visitors as leechers without thinking too much).

Related

Calculate the number of visits based on downloaded GB

I have a website hosted in firebase that totally went viral for a day. Since I wasn't expecting that, I didn't install any analytics tool. However, I would like to know the number of visits or downloads. The only metric I have available is the GB Downloaded: 686,8GB. But I am confused because if I open the website with the console of Chrome, I get two different metrics about the size of the page: 319KB transferred and 1.2MB resources. Furthermore, not all of those things are transferred from firebase but from other CDN as you can see in the screenshots. What is the proper way of calculating the visits I had?
Transferred metric is how much bandwidth was used after compression was applied.
Resources metric is how much disk space those resources use before they are compressed (for transfer).
True analytics requires an understanding how what is on the web. There are three classifications:
Humans, composed of flesh and blood and overwhelmingly (though not absolutely) use web browsers.
Spiders (or search engines) that request pages with the notion that they obey robots.txt and will list your website in their websites for relevant search queries.
Rejects (basically spammers and the unknowns) which include (though are far from limited to) content/email scrapers, brute-force password guessers, vulnerability scanners and POST spammers.
With this clarification in place what you're asking in effect is, "How many human visitors am I receiving?" The easiest way to obtain that information is to:
Determine what user agent requests are human (not easy, behavior based).
Determine the length of time a single visit from a human should count as.
Assign human visitors a session.
I presume you understand what a cookie is and how it differs from a session cookie. Obviously when you sign in to a website you are assigned a session. If that session cookie is not sent to the server on a page request you will in effect be signed out. You can make session cookies last for a long time and it will come down to factors such as convenience for the visitor and if you directly count those sessions or use it in conjunction with something else.
Now your next thought likely is, "But how do I count downloads?" Thankfully you mention PHP in your website so I can thankfully give you some code that should make sense to you. If you just link directly to the file you'd be stuck with (at best) counting clicks via a click event on the anchor element though if the download gets canceled because it was a mistake or something else makes it more subjective than my suggestion. Granted my suggestion can still be subjective (e.g. they decide they actually don't want to download and cancel before the completion) and of course if they use the download is another aspect to consider. That being said if you want the server to give you a download count you'd want to do the following:
You'll may want to use Apache rewrite (or whatever the other HTTP server equivalents are) so that PHP handles the download.
You'll may need to ensure Apache has the proper handling for PHP (e.g. AddType application/x-httpd-php5 .exe .msi .dmg) so your server knows to let PHP run on the request file.
You'll want to use PHP's file_exists() with an absolute file path on the server for the sake of security.
You'll want to ensure that you set the correct mime for the file via PHP's header() as you should expect browsers to be horrible at guessing.
You absolutely need to use die() or exit() to avoid Gecko (Firefox) bugs if your software leaks even whitespace as the browser would interpret it as part of the file likely causing corruption.
Here is the code for PHP itself:
$p = explode('/',strrev($_SERVER['REQUEST_URI']));
$file = strrev($p[0]);
header('HTTP/1.1 200');
header('Content-Type: '.$mime);
echo file_get_contents($path_absolute.$file);
die();
For counting downloads if you want to get a little fancy you could create a couple of database tables. One for the files (download_files) and the second table for requests (download_requests). Throw in basic SQL queries and you're collecting data. Record IPv6 (Storing IPv6 Addresses in MySQL) and you'll be able to discern from a query how many unique downloads you have.
Back to human visitors: it takes a very thorough study to understand the differences between humans and bots. Things like Captcha are garbage and are utterly annoying. You can get a rough start by requiring a cookie to be sent back on requests though not all bots are ludicrously stupid. I hope this at least gets you on the right path.

How can I communicate to the user's browser that a POST request it made is side-effect-free?

I have to add a page to my website that will be accessed via a POST request. The request is side-effect-free, hence it is safe for the user to use their browser's "Refresh" button on the page. The reason why it has to be POST and not GET is that the volume of data needed to characterise the request is large (it includes a collection of arbitrarily many GUIDs identifying resources to be operated upon at a later stage in the process).
When the user of a browser refreshes a page that was the result of a POST request, the browser will typically warn them that the form will be resubmitted and may cause an action to be repeated. This is not a concern in this case, because as I said, the action of requesting this page is side-effect-free. I therefore want to communicate to the user's browser that no such warning should be presented to the user if they use the "Refresh" function. How can I do this?
You cannot prevent the browser from warning the user about resubmitting a POST request.
References
Mozzila forums (Firefox predecessor) discussed the feature extensively starting in 2002. Some discussion of other browsers also occurs. It is clear that the decision was made to enforce the feature and although workarounds were suggested, they were not taken up.
Google Chrome (2008) and other subsequent browsers also included the feature.
The reasons for this related to the difference between GET and POST in rfc2616: Hypertext Transfer Protocol -- HTTP/1.1 (1999).
GET
retrieve whatever information is identified by the Request-URI
POST
request that the origin server accept the
entity enclosed in the request as a new subordinate of the resource
This infers that whilst a GET request only retrieves data, a POST request modifies the data in some way. As per the discussion on the Mozilla forum, the decision was that enabling the warning to be disabled created more risks for the user than the inconvenience of keeping it.
Solutions
Instead a solution is to use sessions to store the data in the POST request and redirect the user with a GET request to a URL that looks in the session data to find the original request parameters.
Assuming the server side application has session support and it's enabled.
User submits POST request with data that generates a specific result POST /results
Server stores that data in the session with a known key
Server responds with a 302 Redirect to a chosen URL (Could be the same one)
The client will request the new page with a GET request GET /results
Server identifies the incoming GET request is asking for the results of previous POST request and retrieves the data from the session using the known key.
If the user refreshes the page then steps 4 & 5 are repeated.
To make the solution more robust, the POST data could be assigned to a unique key that is passed as part of the path or query in the 302 redirect GET /results?set=1. This would enable multiple different pages to be viewed and refreshed, for example in different browser tabs. Consideration must be given to ensuring that the unique key is valid and does not allow access to other session data.
Some systems, Kibana, Grafana, pastebin.com and many others go one step further. The POST request values are stored in a persistent data store and a unique short URL is provided to the user. The short URL can be used in GET requests and shared with other users to view the same result of what was originally a POST request.
You can solve this problem by implementing the Post/Redirect/Get pattern.
You typically get a browser warning when trying to re-send a POST request for security reasons. Consider a form where you input personal data to register an account or order a product. If you would double-send your data it might happen that you register twice or buy the same thing two times (of course, this is just a theoretical example). Thus, the user should get warned when trying to send the same POST request several times. This behaviour is intended and cannot be disabled but avoided by using the aforementioned PRG pattern.
Image from Wikipedia (published under LGPL).
In simple words, this pattern can be used to avoid double submissions of form data that could possibly cause undesired results. You have to configure your server to redirect the affected incoming POST requests using the status code 303 ("see other"). Then the user will be redirected (using a GET request) to a confirmation page, showing that the request has been successful and will now be processed. If the user now reloads the page, he / she will be redirected to the same page without re-submitting the POST request.
However, this strategy might not always work. In case the server did not receive the first submission yet (because of traffic for instance), if the user now re-submits the second POST request could still be sent.
If you provide more information on your tech stack, I can expand my answer by adding specific code samples.
You can't prevent all browsers from showing this "Are you sure you want to re-submit this form?" popup when the user refreshes a page that is the result of a POST request. So you will have to turn this POST request into a GET request if you want to prevent this popup when your users hit F5 on that page.
And for a search form, which you kind of admitted this was for, turning a POST into a GET has its own problems.
For starters, are you sure you need POST to begin with? Is the data really too large to fit in the query string? Taking a reasonable limit of 1024 characters, being around 30 GUIDs (give or take some space for repeated &q=), why do you need the search parameters to be GUIDs to begin with? If you can map them or look them up somehow, you could perhaps limit the size of each parameter to a handful of characters instead of 32 for a non-dashed GUID, and with 5 characters per key you could suddenly fit 200 parameters in the query string.
Still not enough? Then you need a POST indeed.
One approach, mentioned in comments, is using AJAX, so your search form doesn't actually submit, but instead it sends the query data in the background through a JavaScript HTTP POST request and updates the page with the results. This has the benefit that refreshing the page doesn't prompt, as there's only a GET as far as the browser is concerned, but there's one drawback: search results don't get a unique URL, so you can't cache, bookmark or share them.
If you don't care about caching or URL bookmarking, then AJAX definitely is the simplest option here and you need to read no further.
For all non-AJAX approaches, you need to persist the query parameters somewhere, enabling a Post/Redirect/Get pattern. This patterns ends up with a page that is the result of a GET request, which users can refresh without said popup. What the other answers are being quite handwavy about, is how to properly do this.
Options are:
Serverside session
When POSTing to the server, you can let the server persist the query parameters in the session (all major serverside frameworks enable you to use sessions), then redirect the user to a generic /search-results page, which on the server side reads the data from the session and presents the user with the results built from querying the database combined with the query parameters from the session.
Drawbacks:
Sessions generally time out, and they do so for good reasons. If your user hits F5 after, say, 20 minutes, their session data is gone, and so are their query parameters.
Sessions are shared between browser tabs. If your user is searching for thing A on tab 1, and for thing B on tab 2, the parameters of the tab that's submitted latest, will overwrite the earlier tabs when those are refreshed.
Sessions are per browser. There's generally no trivial way to share sessions (apart from putting the session ID in the URL, but see the first bullet), so you can't bookmark or share your search results.
Local storage / cookies
You could think "but cookies can contain more data than the query string", but just no. Apart from having a limit as well, they're also shared between tabs and can't be (easily) shared between users and not bookmarked.
Local storage also isn't an option, because while that can contain way more data - it doesn't get sent to the server. It's local storage.
Serverside persistent storage
If your search queries actually are that complex that you need multiple KB of query parameters, then you could probably benefit from persisting the query parameters in a database.
So for each search request, you create a new search_query database record that contains the appropriate parameters for the query-to-execute, and, given search results aren't private, you could even write some code that looks up whether the given parameter combination has been used before and first perform a lookup.
So you get a unique search_id that points to a set of parameters with which you can perform a query. Now you can redirect your user, so they perform a GET request to this page:
/search-results?search_id=Xxx
And there you render the results for the given query. Benefits:
You can cache, bookmark and share the URL /search-results?search_id=Xxx
You can refresh the page displaying the search results without an annoying popup
Each browser tab displays its own search results
Of course this approach also has drawbacks:
Unless you use a unguessable key for search_id, users can enumerate earlier searches by other users
Each search costs permanent serverside storage, unless you decide to evict earlier searches based on some criteria

Ok to pass IDs in query string?

Is it okay to pass IDs in the query string? For example:
example.com/viewperson.aspx?personid=22d62e18-2383-42ca-ba6d-a535355b98bb
Does this change (less risk) for an intranet site?
If a public site, assume anyone can access (shoulder serving, browser logging, etc).. even though we’re under SSL I still assume “out there”. Obviously, security will be applied to disallow an unauthenticated user from viewing page. But still a risk?
An added benefit of using query string is bookmarking a few (e.g. persons in this case) you want to call back up without having to go through the front door and look back up.
Would never pass anything meaningful, but maybe an ID is meaningful enough to not pass?
An alternative would be a cookie or session variable, of course.
Assuming these are resources you want protected its not a risk provided that requests for resources by id are both authenticated and authorised. Meaning you should verify that the request comes from a logged in user and that the user has access to that resource.
So if I belong to company 5 i shouldn't be able to open /companies/4.
Otherwise no issues and there is no alternative approach I am aware of. (by which I mean you must provide an identifier somehow)

Why does CORS ask new domain and not original domain?

I believe I am finally grok-ing CORS and its motivations.
In brief, I understand that a script originating from original.com attempting to request a resource from other.com is potentially risky (for all parties: user, original.com and other.com) due to information leaking; thats the motivation behind CORS. My question is not about this.
CORS mandates that other.com opts-in/agrees to the request from the scripts that originates from original.com.
It took a while to grok because I had a certain intuition I had to go against. My intuition was that its supposed to be original.com (and not other.com) that had to opt-in/agree to the cross-origin request.
The line of thinking was that original.com is the domain that the user trusts (after all, user went to original.com in the first place). Hence the browser (enforcer of CORS) should trust whomever the user trusts.
e.g.
If original.com says to trust other.com, ads.com and/or tracker.com then go ahead and allow requests to them. But if subsequently ads.com returns a script that requests something from shadows.com (whom original.com does not trust) then block it.
Currently, CORS will cause the browser to ask shadows.com if it accepts a request from original.com. And I imagine shadows.com to be a villain in a leather armchair saying 'why yes, absolutely' >:).
In short and a bit simplified, it's because data is downloaded from the other domain. In your example, other.com holds the data and can decide who (what origins) it shares that data with.
The kind of trust that you are mentioning is by the way also present. When a user visits original.com in the example above, he trusts that original.com will only make him download stuff from domains deemed ok to download data from by developers of original.com. However, whether those other sources (other.com, etc.) want to serve a request when it comes from a user visiting a different site (original.com), it's a decision for the other domains to make.

When should one use GET instead of POST in a web application?

It seems that sticking to POST is the way to go because it results in clean looking URLs. GET seems to create long confusing URLs. POST is also better in terms of security. Good for protecting passwords in forms. In fact I hear that many developers only use POST for forms. I have also heard that many developers never really use GET at all.
So why and in what situation would one use GET if POST has these 2 advantages?
What benefit does GET have over POST?
you are correct, however it can be better to use gets for search pages and such. Places where you WANT the URL's to be obvious and discoverable. If you look at Google's (or any search page), it puts a www.google.com/?q=my+search at the end so people could link directly to the search.
You actually use GET much more than you think. Simply returning the web page is a GET request. There are also POST, PUT, DELETE, HEAD, OPTIONS and these are all used in RESTful programming interfaces.
GET vs. POST has no implications on security, they are both insecure unless you use HTTP/SSL.
Check the manual, I'm surprised that nobody has pointed out that GET and POST are semantically different and intended for quite different purposes.
While it may appear in a lot of cases that there is no functional difference between the 2 approaches, until you've tested every browser, proxy and server combination you won't be able to rely on that being a consistent in every case. e.g. mobile devices / proxies often cache aggressivley even where they are requested not to (but I've never come across one which incorrectly caches a POST response).
The protocol does not allow for anything other than simple, scalar datatypes as parameters in a GET - e.g. you can only send a file using POST or PUT.
There are also implementation constraints - last time I checked, the size of a URL was limited to around 2k in MSIE.
Finally, as you've noted, there's the issue of data visibility - you may not want to allow users to bookmark a URL containing their credit card number / password.
POST is the way to go because it results in clean looking URLs
That rather defeats the purpose of what a URL is all about. Read RFC 1630 - The Need For a Universal Syntax.
Sometimes you want your web application to be discoverable as in users can just about guess what a URL should be for a certain operation. It gives a nicer user experience and for this you would use GET and base your URLs on some sort of RESTful specification like http://microformats.org/wiki/rest/urls
If by 'web application' you mean 'website', as a developer you don't really have any choice. It's not you as a developer that makes the GET or POST requests, it's your user. They make the requests via their web browser.
When you request a web page by typing its URL into the address bar of the browser (or clicking a link, etc), the browser issues a GET request.
When you submit a web page using a button, you make a POST request.
In a GET request, additional data is sent in the query string. For example, the URL www.mysite.com?user=david&password=fish sends the two bits of data 'user' and 'password'.
In a POST request, the values in the form's controls (e.g. text boxes etc) are sent. This isn't visible in the address bar, but it's completely visible to anyone viewing your web traffic.
Both GET and POST are completely insecure unless SSL is used (e.g. web addresses beginning https).

Resources