What is the best HTTP status code to use when the requested language is unavailable? - http

I'm working on a site which will be available in multiple languages. I'm using subdomains to identify the locale (seems like the best option for us after reading: https://support.google.com/webmasters/answer/182192), so our English site will be at en.mysite.org, French at fr.mysite.org etc.
Not all translations will be available right away and some will never be available. So I have two scenarios which may require two different status codes:
When the user visits ru.mysite.org I would like to show them a page that tells them that the language is not yet available but will provide more information on how they can help make it available.
When the user visits pirate.mysite.org I would like them to know that it will likely never be available (and Google should probably also be unaware of the site).
Right now I'm simply rendering a 404 in both cases but I'm thinking that there may be a better practise for these cases particularly for SEO purposes. For scenario 1 I'm starting to think that 501 may make more sense. For the second scenario I'm not sure if there is a better option.

I think a 404 is reasonable for both cases:
404 Not Found
The requested resource could not be found but may be available again in the future. Subsequent requests by the client are permissible.
(Wikipedia)
HTTP does have language support via the Accept-Language header, but your site does not use it. (Why not?) For all HTTP knows, the fr and ru subdomains are just other parts of your site. Using the subdomain to represent language is not "the HTTP way" to return a different representation of the same resource. So as far as HTTP is concerned, a user has requested something which is not available, but it might be available in the future.
I do not think that either case represents a server error so 5xx is not the appropriate category.

Related

HTTP status code for "never going to exist"

I'm building a service where resources are divided into categories, e.g. example.com/category/foo would represent "foo" (think: similar to SO tags).
New categories may be added at any time and, when attempting to view a category that doesn't yet exist, users would be able to suggest it added.
However, I'd like to just outright ban some category names (e.g. NSFW terms). This means that (assuming "bar" is one such name) example.com/category/bar not only never existed, but is guaranteed never to exist in the future.
Which HTTP status code is appropriate for this situation?
Several ideas come to mind:
410 Gone - while this makes it clear the resource will not be available in the future, I'm not sure if it's appropriate as it also seems to imply it used to exist in the past.
400 Bad Request - the request is technically not malformed, so this is probably not the way to go.
404 Not Found initially seemed like the logical option, but doesn't convey the permanence of the ban, especially since I plan to use 404 for categories that don't yet exist, but can be suggested.
301 Moved Permanently and redirect to another page, either the home page or some other page explaining that some category names are banned.
I don't think there's such a thing as "will never exist" in the specs of HTTP. Let's think about it slightly differenly :
the resource is created through your website (and only through your website). If a resource is created, that means that whatever validation you put in place succeeded.
So to keep it simple stupid, you should stick to HTTP semantics. If someone hits a URL : example.com/category/{cat}, either you know {cat} (it's in your DB and has a valid name, right?) and process the request safley, or you have never seen {cat} before and you just return 404.
After all there's an infinity of possible values that one could use for {cat} and all of them would be valid URLS.
Hope it helps
I would suggest a 410 Gone is an appropriate response, assuming that your web clients are common and implement the http spec correctly (most popular ones do).
When looking on this page here: Status Code Definitions
The section about 410 gone says:
10.4.11 410 Gone
The requested resource is no longer available at the server and no
forwarding address is known. This condition is expected to be
considered permanent. Clients with link editing capabilities SHOULD
delete references to the Request-URI after user approval. If the
server does not know, or has no facility to determine, whether or not
the condition is permanent, the status code 404 (Not Found) SHOULD be
used instead. This response is cacheable unless indicated otherwise.
The 410 response is primarily intended to assist the task of web
maintenance by notifying the recipient that the resource is
intentionally unavailable and that the server owners desire that
remote links to that resource be removed. Such an event is common for
limited-time, promotional services and for resources belonging to
individuals no longer working at the server's site. It is not
necessary to mark all permanently unavailable resources as "gone" or
to keep the mark for any length of time -- that is left to the
discretion of the server owner.
I take that to mean the resource definitely does not exist now and will not exist in the future, giving the permanence that you want, and it is also cachable unless stated otherwise.
Usually I would not need to know that it had "never existed before", I just know that it doesn't exist right now when i'm requesting information about it, and it is not going to exist in the future if i request it again. If I did need to learn that certain categories never existed in the past (e.g. a blacklist of categories), i would probably want a separate explicit request that I could use to get all of those categories rather than having to check each category one at a time at run time.

how to prevent vulnerability scanning

I have a web site that reports about each non-expected server side error on my email.
Quite often (once each 1-2 weeks) somebody launches automated tools that bombard the web site with a ton of different URLs:
sometimes they (hackers?) think my site has inside phpmyadmin hosted and they try to access vulnerable (i believe) php-pages...
sometimes they are trying to access pages that are really absent but belongs to popular CMSs
last time they tried to inject wrong ViewState...
It is clearly not search engine spiders as 100% of requests that generated errors are requests to invalid pages.
Right now they didn't do too much harm, the only one is that I need to delete a ton of server error emails (200-300)... But at some point they could probably find something.
I'm really tired of that and looking for the solution that will block such 'spiders'.
Is there anything ready to use? Any tool, dlls, etc... Or I should implement something myself?
In the 2nd case: could you please recommend the approach to implement? Should I limit amount of requests from IP per second (let's say not more than 5 requests per second and not more then 20 per minute)?
P.S. Right now my web site is written using ASP.NET 4.0.
Such bots are not likely to find any vulnerabilities in your system, if you just keep the server and software updated. They are generally just looking for low hanging fruit, i.e. systems that are not updated to fix known vulnerabilities.
You could make a bot trap to minimise such traffic. As soon as someone tries to access one of those non-existant pages that you know of, you could stop all requests from that IP address with the same browser string, for a while.
There are a couple of things what you can consider...
You can use one of the available Web Application Firewalls. It usually has set of rules and analytic engine that determine suspicious activities and react accordingly. For example in you case it can automatically block attempts to scan you site as it recognize it as a attack pattern.
More simple (but not 100% solution) approach is check referer url (referer url description in wiki) and if request was originating not from one of you page you rejected it (you probably should create httpmodule for that purpose).
And of cause you want to be sure that you site address all known security issues from OWASP TOP 10 list (OWASP TOP 10). You can find very comprehensive description how to do it for asp.net here (owasp top 10 for .net book in pdf), i also recommend to read the blog of the author of the aforementioned book: http://www.troyhunt.com/
Theres nothing you can do (reliabily) to prevent vulernability scanning, the only thing to do really is to make sure you are on top of any vulnerabilities and prevent vulernability exploitation.
If youre site is only used by a select few and in constant locations you could maybe use an IP restriction

appropriate user-agent header value

I'm using HttpBuilder (a Groovy HTTP library built on top of apache's httpclient) to sent requests to the last.fm API. The docs for this API say you should set the user-agent header to "something appropriate" in order to reduce your chances of getting blocked.
Any idea what kind of values would be deemed appropriate?
The name of your application including a version number?
I work for Last.fm. "Appropriate" means something which will identify your app in a helpful way to us when we're looking at our logs. Examples of when we use this information:
investigating bugs or odd behaviour; for example if you've found an edge case we don't handle, or are accidentally causing unusual load on a system
investigating behaviour that we think is inappropriate; we might want to get in touch to help your application work better with our services
we might use this information to judge which API methods are used, how often, and by whom, in order to do capacity planning or to get general statistics on the API eco-system.
A helpful (appropriate) User-Agent:
tells us the name and version of your application (preferably something unique and easy to find on Google!)
tells us the specific version of your application
might also contain a URL at which to find out more, e.g. your application's homepage
Examples of unhelpful (inappropriate) User-Agents:
the same as any of the popular web browsers
the default user-agent for your HTTP Client library (e.g. curl/7.10.6 or PEAR HTTP_Request)
We're aware that it's not possible to change the User-Agent sent when your application is browser-based (e.g. Javascript or Flash) and don't expect you to do so. (That shouldn't be a problem in your case.)
If you're using a 3rd party Last.fm API library, such as one of the ones listed at http://www.last.fm/api/downloads , then we would prefer it if you added extra information to the User-Agent to identify your application, but left the library name and version in there as well. This is immensely useful when tracking down bugs (in either our service or in the client libraries).

When should one use GET instead of POST in a web application?

It seems that sticking to POST is the way to go because it results in clean looking URLs. GET seems to create long confusing URLs. POST is also better in terms of security. Good for protecting passwords in forms. In fact I hear that many developers only use POST for forms. I have also heard that many developers never really use GET at all.
So why and in what situation would one use GET if POST has these 2 advantages?
What benefit does GET have over POST?
you are correct, however it can be better to use gets for search pages and such. Places where you WANT the URL's to be obvious and discoverable. If you look at Google's (or any search page), it puts a www.google.com/?q=my+search at the end so people could link directly to the search.
You actually use GET much more than you think. Simply returning the web page is a GET request. There are also POST, PUT, DELETE, HEAD, OPTIONS and these are all used in RESTful programming interfaces.
GET vs. POST has no implications on security, they are both insecure unless you use HTTP/SSL.
Check the manual, I'm surprised that nobody has pointed out that GET and POST are semantically different and intended for quite different purposes.
While it may appear in a lot of cases that there is no functional difference between the 2 approaches, until you've tested every browser, proxy and server combination you won't be able to rely on that being a consistent in every case. e.g. mobile devices / proxies often cache aggressivley even where they are requested not to (but I've never come across one which incorrectly caches a POST response).
The protocol does not allow for anything other than simple, scalar datatypes as parameters in a GET - e.g. you can only send a file using POST or PUT.
There are also implementation constraints - last time I checked, the size of a URL was limited to around 2k in MSIE.
Finally, as you've noted, there's the issue of data visibility - you may not want to allow users to bookmark a URL containing their credit card number / password.
POST is the way to go because it results in clean looking URLs
That rather defeats the purpose of what a URL is all about. Read RFC 1630 - The Need For a Universal Syntax.
Sometimes you want your web application to be discoverable as in users can just about guess what a URL should be for a certain operation. It gives a nicer user experience and for this you would use GET and base your URLs on some sort of RESTful specification like http://microformats.org/wiki/rest/urls
If by 'web application' you mean 'website', as a developer you don't really have any choice. It's not you as a developer that makes the GET or POST requests, it's your user. They make the requests via their web browser.
When you request a web page by typing its URL into the address bar of the browser (or clicking a link, etc), the browser issues a GET request.
When you submit a web page using a button, you make a POST request.
In a GET request, additional data is sent in the query string. For example, the URL www.mysite.com?user=david&password=fish sends the two bits of data 'user' and 'password'.
In a POST request, the values in the form's controls (e.g. text boxes etc) are sent. This isn't visible in the address bar, but it's completely visible to anyone viewing your web traffic.
Both GET and POST are completely insecure unless SSL is used (e.g. web addresses beginning https).

Why shouldn't data be modified on an HTTP GET request?

I know that using non-GET methods (POST, PUT, DELETE) to modify server data is The Right Way to do things. I can find multiple resources claiming that GET requests should not change resources on the server.
However, if a client were to come up to me today and say "I don't care what The Right Way to do things is, it's easier for us to use your API if we can just use call URLs and get some XML back - we don't want to have to build HTTP requests and POST/PUT XML," what business-conducive reasons could I give to convince them otherwise?
Are there caching implications? Security issues? I'm kind of looking for more than just "it doesn't make sense semantically" or "it makes things ambiguous."
Edit:
Thanks for the answers so far regarding prefetching. I'm not as concerned with prefetching since is mostly surrounding internal network API use and not visitable HTML pages that would have links that could be prefetched by a browser.
Prefetch: A lot of web browsers will use prefetching. Which means that it will load a page before you click on the link. Anticipating that you will click on that link later.
Bots: There are several bots that scan and index the internet for information. They will only issue GET requests. You don't want to delete something from a GET request for this reason.
Caching: GET HTTP requests should not change state and they should be idempotent. Idempotent means that issuing a request once, or issuing it multiple times gives the same result. I.e. there are no side effects. For this reason GET HTTP requests are tightly tied to caching.
HTTP standard says so: The HTTP standard says what each HTTP method is for. Several programs are built to use the HTTP standard, and they assume that you will use it the way you are supposed to. So you will have undefined behavior from a slew of random programs if you don't follow.
How about Google finding a link to that page with all the GET parameters in the URL and revisiting it every now and then? That could lead to a disaster.
There's a funny article about this on The Daily WTF.
GETs can be forced on a user and result in Cross-site Request Forgery (CSRF). For instance, if you have a logout function at http://example.com/logout.php, which changes the server state of the user, a malicious person could place an image tag on any site that uses the above URL as its source: http://example.com/logout.php. Loading this code would cause the user to get logged out. Not a big deal in the example given, but if that was a command to transfer funds out of an account, it would be a big deal.
Good reasons to do it the right way...
They are industry standard, well documented, and easy to secure. While you fully support making life as easy as possible for the client you don't want to implement something that's easier in the short term, in preference to something that's not quite so easy for them but offers long term benefits.
One of my favourite quotes
Quick and Dirty... long after the
Quick has departed the Dirty remains.
For you this one is a "A stitch in time saves nine" ;)
Security:
CSRF is so much easier in GET requests.
Using POST won't protect you anyway but GET can lead easier exploitation and mass exploitation by using forums and places which accepts image tags.
Depending on what you do in server-side using GET can help attacker to launch DoS (Denial of Service). An attacker can spam thousands of websites with your expensive GET request in an image tag and every single visitor of those websites will carry out this expensive GET request against your web server. Which will cause lots of CPU cycle to you.
I'm aware that some pages are heavy anyway and this is always a risk, but it's bigger risk if you add 10 big records in every single GET request.
Security for one. What happens if a web crawler comes across a delete link, or a user is tricked into clicking a hyperlink? A user should know what they're doing before they actually do it.
I'm kind of looking for more than just "it doesn't make sense semantically" or "it makes things ambiguous."
...
I don't care what The Right Way to do things is, it's easier for us
Tell them to think of the worst API they've ever used. Can they not imagine how that was caused by a quick hack that got extended?
It will be easier (and cheaper) in 2 months if you start with something that makes sense semantically. We call it the "Right Way" because it makes things easier, not because we want to torture you.

Resources