How to know if it's actually a 404 page? - http

What I learned from Foregenix:
The HTTP 404 Not Found Error means that the webpage you were trying to reach could not be found on the server. It is a Client-side Error which means that either the page has been removed or moved and the URL was not changed accordingly, or that you typed in the URL incorrectly
But then I also do web app pentests with Python and I am wondering that if I only check for the String 404 on the page, it may not really be a 404 error.
It can so happen that the page exists but the heading is 404 just to fool us.
So how exactly do I find out?

You can check the HTTP status code, and see if it is 404 or not. The status code is on the first line of the response:
HTTP/1.1 404 Not Found
If you are using HTTPlib you can just read the status property of the HTTPResponse object.
However, it is the server that decides what HTTP status code to send. Just because 404 is defined to mean "page not found" does not mean the server can not lie to you. It is quite common to do things like this:
Send 404 instead of 403, to hide the resource that requires authentication.
Send 404 instead of 500, to hide the fact something is not working.
Send 404 when your IP is blocked for some reason.
Without access to the server, it is impossible to know what is really going on behind the curtains.

You are right: someone could write "404 Page Not Found" in a HTML page and make you think that the page doesn't exist.
In order to properly recognize HTTP status codes such as the 404, you should capture the HTTP response with Python and parse it. HTTP 1 and HTTP 2 standards dictate that an HTTP response, which is written in the HTTP generic message format, must contain the status code.
Example of an HTTP response (from Tutorials Point):
HTTP/1.1 404 Not Found
Date: Sun, 18 Oct 2012 10:36:20 GMT
Server: Apache/2.2.14 (Win32)
Content-Length: 230
Connection: Closed
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head>
<title>404 Not Found</title>
</head>
<body>
<h1>Not Found</h1>
<p>The requested URL /t.html was not found on this server.</p>
</body>
</html>
You should definitely not trust the HTML part, which can show a 404 error (or even a 418 I'm a teapot) when the page can in fact be found.

In addition to Anders' answer, I found a way to detect some cases where 404 is misused with a Timing attack. It is hardly reliable, though.
Send 404 instead of 403, to hide the resource that requires authentication.
Often servers need more time to determine that "you dont have authorization to get this resource", because they need more roundtrips to external resources like databases, then they need to determine "this is not there", quite often even cacheable and quickly to determine.
A typical example in an MVC application with a RDBS as backend is the difference between a simpleSELECT COUNT(id) FROM articles WHERE id=123 LIMIT 1
and the much more complex SELECT access FROM accesses JOIN articles ON articles.id = accesses.foreign_id WHERE articles.id = 123 AND accesses.type='articles' AND accesses.user_id = (SELECT id FROM users WHERE token='t0k3n' LIMIT 1). And that implies that the application can make such single line queries in the first place: more often it is a lot of "fetch a user, extract some data, now fetch a Thing, now ask Thing if user may access it through an authorization-api".
Unless the developers or the framework of the site took care to cover this case, quite often you'll see a notable difference in time to serve both cases of 404.
Send 404 instead of 500, to hide the fact something is not working.
Typically, crashing or unexpected errors occur only after some code has ran. 404-detection often comes early: after all, it is cheap to determine that something is not there (see above). Whereas the error would occur later on. Meaning that such a 500-hidden-as-404-error would, quite often take a lot longer to reach you then a normal 404.
Send 404 when your IP is blocked for some reason.
Here, the timing is often the other way around, depending on the implementation. Such IP-blocking would often be kept outside of the web-app (CMS etc) because it is much simpler and performant to handle higher up in the stack: the webserver, a proxy etc.
However, when the application itself takes care of this, generating an actual 404 is often reasonably cheap, whereas looking an IP in a database, applying masks and so on, takes some time. Similar to hiding a 403 as 404.

Related

Is there any reason to use the HTTP 410 GONE status code?

When permanently removing a page from your website, are there any practical benefits to setting up a "410 GONE" HTTP response for the URL (vs. letting it 404)?
Yes, the 410 Gone HTTP status code conveys that the resource requested was once available in the past, but it has now been retired or made obsolete.
The 404 Not Found HTTP status code could imply that the website has been incorrectly updated so as to be missing a file that would normally be defined there. It could also mean that the requesting client referenced a resource that never did exist and probably never will.
The 410 Gone status can have more immediate SEO implications because it tells search engines that the missing resource was intentionally removed. That should hasten the reduction of future search references to that page more so than the 404 Not Found status.
I could imagine if you have a public API, and you finally disable your long deprecated v1 after publishing like v4 or something, you could use this statuscode to make it obvious to consumers of that API. But then again one could argue that a 301 is also valid for this type of situation. It also depends on how different it is, and whether there is an actual replacement, or is it just actually gone.
From RFC 9110:
The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.

What is the most appropriate HTTP response code for a resource known NOT to exist?

I wonder if there is a better choice than 404 when someone requests a page like http://www.example.com/page-that-never-existed-nor-will-ever-exist ("ever" meaning for the foreseeable future but for all intents and purposes: never ever).
For instance I get requests for pages that some "clever" crawlers think might exist based on the structure they have encountered on the website or elsewhere on the web. They are not misspellings but requests that I know to lead nowhere
I don't want to use 301 Moved Permanently because nothing has moved and there is no logical destination to move to.
I don't want to use 410 Gone because it was never there in the first place.
I also would like something more fitting than 404 Not Found because I would really like to give the message "Does Not Exist": not just "Not Found, what happened? Who knows?". How can I tell a User-Agent that it is a waste of both our times to ask for it again?
Based on HTTP 1.1, 404 Not Found seems like the most correct option, because the definition ends with "or when no other response is applicable" but I am not fully satisfied with that. Any other idea?
Have you considered 403 Forbidden? It sounds like what you might be looking for and you can include a message in the body of the response that tells the client that the resource will never exist.
The server understood the request, but is refusing to fulfill it.
Authorization will not help and the request SHOULD NOT be repeated. If
the request method was not HEAD and the server wishes to make public
why the request has not been fulfilled, it SHOULD describe the reason
for the refusal in the entity. If the server does not wish to make
this information available to the client, the status code 404 (Not
Found) can be used instead.

HTTP Status Code for Resource not yet available

I have a REST endpoint accepting a POST request to mark a code as redeemed. The code can only be redeemed between certain dates.
How should I respond if someone attempts to redeem the code early?
I suspect HTTP 403, Forbidden, is the right choice but then the w3c states that "the request SHOULD NOT be repeated" whereas in this case I would anticipate the request being repeated, just at a later date.
409 Conflict
The request could not be completed due to a conflict with the current
state of the resource. This code is only allowed in situations where
it is expected that the user might be able to resolve the conflict and
resubmit the request. The response body SHOULD include enough
information for the user to recognize the source of the conflict.
Ideally, the response entity would include enough information for the
user or user agent to fix the problem; however, that might not be
possible and is not required.
403 Forbidden makes more sense if they are trying to redeem a coupon that has already been redeemed, though 410 Gone seams elegant in this situation as well.
404 Not Found isn't ideal because the resource does in fact exist, however you can use it if you don't want to specify a reason with the 403 or if you want to hide the existence of the resource for security reasons.
If you are using HATEOAS, then you can also head you clients off at the pass (so to speak) by only including a redeem hypermedia control in the coupon resource (retrieved via a GET) when the coupon can be redeemed; though this won't stop overly bound clients from trying to redeem it anyway.
EDIT: Thanks to some good critiques (see below), I want to caveat this answer. It is based on Richardson & Ruby's writeup, which arguably doesn't mesh well with the httpbis writing on 403 Forbidden. (Personally, now I'm learning towards 409 as explained by Tom in a separate answer.)
403 Forbidden is the best choice. I will cite RESTful Web Services by Richardson & Ruby line by line. As you will see, 403 is a great fit:
The client's request is formed correctly, but the server doesn't want to carry it out.
Check!
This is not merely the case of insufficient credentials: that would be a 401 ("Unauthorized"). This is more like a resource that is only accessible at certain times, or from certain IP addresses.
Check!
A response of 403 implies that the client requested a resource that really exists. As with with 401 ("Unauthorized"), if the server doesn't want to give out even this information, it can lie and send a 404 ("Not Found") instead.
You wrote above: "The Code representation is available to be GETted before it goes live." So, you aren't trying to hide anything. So, stick with the 403. Check!
If the client's request is well-formed, why is this status code in the 4xx series (client-side error) instead of the 5xx series (server-side error)? Because the serve made it decision based on some aspect of the request other than its form; say, the time of day the request was made.
Check! The client's request was formed corrected, but it was inappropriate for the particular time.
We went four for four. The 403 code is a winner. No other codes match as well.
All of this said, a plain, non-specific 400 wouldn't be wrong, but would not be as specific or useful.
Another answer suggested the 409 Conflict code. Although worth considering, it isn't as good a fit. Here is why. According to Richardson & Ruby again:
Getting this [409] response response means that you tried to put the server's resources into an impossible or inconsistent state. Amazon S3 gives this response code when you try to delete a bucket that is not empty.
Claiming a promotion before it is 'active' wouldn't "put a server resource into an inconsistent state." It would break some business rules -- and result in cheating -- but it wouldn't cause a logical contradiction that I see.
So, whether you realized it at the onset of asking your question or not, 403 is a great choice. :)
Since Rest URLs should represent resources I would reply with 404 - Not Found
The resource is only available between certain dates, so on any other date it is not found.
When it says the request "SHOULD NOT be repeated", it is referring to the message that you should send to the viewer.
It has nothing to do with whether an actual request is repeated. (The user will get the same 403 message over and over again if s/he so desires.)
That said, a 404 is not appropriate for this because the resource is available - just that the code is not redeemable/forbidden to redeem. It is actually harmful because it tells the user that you probably made a mistake in your URL link or server configuration.
Of course, this assumes that on the appropriate date you return a 200 instead.

Is it acceptable to modify the text sent with the HTTP status code?

I'm implementing a 'testing mode' with my website which will forbid access to certain pages while they are undergoing construction, making them only accessible to administrators for private testing. I was planning on using the 401 status code, since the page does exist but they are not allowed to use it, and they may or may not be authenticated, yet only certain users (basically me) would still be allowed to access the page.
The thing I'm wondering is if the text after the HTTP/1.1 401 part mattered? Does it have to be Unauthorized or can it basically be whatever you want to put after it, so long as the 401 is still appropriate for the error? I wanted to send a message such as Temporarily Unavailable to indicate that the page is normally available to all visitors, but is undergoing reconstruction and is temporarily unavailable. Should I do this or not?
You may change them.
The status messages (technically called "reason phrases") are only recommendations and "MAY be changed without affecting the protocol)."
See http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html#sec6.1.1
However, you SHOULD :-) still use the codes properly and give meaningful messages. Only use a 401 if your condition is what the RFC says a 401 should be.
Yes, the reason phrase can be changed. It doesn't affect the meaning of the message.
But if you need to say "temporarily unavailable", you need to make it 5xx (server) code. 503 seems right here (see RFC 2616, Section 10.5.4).
You MAY change the text (very few http clients pay any attention to it), but it is better to use the most applicable response code. Afterall, indicating the reason for failure is how the various response codes were intended to be used.
Perhaps this fits:
404 Not Found The requested resource could not be found but may be
available again in the future.[2] Subsequent requests by the client
are permissible.

When does the standard 404 page appear?

I am building a simple HTTP server for a project.
Most websites have custom 404 error pages. Sometimes though, you'll see Firefox spitting a generic 404 page (or 405, etc...).
How does it decide what to do?
What should the HTTP response be?
Is "HTTP/1.0 404 NOT FOUND" enough?
Thanks
If server can't find the requested resource (e.g. a webpage), it sends an HTTP/1.0 404 NOT FOUND in the HTTP header section.
Servers can map an error page for this error, so you can get a readable error page. Browsers can also map an own error page, so you can see a browser-specific error 404 message.
You can see the error code in the status field in log files.
You can redirect your user to a specific page with this structure:
<HTML>
<head>
<meta HTTP-EQUIV="Refresh" CONTENT="5; URL=not404.htm">
</head>
</HTML>
See details on Welcome to 404 Error Pages .com
It is perfectly valid to return an html body with a 404 response code. If no body is provided then the browser will show a default page.
If you only send HTTP/1.0 404 NOT FOUND then the browser default will be displayed.
If you add a body to the response the browser will mostly use that.
If you are creating an HTTP server you might want to look at the RFC that describes the protocol: http://www.faqs.org/rfcs/rfc2616.html
For the 404 status code it says:
The server has not found anything
matching the Request-URI. No
indication is given of whether the
condition is temporary or
permanent. The 410 (Gone) status code
SHOULD be used if the server knows,
through some internally configurable
mechanism, that an old resource is
permanently unavailable and has no
forwarding address. This status
code is commonly used when the server
does not wish to reveal exactly why
the request has been refused, or when
no other response is applicable.
You can't control how the browser will treat each status code, you shuld rely on its good behaviour.
That said, you may benefit from using one of the existing HTTP servers. Look at this question on how to create an HTTP server in C or C++ posted few days ago,
So, Firefox won't show a generic 404 error page under most circumstances; you're thinking of Internet Explorer, which ignores a website's 404 page if it's below a certain size and displays its own.
Usually it set up in the webserver, ie: When the server gets a 404, refer it to this page.

Resources