When does the standard 404 page appear? - http

I am building a simple HTTP server for a project.
Most websites have custom 404 error pages. Sometimes though, you'll see Firefox spitting a generic 404 page (or 405, etc...).
How does it decide what to do?
What should the HTTP response be?
Is "HTTP/1.0 404 NOT FOUND" enough?
Thanks

If server can't find the requested resource (e.g. a webpage), it sends an HTTP/1.0 404 NOT FOUND in the HTTP header section.
Servers can map an error page for this error, so you can get a readable error page. Browsers can also map an own error page, so you can see a browser-specific error 404 message.
You can see the error code in the status field in log files.
You can redirect your user to a specific page with this structure:
<HTML>
<head>
<meta HTTP-EQUIV="Refresh" CONTENT="5; URL=not404.htm">
</head>
</HTML>
See details on Welcome to 404 Error Pages .com

It is perfectly valid to return an html body with a 404 response code. If no body is provided then the browser will show a default page.
If you only send HTTP/1.0 404 NOT FOUND then the browser default will be displayed.
If you add a body to the response the browser will mostly use that.

If you are creating an HTTP server you might want to look at the RFC that describes the protocol: http://www.faqs.org/rfcs/rfc2616.html
For the 404 status code it says:
The server has not found anything
matching the Request-URI. No
indication is given of whether the
condition is temporary or
permanent. The 410 (Gone) status code
SHOULD be used if the server knows,
through some internally configurable
mechanism, that an old resource is
permanently unavailable and has no
forwarding address. This status
code is commonly used when the server
does not wish to reveal exactly why
the request has been refused, or when
no other response is applicable.
You can't control how the browser will treat each status code, you shuld rely on its good behaviour.
That said, you may benefit from using one of the existing HTTP servers. Look at this question on how to create an HTTP server in C or C++ posted few days ago,

So, Firefox won't show a generic 404 error page under most circumstances; you're thinking of Internet Explorer, which ignores a website's 404 page if it's below a certain size and displays its own.

Usually it set up in the webserver, ie: When the server gets a 404, refer it to this page.

Related

how to get a redirect with HTTP POST

I try to Login at a website with a HTTP POST. It doesn´t work, I get the status Code 200 instead of 303. Which mistakes could be the reason for that?
The typical explanation is that the website is designed to respond with the login page, if the authentication failed. A 200 can indicate that page content is available and returned successfully, while the page content indicates that the login failed. The most likely reason for this is browser compatibility, whereby some non-standard browsers might react differently to non-200 responses.

Server returns 404 for a web page, but page is showing fine in browser - why?

A strange web page crossed my way. (And being a developer I have to solve the mystery.)
When accessing the web page in any browser, all seems normal. The web page is displayed as expected.
But when looking in the console the server acually returns a 404 status code:
So why is the browser rendering a page?
Looking at the Body shows valid HTML is returned:
Hold on. Responding 404 and sending the HTML along the way? And the browser renders it??
Why is this happening? Is this some server misconfiguration? Or is something clever going on here that I don't understand? Is there a practical reason for configuring a server on purpose to behave like this?
Another answer on Stack Overflow contains some interesting information: A HTTP status code of 404 plus HTML response body is actually recommended by the spec.
The 4xx class of status code is intended for cases in which the
client seems to have erred. Except when responding to a HEAD
request, the server SHOULD include a representation containing an
explanation of the error situation, and whether it is a temporary or
permanent condition. These status codes are applicable to any
request method. User agents SHOULD display any included
representation to the user.
This leaves me with two possible explanations:
Explanation 1: it's a server error.
the server wrongly returns a 404 status code
the browser thinks the response body contains details about the error and displays it - for the end user this is the actual page
Explanation 2: it's done on purpose to defeat crawlers and page watchers.
the server returns 404 on purpose - non-browser user agents won't process the result as they interpret it as error
browsers are unaffected, the end user doesn't care as long as the page is being displayed
The second one would indeed be kind of clever if you don't want your page to be indexed.
I faced with the same situation. My portal was hosted in a tomcat server. The portal was loaded when the host name along with the tomcat directory path was hit. But on loading the the webpage redirected to a deep-link URL and rendered the page. But if you hit the deep link URL directly in the browser it would give you 404 error in the network tab in Dev tools although the webpage would be rendered fine.
This happens because there is no resource as your deep-link URL anywhere in your server config files, so when it searches for the resource it doesn't find one and returns 404 in the network tab in Dev tools.
But browser behaves differently with the resource URL. It first loads and connects to the host name of the resource, when returned with success gets redirected as per the config files settings and renders the deep-link URL resource HTML, styling contents properly.
Note: I don't know whether this issue comes from me not being strict enough in the .htaccess or my CMS.
In my contrived .htaccess example I had the following rules to ignore these directories from being handled by the CMS.
RewriteCond $1 !^(branch|css|js|html|images) [NC]
I also had a branches directory inside my CMS' templates (created within CMS). I guess my .htaccess rule wasn't strict enough here. I had to change branch to branch\/, like so:
RewriteCond $1 !^(branch\/|css|js|html|images) [NC]
Only then would the page load without the 404 in the console.

How to know if it's actually a 404 page?

What I learned from Foregenix:
The HTTP 404 Not Found Error means that the webpage you were trying to reach could not be found on the server. It is a Client-side Error which means that either the page has been removed or moved and the URL was not changed accordingly, or that you typed in the URL incorrectly
But then I also do web app pentests with Python and I am wondering that if I only check for the String 404 on the page, it may not really be a 404 error.
It can so happen that the page exists but the heading is 404 just to fool us.
So how exactly do I find out?
You can check the HTTP status code, and see if it is 404 or not. The status code is on the first line of the response:
HTTP/1.1 404 Not Found
If you are using HTTPlib you can just read the status property of the HTTPResponse object.
However, it is the server that decides what HTTP status code to send. Just because 404 is defined to mean "page not found" does not mean the server can not lie to you. It is quite common to do things like this:
Send 404 instead of 403, to hide the resource that requires authentication.
Send 404 instead of 500, to hide the fact something is not working.
Send 404 when your IP is blocked for some reason.
Without access to the server, it is impossible to know what is really going on behind the curtains.
You are right: someone could write "404 Page Not Found" in a HTML page and make you think that the page doesn't exist.
In order to properly recognize HTTP status codes such as the 404, you should capture the HTTP response with Python and parse it. HTTP 1 and HTTP 2 standards dictate that an HTTP response, which is written in the HTTP generic message format, must contain the status code.
Example of an HTTP response (from Tutorials Point):
HTTP/1.1 404 Not Found
Date: Sun, 18 Oct 2012 10:36:20 GMT
Server: Apache/2.2.14 (Win32)
Content-Length: 230
Connection: Closed
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head>
<title>404 Not Found</title>
</head>
<body>
<h1>Not Found</h1>
<p>The requested URL /t.html was not found on this server.</p>
</body>
</html>
You should definitely not trust the HTML part, which can show a 404 error (or even a 418 I'm a teapot) when the page can in fact be found.
In addition to Anders' answer, I found a way to detect some cases where 404 is misused with a Timing attack. It is hardly reliable, though.
Send 404 instead of 403, to hide the resource that requires authentication.
Often servers need more time to determine that "you dont have authorization to get this resource", because they need more roundtrips to external resources like databases, then they need to determine "this is not there", quite often even cacheable and quickly to determine.
A typical example in an MVC application with a RDBS as backend is the difference between a simpleSELECT COUNT(id) FROM articles WHERE id=123 LIMIT 1
and the much more complex SELECT access FROM accesses JOIN articles ON articles.id = accesses.foreign_id WHERE articles.id = 123 AND accesses.type='articles' AND accesses.user_id = (SELECT id FROM users WHERE token='t0k3n' LIMIT 1). And that implies that the application can make such single line queries in the first place: more often it is a lot of "fetch a user, extract some data, now fetch a Thing, now ask Thing if user may access it through an authorization-api".
Unless the developers or the framework of the site took care to cover this case, quite often you'll see a notable difference in time to serve both cases of 404.
Send 404 instead of 500, to hide the fact something is not working.
Typically, crashing or unexpected errors occur only after some code has ran. 404-detection often comes early: after all, it is cheap to determine that something is not there (see above). Whereas the error would occur later on. Meaning that such a 500-hidden-as-404-error would, quite often take a lot longer to reach you then a normal 404.
Send 404 when your IP is blocked for some reason.
Here, the timing is often the other way around, depending on the implementation. Such IP-blocking would often be kept outside of the web-app (CMS etc) because it is much simpler and performant to handle higher up in the stack: the webserver, a proxy etc.
However, when the application itself takes care of this, generating an actual 404 is often reasonably cheap, whereas looking an IP in a database, applying masks and so on, takes some time. Similar to hiding a 403 as 404.

How do servers behave when browsers request embedded resources that do not exist?

Let's take the following hypothetical situation:
an HTTP server has a custom error page set up /404.html and does a server-side forward for any URL that gives a 404 response (for example /blabla.html) to the 404.html page
a browser requests an existing page from the server, say /home.html
the page contains <img src="a.jpg" alt="a" />, but that resource does not exist on the server
the browser receives a 404 for the resource, marks it as missing and does not receive any response (tested this in Chrome and FF in the network tab of the dev console - the response bit is empty)
My question is: what happens on the server when the image is requested?
My guess is the browser cuts off the connection when it gets the 404 status in the header so it doesn't wait or download the response. My other guess is that it's implementation specific, but I'm curious if the servers notice that the connection has been cut off.
The browser will get your error page but he can't handle html in an image. (It will throw an error in the console)
If you would do it with a frame it will show your error page.

when favicon.ico is requested?

When does a browser request favicon.ico? Is it after getting 200 HTTP status code? Or maybe before accessing page itself? I have no idea...
According to Will browsers request /favicon.ico or <link> first?, the tag in the page source overrides the request to /favicon.ico, meaning it is requested after getting a response of some kind. Don't think a specific status matters much, except for redirects and others like that of course

Resources