Serving 404 directly - nginx

So I have an Nginx server set up which is supposed to redirect all http to https (and non-www to www) using 4 server blocks.
The issue is that any 404 or non existent http URL first get a 301 redirect to what could have been an https version if it hypothetically existed (hence creating an extra URL and redirect).
See example:
1) http://example.com/thisurldoesntexit
301 Redirect
2) https://example.com/thisurldoesntexit
404
3) https://example.com/notfound
Is there a way to redirect user directly to a https 404 (URL 3)?

First of all, as already been pointed out, doing a 301 redirect from a non-existent page to a single /notfound moniker, is a really bad practice, and is likely against the RFCs.
What if the user simply mistyped a single character of a long URL? Modern browsers make it non-trivial to go back to what has been typed in order to correct it. The user would have to decide whether your site is worth a retyping from scratch, or whether your competitor might have thought of a better experience.
What if the user simply followed a broken link, which is broken in a very obvious way, and could be easily fixed? E.g., http://www.example.org/www.example.com/page, where an absolute URL was mistyped by the creator to be a relative one, or maybe a URI like /page.html., with an extra dot in the end. Likewise, you'll be totally confusing the user with what's going on, and offering a terrible user experience, where if left alone, the URL could easily have been corrected promptly.
But, more importantly, what real problem are you actually trying to solve?!
For better or worse, it's a pretty common practice to indiscriminately redirect from http to https scheme, without an account of whether a given page may or may not exist. In fact, if you employ HSTS, then content served over http effectively becomes meaningless; the browser with a policy would never even be requesting anything over http from there on out.
Undoubtedly, in order to know whether or not a given page exists, you must consult with the backend. As such, you might as well do the redirect from http to https from within your backend; but it'll likely tie up your valuable server resources for little to no extra benefit.
Moreover, the presence or absence of the page may be dictated by the contents of the cookies. As such, if you require that your backend must discern whether a page does or does not exist for an http request, then you'll effectively be leaking private information that was meant to be protected by https in the first place. (In turn, if your site has no such private information, then maybe you shouldn't be using https in the first place.)
So, overall, the whole approach is just a REALLY, REALLY bad idea!
Consider instead:
Do NOT do a 301 redirect from all non-existent pages to a single /notfound page. Very bad practice, very bad UX.
It is totally OK to do an indiscriminate redirect from http to https, without accounting for whether or not the page exists. In fact, it's not only okay, but it's the way God intended, because an adversary should not be capable of discerning whether or not a given page exists for an https-based site, so, if you do find and implement a solution for your "problem", then you'll effectively create a security vulnerability and a data leak.

Use https://www.drupal.org/project/fast_404 module for serving 404 pages directly without much overload.

I'd suggest redirecting to a 404 page is a poor choice, and you should instead serve the 404 on the incorrect URL.
My reasons for stating this are:
By redirecting away from the page, you are issuing headers that implicitly say "The content does not exist on this URL, but it does over here". I'm not sure how the various search engines would react to being redirected to a 404
I can speak from my own experience as a user when I say that having the URL change on me when I've mis-typed by a single character can be very frustrating. I then need to spend the time to type out the entire URL again.
You can avoid having logic in your .htaccess file or whatever to judge a page as a 404. This will greatly simplify your initial logic (which by-the-by gets computed on every single page load) - and will remove far more redirects than just the odd one of http://badurl to https://badurl to https://404

Related

Is a cached redirect loop possible?

tldr: 308 (permanent) redirects may be cached. What happens when a redirect loop includes one or more (or all) cached responses?
My situation: Say I serve a page /version-1 for a bit. After a while I set up /version-2 and start redirecting (code 308) all requests for /version-1 to /version-2. A while later I decide I actually preferred the first version and resume serving /version-1 while redirecting (308 again) /version-2 back to /version-1.
What is the expected behavior in this instance? Will a browser reuse the cached redirect(s) (causing an redirect loop), or is there a defined way it should break out of the loop?
(Please note that I am not asking how to improve this site design. It is an example to demonstrate my question. I am asking what the RFCs/actual browsers have to say about the situation.)
The closest I've been able to find myself is RFC 9110 Section 15.4 which states:
A client SHOULD detect and intervene in cyclical redirections (i.e., "infinite" redirection loops).
But afaict says nothing about eg. not reusing cached responses when attempting to resolve a redirect loop. What else is there?

What happens if a 302 URI can't be found?

If I make an HTTP request to get index.html on http://www.example.com but that URL has a 302 re-direct in place that points to http://www.foo.com/index.html, what happens if the redirect target (http://www.foo.com/index.html) isn't available? Will the user agent try the original URL (http://www.example.com/index.html) or just return an error?
Background to the question: I manage a legacy site that supports a few existing customers but doesn't allow new signs ups. Pretty much all the pages are redirected (using 302s rather than 301s for some unknown reason...) to a newer site. This includes the sign up page. On one of the pages that isn't redirected there is still a link to the sign up page which itself links through to a third party payment page (i.e. on another domain). Last week our current site went down for a couple of hours and in that period someone successfully signed up through the old site. The only way I can imagine this happened is that if a 302 doesn't find its intended URL some (all?) user agents bypass the redirect and then go to originally requested URL.
By the way, I'm aware there are many better ways to handle the particular situation we're in with the two sites. We're on it! This is just one of those weird situations I want to get to the bottom of.
You should receive a 404 Not Found status code.
Since HTTP is a stateless protocol, there is no real connection between two requests of a user agent. The redirection status codes are just a way for servers to politely tell their clients that the resource they were looking for is somewhere else now. The clients, however, are in no way obliged to actually request the resource from that other URL.
Oh, the signup page is at that URL now? Well then I don't want it anymore... I'll go and look at some kittens instead.
Moreover, even if the client decides to do request the new URL (which it usually does ^^), this can be considered as a completely new communication between server and client. Neither server nor client should remember that there was a previous request which resulted in a redirection status code. Instead, the current request should be treated as if it was the first (and only) request. And what happens when you request a URL that cannot be found? You get a 404 Not Found status code.

Can the actual index page be determined when requesting a directory?

When requesting http://example.com (using something like cURL), is there anyway to determine what the actual page server side is? Is it /index.php /index.html /index.asp?
This is a completely client side question.
Definitely no guaranteed way, there may not even be a default page at all on the server. Although there's usually some sort of page, script, or template associated with a given URL, it can be buried under several layers framework that make it not useful information anyway. You might be able to glean some extra info from the http response headers. But that's pretty much all you get on the client side.

Should I stop redirecting after successful POST or PUT requests?

It seems common in the Rails community, at least, to respond to successful POST, PUT or DELETE requests by redirecting instead of returning success. For instance, if I PUT a legal change to my user profile, the idiomatic response would be a 302 Redirect to the profile page.
Isn't this wrong? Shouldn't we be returning 200 OK from the request? Or a 201 Created, in the case of a POST request? Either of those, in the HTTP/1.1 Status Definitions are allowed to (or required to) include a response, anyway.
I guess I'm wondering, before I go and "fix" my application, whether there is there a darn good reason why the community has gone the way of redirects instead of successful responses.
I'll assume, your use of the PUT verb notwithstanding, that you're talking about a web app that will be accessed primarily through the browser. In that case, the usual reason for following up a POST with a redirect is the post-redirect-get pattern, which avoids duplicate requests caused by a user refreshing or using the back and forward controls of their browser. It seems that in many instances this pattern is overloaded by redirecting not to a success page, but to the next most likely place the user would visit. I don't think either way you mention is necessarily wrong, but doing the redirect may be more user-friendly at the expense of not strictly adhering to the semantics of HTTP.
It's called the POST-Redirect-GET (PRG) pattern. This pattern will prevent clients from (accidently) re-executing non-idempotent requests when for example navigating forth and back in browser's history.
It's a good general web development practice which doesn't only apply on RoR. I'd just keep it as is.
In a perfect world, yes, probably. However HTTP clients and servers are a mess when it comes to standardization and don't always agree on proper protocol. Redirecting after a post helps avoid things like duplicate form submissions.

Is it possible to redirect non-HTML files with HTTP? And chaining redirects?

I have been thinking about a neat way of load balancing and one thing that would be required is to be capable of loading an image on an HTML page from multiple locations without rewriting the URL(on each load)
So what I need to be able to do is have one URL which is the "static" URL. Such as http://example.com/myimage.png The image is not actually contained in example.com though. So example.com does a either a 302 or 301 or 307 HTTP response to cause a redirect to 2.example.com. How do browsers handle this with images like in this situation? Also, how do browsers handle multiple redirections for instance if 2.example.com also didn't contain it and it went to 3.example.com ? (Note, I am asking this because I've never seen a 301 redirect on anything but an HTML page)
Also, which status code would be best to use. 301 means "moved permanently" which this "move" isn't permanent so I don't want it cached. Should I use 307? Is that supported by search engines and modern browsers?
Redirect is an HTTP concept and applies to any resource that can be delivered over HTTP, not only HTML. Chaining redirects and non-HTML redirects work just fine in most modern browsers.
If you want temporary redirect, use 302, unless you want to redirect POST and PUTs as well. The problem is that most implementations will issue GET for the new resource address after POST or PUT that got 302.
Note that 303 and 307 are HTTP 1.1 specific.
I would advise against load balancing like this. Load balancing is not what 3xx responses are intended for.
The HTTP protocol has capabilities for caching which can help with reducing server load. There are also server technologies for load balancing. These technologies are well developed will be more stable and reusable.
As Benedict C says, I think you're barking up the wrong tree.
If you want to do load-balancing, then do load balancing. Round robin DNS is the simplest method (and is more effective in a number of regards than more expensive solutions). If you must try to load balance across servers with different FQDNs, then generate the URL client-side in javascript.
The remnants of your post are applicable to other questions about redirection. There's a lot of bad advice published about SEO. Google has approx 92% of the world market and publish quite detailled specs about how they spider and rank sites. Redirection within your domain should not affect your rankings in any competent search engine. Redirection outside your domain will only improve the ranking of the target.
Yes, browsers implement a limit on the number of redirections followed for a single request - but it varies by browser.

Resources