Browser cache behaviour for redirects - http

I'm trying to figure out if redirecting all www.example.com requests to example.com will be beneficial for caching or not, to which end I have 2 questions. SEO is not an issue here.
If the browser requests an image from the www URL (#1) and gets HTTP redirected to the www-less version (#2), will it store the result as cache value for just #2, or #1 as well.
The browser will occasionally ask for a new version of the image (and might get it, or a "Not Modified" response). There will then be an overhead for having to process the redirect every time. Is this overhead larger than the cost of storing two versions of the same image?

If the browser requests an image from the www URL (#1) and gets HTTP redirected to the www-less version (#2), will it store the result as cache value for just #2, or #1 as well.
#: See W3C Status Code Definitions for 301. If it's a 301 redirect, it 'should' be cacheable. See How to Redirect a Web Page (301).
The browser will occasionally ask for a new version of the image (and might get it, or a "Not Modified" response). There will then be an overhead for having to process the redirect every time. Is this overhead larger than the cost of storing two versions of the same image?
#: I'm not exactly sure of this, I suppose if the redirect is handled by the webserver (IIS, apache etc), the overhead should be minimal. Don't quote me though :P

Related

Is a cached redirect loop possible?

tldr: 308 (permanent) redirects may be cached. What happens when a redirect loop includes one or more (or all) cached responses?
My situation: Say I serve a page /version-1 for a bit. After a while I set up /version-2 and start redirecting (code 308) all requests for /version-1 to /version-2. A while later I decide I actually preferred the first version and resume serving /version-1 while redirecting (308 again) /version-2 back to /version-1.
What is the expected behavior in this instance? Will a browser reuse the cached redirect(s) (causing an redirect loop), or is there a defined way it should break out of the loop?
(Please note that I am not asking how to improve this site design. It is an example to demonstrate my question. I am asking what the RFCs/actual browsers have to say about the situation.)
The closest I've been able to find myself is RFC 9110 Section 15.4 which states:
A client SHOULD detect and intervene in cyclical redirections (i.e., "infinite" redirection loops).
But afaict says nothing about eg. not reusing cached responses when attempting to resolve a redirect loop. What else is there?

Serving 404 directly

So I have an Nginx server set up which is supposed to redirect all http to https (and non-www to www) using 4 server blocks.
The issue is that any 404 or non existent http URL first get a 301 redirect to what could have been an https version if it hypothetically existed (hence creating an extra URL and redirect).
See example:
1) http://example.com/thisurldoesntexit
301 Redirect
2) https://example.com/thisurldoesntexit
404
3) https://example.com/notfound
Is there a way to redirect user directly to a https 404 (URL 3)?
First of all, as already been pointed out, doing a 301 redirect from a non-existent page to a single /notfound moniker, is a really bad practice, and is likely against the RFCs.
What if the user simply mistyped a single character of a long URL? Modern browsers make it non-trivial to go back to what has been typed in order to correct it. The user would have to decide whether your site is worth a retyping from scratch, or whether your competitor might have thought of a better experience.
What if the user simply followed a broken link, which is broken in a very obvious way, and could be easily fixed? E.g., http://www.example.org/www.example.com/page, where an absolute URL was mistyped by the creator to be a relative one, or maybe a URI like /page.html., with an extra dot in the end. Likewise, you'll be totally confusing the user with what's going on, and offering a terrible user experience, where if left alone, the URL could easily have been corrected promptly.
But, more importantly, what real problem are you actually trying to solve?!
For better or worse, it's a pretty common practice to indiscriminately redirect from http to https scheme, without an account of whether a given page may or may not exist. In fact, if you employ HSTS, then content served over http effectively becomes meaningless; the browser with a policy would never even be requesting anything over http from there on out.
Undoubtedly, in order to know whether or not a given page exists, you must consult with the backend. As such, you might as well do the redirect from http to https from within your backend; but it'll likely tie up your valuable server resources for little to no extra benefit.
Moreover, the presence or absence of the page may be dictated by the contents of the cookies. As such, if you require that your backend must discern whether a page does or does not exist for an http request, then you'll effectively be leaking private information that was meant to be protected by https in the first place. (In turn, if your site has no such private information, then maybe you shouldn't be using https in the first place.)
So, overall, the whole approach is just a REALLY, REALLY bad idea!
Consider instead:
Do NOT do a 301 redirect from all non-existent pages to a single /notfound page. Very bad practice, very bad UX.
It is totally OK to do an indiscriminate redirect from http to https, without accounting for whether or not the page exists. In fact, it's not only okay, but it's the way God intended, because an adversary should not be capable of discerning whether or not a given page exists for an https-based site, so, if you do find and implement a solution for your "problem", then you'll effectively create a security vulnerability and a data leak.
Use https://www.drupal.org/project/fast_404 module for serving 404 pages directly without much overload.
I'd suggest redirecting to a 404 page is a poor choice, and you should instead serve the 404 on the incorrect URL.
My reasons for stating this are:
By redirecting away from the page, you are issuing headers that implicitly say "The content does not exist on this URL, but it does over here". I'm not sure how the various search engines would react to being redirected to a 404
I can speak from my own experience as a user when I say that having the URL change on me when I've mis-typed by a single character can be very frustrating. I then need to spend the time to type out the entire URL again.
You can avoid having logic in your .htaccess file or whatever to judge a page as a 404. This will greatly simplify your initial logic (which by-the-by gets computed on every single page load) - and will remove far more redirects than just the odd one of http://badurl to https://badurl to https://404

What happens if a 302 URI can't be found?

If I make an HTTP request to get index.html on http://www.example.com but that URL has a 302 re-direct in place that points to http://www.foo.com/index.html, what happens if the redirect target (http://www.foo.com/index.html) isn't available? Will the user agent try the original URL (http://www.example.com/index.html) or just return an error?
Background to the question: I manage a legacy site that supports a few existing customers but doesn't allow new signs ups. Pretty much all the pages are redirected (using 302s rather than 301s for some unknown reason...) to a newer site. This includes the sign up page. On one of the pages that isn't redirected there is still a link to the sign up page which itself links through to a third party payment page (i.e. on another domain). Last week our current site went down for a couple of hours and in that period someone successfully signed up through the old site. The only way I can imagine this happened is that if a 302 doesn't find its intended URL some (all?) user agents bypass the redirect and then go to originally requested URL.
By the way, I'm aware there are many better ways to handle the particular situation we're in with the two sites. We're on it! This is just one of those weird situations I want to get to the bottom of.
You should receive a 404 Not Found status code.
Since HTTP is a stateless protocol, there is no real connection between two requests of a user agent. The redirection status codes are just a way for servers to politely tell their clients that the resource they were looking for is somewhere else now. The clients, however, are in no way obliged to actually request the resource from that other URL.
Oh, the signup page is at that URL now? Well then I don't want it anymore... I'll go and look at some kittens instead.
Moreover, even if the client decides to do request the new URL (which it usually does ^^), this can be considered as a completely new communication between server and client. Neither server nor client should remember that there was a previous request which resulted in a redirection status code. Instead, the current request should be treated as if it was the first (and only) request. And what happens when you request a URL that cannot be found? You get a 404 Not Found status code.

Any performance difference between http 304 and 404?

We have a web site that does not have a favourite icon favicon.
Therefore, we get a lot of http 404 errors for the file favicon.ico.
For the functionality of the web site it does not make a difference.
But I was wondering if the system uses more time looking for a file that is not there, rather than returning a 304 not modified?
At max load we have ca. 15,000 concurrent connections across all frontend servers.
No, the performance difference is insignificant - if you don't want to have a favicon, I'd suggest creating a 0-byte /favicon.ico : the logs will stop complaining, and the browsers will behave as if there's no favicon.
Also, you could set a far future Expires header for your favicon - that way, the clients will only request it once, further reducing the load.
In terms of the load the request causes on the server, there is no difference.
In terms of network bandwidth, a valid 304 response will be a bit bigger, since you also have to include at least a Date and an Expires or ETag headers in the response.
If the idea is to send a response without content, then I think 204 No Content is more appropriate.
If you are absolutely sure that the web site will never ever have a favicon, you could use a 410 Gone response. That tells the client/browser to don't come back and ask again. It is also more likely to be cached by a proxy server than a 404.

HTTP redirect: 301 (permanent) vs. 302 (temporary)

Is the client supposed to behave differently? How?
Status 301 means that the resource (page) is moved permanently to a new location. The client/browser should not attempt to request the original location but use the new location from now on.
Status 302 means that the resource is temporarily located somewhere else, and the client/browser should continue requesting the original url.
When a search engine spider finds 301 status code in the response header of a webpage, it understands that this webpage no longer exists, it searches for location header in response pick the new URL and replace the indexed URL with the new one and also transfer pagerank.
So search engine refreshes all indexed URL that no longer exist (301 found) with the new URL, this will retain your old webpage traffic, pagerank and divert it to the new one (you will not lose you traffic of old webpage).
Browser: if a browser finds 301 status code then it caches the mapping of the old URL with the new URL, the client/browser will not attempt to request the original location but use the new location from now on unless the cache is cleared.
When a search engine spider finds 302 status for a webpage, it will only redirect temporarily to the new location and crawl both of the pages. The old webpage URL still exists in the search engine database and it always attempts to request the old location and crawl it. The client/browser will still attempt to request the original location.
Read more about how to implement it in asp.net c# and what is the impact on search engines -
http://www.dotnetbull.com/2013/08/301-permanent-vs-302-temporary-status-code-aspnet-csharp-Implementation.html
Mostly 301 vs 302 is important for indexing in search engines, as their crawlers take this into account and transfer PageRank when using 301.
See Peter Lee's answer for more details.
301 redirects are cached indefinitely (at least by some browsers).
This means, if you set up a 301, visit that page, you not only get redirected, but that redirection gets cached.
When you visit that page again, your Browser* doesn't even bother to request that URL, it just goes to the cached redirection target.
The only way to undo a 301 for a visitor with that redirection in Cache, is re-redirecting back to the original URL**. In that case, the Browser will notice the loop, and finally really request the entered URL.
Obviously, that's not an option if you decided to 301 to facebook or any other resource you're not fully under control.
Unfortunately, many Hosting Providers offer a feature in their Admin Interface simply called "Redirection", which does a 301 redirect. If you're using this to temporarily redirect your domain to facebook as a coming soon page, you're basically screwed.
*at least Chrome and Firefox, according to How long do browsers cache HTTP 301s?. Just tried it with Chrome 45.
Edit: Safari 7.0.6 on Mac also caches, a browser restart didn't help (Link says that on Safari 5 on Windows it does help.)
**I tried javascript window.location = '', because it would be the solution which could be applied in most cases - it doesn't work. It results in an undetected infinite Loop. However, php header('Location: new.url') does break the loop
Bottom Line: only use 301s if you're absolutely sure you're never going to use that URL again. Usually never on the root dir (example.com/)
301 is that the requested resource has been assigned a new permanent URI and any future references to this resource should be done using one of the returned URIs.
302 is that the requested resource resides temporarily under a different URI.
Since the redirection may be altered on occasion, the client should continue to use the Request-URI for future requests.
This response is only cachable if indicated by a Cache-Control or Expires header field.
The main issue with 301 is browser will cache the redirection even if you disabled the redirection from the server level.
It's always better to use 302 if you are enabling the redirection for a short maintenance window.
There have already been plenty of good answers, but none tells pitfalls or when to use one over the other from a plain browsers perspective.
Use 302 over a 301 HTTP Status whenever you need to keep dynamic server side control about the final URL. Using a 301 http status will make your browser always load the final URL from its own cache, without fetching anything of any previous URL (totally skipping the first time request). That may have unpredictable results in case you need to keep server side control about the redirected URL.
As an example, in case you need to do URL redirection on behalf of a users ip-geo-position (geo-ip-switching) use 302. If you would use a 301 in such a scenario, the final redirected page will always come directly from the browsers cache, giving incorrect/false content to the user.
301 is a permanent redirect, and 302 is a temporary redirect.
The browser is allowed to cache the 301 but 302 means it has to hit our system every time. assuming that we want to minimize the load on our system, 301 is the right decision. Imagine creating URL shortening service for a big company, we try to get as less hit to our servers by the clients
But if the user wants to edit their short URLs, it might take more time than usual for the browser to pick up the change because the browser has the old one cached. Also, if you want to offer users metrics on how often their URL is getting hit, 301 would mean we would not necessarily see every hit from the client. So if you want analytics as a feature later on and a smooth user experience for editing URLs, 302 is a better choice.

Resources