What HTTP Status Code should be used for redirecting to a correct URL? - http

I am not so sure if the the title of the question is perfect to ask but here is what I am trying figure out. Of course in relation to SEO and such things.
You might have seen URLs with numbers at the end in addition to the title/slug in the URL. Such as
/topic/new-topic-at-this-website-1234
The web application actually takes that number (at the end) as an identity to the resource to deliver. So changing that number in the URL brings up another resource but in that case not all but some websites changes the full URL and redirect to the correct resource with correct URL. Even though some websites do not change the URL and just deliver the content.
While I want to redirect the user to correct URL if the numeric values changes, what status code should be used in that case?
Hope I have asked the question correctly.

OK, Until I find a suggestion, I am going use status code 301 for such redirection because I just found that in fact stackoverflow is handling the same case with 301. As requesting this page with following link,
/questions/33775526
redirects to
/questions/33775526/what-http-status-code-should-be-used-for-redirecting-to-a-correct-url
with status code 301.
Advices are still welcomed but

Related

404 error on my homepage although I can see my site

I am at my wits end with the following problem:
My site www.sebastianthalhammer.com is available under that URL without any problems.
However Google Search Console as well as other external third party test tools return a 404 error.
Status report from Uptrends
It is just the main page that's affected. All the other subpages and blog content isn't affected.
I have been in contact with the server stuff but it seems alright to them. As mentioned. The site can be reached. The site runs on wordpress - latest version.
I have no real clue where to start as this error seems to be quite a tricky one. Does anyone here might have an idea what's going on?
Sebastian
The 4xx class of status code is intended for cases in which the
client seems to have erred. Except when responding to a HEAD
request, the server SHOULD include a representation containing an
explanation of the error situation, and whether it is a temporary or
permanent condition. These status codes are applicable to any
request method. User agents SHOULD display any included
representation to the user.
This leaves me with two possible explanations:
Explanation 1: it's a server error.
the server wrongly returns a 404 status code
the browser thinks the response body contains details about the error and displays it - for the end user this is the actual page
Explanation 2: it's done on purpose to defeat crawlers and page watchers.
the server returns 404 on purpose - non-browser user agents won't process the result as they interpret it as error
browsers are unaffected, the end user doesn't care as long as the page is being displayed
The second one would indeed be kind of clever if you don't want your page to be indexed.
Thanks to your feedback I could think about the problem in a different way.
Ultimately at the unholy depths of a certain plugin I could dig out a setting that caused the error.
It was a redirection plugin that (for whatever reason) sent out a 404 signal when the URL was requested.
I don't know what the purpose would be for something. All I know is that the setting was on default for quite a while now and that caused the weird situation.
thanks guys for getting me on the right track.
Sebastian

Serving 404 directly

So I have an Nginx server set up which is supposed to redirect all http to https (and non-www to www) using 4 server blocks.
The issue is that any 404 or non existent http URL first get a 301 redirect to what could have been an https version if it hypothetically existed (hence creating an extra URL and redirect).
See example:
1) http://example.com/thisurldoesntexit
301 Redirect
2) https://example.com/thisurldoesntexit
404
3) https://example.com/notfound
Is there a way to redirect user directly to a https 404 (URL 3)?
First of all, as already been pointed out, doing a 301 redirect from a non-existent page to a single /notfound moniker, is a really bad practice, and is likely against the RFCs.
What if the user simply mistyped a single character of a long URL? Modern browsers make it non-trivial to go back to what has been typed in order to correct it. The user would have to decide whether your site is worth a retyping from scratch, or whether your competitor might have thought of a better experience.
What if the user simply followed a broken link, which is broken in a very obvious way, and could be easily fixed? E.g., http://www.example.org/www.example.com/page, where an absolute URL was mistyped by the creator to be a relative one, or maybe a URI like /page.html., with an extra dot in the end. Likewise, you'll be totally confusing the user with what's going on, and offering a terrible user experience, where if left alone, the URL could easily have been corrected promptly.
But, more importantly, what real problem are you actually trying to solve?!
For better or worse, it's a pretty common practice to indiscriminately redirect from http to https scheme, without an account of whether a given page may or may not exist. In fact, if you employ HSTS, then content served over http effectively becomes meaningless; the browser with a policy would never even be requesting anything over http from there on out.
Undoubtedly, in order to know whether or not a given page exists, you must consult with the backend. As such, you might as well do the redirect from http to https from within your backend; but it'll likely tie up your valuable server resources for little to no extra benefit.
Moreover, the presence or absence of the page may be dictated by the contents of the cookies. As such, if you require that your backend must discern whether a page does or does not exist for an http request, then you'll effectively be leaking private information that was meant to be protected by https in the first place. (In turn, if your site has no such private information, then maybe you shouldn't be using https in the first place.)
So, overall, the whole approach is just a REALLY, REALLY bad idea!
Consider instead:
Do NOT do a 301 redirect from all non-existent pages to a single /notfound page. Very bad practice, very bad UX.
It is totally OK to do an indiscriminate redirect from http to https, without accounting for whether or not the page exists. In fact, it's not only okay, but it's the way God intended, because an adversary should not be capable of discerning whether or not a given page exists for an https-based site, so, if you do find and implement a solution for your "problem", then you'll effectively create a security vulnerability and a data leak.
Use https://www.drupal.org/project/fast_404 module for serving 404 pages directly without much overload.
I'd suggest redirecting to a 404 page is a poor choice, and you should instead serve the 404 on the incorrect URL.
My reasons for stating this are:
By redirecting away from the page, you are issuing headers that implicitly say "The content does not exist on this URL, but it does over here". I'm not sure how the various search engines would react to being redirected to a 404
I can speak from my own experience as a user when I say that having the URL change on me when I've mis-typed by a single character can be very frustrating. I then need to spend the time to type out the entire URL again.
You can avoid having logic in your .htaccess file or whatever to judge a page as a 404. This will greatly simplify your initial logic (which by-the-by gets computed on every single page load) - and will remove far more redirects than just the odd one of http://badurl to https://badurl to https://404

What happens if a 302 URI can't be found?

If I make an HTTP request to get index.html on http://www.example.com but that URL has a 302 re-direct in place that points to http://www.foo.com/index.html, what happens if the redirect target (http://www.foo.com/index.html) isn't available? Will the user agent try the original URL (http://www.example.com/index.html) or just return an error?
Background to the question: I manage a legacy site that supports a few existing customers but doesn't allow new signs ups. Pretty much all the pages are redirected (using 302s rather than 301s for some unknown reason...) to a newer site. This includes the sign up page. On one of the pages that isn't redirected there is still a link to the sign up page which itself links through to a third party payment page (i.e. on another domain). Last week our current site went down for a couple of hours and in that period someone successfully signed up through the old site. The only way I can imagine this happened is that if a 302 doesn't find its intended URL some (all?) user agents bypass the redirect and then go to originally requested URL.
By the way, I'm aware there are many better ways to handle the particular situation we're in with the two sites. We're on it! This is just one of those weird situations I want to get to the bottom of.
You should receive a 404 Not Found status code.
Since HTTP is a stateless protocol, there is no real connection between two requests of a user agent. The redirection status codes are just a way for servers to politely tell their clients that the resource they were looking for is somewhere else now. The clients, however, are in no way obliged to actually request the resource from that other URL.
Oh, the signup page is at that URL now? Well then I don't want it anymore... I'll go and look at some kittens instead.
Moreover, even if the client decides to do request the new URL (which it usually does ^^), this can be considered as a completely new communication between server and client. Neither server nor client should remember that there was a previous request which resulted in a redirection status code. Instead, the current request should be treated as if it was the first (and only) request. And what happens when you request a URL that cannot be found? You get a 404 Not Found status code.

How can a bot get the contents of subsequent pages in a category listing in WordPress?

I'm writing a bot to automatically download pages from my WordPress blog. The bot gets most of the pages without a problem. For example, it can easily get the first page of the article listing of a given tag: http://example.com/myblog/index.php/archives/tag/mytag. However, for some reason it can't get the subsequent pages, like http://example.com/myblog/index.php/archives/tag/mytag/page/2.
I've tried to figure out what was going on, and here's what I found: while the server answers normally to most requests, upon such requests it answers with a 301 permanent redirect. Peculiarly, the Location header is set to the exact same URL as the request! Basically, the server tells me to redirect my request of the page http://example.com/myblog/index.php/archives/tag/mytag/page/2 to... the very same page :P
When trying to access the page from the browser I get the page without a problem. I thought maybe the browser sends some headers (including cookies) that my bot doesn't send, so I copied the headers (including the cookies) from my browser's web console, but the behaviour didn't change.
I would appreciate any suggestions regarding what might be causing this strange behaviour, what I can do in order to understand what's going on better, and of course what I can do in order to fetch those pages automatically, just like I fetch their brethren.
Thanks!
It seems this post hasn't generated much public interest. However, in case somebody ever runs into the same problem and finds this post, here's the solution I used. Important note: I still don't understand the behaviour I witnessed, and would appreciate it if somebody could explain it.
So the solution I've found is basically to use the URL http://example.com/myblog/archives/tag/mytag?paged=2 instead of http://example.com/myblog/index.php/archives/tag/mytag/page/2. Funnily enough, this URL gets redirected to the original one when browsed to from a browser! But when the bot requested it it got the page without redirection or anything. (So I managed to do what I wanted to do, but I've got no idea what happened there, why there was a problem in the first place, and why this solution worked: for one URL the bot gets infinite redirection and the browser just gets the page, while for the other the browser gets redirected [finitely] and the bot gets the page. I am yet to figure this one out...)

using customErrors for vanity URLs / asp.net url redirection

So, from here...
In ASP.NET, you have a choice about how to respond to that - it's in the web.config as CustomErrors. Turn that on, then redirect to a fancy 404 page (maybe you already do). The fancy 404 page, then, could be checking the requested querystring (which gets passed over to the custom error page as yet another querystring) to see if it's a valid redirect, lives in your database, etc. Just do a Response.Redirect() from there.
Then schooner writes:
Thanks, we do have a 404 now but we would prefer this not to be detected as a 404 in the process. We would like ot handle it directly and seperately if possible.
..and I'd like to know just how bad a practice this is. I don't expect to put my "pretty" URLs on the internet (just business cards) and I have a sample of 404-redirecting-to-a-helpful-site code working, but I don't want to get to production and have an issue with a browser that takes the initial 404 too seriously. Can anyone help me understand more about why I wouldn't want to use customErrors / 404 to flow users to the page they actually wanted?
The main problem with using customeErrors as your 404 error handler is that every time customErrors picks up an errored request rather than throwing a 404 error back to your browser and letting your browser know there was a bad request, it instead returns a 302 which indicates that a page has been relocated to whatever your customErrors page is. This isn't bad for most users because they don't know or even notice the difference, the problem comes from the fact that web crawlers DO know the difference and the status code they receive directly affects how their indexing works.
Consider the scenario where you have a page at http://mysite.com/MyAwesomePageAboutStuff.aspx for some period of time and then one day you decide you no longer need it and delete the file. If Google or some other crawler has already indexed that URL and goes back to it after you delete it the crawler will get a 302 status code instead of a 404 error and because of this status code the crawler will update the page's url to point to your error page rather deleting the non-existent link. Now, whenever someone finds that url by way of a search engine they'll end up at your error page.
It's not really a huge issue, but you can definitely see the headaches this can create for your users in the long run.
Look here for some corroborating data.
I created a vanity url system using the 404 as the handler. There's no need for a 302 on my side as the 404 dynamically loads the content and returns that. I am fully able to handle any and all POST / GET and SERVER data.
Works great. If you are interested TarantulaHawk is up on SourceForge.

Resources