I'm writing a new site to overtake an old site and need to handle the old url redirects (w/ appropriate content migration). I'm thinking I'd like to use rewrite maps rather than develop a bunch of regular expressions to match b/c the old site has a huge variety of strange url formats. But, I could have up to 8,000 old urls to map and I'm worried that this might consume too many resources and slow down my site. Anyone have any data/experience/guidance regarding rewrite map performance w/ large amounts of urls?
Thanks!
The question is referring to the II7 Url Rewrite module 2.0 and the implementation of a large set of items inside a rewritemap. Adding this as custom functionality to the 404 page is not a good design and not optimal for the request pipeline.
The rewritemap is specifically designed to handle large url collections and shouldn't be a problem as long as you have memory to hold 8k urls in addition to yoru other memory needs.
You can easily test the server using jmeter or visual studio to see if the rewritemap size has any significant impact on response time.
I have implemented a URL redirect mechanism on more than one occasion; i've mostly used the 404 handler for this (as this does not incur an overhead penalty for the new URLs); if a request triggers the 404 handler see if it's not actually a redirect; if it is send 301 else send 404.
Redirects can be stored in an indexed lookup (like a Dictionary in c#) which guarantees near constant-time lookup performance for the size you're indicating (8k entries).
Alternatives could be using an HttpModule or (ugh) ISAP filters but your "404 aspx error page" is the most straightforward and easiest to implement.
As for performance; this is a proven solution for a site with 9M+ hits/day.
Related
I'm working on an application where in a certain circumstance a redirect must take place, but the necessary querystring parameters are extremely large and at risk of not working in certain browsers / HTTP proxies.
In the past when working with .NET, I could do a Server.Transfer to perform a redirect without sending anything to the client.
Is there a similar capability or pattern I can use in Django to accomplish this? Some things I've considered are:
Transferring the parameters out of the querystring and into Session (or other server-side) state, and doing a normal 302 redirect
Having the first view function directly call the second view function
Neither of these approaches feels right; what other possibilities should I consider?
So I have an Nginx server set up which is supposed to redirect all http to https (and non-www to www) using 4 server blocks.
The issue is that any 404 or non existent http URL first get a 301 redirect to what could have been an https version if it hypothetically existed (hence creating an extra URL and redirect).
See example:
1) http://example.com/thisurldoesntexit
301 Redirect
2) https://example.com/thisurldoesntexit
404
3) https://example.com/notfound
Is there a way to redirect user directly to a https 404 (URL 3)?
First of all, as already been pointed out, doing a 301 redirect from a non-existent page to a single /notfound moniker, is a really bad practice, and is likely against the RFCs.
What if the user simply mistyped a single character of a long URL? Modern browsers make it non-trivial to go back to what has been typed in order to correct it. The user would have to decide whether your site is worth a retyping from scratch, or whether your competitor might have thought of a better experience.
What if the user simply followed a broken link, which is broken in a very obvious way, and could be easily fixed? E.g., http://www.example.org/www.example.com/page, where an absolute URL was mistyped by the creator to be a relative one, or maybe a URI like /page.html., with an extra dot in the end. Likewise, you'll be totally confusing the user with what's going on, and offering a terrible user experience, where if left alone, the URL could easily have been corrected promptly.
But, more importantly, what real problem are you actually trying to solve?!
For better or worse, it's a pretty common practice to indiscriminately redirect from http to https scheme, without an account of whether a given page may or may not exist. In fact, if you employ HSTS, then content served over http effectively becomes meaningless; the browser with a policy would never even be requesting anything over http from there on out.
Undoubtedly, in order to know whether or not a given page exists, you must consult with the backend. As such, you might as well do the redirect from http to https from within your backend; but it'll likely tie up your valuable server resources for little to no extra benefit.
Moreover, the presence or absence of the page may be dictated by the contents of the cookies. As such, if you require that your backend must discern whether a page does or does not exist for an http request, then you'll effectively be leaking private information that was meant to be protected by https in the first place. (In turn, if your site has no such private information, then maybe you shouldn't be using https in the first place.)
So, overall, the whole approach is just a REALLY, REALLY bad idea!
Consider instead:
Do NOT do a 301 redirect from all non-existent pages to a single /notfound page. Very bad practice, very bad UX.
It is totally OK to do an indiscriminate redirect from http to https, without accounting for whether or not the page exists. In fact, it's not only okay, but it's the way God intended, because an adversary should not be capable of discerning whether or not a given page exists for an https-based site, so, if you do find and implement a solution for your "problem", then you'll effectively create a security vulnerability and a data leak.
Use https://www.drupal.org/project/fast_404 module for serving 404 pages directly without much overload.
I'd suggest redirecting to a 404 page is a poor choice, and you should instead serve the 404 on the incorrect URL.
My reasons for stating this are:
By redirecting away from the page, you are issuing headers that implicitly say "The content does not exist on this URL, but it does over here". I'm not sure how the various search engines would react to being redirected to a 404
I can speak from my own experience as a user when I say that having the URL change on me when I've mis-typed by a single character can be very frustrating. I then need to spend the time to type out the entire URL again.
You can avoid having logic in your .htaccess file or whatever to judge a page as a 404. This will greatly simplify your initial logic (which by-the-by gets computed on every single page load) - and will remove far more redirects than just the odd one of http://badurl to https://badurl to https://404
We are working on the conversion of an ASP site to ASP.Net and are running into problems with redirects and resource locations. Our issues are coming from a peculiarity of our set-up. Our site can be accessed in two ways:
Directly by URL: http://www.mysite.com - in this case everything works fine
Via a proxy server with a URL like: http://www.proxy.com/mysite_proxy/proxy/
In #2 "mysite_proxy" is a mapping on proxy.com that directs the request behind the scenes to www.mysite.com, "proxy" is a virtual sub-website that just redirects the request to the root of www.mysite.com. It essntially is meant to give us a convenient way of knowing if a request is hitting the site from the proxy or not.
We are running into two problems with this setup:
Using Response.Redirect either with the "~" or a plain relative path (Default.aspx) generates a 302 response with a location of "/proxy/rest_of_the_path.aspx." This causes the browser to request http://www.proxy.com/proxy/rest_of_the_path.aspx which isn't anything and doesn't even hit our server so we couldn't do an after the fact re-write.
Using "~" based URLs in our pages for links, images, style-sheets, etc. creates the same kind of path: "/proxy/path_to_resources.css." We could probably solve some of these by using relative paths for all these resources though that would be a lot of work and it would do nothing to address similar resource links generated by the framework and 3rd party components.
Ideally I want to find a global fix that will make these problems transparent to the developers working on the site. I have a few ideas at this point:
Getting rid of the proxy, it is not really needed and is there for administrative and not technical reasons. Easiest to accomplish technically, the hardest to accomplish in the real world.
Hand the problem off to the group that runs the proxy and say it is their problem they need to fix it.
Use a Response filter to modify the raw html before it is sent to the client. I know this could fix my resource links, but I am not certain about the headers (need to test it out) and there would be a performance hit to having to parse every response looking for and re-writing urls.
All of these solutions have big negatives in my mind and I was hoping someone might have another idea. So any thoughts?
Aside: there are a lot of posts up already that deal with the reverse of this issue: I have a relative URL, how do I may it absolute, but I didn't come across anything that fit the bill for the other direction.
As a fix, I'd go with a small detection routine at Global.asax:Session_Start (since i imagine that the proxy doesn't actually starts another application instance), set a session variable with the correct path, and use it instead of '~'.
In the case a different application instance is used, then use Application_Start instead of Session_Start and a static Global variable instead of a Session variable.
We have a number of existing clients that point to urls like:
http://sub1.site.com/images/image1.jpg
/images is a virutal directory that points to a directory that actually contains image1.jpg on that server.
We're moving all of the files out of this directory and onto a separate server that will not run this same application.
The file will now only be available at:
http://sub2.site.com/image1.jpg
What is the best way to make it so clients requesting
http://sub1.site.com/images/image1.jpg will get the content that now resides at http://sub2.site.com/image1.jpg?
A few requirements:
We need the actual content to be returned through that url - not a 302 response.
We cannot modify the IIS server configuration - only the web.config for the site
Again, we're running asp.net 3.5
Thanks.
Not totally sure this would work, but you could setup URL Routing on the old site so all requests are sent to a handler and within that handler you could do a web request to get the file from it's new location.
I use a variation of the process to map image URLs to different locations and my handler does some database queries to get the mapped relationship and provide the correct image. I don't see why you could do a web request to get the image.
Since you are using IIS7, you can use the built in URL rewrite module.
You would want an inbound and an outbound rule to change \images\image1.jpg to \image1.jpg
It can get pretty involved, but this should be rather simple.
Assuming you can add handlers to your site (as in add a DLL to your /bin directory in the site) and with the restriction that you can't send 302 responses for better performance, then alternatively you could write a custom handler to grab all requests that match that URL pattern, do the web request for the sub2.site image from the original site via web client code, then serve it back out of the original site, sub1.site.com.
See How To Create an ASP.NET HTTP Handler by Using Visual C# .NET for the very basics of creating and setting up a custom handler. Then use the HttpWebRequest to make the request of sub2.site.com, as in the guide A Deeper Look at Performing HTTP Requests in an ASP.NET Page. Plus a little other code to handle errors, timeouts, passing the image through with as little processing and memory usage as possible, etc.
Depending on the response time/lag between the two servers, this may be slow, but it would fit all your requirements. But if the point of moving the images to site 2 was for performance (CPU or memory) or bandwidth limitations, then this solution would nullify any gains — and would actually make things worse. But if they were moved for other business or technical reasons though, then this solution might be helpful still.
If you have other control over the server or anything upstream from the server, you could use mod_proxy (or similar Windows/IIS tool) to intercept those URLs and forward them to another server and respond back with the real request. Depending on your network configuration and available servers, this could be the simplest, best performing solution.
Can IIS be configure to forward request to another web server? on serverfault has a quick process and link for an IIS 7.5 solution.
I have been thinking about a neat way of load balancing and one thing that would be required is to be capable of loading an image on an HTML page from multiple locations without rewriting the URL(on each load)
So what I need to be able to do is have one URL which is the "static" URL. Such as http://example.com/myimage.png The image is not actually contained in example.com though. So example.com does a either a 302 or 301 or 307 HTTP response to cause a redirect to 2.example.com. How do browsers handle this with images like in this situation? Also, how do browsers handle multiple redirections for instance if 2.example.com also didn't contain it and it went to 3.example.com ? (Note, I am asking this because I've never seen a 301 redirect on anything but an HTML page)
Also, which status code would be best to use. 301 means "moved permanently" which this "move" isn't permanent so I don't want it cached. Should I use 307? Is that supported by search engines and modern browsers?
Redirect is an HTTP concept and applies to any resource that can be delivered over HTTP, not only HTML. Chaining redirects and non-HTML redirects work just fine in most modern browsers.
If you want temporary redirect, use 302, unless you want to redirect POST and PUTs as well. The problem is that most implementations will issue GET for the new resource address after POST or PUT that got 302.
Note that 303 and 307 are HTTP 1.1 specific.
I would advise against load balancing like this. Load balancing is not what 3xx responses are intended for.
The HTTP protocol has capabilities for caching which can help with reducing server load. There are also server technologies for load balancing. These technologies are well developed will be more stable and reusable.
As Benedict C says, I think you're barking up the wrong tree.
If you want to do load-balancing, then do load balancing. Round robin DNS is the simplest method (and is more effective in a number of regards than more expensive solutions). If you must try to load balance across servers with different FQDNs, then generate the URL client-side in javascript.
The remnants of your post are applicable to other questions about redirection. There's a lot of bad advice published about SEO. Google has approx 92% of the world market and publish quite detailled specs about how they spider and rank sites. Redirection within your domain should not affect your rankings in any competent search engine. Redirection outside your domain will only improve the ranking of the target.
Yes, browsers implement a limit on the number of redirections followed for a single request - but it varies by browser.