Modify regex to more selective - asp.net

Using urlrewriting on a website that also has a blog, however I have found an issue with conflicting rules.
I think can be resolved by simply modifying the regex in the offending rule to eliminate the blog pages, but I've not yet been able to get it to work.
Here's the issue that's causing the problem:
www.mysite.com <== my site
www.mysite.com/blog/... <== root of blog
At the moment, the regex is simply "(.*)/$", which is actionable on all pages.
However, it obviously picks up the blog pages. The blog has a slightly different setup which causes issues when I apply the rule, so I'm looking to be able to select the main site pages, but not anything that has the /blog/ directory in its structure.
Anyone help me convert the regex pattern to exclude what I need - I've tried so many permutations - its now causing issues, as our SEO is broken, and its very difficult to test over on the live system.

The elegance of the regex used depends on the power of the regex engine which in turn depends on the environment your regex is used in.
Basic
Works with any engine since it only uses basic operators.
Write
^www.mysite.com($|/$|(/([^b]|b[^l]|bl[^o]|blo[^g]|blog[^/])).*)/$
or for general domains (not advisable imho)
^[^/]+($|/$|(/([^b]|b[^l]|bl[^o]|blo[^g]|blog[^/])).*)/$
Negative Lookahead
Patterns may be specified contingent on non-occurrence of entire strings after a particular position within a match.
^(www.mysite.com(?!/blog/).*)$
For more details on lookahead ( and the similar concept of lookbehind ) see this part of a resource on regular expressions engines and the concepts behind.
Web Server URL Rewriting
If the execution context is a URL rewrite module, you may exploit rule ordering, special processing directives and flags to simplify the regexen. Aside from peculiarities of the syntax, the general ideas of the samples should be implementable in any decent Url rewriting plugin.
Microsoft URL Rewriting module for IIS
(Note: The content of this section reflects MS documentation only.)
There is a tutorial on how to specify Rewriting rules in IIS. As authoritative source for the available regex syntax,the ECMA-262 standard is given (basically that's JavaScript - so lookahead should be available).
Thus in the web.config configuration file, the rewrite section should look like:
<rewrite>
<rules>
<rule name="Whatever description fits">
<match url="^(www.mysite.com(?!/blog/).*)$" />
<action type="Rewrite" url="example.com/whatever" />
</rule>
</rules>
</rewrite>
Apache mod_rewrite and compatible
The following sample addresses your problem for Apache 2.4/mod_rewrite and IIS/IIS-Mod-Rewrite.
(Note: The website of the IIS plugin claims to be 100% mod_rewrite compatible. I do not know whether that claim is correct.):
RewriteRule ^(www.mysite.com/blog/.*)$ $0 [S=1]
RewriteRule ^(.*)$ ...
where ... represents the rewritten url plus any applicable flags. The first rule leaves a matched blog url unchanged and due to the skip flag ([S=1]) ignores the next rewriting rule.
You might also wish to look at the ENDflag and the RewriteCond directive.
Reference: Apache Docs ( Apache httpd 2.4: mod_rewrite )

Related

Should the server treat such endpoints as different? [duplicate]

When should a trailing slash be used in a URL? For example - should my URL look like /about-us/ or like /about-us?
I am fully aware of the SEO-related issues - duplicate content and the canonical thing; I'm trying to figure out which one I should use in the context of serving pages correctly alone.
For example, my colleague is thinking that a trailing slash at the end means it's a "folder" - a "directory", so this is not a correct style. But I think that without a slash in the end - it's not quite correct either, because it almost looks like a folder, but it isn't and it's not a normal file either, but a filename without extension.
Is there a proper way of knowing which to use?
It is not a question of preference. /base and /base/ have different semantics. In many cases, the difference is unimportant. But it is important when there are relative URLs.
child relative to /base/ is /base/child.
child relative to /base is (perhaps surprisingly) /child.
In my personal opinion trailing slashes are misused.
Basically the URL format came from the same UNIX format of files and folders, later on, on DOS systems, and finally, adapted for the web.
A typical URL for this book on a Unix-like operating system would be a file path such as file:///home/username/RomeoAndJuliet.pdf, identifying the electronic book saved in a file on a local hard disk.
Source: Wikipedia: Uniform Resource Identifier
Another good source to read: Wikipedia: URI Scheme
According to RFC 1738, which defined URLs in 1994, when resources contain references to other resources, they can use relative links to define the location of the second resource as if to say, "in the same place as this one except with the following relative path". It went on to say that such relative URLs are dependent on the original URL containing a hierarchical structure against which the relative link is based, and that the ftp, http,
and file URL schemes are examples of some that can be considered hierarchical, with the components of the hierarchy being separated by "/".
Source: Wikipedia Uniform Resource Locator (URL)
Also:
That is the question we hear often. Onward to the answers! Historically, it’s common for URLs with a trailing slash to indicate a directory, and those without a trailing slash to
denote a file:
http://example.com/foo/ (with trailing slash, conventionally a directory)
http://example.com/foo (without trailing slash, conventionally a file)
Source: Google WebMaster Central Blog - To slash or not to slash
Finally:
A slash at the end of the URL makes the address look "pretty".
A URL without a slash at the end and without an extension looks somewhat "weird".
You will never name your CSS file (for example) http://www.sample.com/stylesheet/ would you?
BUT I'm being a proponent of web best practices regardless of the environment.
It can be wonky and unclear, just as you said about the URL with no ext.
I'm always surprised by the extensive use of trailing slashes on non-directory URLs (WordPress among others). This really shouldn't be an either-or debate because putting a slash after a resource is semantically wrong. The web was designed to deliver addressable resources, and those addresses - URLs - were designed to emulate a *nix-style file-system hierarchy. In that context:
Slashes always denote directories, never files.
Files may be named anything (with or without extensions), but cannot contain or end with slashes.
Using these guidelines, it's wrong to put a slash after a non-directory resource.
That's not really a question of aesthetics, but indeed a technical difference. The directory thinking of it is totally correct and pretty much explaining everything. Let's work it out:
You are back in the stone age now or only serve static pages
You have a fixed directory structure on your web server and only static files like images, html and so on — no server side scripts or whatsoever.
A browser requests /index.htm, it exists and is delivered to the client. Later you have lots of - let's say - DVD movies reviewed and a html page for each of them in the /dvd/ directory. Now someone requests /dvd/adams_apples.htm and it is delivered because it is there.
At some day, someone just requests /dvd/ - which is a directory and the server is trying to figure out what to deliver. Besides access restrictions and so on there are two possibilities: Show the user the directory content (I bet you already have seen this somewhere) or show a default file (in Apache it is: DirectoryIndex: sets the file that Apache will serve if a directory is requested.)
So far so good, this is the expected case. It already shows the difference in handling, so let's get into it:
At 5:34am you made a mistake uploading your files
(Which is by the way completely understandable.) So, you did something entirely wrong and instead of uploading /dvd/the_big_lebowski.htm you uploaded that file as dvd (with no extension) to /.
Someone bookmarked your /dvd/ directory listing (of course you didn't want to create and always update that nifty index.htm) and is visiting your web-site. Directory content is delivered - all fine.
Someone heard of your list and is typing /dvd. And now it is screwed. Instead of your DVD directory listing the server finds a file with that name and is delivering your Big Lebowski file.
So, you delete that file and tell the guy to reload the page. Your server looks for the /dvd file, but it is gone. Most servers will then notice that there is a directory with that name and tell the client that what it was looking for is indeed somewhere else. The response will most likely be be:
Status Code:301 Moved Permanently with Location: http://[...]/dvd/
So, totally ignoring what you think about directories or files, the server only can handle such stuff and - unless told differently - decides for you about the meaning of "slash or not".
Finally after receiving this response, the client loads /dvd/ and everything is fine.
Is it fine? No.
"Just fine" is not good enough for you
You have some dynamic page where everything is passed to /index.php and gets processed. Everything worked quite good until now, but that entire thing starts to feel slower and you investigate.
Soon, you'll notice that /dvd/list is doing exactly the same: Redirecting to /dvd/list/ which is then internally translated into index.php?controller=dvd&action=list. One additional request - but even worse! customer/login redirects to customer/login/ which in turn redirects to the HTTPS URL of customer/login/. You end up having tons of unnecessary HTTP redirects (= additional requests) that make the user experience slower.
Most likely you have a default directory index here, too: index.php?controller=dvd with no action simply internally loads index.php?controller=dvd&action=list.
Summary:
If it ends with / it can never be a file. No server guessing.
Slash or no slash are entirely different meanings. There is a technical/resource difference between "slash or no slash", and you should be aware of it and use it accordingly. Just because the server most likely loads /dvd/index.htm - or loads the correct script stuff - when you say /dvd: It does it, but not because you made the right request. Which would have been /dvd/.
Omitting the slash even if you indeed mean the slashed version gives you an additional HTTP request penalty. Which is always bad (think of mobile latency) and has more weight than a "pretty URL" - especially since crawlers are not as dumb as SEOs believe or want you to believe ;)
When you make your URL /about-us/ (with the trailing slash), it's easy to start with a single file index.html and then later expand it and add more files (e.g. our-CEO-john-doe.jpg) or even build a hierarchy under it (e.g. /about-us/company/, /about-us/products/, etc.) as needed, without changing the published URL. This gives you a great flexibility.
Other answers here seem to favor omitting the trailing slash. There is one case in which a trailing slash will help with search engine optimization (SEO). That is the case that your document has what appears to be a file extension that is not .html. This becomes an issue with sites that are rating websites. They might choose between these two urls:
http://mysite.example.com/rated.example.com
http://mysite.example.com/rated.example.com/
In such a case, I would choose the one with the trailing slash. That is because the .com extension is an extension for Windows executable command files. Search engines and virus checkers often dislike URLs that appear that they may contain malware distributed through such mechanisms. The trailing slash seems to mitigate any concerns, allowing the page to rank in search engines and get by virus checkers.
If your URLs have no . in the file portion, then I would recommend omitting the trailing slash for simplicity.
Who says a file name needs an extension?? take a look on a *nix machine sometime...
I agree with your friend, no trailing slash.
From an SEO perspective, choosing whether or not to include a trailing slash at the end of a URL is irrelevant. These days, it is common to see examples of both on the web. A site will not be penalized either way, nor will this choice affect your website's search engine ranking or other SEO considerations.
Just choose a URL naming convention you prefer, and include a canonical meta tag in the <head> section of each webpage.
Search engines may consider a single webpage as two separate duplicate URLS when they encounter it with and without the trailing slash, ie example.com/about-us/ and example.com/about-us.
It is best practice to include a canonical meta tag on each page because you cannot control how other sites link to your URLs.
The canonical tag looks like this: <link rel="canonical" href="https://example.com/about-us" />. Using a canonical meta tag ensures that search engines only count each of your URLs once, regardless of whether other websites include a trailing slash when they link to your site.
The trailing slash does not matter for your root domain or subdomain. Google sees the two as equivalent.
But trailing slashes do matter for everything else because Google sees the two versions (one with a trailing slash and one without) as being different URLs.
Conventionally, a trailing slash (/) at the end of a URL meant that the URL was a folder or directory.
A URL without a trailing slash at the end used to mean that the URL was a file.
Read more
Google recommendation

Matching part of url and redirect

I have a regular expression in my web.config file that I'm using to redirect users to some other domain:
<redirect url="/(.*/)?((da-DK)|(es-ES))/?$" to="http://www.example.com" />
This successfully matches following url:
http://www.example.com/ik/da-DK/
But not the below one:
http://www.example.com/da-DK/
Why is that? I'm certain that this regex is good because I've tested it against lots of example urls. Is this a bug in parser or something like that?
I'm using urlrewriter.net which is no longer maintained, but maybe any of you had such problems in the past?
Regex itself seems ok, so considering that urlrewriter.net is not actively maintained, I would suggest that you try to switch to IIS Url Rewrite:
http://www.iis.net/downloads/microsoft/url-rewrite

ASP/VBSCRIPT URL Trailing dots issue

i've got a problem whereby some websites linking into mine have truncated the URL with trailing dots, 3 of them to be exact!
eg. http://www.mywebsite.com/7542-this-is-a-link-to...
The url should be http://www.mywebsite.com/7542-this-is-a-link-to-my-website.html
Naturally, ISAPI rewrite doesn't understand the truncated url so I need to do a redirect to the correct url using a 301 redirect
Something like:
RewriteRule ^7542-this-is-a-link-to... /7542-this-is-a-link-to-my-website.html [L,R=301]
But for the life of me I cannot get ISAPI rewrite to match against the 3 dots, annoyingly the incorrect URL doesn't even 404 redirect! I have no idea where it is going to... Just a blank screen so am guessing it has something to do with IIS web.config file...
Please help me before I become balder than I already am!
The could be several different reasons for that. Basically the rule like:
RewriteRule ^7542-this-is-a-link-to.* /7542-this-is-a-link-to-my-website.html [L,R=301]
would fix the problem, matching everything after "-to". But it's not ISAPI_Rewrite that throws 404. It's IIS. I had issues before and all googling ended up with IIS blocking suspicious characters. Try to tweak that.
It could be this "dots in the URL"
This thread has a lot of useful info
I'm totally ignorant of Microsoft internet tech, but is there any chance that the three dots in the incoming URL are actually a single "ellipsis" character (… not ...)? If so, you'd need to use that character in your RewriteRule. You'd have to check the docs to know how to correctly encode that character for the config file.

URL Rewrite from /default.aspx to /

I'm using the URL Rewriting.NET tool with IIS 6. I've got my default page content set for default.aspx in IIS. What I'm trying to do is have /default.aspx provide a 301 redirect to the root directory (www.example.com/default.aspx -> www.example.com). I've tried turning off default documents, to no avail.
What I'm hoping to do is use a couple of URL Rewriting.NET rules to accomplish this goal. Any thoughts?
EDIT:
Sorry, I forgot to clarify. If I redirect from /default.aspx to / with default documents turned on (I'd like to leave them on) then I get an infinite loop of default -> / -> default
In the end I wound up using IIS 7 with the URL Rewrite module, which allows you to do this redirect properly.
Edit :
The rule is
<rule name="Default Redirect" stopProcessing="true">
<match url="^default\.aspx$" />
<action type="Redirect" url="/" redirectType="Permanent" />
</rule>
you can do that with a separate rule for each folder, or you can use
<rule name="All Redirect">
<match url="^(.*\/)*default\.aspx$" />
<action type="Rewrite" url="{R:1}" />
</rule>
I came across this very problem a while back while trying to work out why some IIS installs would work redirecting the /default.aspx and some would degenerate into a terminal loop.
I found the answer was whether or not asp.net was 'wildcard' mapped to run all requests within IIS.
Put simply, if you have an out-of-the-box IIS setup, it will always append the default document onto any request for the site root. Thus example.com becomes example.com/default.aspx when you inspect the Request.Url in ASP.NET. Therefore if you detect this situation and try to redirect away and back to example.com, IIS does so, appends the /default.aspx and your code is caught in a loop of it's own making.
The exception to this is if you set up wildcard mapping so that all requests are processed through the asp.net pipeline. In this case, IIS no longer appends the default document onto each request at the Request.Url level. And thus you can do the redirect.
I put it all in this blog post : 301 Redirecting from /default.aspx to the site root - the final word - but this was written several years back and changes in IIS7 may have fixed the problem, as the currently accepted answer provides.
But if you're battling this problem, then looking at the wildcard mapping status is the right place to start.
I had the same problem. For those who wonder why anyone would want to do this, it's a question of SEO. If Google indexes your home page with and without the default.aspx at the end, the PageRank and link popularity will be split between the two URL's. Now, if you're experiencing this problem, and you're able to consolidate the two URL's then you may get a boost in search rankings. One more thing to keep in mind is that if you're going through the trouble, you MUST use a 301 redirect for Google to consolidate their index between two URL's. Otherwise your efforts will be futile.
This is a little too late since you've already solved this by upgrading to IIS7. But I'll just add that the only solution to this problem I've come up with for IIS6 is to add an ISAPI filter.
I documented the complete solution here...
http://swortham.blogspot.com/2008/12/redirecting-default-page-defaultaspx-to.html
If I understand you correctly, you don't want to display 'default.aspx' whenever someone comes into a folder with that document available.
So if they do hit it, you want to automatically redirect to the '/' and just load the default document anyway?
If that's the case then, as stated above, you run the risk of an infinite loop. The second comment gives you an answer but I guess expanding that to the re-write engine what you'd want is to:
Turn off default documents
Register each folder with the re-write engine
When that folder is requested load the default.aspx file as per your target rule
Does this sound about right?
I have to ask, why do you want to do this?
I'm not sure I understand what the problem is.
Though if you turn off default documents then / will simply point to the directory rather than the default.aspx page.
Leave default documents on and just do a redirect based on whether default.aspx is in the requested url or not.
well you can use regular .net to inspect httprequest url, if it has "default.aspx" in it, you can redirect to "/", there will be no infinite loop and you better do this on preload, and end response afterwards, to minimize time it takes to process

ASP.NET URL Rewriting

How do I rewrite a URL in ASP.NET?
I would like users to be able to go to
http://www.website.com/users/smith
instead of
http://www.website.com/?user=smith
Try the Managed Fusion Url Rewriter and Reverse Proxy:
http://urlrewriter.codeplex.com
The rule for rewriting this would be:
# clean up old rules and forward to new URL
RewriteRule ^/?user=(.*) /users/$1 [NC,R=301]
# rewrite the rule internally
RewriteRule ^/users/(.*) /?user=$1 [NC,L]
Microsoft now ships an official URL Rewriting Module for IIS: http://www.iis.net/download/urlrewrite
It supports most types of rewriting including setting server variables and wildcards.
It also will exist on all Azure web instances out of the box.
I have used an httpmodule for url rewriting from www.urlrewriting.net with great success (albeit I believe a much earlier, simpler version)
If you have very few actual rewriting rules then url mappings built in to .NET 2.0 are probably an easier option, there are a few write ups of these on the web, the 4guysfromrolla one seems fairly exhaustive but as you can see they don't support regular expression mappings are are as such rendered fairly useless in a dynamic environment (assuming "smith" in your example is not a special case then these would be of no use)

Resources