I need to set up routing in global.asax so that anybody going to a certain page with an actual tilde in the URL (due to a bug a tilde ended up in a shared link) is redirected to the proper place using routing. How can I set up a route for a URL with an ACTUAL tilde ("~") in it, e.g. www.example.com/~/something/somethingelse to go to the same place as www.example.com/something/somethingelse - it never seems to work!
In addition to Gerrie Schenck:
You should NEVER ever use unsafe characters in URL's. It's bad practice and you can not be sure that all browsers will recognize this character.
Webdevelopment is about creating websites/webapplications that will function under all browsers(theoretically ofcourse, practically it resolves to a limited few that are used the most based on the goal it serves: p )
The encoding should work, if not, it proves Gerrie and my point as why you should not use unsafe characters.
List of unsafe characters and which encoding could be used:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
You could try escaping the tilde, but I doubt this will work since it's an unsafe character, meaning it should never be used in an URL.
For example:
www.example.com/%7E/something/somethingelse
%7E is the escape code for a tilde.
Related
I am using ^[\w-\.\+]+#([\w-]+\.)+[\w-]{2,4}$ to validate email address, when I use it from .aspx.cs it works fine to validate IDN email but when I use it from aspx page directly it doesn't work.
return Regex.IsMatch(
email,
#"^[\w-\.\+]+#([\w-]+\.)+[\w-]{2,4}$",
RegexOptions.Singleline);
the ID that I would like to validate looks like pelai#ÖßÜÄÖ.com
I am too bad at regex do you guys know what am I doing wrong?
You may want to take a look at regexlib.com, they have a fantastic selection of user-created content to do these extremely commont types of matches.
http://regexlib.com/Search.aspx?k=email
First the correct validation of an e-mail address is somewhat more complex as regex. But that apart, the Regex is not at fault, but probably rather how you use it.
Edit (after seeing your code): do you make sure that the string to be tested has no whitespace and such in it? Put a breakpoint on it right there and inspect the string, that might give you an idea of what is going wrong.
You should escape dash (-) within the first char class and no need for dot and plus :
[\w\-.+]
or
[\w.+-]
no need to escape dash if it is the last char.
With "directly from aspx page" you probably mean in a regularexpression validator?
Then you need to be aware that the regex is used by a different system: javascript which has it's own implementation of regex. This means that regexes that work in .Net directly, might fail in js.
The implementations are not too different, the basics are identical. But there might be differences in details (as how an unescaped - is handled) and js lacks some "advanced features" (although your regex doesn't look too "advanced" ;-) ).
Do you see any error messages in the browser?
The problem is those non-ASCII characters in your test address, ÖßÜÄÖ (which you only ever mentioned in a comment to #HansKesting's answer). In .NET, \w matches all Unicode letters and digits, and even several characters besides _ that are classified as connecting punctuation, but in JavaScript it only matches [A-Za-z0-9_].
JavaScript also lacks support for Unicode properties (like \p{L} for letters) and blocks (\p{IsLatin}), so you would have to list any non-ASCII characters you want to allow by their Unicode escapes (\uXXXX). If you just want to support Latin1 letters, I suppose you could use [\w\u00C0-\u00FF], but IDN is supposed to support more than just Latin1, isn't it?
By the way, JavaScript also doesn't support Singleline mode, and even if it did you wouldn't be able to use it. JS does support Multiline and IgnoreCase modes, but there's no way to set them on both the server and client side. The inline modifiers, (?i) and (?m), don't work in JS, and the RegexOptions argument only works server-side.
Fortunately, you don't really need Singleline mode anyway; it allows the . metacharacter to match linefeeds, but the only dots in your regex are matching literal dots.
How should I sanitize urls so people don't put 漢字 or other things in them?
EDIT: I'm using java. The url will be generated from a question the user asks on a form. It seems StackOverflow just removed the offending characters, but it also turns an á into an a.
Is there a standard convention for doing this? Or does each developer just write their own version?
The process you're describing is slugify. There's no fixed mechanism for doing it; every framework handles it in their own way.
Yes, I would sanitize/remove. It will either be inconsistent or look ugly encoded
Using Java see URLEncoder API docs
Be careful! If you are removing elements such as odd chars, then two distinct inputs could yield the same stripped URL when they don't mean to.
The specification for URLs (RFC 1738, Dec. '94) poses a problem, in that it limits the use of allowed characters in URLs to only a limited subset of the US-ASCII character set
This means it will get encoded. URLs should be readable. Standards tend to be English biased (what's that? Langist? Languagist?).
Not sure what convention is other countries, but if I saw tons of encoding in a URL send to me, I would think it was stupid or suspicious ...
Unless the link is displayed properly, encoded by the browser and decoded at the other end ... but do you want to take that risk?
StackOverflow seems to just remove those chars from the URL all together :)
StackOverflow can afford to remove the
characters because it includes the
question ID in the URL. The slug
containing the question title is for
convenience, and isn't actually used
by the site, AFAIK. For example, you
can remove the slug and the link will
still work fine: the question ID is
what matters and is a simple mechanism
for making links unique, even if two
different question titles generate the
same slug. Actually, you can verify
this by trying to go to
stackoverflow.com/questions/2106942/…
and it will just take you back to this
page.
Thanks Mike Spross
Which language you are talking about?
In PHP I think this is the easiest and would take care of everything:
http://us2.php.net/manual/en/function.urlencode.php
I'm working on a site which the client has had translated into Croatian and Slovenian. In keeping with our existing URL patterns we have generated URL re-writing rules that mimic the layout of the application which has lead to having many non-ascii charachters in the URLs.
Examples š ž č
Some links are triggered from Flash using getURL, some are standard HTML links. Some are programatic Response.Redirects and some through adding 301 status codes and location headers to the response. I'm testing in IE6, IE7 and Firefox 3 and internitmtently, the browsers display the non-latin chars url encoded.
š = %c5%a1
ž = %c5%be
č = %c4%8d
I'm guessing this is something to do with IIS and the way it handles Response.Redirect and AddHeader("Location ...
Does anyone know of a way of forcing IIS to not URL encode these chars or is my best bet to replace these with non-diacritic chars?
Thanks
Ask yourself if you really want them non-url encoded. What happens when a user that does not have support for those characters installed comes around? I have no idea, but I wouldn't want to risk making large parts of my site unavailable to a large part of the world's computers...
Instead, focus on why you need this feature. Is it to make the urls look nice? If so, using a regular z instead of ž will do just fine. Do you use the urls for user input? If so, url-encode everything before parsing it to link output, and url-decode it before using the input. But don't use ž and other local letters in urls...
As a side note, in Sweden we have å, ä and ö, but no one ever uses them in urls - we use a, a and o, because browsers won't support the urls otherwise. This doesn't surprise the users, and very few are unable to understand what words we're aiming at just because the ring in å is missing in the url. The text will still show correctly on the page, right? ;)
Does anyone know of a way of forcing IIS to not URL encode
You must URL-encode. Passing a raw ‘š’ (\xC5\xA1) in an HTTP header is invalid. A browser might fix the error up to ‘%C5%A1’ for you, but if so the result won't be any different to if you'd just written ‘%C5%A1’ in the first place.
Including a raw ‘š’ in a link is not wrong as such, the browser is supposed to encode it to UTF-8 and URL-encode as per the IRI spec. But to make sure this actually works you should ensure that the page with the link in is served as UTF-8 encoded. Again, manual URL-encoding is probably safest.
I've had no trouble with UTF-8 URLs, can you link to an example that is not working?
do you have a link to a reference where it details what comprises a valid HTTP header?
Canonically, RFC 2616. However, in practice it is somewhat unhelpful. The critical passage is:
Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047.
The problem is that according to the rules of RFC 2047, only ‘atoms’ can accommodate a 2047 ‘encoded-word’. TEXT, in most situations it is included in HTTP, cannot be contrived to be an atom. Anyway RFC 2047 is explicitly designed for RFC 822-family formats, and though HTTP looks a lot like an 822 format, it isn't in reality compatible; it has its own basic grammar with subtle but significant differences. The reference to RFC 2047 in the HTTP spec gives no clue for how one might be able to interpret it in any consistent way and is, as far as anyone I know can work out, a mistake.
In any case no actual browser attempts to find a way to interpret RFC 2047 encoding anywhere in its HTTP handling. And whilst non-ASCII bytes are defined by RFC 2616 to be in ISO-8859-1, in reality browsers can use a number of other encodings (such UTF-8, or whatever the system default encoding is) in various places when handling HTTP headers. So it's not safe to rely even on the 8859-1 character set! Not that that would have given you ‘š’ anyhow...
Those characters should be valid in a URL. I did the URL SEO stuff on a large travel site and that's when I learnt that. When you force diacritics to ascii you can change the meaning of words if you're not careful. There often is no translation as diacritics only exist in their context.
I have uriscan installed on my Win2003 server and it is blocking an older ColdFusion script. The log entry has the following--
2008-09-19 00:16:57 66.82.162.13 1416208729 GET /Admin/Uploads/Mountain/Wolf%2520Creek%2520gazeebo.jpg Rejected URL+is+double+escaped URL - -
How do I get uriscan to allow submissions like this without turning off the double-escaped url feature?
To quote another post on the subject,
some aspect of your process for
submitting URIs is doing some bad
encoding.
http://www.usenet-forums.com/archive/index.php/t-39111.html
I recommend changing the name of the JPG to not have spaces in it as a good practice, then later try to figure out with a non-production page why you're not interpreting the %20 as an encoded space, but as a percent sign and two digits.
How do I get uriscan to allow
submissions like this without turning
off the double-escaped url feature?
How do you get it to allow double-escaped URLs without turning off the double-escaped url feature? I think there's something wrong with what you're trying to do. My question is this: does your HTML source literally show image requests with "%2520" in them? Is that the correct name for your file? If so, you really have only two options: rename the file or turn off the feature disallowing double escapes.
How would one go about spotting URIs in a block of text?
The idea is to turn such runs of texts into links. This is pretty simple to do if one only considered the http(s) and ftp(s) schemes; however, I am guessing the general problem (considering tel, mailto and other URI schemes) is much more complicated (if it is even possible).
I would prefer a solution in C# if possible. Thank you.
Regexs may prove a good starting point for this, though URIs and URLs are notoriously difficult to match with a single pattern.
To illustrate, the simplest of patterns looks fairly complicated (in Perl 5 notation):
\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*
This would match
http://example.com/foo/bar-baz
and
ftp://192.168.0.1/foo/file.txt
but would cause problems for at least these:
mailto:support#stackoverflow.com (no match - no //, but present #)
ftp://192.168.0.1.2 (match, but too many numbers, so it's not a valid URI)
ftp://1000.120.0.1 (match, but the IP address needs numbers between 0 and 255, so it's not a valid URI)
nonexistantscheme://obvious.false.positive
http://www.google.com/search?q=uri+regular+expression (match, but query isn't
I think this is a case of the 80:20 rule. If you want to catch most things, then I would do as suggested an find a decent regular expression if you can't write one yourself.
If you're looking at text pulled from fairly controlled sources (e.g. machine generated), then this will the best course of action.
If you absolutely positively have to catch every URI that you encounter, and you're looking at text from the wild, then I think I would look for any word with a colon in it e.g. \s(\w:\S+)\s. Once you have a suitable candidate for a URI, then pass it to the a real URI parser in the URI class of whatever library you're using.
If you're interested in why it's so hard to write a URI pattern, the I guess it would be that the definition of a URI is done with a Type-2 grammar, while regular expressions can only parse languages from Type-3 grammars.
Whether or not something is a URI is context-dependent. In general the only thing they always have in common is that they start "scheme_name:". The scheme name can be anything (subject to legal characters). But other strings also contain colons without being URIs.
So you need to decide what schemes you're interested in. Generally you can get away with searching for "scheme_name:", followed by characters up to a space, for each scheme you care about. Unfortunately URIs can contain spaces, so if they're embedded in text they are potentially ambiguous. There's nothing you can do to resolve the ambiguity - the person who wrote the text would have to fix it. URIs can optionally be enclosed in <>. Most people don't do that, though, so recognising that format will only occasionally help.
The Wikipedia article for URI lists the relevant RFCs.
[Edit to add: using regular expressions to fully validate URIs is a nightmare - even if you somehow find or create one that's correct, it will be very large and difficult to comment and maintain. Fortunately, if all you're doing is highlighting links, you probably don't care about the odd false positive, so you don't need to validate. Just look for "http://", "mailto:\S*#", etc]
For a lot of the protocols you could just search for "://" without the quotes. Not sure about the others though.
Here is a code snippet with regular expressions for various needs:
http://snipplr.com/view/6889/regular-expressions-for-uri-validationparsing/
That is not easy to do, if you want to also match "something.tld", because normal text will have many instances of that pattern, but if you want to match only URIs that begin with a scheme, you can try this regular expression (sorry, I don't know how to plug it in C#)
(http|https|ftp|mailto|tel):\S+[/a-zA-Z0-9]
You can add more schemes there, and it will match the scheme until the next whitespace character, taking into account that the last character is not invalid (for example as in the very usual string "http://www.example.com.")
the URL Tool for Ubiquity does the following:
findURLs: function(text) {
var urls = [];
var matches = text.match(/(\S+\.{1}[^\s\,\.\!]+)/g);
if (matches) {
for each (var match in matches) {
urls.push(match);
}
}
return urls;
},
The following perl regexp should pull do the trick. Does c# have perl regexps?
/\w+:\/\/[\w][\w\.\/]*/