Quotes inside ValidationExpression for RegularExpressionValidator - asp.net

Using said control to validate a ASP.NET TextBox, I'm curious what the most popular practice is. Currently using:
ValidationExpression="^[\w\d\s.,"'-]+$"
Any shorter way of doing this? Tried \ , "" to no avail. Thanks.

Using \" won't work, and you won't be able to use a "" either. What you have matches correctly.
That said, to make it shorter you could always use the unicode character escape equivalent: \x22. Even shorter is the octal representation: \42. While both are shorter, they don't help readability much. Frequent regex users would understand that it represents some character, but they might not know which character without looking it up. In addition, you won't be able to comment it, unless you plan to leave ASP.NET markup comments nearby to explain the regex.
Yet, I don't particularly like how " looks like either. It looks odd and out of place, making \x22 or \42 look a tad cleaner. Your call.
ValidationExpression="^[\w\d\s.,\x22'-]+$"
ValidationExpression="^[\w\d\s.,\42'-]+$"
Ultimately this lets you shave 2-3 characters off.
EDIT: added an even shorter approach using octal representation.

I think setting that value from the codebehind (where you'll have more control over string formatting/escaping) is going to be your best bet.

Related

RegularExpressionValidator slow on multiline textbox (textarea)

I have a multiline textbox (textarea) that I want to verify has a particular string in it. I was trying:
<asp:RegularExpressionValidator runat="server" ControlToValidate="txtTemplate" ValidationExpression="^(.\s*)*Content(.\s*)*$" Text="content" ErrorMessage="Must contain: Content" />
Using ^(.\s*)*$ seems to pass for a textarea. So I tried to sandwich my criteria between two of these. But it seems to lock up both IE and Chrome.
This should be simple, I think I'm making it tougher than it needs to be.
If the validation is always being done on the server (that's what runat="server" means, isn't it?), the simplest solution is probably to use this regex:
(?s)^.*Content.*$
(?s) turns on Singleline mode, which allows the . metacharacter to match all characters including linefeeds. If you want it to run on the client as well, use this:
^[\s\S]*Content[\s\S]*$
That's because JavaScript has no equivalent for Singleline mode (also known as DOT_ALL, DOTALL, dot-matches-all, single-line, or /s mode). It doesn't recognize inline modifiers like (?s) and (?i), either.
Watch out for constructs like (.\s*)*, where an expression with quantifiers (*, +, etc.) is enclosed in a group which is itself controlled by a quantifier. If the regex fails to achieve a match right away, it goes back and tries to match by different paths (i.e., by using different parts of the regex to match different parts of the string), which can get very expensive, performance-wise. This regex is especially bad because . and \s can match many of the same characters, which dramatically increases the number of paths it has to explore before giving up.
The phenomenon is commonly known as catastrophic backtracking, and it usually manifests in cases where there's no possibility of a match. I would expect your validator to work fine when the sequence Content is present.
By the way, if you want to match only on the complete word Content, you should add word boundaries, like so:
(?s)^.*\bContent\b.*$
That will prevent false positives on words like MalContent and Contentious. \b works differently in different regex flavors. In .NET it's Unicode-aware unless you specify ECMAScript mode. In JavaScript it's supposed to recognize only the ASCII letters and digits as word characters; in most browsers it does, but don't take it for granted.
Try
[\S\s]*Content[\S\s]*
I think a regex more like .*Content.* would be more effective and possibly faster. Also, you may want to implement a custom validator if this continues to be a performance drag, where you use JavaScript to search the text for the string.

What is wrong with this Regex

I am using ^[\w-\.\+]+#([\w-]+\.)+[\w-]{2,4}$ to validate email address, when I use it from .aspx.cs it works fine to validate IDN email but when I use it from aspx page directly it doesn't work.
return Regex.IsMatch(
email,
#"^[\w-\.\+]+#([\w-]+\.)+[\w-]{2,4}$",
RegexOptions.Singleline);
the ID that I would like to validate looks like pelai#ÖßÜÄÖ.com
I am too bad at regex do you guys know what am I doing wrong?
You may want to take a look at regexlib.com, they have a fantastic selection of user-created content to do these extremely commont types of matches.
http://regexlib.com/Search.aspx?k=email
First the correct validation of an e-mail address is somewhat more complex as regex. But that apart, the Regex is not at fault, but probably rather how you use it.
Edit (after seeing your code): do you make sure that the string to be tested has no whitespace and such in it? Put a breakpoint on it right there and inspect the string, that might give you an idea of what is going wrong.
You should escape dash (-) within the first char class and no need for dot and plus :
[\w\-.+]
or
[\w.+-]
no need to escape dash if it is the last char.
With "directly from aspx page" you probably mean in a regularexpression validator?
Then you need to be aware that the regex is used by a different system: javascript which has it's own implementation of regex. This means that regexes that work in .Net directly, might fail in js.
The implementations are not too different, the basics are identical. But there might be differences in details (as how an unescaped - is handled) and js lacks some "advanced features" (although your regex doesn't look too "advanced" ;-) ).
Do you see any error messages in the browser?
The problem is those non-ASCII characters in your test address, ÖßÜÄÖ (which you only ever mentioned in a comment to #HansKesting's answer). In .NET, \w matches all Unicode letters and digits, and even several characters besides _ that are classified as connecting punctuation, but in JavaScript it only matches [A-Za-z0-9_].
JavaScript also lacks support for Unicode properties (like \p{L} for letters) and blocks (\p{IsLatin}), so you would have to list any non-ASCII characters you want to allow by their Unicode escapes (\uXXXX). If you just want to support Latin1 letters, I suppose you could use [\w\u00C0-\u00FF], but IDN is supposed to support more than just Latin1, isn't it?
By the way, JavaScript also doesn't support Singleline mode, and even if it did you wouldn't be able to use it. JS does support Multiline and IgnoreCase modes, but there's no way to set them on both the server and client side. The inline modifiers, (?i) and (?m), don't work in JS, and the RegexOptions argument only works server-side.
Fortunately, you don't really need Singleline mode anyway; it allows the . metacharacter to match linefeeds, but the only dots in your regex are matching literal dots.

Avoiding escape sequence processing in ActionScript?

I need to mimic C# functionality of the # symbol when it precedes a string.
#"C:\A\File\Path" for example
What is the best way to do this? Also there are some sites that will escape larger strings for you so that they survive the processing but I could not find one for Actionscript.
Help?
No, unfortunately there is no support for verbatim string literals in actionscript. You are going to have to escape them manually. Even calling string.replace("\", "\\") doesn't work.

Should I use Replace() instead of HtmlEncode()?

Should HtmlEncode() be abandoned and Replace() used instead of I want to parse links in posts/comments (with regular expressions)? HtmlEncode() replaces & with & which I assume can cause problems with links, should I just use Replace() to replace < with <?
For example if a user posts something like:
See this site http://www.somesite.com/somepage.aspx?qs1=1&qs2=2&qs3=3
I want it to be:
See this site <a href="http://www.somesite.com/somepage.aspx?qs1=1&qs2=2&qs3=3">http://www.somesite.com/somepage.aspx?qs1=1&qs2=2&qs3=3</a>
But With HtmlEncode() the URL will become (notice the ampersand):
See this site http://www.somesite.com/somepage.aspx?qs1=1&qs2=2&qs3=3
Should I avoid the problem by using Replace() instead?
Thanks
Actually, your last example - the one you're worried about - is the only correct one. In HTML documents, ampersands are used to introduce entity references, and therefore must be escaped. While most browsers are forgiving enough to let them slip through when not obviously part of an entity reference, you can run into subtle problems should their use in a URL happen to look like an entity.
Let HtmlEncode() do its job.
Perhaps you are looking for UrlEncode()?
http://msdn.microsoft.com/en-us/library/zttxte6w.aspx
What are you looking to replace and why? HtmlEncode() is typically used to sanitize user-supplied data. That said, if you're allowing users to submit links, you probably don't want to HtmlEncode them, in the first place. You're basically going to render them exactly as the user supplied them.
Replacing & with & inside of an href attribute is correct. If you do not, then your code is technically invalid. Also, you should escape it even if it's inside of a link. The only case you'll run into problems is if you end up HTMLEncoding it multiple times.
I recommend against using Replace to do the job of HTMLEncode or URLEncode. These functions are custom designed to take care of most of the problems that you'd see in user entered content and if you try to replace them with your own code, the results might get ugly (I am talking from experience here) if you forgot something vital.

How have Html entities inside asp.net page?

Inside an asp.net page, should I use
<html><title>My page's title from México</title></html>
Or
<html><title>My page’s title from México</title></html>
Both examples have the same output. Since asp.net encodes all my pages to utf-8, there is no need to use html entities, is that right?
The ASCII table is set of characters, arguable the first standardized set of characters back in the days when you could only spare 1 byte per character. http://asciitable.com/ But I did some looking around at the extended character set of ASCII and it appears that the character you are referencing is an ASCII character. So there really isn't a problem which ever way you choose to display your title.
My revised answer is go for less expensive one according to space (i.e. the first one)
The second example will ensure compatibility with ASCII standards of HTML transmition. So my vote is for the second example, so you don't have to ensure the HTML is output and encoded as UTF-8 all the way through all the proxy servers and any other kind of caching and translation that might occur.
You're correct; As long as there's unicode at both ends of the pipe, it really doesn't matter. Personally, I would use the first simply because it's more readable.
And, honestly, unicode has been widespread for some time. I personally believe that it's time to leave anyone who can't handle UTF-8 behind.

Resources