Asp.net Regular expression to support multiple languages - asp.net

I have asp.net RegularExpressionValidator
ValidationExpression="^[a-zA-Z\?*.\?!\##\%\&\~`\$\^_\,()\//]{1,30}$" />
It will support any alpha numeric charectors excepts script tags. right now it wont supports any other language except english.
I want modify this regular expression to support arabic charectors also.
Please help me how to modify this expression..
Thanks in advance..

You essentially need to change your regex from a whitelist to a blacklist. So you want to check for characters that you don't want to allow. You can achieve this by starting your regex with a ^ inside the opening bracket. So
ValidationExpression="[^\?*.\?!\##\%\&\~`\$\^_\,()\//]"
will pass any string that does not contain the characters in the expression.

You can add Arabic characters to regular expressions; they match themselves. One problem with Unicode is that Arabic digits, punctuation, and ornaments are scattered around in code blocks, so you may have to add specific symbols you are looking for:
ValidationExpression="^[a-zA-Z\?*.\?!\##\%\&\~`\$\^_\,()\//\u0621-\u063F\u066E-\u06D3]{1,30}$"

Related

How search special characters in Kibana

I would like to find messages which contains this sequence "--->", but Kibana result it's wrong.
How escape this characters to have a good result ?
Thanks,
First I think you should be checking your mappings, whether your fields are not marked as not_analyzed (or don't have keyword analyzer). If it happened to be there as such you won't be able to see any search results. Standard analyzer removes characters when indexing a document.
What if you search it within the quotes including your special character?
This SO could help you. You could also maybe have a look at the Special Characters section at the Lucene doc. Hope it helps!

RegularExpressionValidator slow on multiline textbox (textarea)

I have a multiline textbox (textarea) that I want to verify has a particular string in it. I was trying:
<asp:RegularExpressionValidator runat="server" ControlToValidate="txtTemplate" ValidationExpression="^(.\s*)*Content(.\s*)*$" Text="content" ErrorMessage="Must contain: Content" />
Using ^(.\s*)*$ seems to pass for a textarea. So I tried to sandwich my criteria between two of these. But it seems to lock up both IE and Chrome.
This should be simple, I think I'm making it tougher than it needs to be.
If the validation is always being done on the server (that's what runat="server" means, isn't it?), the simplest solution is probably to use this regex:
(?s)^.*Content.*$
(?s) turns on Singleline mode, which allows the . metacharacter to match all characters including linefeeds. If you want it to run on the client as well, use this:
^[\s\S]*Content[\s\S]*$
That's because JavaScript has no equivalent for Singleline mode (also known as DOT_ALL, DOTALL, dot-matches-all, single-line, or /s mode). It doesn't recognize inline modifiers like (?s) and (?i), either.
Watch out for constructs like (.\s*)*, where an expression with quantifiers (*, +, etc.) is enclosed in a group which is itself controlled by a quantifier. If the regex fails to achieve a match right away, it goes back and tries to match by different paths (i.e., by using different parts of the regex to match different parts of the string), which can get very expensive, performance-wise. This regex is especially bad because . and \s can match many of the same characters, which dramatically increases the number of paths it has to explore before giving up.
The phenomenon is commonly known as catastrophic backtracking, and it usually manifests in cases where there's no possibility of a match. I would expect your validator to work fine when the sequence Content is present.
By the way, if you want to match only on the complete word Content, you should add word boundaries, like so:
(?s)^.*\bContent\b.*$
That will prevent false positives on words like MalContent and Contentious. \b works differently in different regex flavors. In .NET it's Unicode-aware unless you specify ECMAScript mode. In JavaScript it's supposed to recognize only the ASCII letters and digits as word characters; in most browsers it does, but don't take it for granted.
Try
[\S\s]*Content[\S\s]*
I think a regex more like .*Content.* would be more effective and possibly faster. Also, you may want to implement a custom validator if this continues to be a performance drag, where you use JavaScript to search the text for the string.

Good whitelist for search terms

I'm implementing a simple search on a website, and right now I'm working on sanitizing the input. My plan is to make a whitelist of allowed characters. I'm using PHP, and so far I've got the current regex:
preg_replace('/[^a-z0-9 -]/i', '', $s);
So, I'm removing anything that's not alphanumeric or a space or a hyphen.
Is there a generally accepted whitelist for this sort of thing, or does it just depend on the application? I'm going to be searching on book titles, author names and book blurbs.
What about 2010 (A space odyssey)? What about Giscard d`Estaing's autobiography? ... This is really impossible to answer generally, it will depend on your application and data structures.
You want to look into the fulltext search functions of the database of your choice, or even specialized search appliances like Sphinx.
Clarify what engine you will use first to actually perform your search, and the rules on what you need to strip out will become much clearer.
Google has some pretty advanced rules for searches, but their basic rule is this:
Generally, punctuation is ignored, including ##$%^&*()=+[]\ and other special characters.
However, Google makes exceptions for common search terms, like C++, C#, or $100.
If you want a search as sophisticated as Google's, you can make rules against the above punctuation and have some exceptions. However, for a simple search, just ignore the characters that Google generally ignores.
There's not a generic regular expression to solve this problem. Your code strips out a lot of things you might want to keep, like commas, exclamation points, (semi-)colons, and non-English letters. If you have a full list of all of the titles in your database, you should be able to write a script that will construct a list of all characters found in all of your titles. If your regular expression strips out any of those characters, then you risk having problems (although passing this test doesn't mean that you won't run into problems).
Depending on how the rest of your search is implemented, you may be able to strip out valid characters and still return relevant search results. In this case, you would want your expression to allow non-English characters (since you don't want to split a word) but you might be able to remove all punctuation marks that aren't inside of a quote-delimited phrase. For example, searching for red haired should give you all of the results you would get from searching for red-haired plus a few extra.

how to use plus sign instead dashes in wordpress

hey like i said in the question i want use the plus sign(+) instead of dashes so the posts will be like that:
some+post+test
another question , when i use non-latin characters the wordpress break the permalink and preserve just 30 words !! how can i solve that?
The + sign is a reserved character in URLs and will translate to a space .
From what you are saying, though, your underlying issue is that you want to use non-ASCII characters in URLs. That is not valid in the first place: You will have to percent encode the slug before inserting it. Most modern browsers will show the URL in its proper form anyway.
Here is an on-line tool for percent-encoding incoming data. For example, the UTF-8 input of
Crêpes
will translate to
Cr%C3%AApes
Background info: Unicode characters in URLs

What is wrong with this Regex

I am using ^[\w-\.\+]+#([\w-]+\.)+[\w-]{2,4}$ to validate email address, when I use it from .aspx.cs it works fine to validate IDN email but when I use it from aspx page directly it doesn't work.
return Regex.IsMatch(
email,
#"^[\w-\.\+]+#([\w-]+\.)+[\w-]{2,4}$",
RegexOptions.Singleline);
the ID that I would like to validate looks like pelai#ÖßÜÄÖ.com
I am too bad at regex do you guys know what am I doing wrong?
You may want to take a look at regexlib.com, they have a fantastic selection of user-created content to do these extremely commont types of matches.
http://regexlib.com/Search.aspx?k=email
First the correct validation of an e-mail address is somewhat more complex as regex. But that apart, the Regex is not at fault, but probably rather how you use it.
Edit (after seeing your code): do you make sure that the string to be tested has no whitespace and such in it? Put a breakpoint on it right there and inspect the string, that might give you an idea of what is going wrong.
You should escape dash (-) within the first char class and no need for dot and plus :
[\w\-.+]
or
[\w.+-]
no need to escape dash if it is the last char.
With "directly from aspx page" you probably mean in a regularexpression validator?
Then you need to be aware that the regex is used by a different system: javascript which has it's own implementation of regex. This means that regexes that work in .Net directly, might fail in js.
The implementations are not too different, the basics are identical. But there might be differences in details (as how an unescaped - is handled) and js lacks some "advanced features" (although your regex doesn't look too "advanced" ;-) ).
Do you see any error messages in the browser?
The problem is those non-ASCII characters in your test address, ÖßÜÄÖ (which you only ever mentioned in a comment to #HansKesting's answer). In .NET, \w matches all Unicode letters and digits, and even several characters besides _ that are classified as connecting punctuation, but in JavaScript it only matches [A-Za-z0-9_].
JavaScript also lacks support for Unicode properties (like \p{L} for letters) and blocks (\p{IsLatin}), so you would have to list any non-ASCII characters you want to allow by their Unicode escapes (\uXXXX). If you just want to support Latin1 letters, I suppose you could use [\w\u00C0-\u00FF], but IDN is supposed to support more than just Latin1, isn't it?
By the way, JavaScript also doesn't support Singleline mode, and even if it did you wouldn't be able to use it. JS does support Multiline and IgnoreCase modes, but there's no way to set them on both the server and client side. The inline modifiers, (?i) and (?m), don't work in JS, and the RegexOptions argument only works server-side.
Fortunately, you don't really need Singleline mode anyway; it allows the . metacharacter to match linefeeds, but the only dots in your regex are matching literal dots.

Resources