regex to allow space if followed by character - asp.net

I have an asp.net regularexpressionvalidator that I need to match on a textbox. If there is any text, logically the rules are as follows:
The text must be at least three characters, after any trimming to remove spaces.
Characters allowed are a-zA-Z0-9-' /\&.
I'm having major pain trying to construct an expression that will allow a space as the thrid character only if there is a fourth non-space character.
Can anyone suggest an expression? My last attempt was:
^[a-zA-Z0-9-'/\\&\.](([a-zA-Z0-9-'/\\&\.][a-zA-Z0-9-' /\\&\.])|([a-zA-Z0-9-' /\\&\.][a-zA-Z0-9-'/\\&\.]))[a-zA-Z0-9-' /\\&\.]{0,}$
but that does not match on 'a a'.
Thanks.

OK, now this is all in one regex:
^\s*(?=[a-zA-Z0-9'/\\&.-])([a-zA-Z0-9'/\\&.\s-]{3,})(?<=\S)\s*$
Explanation:
^ # Start of string
\s* # Optional leading whitespace, don't capture that.
(?= # Assert that...
[a-zA-Z0-9'/\\&.-] # the next character is allowed and non-space
)
( # Match and capture...
[a-zA-Z0-9'/\\&.\s-]{3,} # three or more allowed characters, including space
)
(?<=\S) # Assert that the previous character is not a space
\s* # Optional trailing whitespace, don't capture that.
$ # End of string
This matches
abc
aZ- &//
a ab abc x
aaa
a a
and doesn't match
aa
abc!
a&

Simplifying your allowed characters to be a-z and space for clarity, doesn't this do it?
^ *[a-z][a-z ]+[a-z] *$
Ignore spaces. Now a letter. Then some letters or spaces. Then a letter. Ignore more spaces.
The full thing becomes:
^ *[a-zA-Z0-9-'/\\&\.][a-zA-Z0-9-'/\\&\. ]+[a-zA-Z0-9-'/\\&\.] *$

Related

Check if character is number

I want to check if a character can be safely converted to a numeric by using a regex.
However, I don't see my error. Example:
stringr::str_detect("4.", pattern = "-{0,1}[0-9]+(.[0-9]+){0,1}")
This produces a TRUE. My intention was to specifiy that whenever a . follows the first sequence of numbers, there must be at least one other number, therefore (.[0-9]+){0,1}.
What's wrong here?
Note:
(.[0-9]+){0,1} is an optional pattern because {0,1} (=?) makes the .[0-9]+ pattern sequence match one or zero times. So, yes, one or more digits ([0-9]+) must follow any char other than line break chars (matched with an unescaped .), but this pattern is optional, and thus you cannot require anything with it.
. is unescaped, so it matches any char other than line break chars. Escape it to match a literal dot
Your regex is not anchored, and can match partial substrings in a longer string. Use ^ and $ to make the pattern match the whole string.
So, consider using
stringr::str_detect("4.", pattern = "^-?[0-9]+(?:\\.[0-9]+)?$")
where
^ - start of string
-? - an optional - char
[0-9]+ - one or more digits
(?:\.[0-9]+)? - a non-capturing group matching an optional sequence of a . and then one or more digits
$ - end of string.

keep only alphanumeric characters and space in a string using gsub

I have a string which has alphanumeric characters, special characters and non UTF-8 characters. I want to strip the special and non utf-8 characters.
Here's what I've tried:
gsub('[^0-9a-z\\s]','',"�+ Sample string here =�{�>E�BH�P<]�{�>")
However, This removes the special characters (punctuations + non utf8) but the output has no spaces.
gsub('/[^0-9a-z\\s]/i','',"�+ Sample string here =�{�>E�BH�P<]�{�>")
The result has spaces but there are still non utf8 characters present.
Any work around?
For the sample string above, output should be:
Sample string here
You could use the classes [:alnum:] and [:space:] for this:
sample_string <- "�+ Sample 2 string here =�{�>E�BH�P<]�{�>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Alternatively you can use PCRE codes to refer to specific character sets:
gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep A-Z, a-z and 0-9:
gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"
This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:
gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "
Or, if you want to find a string, even not bound by spaces, use the word boundary \\b instead:
gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"
What happens here:
.*? fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.
everything between () will be stored and can be refered to in the replacement by \\1
\\b indicates a word boundary
This is followed at least once (+) by any character that's A-Z, a-z, 0-9 or a space. You have to do it that way, because the special letters are contained in between the upper and lowercase in the code table. So using A-z will include all special letters (which are UTF-8 btw!)
after that sequence,fit anything at least zero times to remove the rest of the string.
the backreference \\1 in combination with .* in the regex, will make sure only the required part remains in the output.
stringr may use a differrent regex engine that supports POSIX character classes. The :ascii: names the class, which must generally be enclosed in square brackets [:asciii:], whithin the outer square bracket. The [^ indicates negation of the match.
library(stringr)
str_replace_all("�+ Sample string here =�{�>E�BH�P<]�{�>", "[^[:ascii:]]", "")
result in
[1] "+ Sample string here ={>EBHP<]{>"

Regular Expression to match only letters

I need write a regular expression for RegularExpressionValidator ASP.NET Web Controls.
The regular expression should ALLOW all alphabetic characters but not numbers or special characters (example: |!"£$%&/().
Any idea how to do it?
^[A-Za-z]+$
validates a string of length 1 or greater, consisting only of ASCII letters.
^[^\W\d_]+$
does the same for international letters, too.
Explanation:
[^ # match any character that is NOT a
\W # non-alphanumeric character (letters, digits, underscore)
\d # digit
_ # or underscore
] # end of character class
Effectively, you get \w minus (\d and _).
Or, you could use the fact that ASP.NET supports Unicode properties:
^\p{L}+$
validates a string of Unicode letters of length 1 or more.
Including spaces:
"^[a-zA-Z ]*$"
Excluding Spaces:
"^[a-zA-Z]*$"
To make it non-optional, change the * to a +
You can use the regex:
^[a-zA-Z]+$
Explanation:
^ : Start anchor
[..] : Char class
+ : one or more repetations
$ : End anchor

Regular expression to match maximium of five words

I have a regular expression
^[a-zA-Z+#-.0-9]{1,5}$
which validates that the word contains alpha-numeric characters and few special characters and length should not be more than 5 characters.
How do I make this regular expression to accept a maximum of five words matching the above regular expression.
^[a-zA-Z+#\-.0-9]{1,5}(\s[a-zA-Z+#\-.0-9]{1,5}){0,4}$
Also, you could use for example [ ] instead of \s if you just want to accept space, not tab and newline. And you could write [ ]+ (or \s+) for any number of spaces (or whitespaces), not just one.
Edit: Removed the invalid solution and fixed the bug mentioned by unicornaddict.
I believe this may be what you're looking for. It forces at least one word of your desired pattern, then zero to four of the same, each preceded by one or more white-space characters:
^XX(\s+XX){0,4}$
where XX is your actual one-word regex.
It's separated into two distinct sections so that you're not required to have white-space at the end of the string. If you want to allow for such white-space, simply add \s* at that point. For example, allowing white-space both at start and end would be:
^\s*XX(\s+XX){0,4}\s*$
You regex has a small bug. It matches letters, digits, +, #, period but not hyphen and also all char between # and period. This is because hyphen in a char class when surrounded on both sides acts as a range meta char. To avoid this you'll have to escape the hyphen:
^[a-zA-Z+#\-.0-9]{1,5}$
Or put it at the beg/end of the char class, so that its treated literally:
^[-a-zA-Z+#-.0-9]{1,5}$
^[a-zA-Z+#.0-9-]{1,5}$
Now to match a max of 5 such words you can use:
^(?:[a-zA-Z+#\-.0-9]{1,5}\s+){1,5}$
EDIT: This solution has a severe limitation of matching only those input that end in white space!!! To overcome this limitation you can see the ans by Jakob.

.net regex meaning of [^\\.]+

I have a question about a regex. Given this part of a regex:
(.[^\\.]+)
The part [^\.]+ Does this mean get everything until the first dot? So with this text:
Hello my name is Martijn. I live in Holland.
I get 2 results: both sentences. But when I leave the + sign, I get 2 two characters: he, ll, o<space>, my, etc. Why is that?
Your regex .[^\\.]+ means:
Match any character
Match any character until you get slash or a dot ".". Note that [^\\.] means NOT slash or NOT dot, which means either a dot or a slash is not a match. It will keep on matching characters until it founds a dot or slash because of the "+" at the end. It is called a greedy quantifier because of that.
When you input (quotes not included): "Hello my name is Martijn. I live in Holland."
The matches are:
Hello my name is Martijn
. I live in Holland
Note that the dot is not included in the first match since it stops at n in Martijn and the second match starts with the dot.
When you remove the +: (.[^\\.])
It just means:
Match any character
Match any character except a dot or a slash.
Because a dot outside a character class (ie, not between []) means (almost) any character.
So, .[^\\.] means match (almost) any character followed by something which is not a dot nor a backslash (dots don't need to be escaped in a character class to mean just a dot, but backslashes do),
This, in your example, is h (any character) e (not a dot nor a backslash) and so on and so forth.
Whereas with a + (one or more of not a dot nor a backslash) you will match all characters which are not dots until a dot.
The regex means:
any one character followed by more than zero characters that are not a backslash or a period.

Resources