I'm trying to create a validator for a string, that may contain 1-N words, which a separated with 1 whitespace (spaces only between words). I'm a newbie in a regex, so I feel a bit confused, cause my expression seem to be correct:
^[[a-zA-Z]+\s{1}]{0,}[a-zA-Z]+$
What am I doing wrong here? (it accepts only 2 words .. but I want it to accept 1+ words)
Any help is greatly appreciated :)
As often happens with someone beginning a new programming language or syntax, you're close, but not quite! The ^ and $ anchors are being used correctly, and the character classes [a-zA-Z] will match only letters (sounds right to me), but your repetition is a little off, and your grouping is not what you think it is - which is your primary problem.
^[[a-zA-Z]+\s{1}]{0,}[a-zA-Z]+$
^ ^^^^^^^^
a bbbacccc
It only matches two words because you effectively don't have any group repetition; this is because you don't really have any groups - only character classes. The simplest fix is to change the first [ and its matching end brace (marked by a's in the listing above) to parentheses:
^([a-zA-Z]+\s{1}){0,}[a-zA-Z]+$
This single change will make it work the way you expect! However, there a few recommendations and considerations I'd like to make.
First, for readability and code maintenance, use the single character repetition operators instead of repetition braces wherever possible. * repeats zero or more times, + repeats one or more times, and ? repeats 0 or one times (AKA optional). Your repetition curly braces are syntactically correct, and do what you intend them to, but one (marked by b's above) should be removed because it is redundant, and the other (marked by c's above) should be shortened to an asterisk *, as they have exactly the same meaning:
^([a-zA-Z]+\s)*[a-zA-z]+$
Second, I would recommend considering (depending upon your application requirements) the \w shorthand character class instead of the [a-zA-Z] character class, with the following considerations:
it matches both upper and lowercase letters
it does match more than letters (it matches digits 0-9 and the underscore as well)
it can often be configured to match non-English (unicode) letters for multi-lingual input
If any of these are unnecessary or undesirable, then you're on the right track!
On a side note, the character combination \b is a word-boundary assertion and is not needed for your case, as you will already begin and end where there are letters and letters only!
As for learning more about regular expressions, I would recommend Regular-Expressions.info, which has a wealth of info about regexes and the inner workings and quirks of the various implementations. I also use a tool called RegexBuddy to test and debug expressions.
Related
I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"
I need to draw a DFA diagram that can recognize arithmetic expressions, varialbes or brackets are not allowed. It can only contain numbers and four arithmetic operators.
And it has to accepts any number string with or without sign - e.g. 5, -7 , +15.
And the numbers strings can be mixed with arithmetic operators - e.g. 3+5 , -1+7*3.
I don't know if my diagram actually performs this requirements.
No, your diagram is not good enough yet: it allows +/-/+++ as an expression.
I'd start formulating this using something like EBNF first, then turn that into a regular expression (essentially inlining the non-terminals), and then build a DFA from that.
What is a number? Only integers, or can you have a decimal point? If decimal points are fine, do you need digits before and after, or just one of them? Do you allow scientific notation with "e" in it? May there be leding zeros in numbers, or only if the whole integer part is zero? And what about signs? Do you allow more than one? Do you allow an unary plus or minus sign after any arithmetic operator?
Depending on your answers above, the EBNF might look somewhat like this:
digits = digit { digit }
number = [ "+" | "-" ] ( digits [ "." [ digits ] ] | "." digits )
operator = "+" | "-" | "*" | "/"
expression = number { operator number }
Turning these things to regular expressions:
digits : [0-9]+
number : [+\-]?([0-9]+(\.[0-9]*)?|\.[0-9]+)
operator : [+\-*/]
expression: [+\-]?([0-9]+(\.[0-9]*)?|\.[0-9]+)([+\-*/][+\-]?([0-9]+(\.[0-9]*)?|\.[0-9]+))*
Now that final regexp is what you can use to build a DFA from. May it have ε transitions or do you need inputs on all edges? Must it be deterministic? Depending on these answers, there may be more or less work ahead of you.
You don't have to mechanically turn the regexp ino an automaton; with a bit of human intuition you can build a simpler automaton. But make sure that you see all the considerations which are captured in the building of these expressions reflected in your automaton. Avoid allowing more than one decimal point. Avoid chaining operators. Make sure every number has at least one digit. Things like this.
OK regex nerds!
I am using regex lookahead assertions for password validation that is similar to the pattern described here:
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)
However, we want to only require that any 3 of the 4 assertions be valid - not necessarily all of them. Any thoughts on how this could be done?
To shorten any kind of pattern, factorize:
\A(?:
(?=\w{6,10}\z) (?=.*[a-z]) (?: (?:.*[A-Z]){3} | .*\d )
|
(?=.*\d) (?=(?:.*[A-Z]){3}) (?: .*[a-z] | \w{6,10}\z )
)
Note that you don't need a lookahead to test the last condition.
demo
Other way, where each condition is optional and that uses a named group to count (.net only):
\A
(?<c>(?=\w{6,10}\z))?
(?<c>(?=[^a-z]*[a-z]))?
(?<c>(?=(?:[^A-Z]*[A-Z]){3}))?
(?<c>(?=\D*\d))?
(?<-c>){3} # decrement c 3 times
(?(c)|(?!$)) # conditional: force the pattern to fail if too few conditions succeed.
demo
There's no "easy" way to do this in a single regular expression. The only way would be to define all possible permutations of the "three out of four" assertions - e.g.
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})| # Maybe no digit
\A(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)| # Maybe wrong length
\A(?=\w{6,10}\z)(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)| # Maybe no lower
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=\D*\d) # Maybe not enough uppers
However, this mind-melting regex is clearly not a good solution.
A better approach would be to perform the four checks separately (with regex or otherwise), and count that there is at least three passed conditions.
...However, let's take a step back here and ask: Why are you doing this?? You're implementing a password entropy check. Based on your fuzzy rules, the following passwords are valid:
AAAa1
password1
LETmein
And the following passwords are invalid:
reallylongsecurepassword8374235359232
HorseBatteryStapleCorrect
I would strongly advise against such a bizarrely restrictive policy.
Brief
The easiest method would be to have separate regular expressions and check whether 3/4 of them are successful in your code's language. The only way to do this in regex is to present all cases. That being said, this is probably the easiest method (in regex) to present all options as it allows you to edit the patterns in one location (where they are defined) rather than multiple times (more prone to bugs). The DEFINE constructs in regex are seldom supported, but PCRE regex does.
You can also have your code generate each regex permutation. See this question about generating all permutations of a list in python
I don't know why you want to do this for passwords, it's considered malpractice, but, since you're asking for it, I figured I'd give you the easiest solution possible in regex... You really should only check minimum length (and complexity if you want [based on algorithms] to show the user how secure your system finds their password to be).
Code
(?(DEFINE)
(?<w>(?=\w{6,10}\z))
(?<l>(?=[^a-z]*[a-z]))
(?<u>(?=(?:[^A-Z]*[A-Z]){3}))
(?<d>(?=\D*\d))
)
\A(?:
(?&w)(?&l)(?&u)|
(?&w)(?&l)(?&d)|
(?&w)(?&u)(?&d)|
(?&l)(?&u)(?&d)
)
Note: The regex above uses the x modifier (ignore whitespace) so that we can nicely organize the content.
Can someone give me an example/explanation what this regular expression does:
(?![#$])
This is part of <%(?![#$])(([^%]*)%)*?> which is what ASP.NET uses to parse server-side code blocks. I understand the second part of the expression but not the first.
I checked the documentation and found (?! ...) means a zero-width negative lookahead but I'm not entirely sure I understand what that means. Any input I tried so far that looks like <% ... %> seems to work - I wonder why this first sub-expression is even there.
Edit:
I came up with this expression for picking up ASP.NET expressions: <%.+?%> then I found the one Microsoft made (the above full expression in question). I'm trying to understand why they chose that particular expression when mine seems a lot simpler. (I'm trying to see if my expression ignores certain boundary conditions that the MS one doesn't.)
It's a negative lookahead assertion that matches if the next character is not # or $, but doesn't consume it.
It's very simlar to the negative character class [^#$] except that the negative character class also consumes the character, preventing it from being matched by the rest of the expression.
To see the difference consider matching <%test%>.
The expression <%(?![#$])(([^%]*)%)*?> captures test%. (rubular)
The expression <%[^#$](([^%]*)%)*?> captures est% because the t was consumed by the negative character class. (rubular)
I have ValidationRegularExpression="[0-9]" which only allows a single character. How do I make it allow between (and including) 1 and 7 digits? I tried [0-9]{1-7} but it didn't work.
You got the syntax almost correct: [0-9]{1,7}.
You can make your solution a bit more elegant (and culture-sensitive) by replacing [0-9] with the generic character group "decimal digit": \d (remember that other languages might use different characters for digits than 0-9).
And here's the documentation for future reference:
.NET Framework Regular Expressions
If you want to avoid leading zeros, you can use this:
^(?!0\d)\d{1,7}$
The first part is a negative lookahead assertion, that checks if there is a 0 followed by a number in the string. If so no match.
Check online here: http://regexr.com?2thtr