I plan to remove repeated elements (each containing two or more characters) from strings. For example, from "aaa" I expect "aaa", from "aaaa" I expect "aa", from "abababcdcd" I epxect "abcd", from "cdababcdcd" I expect "cdabcd".
I tried gsub("(.{2,})\\1+","\\1",str). It works in cases 1-3, but fails in case 4. How to solve this problem?
SOLUTION
The solution is to rely on the PCRE or ICU regex engines, rather than TRE.
Use either base R gsub with perl=TRUE (it uses PCRE regex engine) and "(?s)(.{2,})\\1+" pattern, or a stringr::str_replace_all() (it uses ICU regex engine) with the same pattern:
> x <- "cdababcdcd"
> gsub("(?s)(.{2,})\\1+", "\\1", x, perl=TRUE)
[1] "cdabcd"
> library(stringr)
> str_replace_all(x, "(?s)(.{2,})\\1+", "\\1")
[1] "cdabcd"
The (?s) flag is necessary for . to match any char including line break chars (in TRE regex, . matches all chars by default).
DETAILS
TRE regex is not good at handling "pathological" cases that are mostly related to backtracking, which directly involves quantifiers (I bolded some parts):
The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.
Predictable matching speed
Because of the matching algorithm used in TRE, the maximum time consumed by any regexec() call is always directly proportional to the length of the searched string. There is one exception: if back references are used, the matching may take time that grows exponentially with the length of the string. This is because matching back references is an NP complete problem, and almost certainly requires exponential time to match in the worst case.
In those cases when TRE has trouble calculating all possibilities of matching a string it does not return any match, the string is returned as is. Hence, there is no changes in the gsub call.
As easy as:
gsub("(.{2,})\\1+","\\1",str, perl = T)
Related
I plan to remove repeated elements (each containing two or more characters) from strings. For example, from "aaa" I expect "aaa", from "aaaa" I expect "aa", from "abababcdcd" I epxect "abcd", from "cdababcdcd" I expect "cdabcd".
I tried gsub("(.{2,})\\1+","\\1",str). It works in cases 1-3, but fails in case 4. How to solve this problem?
SOLUTION
The solution is to rely on the PCRE or ICU regex engines, rather than TRE.
Use either base R gsub with perl=TRUE (it uses PCRE regex engine) and "(?s)(.{2,})\\1+" pattern, or a stringr::str_replace_all() (it uses ICU regex engine) with the same pattern:
> x <- "cdababcdcd"
> gsub("(?s)(.{2,})\\1+", "\\1", x, perl=TRUE)
[1] "cdabcd"
> library(stringr)
> str_replace_all(x, "(?s)(.{2,})\\1+", "\\1")
[1] "cdabcd"
The (?s) flag is necessary for . to match any char including line break chars (in TRE regex, . matches all chars by default).
DETAILS
TRE regex is not good at handling "pathological" cases that are mostly related to backtracking, which directly involves quantifiers (I bolded some parts):
The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.
Predictable matching speed
Because of the matching algorithm used in TRE, the maximum time consumed by any regexec() call is always directly proportional to the length of the searched string. There is one exception: if back references are used, the matching may take time that grows exponentially with the length of the string. This is because matching back references is an NP complete problem, and almost certainly requires exponential time to match in the worst case.
In those cases when TRE has trouble calculating all possibilities of matching a string it does not return any match, the string is returned as is. Hence, there is no changes in the gsub call.
As easy as:
gsub("(.{2,})\\1+","\\1",str, perl = T)
When I match multiple negative lookahead or multiple negative lookbehind, I find that R's behavior is different. To illustrate, suppose I want to match anything following z except a, d, bd, or bcd in str. The following regex works:
grep("z(?!a|(bc?)?d)",str,perl=TRUE)
Next, I want to match anything preceding z except a, b, bd, or bcd in str. A regex constructed in the similar way fails (invalid regex):
grep("(?<!a|b(c?d)?)z",str,perl=TRUE)
Consequently I have to use a rather cumbersome regex:
grep("(?<!a|b)(?<!bd)(?<!bcd)z",str,perl=TRUE)
It seems that in the case of (negative) lookbehind, if I wan to use the "or" operator |, the subexpressions must be of equal length, but there is no such limitation in the case of (negative) lookahead.
Do I miss anything here? My problem is I have many patterns to match in the negative lookbehind case. Using | and ? would substantially simplify the regular expression, but for the reasons stated above I cannot use them. How to solve this problem?
I need to draw a DFA diagram that can recognize arithmetic expressions, varialbes or brackets are not allowed. It can only contain numbers and four arithmetic operators.
And it has to accepts any number string with or without sign - e.g. 5, -7 , +15.
And the numbers strings can be mixed with arithmetic operators - e.g. 3+5 , -1+7*3.
I don't know if my diagram actually performs this requirements.
No, your diagram is not good enough yet: it allows +/-/+++ as an expression.
I'd start formulating this using something like EBNF first, then turn that into a regular expression (essentially inlining the non-terminals), and then build a DFA from that.
What is a number? Only integers, or can you have a decimal point? If decimal points are fine, do you need digits before and after, or just one of them? Do you allow scientific notation with "e" in it? May there be leding zeros in numbers, or only if the whole integer part is zero? And what about signs? Do you allow more than one? Do you allow an unary plus or minus sign after any arithmetic operator?
Depending on your answers above, the EBNF might look somewhat like this:
digits = digit { digit }
number = [ "+" | "-" ] ( digits [ "." [ digits ] ] | "." digits )
operator = "+" | "-" | "*" | "/"
expression = number { operator number }
Turning these things to regular expressions:
digits : [0-9]+
number : [+\-]?([0-9]+(\.[0-9]*)?|\.[0-9]+)
operator : [+\-*/]
expression: [+\-]?([0-9]+(\.[0-9]*)?|\.[0-9]+)([+\-*/][+\-]?([0-9]+(\.[0-9]*)?|\.[0-9]+))*
Now that final regexp is what you can use to build a DFA from. May it have ε transitions or do you need inputs on all edges? Must it be deterministic? Depending on these answers, there may be more or less work ahead of you.
You don't have to mechanically turn the regexp ino an automaton; with a bit of human intuition you can build a simpler automaton. But make sure that you see all the considerations which are captured in the building of these expressions reflected in your automaton. Avoid allowing more than one decimal point. Avoid chaining operators. Make sure every number has at least one digit. Things like this.
I have an ifElse Statement which can be of the following two types
a) ifElse(condition, expression_bool_result, expression_bool_result)
whereas expression_bool_result may either be TRUE/FALSE, the result of and(), or(), ==, !=.... or further ifElse
b) ifElse(condition, expression_arith_result, expression_arith_result)
whereas expression_arith_result may either be any number, the result of calculations of further functions returning a number... (or further ifElse)
Since I am new to javacc, I would like to ask you how a production could look like which allows the parser for a clear decision.
Currently I get the warning
Warning: Choice conflict involving two expansions at
line 824, column 5 and line 825, column 5 respectively.
A common prefix is: "ifElse" "("
Consider using a lookahead of 3 or more for earlier expansion.
which - as far as I can tell - implies that my grammer (regarding ifelse) is ambiguous.
If there is no way to write it unambiguously, how could the suggested lookahead look like?
Thanks for your feedback in advance!
No fixed amount of lookahead could possibly resolve this ambiguity in all cases. You could have an arbitrarily long stream of tokens that form a valid expression_arith_result - but is then followed by a comparison operator and another arithmetic value, thus turning it into an expression_bool_result.
The solution would be to have a single ifElse statement, that takes two arbitrary expressions. The required agreement in type between the two expressions would be a matter of semantics, not grammar.
Jason's answer is correct in that you can't resolve the choice with a fixed length of lookahead. However JavaCC does not limit you to fixed length of lookahead. So you can do the following.
void IfExpression() :
{ }
{ LOOKAHEAD( <IFELSE> "(" Condition() "," BooleanExpression() )
BooleanIfExpression()
|
ArithmeticIfExpression()
}
I'm trying to create a validator for a string, that may contain 1-N words, which a separated with 1 whitespace (spaces only between words). I'm a newbie in a regex, so I feel a bit confused, cause my expression seem to be correct:
^[[a-zA-Z]+\s{1}]{0,}[a-zA-Z]+$
What am I doing wrong here? (it accepts only 2 words .. but I want it to accept 1+ words)
Any help is greatly appreciated :)
As often happens with someone beginning a new programming language or syntax, you're close, but not quite! The ^ and $ anchors are being used correctly, and the character classes [a-zA-Z] will match only letters (sounds right to me), but your repetition is a little off, and your grouping is not what you think it is - which is your primary problem.
^[[a-zA-Z]+\s{1}]{0,}[a-zA-Z]+$
^ ^^^^^^^^
a bbbacccc
It only matches two words because you effectively don't have any group repetition; this is because you don't really have any groups - only character classes. The simplest fix is to change the first [ and its matching end brace (marked by a's in the listing above) to parentheses:
^([a-zA-Z]+\s{1}){0,}[a-zA-Z]+$
This single change will make it work the way you expect! However, there a few recommendations and considerations I'd like to make.
First, for readability and code maintenance, use the single character repetition operators instead of repetition braces wherever possible. * repeats zero or more times, + repeats one or more times, and ? repeats 0 or one times (AKA optional). Your repetition curly braces are syntactically correct, and do what you intend them to, but one (marked by b's above) should be removed because it is redundant, and the other (marked by c's above) should be shortened to an asterisk *, as they have exactly the same meaning:
^([a-zA-Z]+\s)*[a-zA-z]+$
Second, I would recommend considering (depending upon your application requirements) the \w shorthand character class instead of the [a-zA-Z] character class, with the following considerations:
it matches both upper and lowercase letters
it does match more than letters (it matches digits 0-9 and the underscore as well)
it can often be configured to match non-English (unicode) letters for multi-lingual input
If any of these are unnecessary or undesirable, then you're on the right track!
On a side note, the character combination \b is a word-boundary assertion and is not needed for your case, as you will already begin and end where there are letters and letters only!
As for learning more about regular expressions, I would recommend Regular-Expressions.info, which has a wealth of info about regexes and the inner workings and quirks of the various implementations. I also use a tool called RegexBuddy to test and debug expressions.