R matching multiple negative lookahead and negative lookbehind - r

When I match multiple negative lookahead or multiple negative lookbehind, I find that R's behavior is different. To illustrate, suppose I want to match anything following z except a, d, bd, or bcd in str. The following regex works:
grep("z(?!a|(bc?)?d)",str,perl=TRUE)
Next, I want to match anything preceding z except a, b, bd, or bcd in str. A regex constructed in the similar way fails (invalid regex):
grep("(?<!a|b(c?d)?)z",str,perl=TRUE)
Consequently I have to use a rather cumbersome regex:
grep("(?<!a|b)(?<!bd)(?<!bcd)z",str,perl=TRUE)
It seems that in the case of (negative) lookbehind, if I wan to use the "or" operator |, the subexpressions must be of equal length, but there is no such limitation in the case of (negative) lookahead.
Do I miss anything here? My problem is I have many patterns to match in the negative lookbehind case. Using | and ? would substantially simplify the regular expression, but for the reasons stated above I cannot use them. How to solve this problem?

Related

How to match duplicate letters in interjections [duplicate]

I plan to remove repeated elements (each containing two or more characters) from strings. For example, from "aaa" I expect "aaa", from "aaaa" I expect "aa", from "abababcdcd" I epxect "abcd", from "cdababcdcd" I expect "cdabcd".
I tried gsub("(.{2,})\\1+","\\1",str). It works in cases 1-3, but fails in case 4. How to solve this problem?
SOLUTION
The solution is to rely on the PCRE or ICU regex engines, rather than TRE.
Use either base R gsub with perl=TRUE (it uses PCRE regex engine) and "(?s)(.{2,})\\1+" pattern, or a stringr::str_replace_all() (it uses ICU regex engine) with the same pattern:
> x <- "cdababcdcd"
> gsub("(?s)(.{2,})\\1+", "\\1", x, perl=TRUE)
[1] "cdabcd"
> library(stringr)
> str_replace_all(x, "(?s)(.{2,})\\1+", "\\1")
[1] "cdabcd"
The (?s) flag is necessary for . to match any char including line break chars (in TRE regex, . matches all chars by default).
DETAILS
TRE regex is not good at handling "pathological" cases that are mostly related to backtracking, which directly involves quantifiers (I bolded some parts):
The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.
Predictable matching speed
Because of the matching algorithm used in TRE, the maximum time consumed by any regexec() call is always directly proportional to the length of the searched string. There is one exception: if back references are used, the matching may take time that grows exponentially with the length of the string. This is because matching back references is an NP complete problem, and almost certainly requires exponential time to match in the worst case.
In those cases when TRE has trouble calculating all possibilities of matching a string it does not return any match, the string is returned as is. Hence, there is no changes in the gsub call.
As easy as:
gsub("(.{2,})\\1+","\\1",str, perl = T)

Remove repeated elements in a string with R

I plan to remove repeated elements (each containing two or more characters) from strings. For example, from "aaa" I expect "aaa", from "aaaa" I expect "aa", from "abababcdcd" I epxect "abcd", from "cdababcdcd" I expect "cdabcd".
I tried gsub("(.{2,})\\1+","\\1",str). It works in cases 1-3, but fails in case 4. How to solve this problem?
SOLUTION
The solution is to rely on the PCRE or ICU regex engines, rather than TRE.
Use either base R gsub with perl=TRUE (it uses PCRE regex engine) and "(?s)(.{2,})\\1+" pattern, or a stringr::str_replace_all() (it uses ICU regex engine) with the same pattern:
> x <- "cdababcdcd"
> gsub("(?s)(.{2,})\\1+", "\\1", x, perl=TRUE)
[1] "cdabcd"
> library(stringr)
> str_replace_all(x, "(?s)(.{2,})\\1+", "\\1")
[1] "cdabcd"
The (?s) flag is necessary for . to match any char including line break chars (in TRE regex, . matches all chars by default).
DETAILS
TRE regex is not good at handling "pathological" cases that are mostly related to backtracking, which directly involves quantifiers (I bolded some parts):
The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.
Predictable matching speed
Because of the matching algorithm used in TRE, the maximum time consumed by any regexec() call is always directly proportional to the length of the searched string. There is one exception: if back references are used, the matching may take time that grows exponentially with the length of the string. This is because matching back references is an NP complete problem, and almost certainly requires exponential time to match in the worst case.
In those cases when TRE has trouble calculating all possibilities of matching a string it does not return any match, the string is returned as is. Hence, there is no changes in the gsub call.
As easy as:
gsub("(.{2,})\\1+","\\1",str, perl = T)

DFA diagram to recognize arithmetic expressions

I need to draw a DFA diagram that can recognize arithmetic expressions, varialbes or brackets are not allowed. It can only contain numbers and four arithmetic operators.
And it has to accepts any number string with or without sign - e.g. 5, -7 , +15.
And the numbers strings can be mixed with arithmetic operators - e.g. 3+5 , -1+7*3.
I don't know if my diagram actually performs this requirements.
No, your diagram is not good enough yet: it allows +/-/+++ as an expression.
I'd start formulating this using something like EBNF first, then turn that into a regular expression (essentially inlining the non-terminals), and then build a DFA from that.
What is a number? Only integers, or can you have a decimal point? If decimal points are fine, do you need digits before and after, or just one of them? Do you allow scientific notation with "e" in it? May there be leding zeros in numbers, or only if the whole integer part is zero? And what about signs? Do you allow more than one? Do you allow an unary plus or minus sign after any arithmetic operator?
Depending on your answers above, the EBNF might look somewhat like this:
digits = digit { digit }
number = [ "+" | "-" ] ( digits [ "." [ digits ] ] | "." digits )
operator = "+" | "-" | "*" | "/"
expression = number { operator number }
Turning these things to regular expressions:
digits : [0-9]+
number : [+\-]?([0-9]+(\.[0-9]*)?|\.[0-9]+)
operator : [+\-*/]
expression: [+\-]?([0-9]+(\.[0-9]*)?|\.[0-9]+)([+\-*/][+\-]?([0-9]+(\.[0-9]*)?|\.[0-9]+))*
Now that final regexp is what you can use to build a DFA from. May it have ε transitions or do you need inputs on all edges? Must it be deterministic? Depending on these answers, there may be more or less work ahead of you.
You don't have to mechanically turn the regexp ino an automaton; with a bit of human intuition you can build a simpler automaton. But make sure that you see all the considerations which are captured in the building of these expressions reflected in your automaton. Avoid allowing more than one decimal point. Avoid chaining operators. Make sure every number has at least one digit. Things like this.

JavaCC - choice based on return type?

I have an ifElse Statement which can be of the following two types
a) ifElse(condition, expression_bool_result, expression_bool_result)
whereas expression_bool_result may either be TRUE/FALSE, the result of and(), or(), ==, !=.... or further ifElse
b) ifElse(condition, expression_arith_result, expression_arith_result)
whereas expression_arith_result may either be any number, the result of calculations of further functions returning a number... (or further ifElse)
Since I am new to javacc, I would like to ask you how a production could look like which allows the parser for a clear decision.
Currently I get the warning
Warning: Choice conflict involving two expansions at
line 824, column 5 and line 825, column 5 respectively.
A common prefix is: "ifElse" "("
Consider using a lookahead of 3 or more for earlier expansion.
which - as far as I can tell - implies that my grammer (regarding ifelse) is ambiguous.
If there is no way to write it unambiguously, how could the suggested lookahead look like?
Thanks for your feedback in advance!
No fixed amount of lookahead could possibly resolve this ambiguity in all cases. You could have an arbitrarily long stream of tokens that form a valid expression_arith_result - but is then followed by a comparison operator and another arithmetic value, thus turning it into an expression_bool_result.
The solution would be to have a single ifElse statement, that takes two arbitrary expressions. The required agreement in type between the two expressions would be a matter of semantics, not grammar.
Jason's answer is correct in that you can't resolve the choice with a fixed length of lookahead. However JavaCC does not limit you to fixed length of lookahead. So you can do the following.
void IfExpression() :
{ }
{ LOOKAHEAD( <IFELSE> "(" Condition() "," BooleanExpression() )
BooleanIfExpression()
|
ArithmeticIfExpression()
}

Regular expression to match 10-14 digits

I am using regular expressions for matching only digits, minimum 10 digits, maximum 14. I tried:
^[0-9]
I'd give:
^\d{10,14}$
a shot.
I also like to offer extra solutions for RE engines that don't support all that PCRE stuff so, in a pinch, you could use:
^[0-9]{10,14}$
If you're RE engine is so primitive that it doesn't even allow specific repetitions, you'd have to revert to either some ugly hack like fully specifying the number of digits with alternate REs for 10 to 14 or, easier, just checking for:
^[0-9]*$
and ensuring the length was between 10 and 14.
But that won't be needed for this case (ASP.NET).
^\d{10,14}$
regular-expressions.info
Character Classes or Character Sets
\d is short for [0-9]
Limiting Repetition
The syntax is {min,max}, where min is a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches.
The limited repetition syntax also allows these:
^\d{10,}$ // match at least 10 digits
^\d{13}$ // match exactly 13 digits
try this
#"^\d{10,14}$"
\d - matches a character that is a digit
This will help you
If I understand your question correctly, this should work:
\d{10,14}
Note:
As noted in the other answer.. ^\d{10,14}$ to match the entire input

Resources