recursive regex for matching bracket pairs in gnu r - r

I am experimenting with a recursive regex in R and the stringr package. Somehow it gives me an syntax error: U_REGEX_RULE_SYNTAX
The regex is working correctly and matches only the matching brackets:
https://regex101.com/r/Uv9Xy4/1
But in R it gives me said syntax error:
str_extract("((blub))(", "(?s)\\((?:[^()]+|(?R))*+\\)")
Am I missing the escape of any control character?

ICU regex library used in stringr is not capable of everything PCRE can do. ICU regex engine does not support recursion.
So, use base R with perl=TRUE:
x <- "((blub))("
regmatches(x, regexpr("\\((?:[^()]+|(?R))*+\\)", x, perl=TRUE))
## => [1] "((blub))"
Note that (?s) DOTALL modifier is redundant here since there is no . in the pattern and can safely be removed.

Related

Regex issue in R when escaping regex special characters with str_extract

I'm trying to extract the status -- in this case the word "Active" from this pattern:
Status\nActive\nHometown\
Using this regex: https://regex101.com/r/xegX00/1, but I cannot get it to work in R using str_extract. It does seem weird to have dual escapes, but I've tried every possible combination here and cannot get this to work. Any help appreciated!
mutate(status=str_extract(df, "(?<=Status\\\\n)(.*?)(?=\\\\)"))
You can use sub in base R -
x <- "Status\nActive\nHometown\n"
sub('.*Status\n(.*?)\n.*', '\\1', x)
#[1] "Active"
If you want to use stringr, here is a suggestion with str_match which avoids using lookahead regex
stringr::str_match(x, 'Status\n(.*)\n')[, 2]
#[1] "Active"
Your regex fails because you tested it against a wrong text.
"Status\nActive\nHometown" is a string literal that denotes (defines, represents) the following plain text:
Status
Active
Hometown
In regular expression testers, you need to test against plain text!
To match a newline, you can use "\n" (i.e. a line feed char, an LF char), or "\\n", a regex escape that also matches a line feed char.
You can use
library(stringr)
x <- "Status\nActive\nHometown\n"
stringr::str_extract(x, "(?<=Status\\n).*") ## => [1] "Active"
## or
stringr::str_extract(x, "(?<=Status\n).*") ## => [1] "Active"
See the R demo online and a correct regex test.
Note you do not need an \n at the end of the pattern, as in an ICU regex flavor (used in R stringr regex methods), the . pattern matches any chars other than line break chars, so it is OK to just use .* to match the whole line.

* vs .* in R with regexp

Why does R (at least with tidyverse/stringr) recognize the following regexp: *\.(png|jpg|jpeg)? (in R due to character escaping one actually needs to write the string "*\\.(png|jpg|jpeg)")
I think the correct regexp should be .*\.(png|jpg|jpeg) (writing in R ".*\\.(png|jpg|jpeg)"))
When i introduce the first expression on e.g. regex101.com, it says that is is an illegal regexp. But R seems to parse it without issues.
Why?
Is the expression *\.(png|jpg|jpeg) a valid regular expression? If so, why does regex101 complain? If not, why does R accept it?
If you use the base R regex functions with the default TRE regex library, the * at the start of the pattern will get ignored. It is in line with how POSIX based regex engines behave, see this sed demo (this tool uses POSIX BRE in the demo).
TRE regex engine being a POSIX based regex engine ignores the * at the start of the regex:
> gsub("*\\.png$", "", "abc.png")
[1] "abc"
However, other NFA regex engines treat it as an error:
> library(stringr)
> str_replace("abc.png", "*\\.png$", "")
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX, context=`*\.png$`)
> gsub("*\\.png$", "", "abc.png", perl=TRUE)
Error in gsub("*\\.png$", "", "abc.png", perl = TRUE) :
invalid regular expression '*\.png$'
In addition: Warning message:
In gsub("*\\.png$", "", "abc.png", perl = TRUE) :
PCRE pattern compilation error
'quantifier does not follow a repeatable item'
at '*\.png$'
stringr regex functions use the ICU regex library and base R regex functions with perl=TRUE use the PCRE regex library (not perl!)

Conditional regular expressions in stringr

I'm wondering how to implement a conditional regular expression in R. It seems that this can be implemented in PERL:
?(if)then|else
However, I'm having trouble figuring out how to implement this in R. As a simple example, let's say I have the following strings:
c('abcabd', 'abcabe')
I would like the regular expression to match "bd" if it is there and "bc" otherwise, then replace it with "zz". Thus, I would like the strings above to be:
c('abcazz', 'azzabe')
I have tried this using both sub and str_replace neither of which seem to work. It seems that my syntax might be wrong in sub:
sub('b(?(?=d)d|c)', 'zz', c('abcabe','abcabd'), perl=TRUE)
[1] "azzabe" "azzabd"
The logic is "match b, if followed by d match d, otherwise match c". With str_replace, I get errors :
str_replace(c('abcabe','abcabd'), regex('b(?(?=d)d|c)'), 'zz')
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
Use of regexp feature that is not yet implemented. (U_REGEX_UNIMPLEMENTED)
I primarily use stringr so would prefer a solution using str_replace but open to solutions using sub.
You are almost near but you should have conditional pattern true assertion in each step:
(?(?=.*bd)bd|bc)
Live demo
You don't even need conditional regex:
^(.*)bd|bc
R code:
sub('^(.*)bd|bc', '\\1zz', c('abcabe','abcabd'))

Unexpected results when using regex to change case

I'm trying to change the case of a character using regex and the stringr package, but I'm getting a curious result. I would expect both expressions below to give the same result (capitalizing the first character), but only the grep function gives the expected result:
> str_replace("will", "(^\\w)", regex("\\U\\1"))
[1] "1ill"
> gsub("(^\\w)", "\\U\\1", "will", perl = TRUE)
[1] "Will"
Related:
gsub error turning upper to lower case in R
gsub uses a kind of a PCRE regex (note PCRE regex does not allow case changing operators \L / \l and \U / \u with \E, but R extends their functionality like in Boost library that supports those operators).
stringr library uses ICU regex library and there is no support for these case changing operators, and the support was not added to the original library functionality.

different output using stringi and gsub (using the same pattern on the same string)

I wish to know why I obtain two different output strings by using gsub and stringi. Does the metacharacter "." not include new lines in stringi? Does stringi read "line by line"?
By the way I did not find any way to perform the "correct" substitution with stringi so I needed to use gsub here.
string <- "is it normal?\n\nhttp://www.20minutes.fr"
> gsub(" .*?http"," http", string)
[1] "is http://www.20minutes.fr"
> stri_replace_all_regex(string, " .*?http"," http")
[1] "is it normal?\n\nhttp://www.20minutes.fr"
One way would be to set . to also match line terminators instead of stopping at a line:
stri_replace_all_regex(string, " .*?http"," http",
opts_regex = stri_opts_regex(dotall = TRUE))
By default -- for historical reasons, see this tutorial -- in most regex engines a dot doesn't match a newline character.
As #lukeA suggested, to match a newline you may set dotall option to TRUE in stringi regex-based functions.
By the way, gsub(..., perl=TRUE) gives results consistent with stringi.

Resources