Unexpected results when using regex to change case - r

I'm trying to change the case of a character using regex and the stringr package, but I'm getting a curious result. I would expect both expressions below to give the same result (capitalizing the first character), but only the grep function gives the expected result:
> str_replace("will", "(^\\w)", regex("\\U\\1"))
[1] "1ill"
> gsub("(^\\w)", "\\U\\1", "will", perl = TRUE)
[1] "Will"
Related:
gsub error turning upper to lower case in R

gsub uses a kind of a PCRE regex (note PCRE regex does not allow case changing operators \L / \l and \U / \u with \E, but R extends their functionality like in Boost library that supports those operators).
stringr library uses ICU regex library and there is no support for these case changing operators, and the support was not added to the original library functionality.

Related

* vs .* in R with regexp

Why does R (at least with tidyverse/stringr) recognize the following regexp: *\.(png|jpg|jpeg)? (in R due to character escaping one actually needs to write the string "*\\.(png|jpg|jpeg)")
I think the correct regexp should be .*\.(png|jpg|jpeg) (writing in R ".*\\.(png|jpg|jpeg)"))
When i introduce the first expression on e.g. regex101.com, it says that is is an illegal regexp. But R seems to parse it without issues.
Why?
Is the expression *\.(png|jpg|jpeg) a valid regular expression? If so, why does regex101 complain? If not, why does R accept it?
If you use the base R regex functions with the default TRE regex library, the * at the start of the pattern will get ignored. It is in line with how POSIX based regex engines behave, see this sed demo (this tool uses POSIX BRE in the demo).
TRE regex engine being a POSIX based regex engine ignores the * at the start of the regex:
> gsub("*\\.png$", "", "abc.png")
[1] "abc"
However, other NFA regex engines treat it as an error:
> library(stringr)
> str_replace("abc.png", "*\\.png$", "")
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX, context=`*\.png$`)
> gsub("*\\.png$", "", "abc.png", perl=TRUE)
Error in gsub("*\\.png$", "", "abc.png", perl = TRUE) :
invalid regular expression '*\.png$'
In addition: Warning message:
In gsub("*\\.png$", "", "abc.png", perl = TRUE) :
PCRE pattern compilation error
'quantifier does not follow a repeatable item'
at '*\.png$'
stringr regex functions use the ICU regex library and base R regex functions with perl=TRUE use the PCRE regex library (not perl!)

recursive regex for matching bracket pairs in gnu r

I am experimenting with a recursive regex in R and the stringr package. Somehow it gives me an syntax error: U_REGEX_RULE_SYNTAX
The regex is working correctly and matches only the matching brackets:
https://regex101.com/r/Uv9Xy4/1
But in R it gives me said syntax error:
str_extract("((blub))(", "(?s)\\((?:[^()]+|(?R))*+\\)")
Am I missing the escape of any control character?
ICU regex library used in stringr is not capable of everything PCRE can do. ICU regex engine does not support recursion.
So, use base R with perl=TRUE:
x <- "((blub))("
regmatches(x, regexpr("\\((?:[^()]+|(?R))*+\\)", x, perl=TRUE))
## => [1] "((blub))"
Note that (?s) DOTALL modifier is redundant here since there is no . in the pattern and can safely be removed.

Conditional regular expressions in stringr

I'm wondering how to implement a conditional regular expression in R. It seems that this can be implemented in PERL:
?(if)then|else
However, I'm having trouble figuring out how to implement this in R. As a simple example, let's say I have the following strings:
c('abcabd', 'abcabe')
I would like the regular expression to match "bd" if it is there and "bc" otherwise, then replace it with "zz". Thus, I would like the strings above to be:
c('abcazz', 'azzabe')
I have tried this using both sub and str_replace neither of which seem to work. It seems that my syntax might be wrong in sub:
sub('b(?(?=d)d|c)', 'zz', c('abcabe','abcabd'), perl=TRUE)
[1] "azzabe" "azzabd"
The logic is "match b, if followed by d match d, otherwise match c". With str_replace, I get errors :
str_replace(c('abcabe','abcabd'), regex('b(?(?=d)d|c)'), 'zz')
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
Use of regexp feature that is not yet implemented. (U_REGEX_UNIMPLEMENTED)
I primarily use stringr so would prefer a solution using str_replace but open to solutions using sub.
You are almost near but you should have conditional pattern true assertion in each step:
(?(?=.*bd)bd|bc)
Live demo
You don't even need conditional regex:
^(.*)bd|bc
R code:
sub('^(.*)bd|bc', '\\1zz', c('abcabe','abcabd'))

different output using stringi and gsub (using the same pattern on the same string)

I wish to know why I obtain two different output strings by using gsub and stringi. Does the metacharacter "." not include new lines in stringi? Does stringi read "line by line"?
By the way I did not find any way to perform the "correct" substitution with stringi so I needed to use gsub here.
string <- "is it normal?\n\nhttp://www.20minutes.fr"
> gsub(" .*?http"," http", string)
[1] "is http://www.20minutes.fr"
> stri_replace_all_regex(string, " .*?http"," http")
[1] "is it normal?\n\nhttp://www.20minutes.fr"
One way would be to set . to also match line terminators instead of stopping at a line:
stri_replace_all_regex(string, " .*?http"," http",
opts_regex = stri_opts_regex(dotall = TRUE))
By default -- for historical reasons, see this tutorial -- in most regex engines a dot doesn't match a newline character.
As #lukeA suggested, to match a newline you may set dotall option to TRUE in stringi regex-based functions.
By the way, gsub(..., perl=TRUE) gives results consistent with stringi.

stargazer and omit regular expressions

I am trying to use regular expressions to omit some variables in stargazer. I finally found a working regex, but it's using the Perl standard. This doesn't work for the base regex in R, though regexpr in R can take a perl=T option. Given that you wrap the regex for variable sets to omit in "", you can't really pass it this option. Any ideas on how to use perl regex with stargazer?
An example of the regex I would like to use is
placed.ind2*(?:(?!:switchind).)*$
applied to these 4 strings:
placed.ind2PROF SERVICES
placed.ind2TRANSPORT
placed.ind2PROF SERVICES:switchind2TRUE
placed.ind2TRANSPORT:switchind2TRUE
I would like the first two to be selected, but the last to be.
Starting from version 4.0 (on CRAN now), you can run stargazer with the argument perl=TRUE to allow for Perl-compatible regular expressions in your other arguments.

Resources