Conditional regular expressions in stringr - r

I'm wondering how to implement a conditional regular expression in R. It seems that this can be implemented in PERL:
?(if)then|else
However, I'm having trouble figuring out how to implement this in R. As a simple example, let's say I have the following strings:
c('abcabd', 'abcabe')
I would like the regular expression to match "bd" if it is there and "bc" otherwise, then replace it with "zz". Thus, I would like the strings above to be:
c('abcazz', 'azzabe')
I have tried this using both sub and str_replace neither of which seem to work. It seems that my syntax might be wrong in sub:
sub('b(?(?=d)d|c)', 'zz', c('abcabe','abcabd'), perl=TRUE)
[1] "azzabe" "azzabd"
The logic is "match b, if followed by d match d, otherwise match c". With str_replace, I get errors :
str_replace(c('abcabe','abcabd'), regex('b(?(?=d)d|c)'), 'zz')
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
Use of regexp feature that is not yet implemented. (U_REGEX_UNIMPLEMENTED)
I primarily use stringr so would prefer a solution using str_replace but open to solutions using sub.

You are almost near but you should have conditional pattern true assertion in each step:
(?(?=.*bd)bd|bc)
Live demo
You don't even need conditional regex:
^(.*)bd|bc
R code:
sub('^(.*)bd|bc', '\\1zz', c('abcabe','abcabd'))

Related

stringr: regex to match and extract strings (including unique substrings) containing same substring

So I have a column in a dataframe that contains some names like this:
colnames <- c("YouAreHappy","YouAreHappy1", "YouAreHappy2", "NiceSmiles", "NiceSmiles1", "NiceSmiles2")
I am trying to use stringr's str_extract function to extract only a specific part of the names, namely things like "Happy", "Happy1", "Happy2", "Smiles", "Smiles1", and "Smiles2".
I tried to use regex with `str_extract' as follows:
> str_extract(colnames, regex("Happy|Happy1|Happy2|Smiles|Smiles1|Smiles2"))
[1] "Happy" "Happy" "Happy" "Smiles" "Smiles" "Smiles"
But I want extract:
[1] "Happy" "Happy1" "Happy2" "Smiles" "Smiles1" "Smiles2"
I am obviously going about this wrong, but I do not know where and how. I get that the | implies OR but I don't know enough about regex to circumvent this hurdle. I am completely new to regular expressions and the like (just discovered regular expressions 101), so any pointers would be appreciated.
When using Happy|Happy1|Happy2|Smiles|Smiles1|Smiles2 pattern, remember that the first alternative that matches "wins" and the ICU regex engine (used in stringr) does not consider the following alternatives. Note that several alternatives in your regex may match at the same location, and the shorter comes before longer ones. That is why the result is not as expected. See Remember That The Regex Engine Is Eager.
It is true that TRE regex engine works differently. regmatches(colnames, gregexpr("Happy|Happy1|Happy2|Smiles|Smiles1|Smiles2", colnames)) will get you the expected matches, because it is a text-directed regex engine and the longest matching alternative "wins". See Text-Directed Engine Returns the Longest Match.
However, you may just use
"(Smiles|Happy)\\d*"
in both engines to get the same output. Make sure the alternatives do not match at the same location in the string, it is the best practice. (Smiles|Happy)\d* matches either Smiles or Happy and then 0 or more digits.

Match everything up until first instance of a colon

Trying to code up a Regex in R to match everything before the first occurrence of a colon.
Let's say I have:
time = "12:05:41"
I'm trying to extract just the 12. My strategy was to do something like this:
grep(".+?(?=:)", time, value = TRUE)
But I'm getting the error that it's an invalid Regex. Thoughts?
Your regex seems fine in my opinion, I don't think you should use grep, also you are missing perl=TRUE that is why you are getting the error.
I would recommend using :
stringr::str_extract( time, "\\d+?(?=:)")
grep is little different than it is being used here, its good for matching separate values and filtering out those which has similar pattern, but you can't pluck out values within a string using grep.
If you want to use Base R you can also go for sub:
sub("^(\\d+?)(?=:)(.*)$","\\1",time, perl=TRUE)
Also, you may split the string using strsplit and filter out the first string like below:
strsplit(time, ":")[[1]][1]

Running regex in R using str_extract_all has regexp not yet implemented

I am trying to use regex to parse a file using regex. Most of the solutions to using regex in R use the stringr package. I have not found another way, or another package to use that would work. If you have another way of going about this that would also be acceptable.
What I am trying to accomplish is to grab a couple of values that are seperated by spaces with the last value being some comma seperated values of variable length. This should go into a matrix or df in table like format is it is currently.
foo foo_123bar foo,bar,bazz
foo2 foo_456bar foo2,bar2
I have the working example of my regex here.
There could be a couple of issues I could be running into. The first could be that the regex I am writing is not supported by R's regex engine. Although I have the feeling from this that would be supported. I have seen that R uses a POSIX like format which could make things interesting. The second simply could be exactly what the error message bellow is showing. This is not a feature that has been coded in yet. This however would be the most troubling because I don't know another way to solve my problem without this package.
Below is the R code that I am using to replicate this error
library("stringr")
string = " foo foo_123bar foo,bar,bazz\n foo2 foo_456bar foo2,bar2,bazz2"
pattern = "
(?(DEFINE)
(?<blanks>[[:blank:]]+)
(?<var>\"?[[:alnum:]_]+\"?)
(?<csvar>(\"?[[:alnum:]_]+\"?,?)+)
)
^
(?&blanks)((?&var))
(?&blanks)((?&var))
(?&blanks)((?&csvar))"
# Both of these are throwing the error
str_extract_all(string, pattern)
str_extract_all(string, regex(pattern, multiline=TRUE, comments=TRUE))
> Error in stri_extract_all_regex(string, pattern, simplify = simplify, :
> Use of regexp feature that is not yet implemented. (U_REGEX_UNIMPLEMENTED)
# Using the example from ?str_extract_all runs without error
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract_all(shopping_list, "\\b[a-z]+\\b", simplify = TRUE)
I am looking for a solution, not necessarily a stringr solution, but this is the only way I found that fits my needs. The other simpler R regex functions only accept the pattern and not the extra parameters that include the multi line and comment functionality that I am using.
You have a PCRE regex that can only be used in methods/functions that parse the regex with the PCRE regex library (or Boost, it is based on PCRE). stringr str_extract parses the regex with the ICU regex library. ICU regex does not support recursion and DEFINE block. You just can't use the in-pattern approach to define subpatterns and then re-use them.
Instead, just declare the regex parts you need to re-use as variables and build the pattern dynamically:
library("stringr")
string = " foo foo_123bar foo,bar,bazz\n foo2 foo_456bar foo2,bar2,bazz2"
blanks <- "[[:blank:]]+"
vars <- "\"?[[:alnum:]_]+\"?"
csvar <- "(?:\"?[[:alnum:]_]+\"?,?)+"
pattern <- paste0("^",blanks,"(", vars, ")",blanks,"(", vars,")",blanks,"(",csvar, ")")
str_match_all(string, pattern)
# [[1]]
# [,1] [,2] [,3] [,4]
#[1,] " foo foo_123bar foo,bar,bazz" "foo" "foo_123bar" "foo,bar,bazz"
Note: you need to use str_match (or str_match_all) to extract the capturing group values as str_extract or str_extract_all only allows access to the whole match values.

Unexpected results when using regex to change case

I'm trying to change the case of a character using regex and the stringr package, but I'm getting a curious result. I would expect both expressions below to give the same result (capitalizing the first character), but only the grep function gives the expected result:
> str_replace("will", "(^\\w)", regex("\\U\\1"))
[1] "1ill"
> gsub("(^\\w)", "\\U\\1", "will", perl = TRUE)
[1] "Will"
Related:
gsub error turning upper to lower case in R
gsub uses a kind of a PCRE regex (note PCRE regex does not allow case changing operators \L / \l and \U / \u with \E, but R extends their functionality like in Boost library that supports those operators).
stringr library uses ICU regex library and there is no support for these case changing operators, and the support was not added to the original library functionality.

stargazer and omit regular expressions

I am trying to use regular expressions to omit some variables in stargazer. I finally found a working regex, but it's using the Perl standard. This doesn't work for the base regex in R, though regexpr in R can take a perl=T option. Given that you wrap the regex for variable sets to omit in "", you can't really pass it this option. Any ideas on how to use perl regex with stargazer?
An example of the regex I would like to use is
placed.ind2*(?:(?!:switchind).)*$
applied to these 4 strings:
placed.ind2PROF SERVICES
placed.ind2TRANSPORT
placed.ind2PROF SERVICES:switchind2TRUE
placed.ind2TRANSPORT:switchind2TRUE
I would like the first two to be selected, but the last to be.
Starting from version 4.0 (on CRAN now), you can run stargazer with the argument perl=TRUE to allow for Perl-compatible regular expressions in your other arguments.

Resources