Why does R appear to be a lazy match [duplicate] - r

This question already has answers here:
R regex to get partly match
(2 answers)
Closed 6 days ago.
I want to use stri_replace_all_regex to replace string see as follows:
It's known that R default to greedy matching, but why it appears lazy matching here?
library(stringi)
a <- c('abc2','xycd2','mnb345')
b <- c('ab','abc','xyc','mnb','mn')
stri_replace_all_regex(a, "\\b" %s+% b %s+% "\\S+", b, vectorize_all=FALSE)
The result is [1] "ab" "xyc" "mn", which is not what I want. I
expected "abc" "xyc" "mnb".

You are calling stri_replace_all_regex with four arguments:
a is length 3. That's the str argument.
"\\b" %s+% b %s+% "\\S+" is length 5. (It would be a lot easier to read if you had used paste0("\\b", b, "\\S+"), but that's beside the point.) That's the pattern argument.
b is length 5. That's the replacement argument.
The last argument is vectorize_all=FALSE.
What it tries to do is documented as follows:
However, for stri_replace_all*, if vectorize_all is FALSE, then each
substring matching any of the supplied patterns is replaced by a
corresponding replacement string. In such a case, the vectorization is
over str, and - independently - over pattern and replacement. In other
words, this is equivalent to something like for (i in 1:npatterns) str <- stri_replace_all(str, pattern[i], replacement[i]). Note that you
must set length(pattern) >= length(replacement).
That's pretty sloppy documentation (I want to know what it does, not "something like" what it does!), but I think the process is as follows:
Your first pattern is "\\bab\\S+". That says "word boundary followed by ab followed by one or more non-whitespace chars". That matches all of a[1], so a[1] is replaced by b[1], which is "ab". It then tries the four other patterns, but none of them match, so you get "ab" as output.
The handling of a[3] is more complicated. The first match replaces it with "mnb", based on pattern[4]. Then a second replacement happens, because "mnb" matches pattern[5], and it gets changed again to "mn".
When you say R defaults to greedy matching, that's when doing a single regular expression match. You're doing five separate greedy matches, not one big greedy match.
EDITED to add:
I don't know the stringi functions well, but in the base regex functions you can do this with just one regex:
a <- c('abc2','xycd2','mnb345')
b <- c('ab','abc','xyc','mnb','mn')
# Build a big pattern:
# "|" means "or", "(" ... ") capture the match
pattern <- paste0("\\b(", b, ")\\S+", collapse = "|")
pattern
#> [1] "\\b(ab)\\S+|\\b(abc)\\S+|\\b(xyc)\\S+|\\b(mnb)\\S+|\\b(mn)\\S+"
# \\1 etc contain whatever matched the parenthesized
# patterns. Only one will match, the rest will be empty
gsub(pattern, "\\1\\2\\3\\4\\5", a)
#> [1] "ab" "xyc" "mnb"
# I would have guessed the greedy rule would have found "abc"
# Try again:
pattern <- paste0("\\b(", b[c(2, 1, 3:5)], ")\\S+", collapse = "|")
pattern
#> [1] "\\b(abc)\\S+|\\b(ab)\\S+|\\b(xyc)\\S+|\\b(mnb)\\S+|\\b(mn)\\S+"
gsub(pattern, "\\1\\2\\3\\4\\5", a)
#> [1] "abc" "xyc" "mnb"
Created on 2023-02-13 with reprex v2.0.2
It appears the "|" takes the first match, not the greedy match. I don't think the R docs specify it one way or the other.

Related

R Use Regular Expression to capture number when sometimes the capture is at the end of the string or not

I need to capture the numbers out of a string that come after a certain parameter name.
I have it working for most, but there is one parameter that is sometimes at the end of the string, but not always. When using the regular expression, it seems to matter.
I've tried different things, but nothing seems to work in both cases.
# Regular expression to capture the digit after the phrase "AppliedWhenID="
p <- ".*&AppliedWhenID=(.\\d*)"
# Tried this, but when at end, it just grabs a blank
#p <- ".*&AppliedWhenID=(.\\d*)&.*|.*&AppliedWhenID=(.\\d*)$"
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
# What should be returned is "2"
gsub(p, "\\1", testAtEnd) # works
gsub(p, "\\1", testNotAtEnd) # doesn't work, it captures 2 + &AgDateTypeID=1
Note that sub and gsub replace the found text(s), thus, in order to extract a part of the input string with a capturing group + a backreference, you need to actually match (and consume) the whole string.
Hence, you need to match the string to the end by adding .* at the end of the pattern:
p <- ".*&AppliedWhenID=(\\d+).*"
sub(p, "\\1", testNotAtEnd)
# => [1] "2"
sub(p, "\\1", testAtEnd)
# => [1] "2"
See the regex demo and the R online demo.
Note that gsub matches multiple occurrences, you need a single one, so it makes sense to replace gsub with sub.
Regex details
.* - any zero or more chars as many as possible
&AppliedWhenID= - a &AppliedWhenID= string
(\d+) - Group 1 (\1): one or more digits
.* - any zero or more chars as many as possible.
You could try using the string look behind conditional "(?<=)" and str_extract() from the stringr library.
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
p <- "(?<=AppliedWhenID=)\\d+"
# What should be returned is "2"
library(stringr)
str_extract(testAtEnd, p)
str_extract(testNotAtEnd, p)
Or in base R
p <- ".*((?<=AppliedWhenID=)\\d+).*"
gsub(p, "\\1", testAtEnd, perl=TRUE)
gsub(p, "\\1", testNotAtEnd, perl=TRUE)

Keep only the first letter of each word after a comma

I have strings like Sacher, Franz Xaver or Nishikawa, Kiyoko.
Using R, I want to change them to Sacher, F. X. or Nishikawa, K..
In other words, the first letter of each word after the comma should be retained with a dot (and a whitespace if another word follows).
Here is a related response, but it cannot be applied to my case 1:1 as it does not have a comma in its strings; it seems that the simple addition of (<?=, ) does not work.
E.g. in the following attempts, gsub() replaces everything, while my str_replace_all()-attempt leads to an error:
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
# first attempt
# (resembles the response from the other thread)
gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1', TEST, perl = TRUE)
# second attempt
# error: "Incorrect unicode property"
stringr::str_replace_all(TEST, '(?<=, )\\b(\\pL)\\pL{2,}|.','\\U\\1')
I would be grateful for your help!
You can use
gsub("(*UCP)^[^,]+(*SKIP)(*F)|\\b(\\p{L})\\p{L}*", "\\U\\1.", TEST, perl=TRUE)
See the regex demo. Details:
(*UCP) - the PCRE verb that will make \b Unicode aware
^[^,]+(*SKIP)(*F) - start of string and then any zero or more chars other than a comma, and then the match is failed and skipped, the next match starts at the location where the failure occurred
| - or
\b - word boundary
(\p{L}) - Group 1: any Unicode letter
\p{L}* - zero or more Unicode letters
See the R demo:
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
gsub("(*UCP)^[^,]+(*SKIP)(*F)|\\b(\\p{L})\\p{L}*", "\\U\\1.", TEST, perl=TRUE)
## => [1] "Sacher, F. X." "Nishikawa, K." "Al-Assam, M."
A crude approach splitting the string :
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
sapply(strsplit(TEST, '\\s+'), function(x)
paste0(x[1], paste0(substr(x[-1], 1, 1), collapse = '.'), '.'))
#[1] "Sacher,F.X." "Nishikawa,K." "Al-Assam,M."
An approach using multiple backreference:
gsub("(\\b\\w+,\\s)(\\b\\w).*(\\b\\w)*", "\\1\\2.\\3", TEST)
[1] "Sacher, F." "Nishikawa, K." "Al-Assam, M."
Here, we use three capturing groups to refer back to in gsub's replacment argument via backreference:
(\\b\\w+,\\s): this, first, group captures the last name plus the comma followed by whitespace
(\\b\\w): this, second, group captures the initial of the first name
(\\b\\w): this, third, group captures the initial of the middle name

Extract string using `rm_between` function

I want to extract strings using rm_between function from the library(qdapRegex)
I need to extract the string between the second "|" and the word "_HUMAN".
I cant figure out how to select the second "|" and not the first.
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
prots <- rm_between(example, '|', 'HUMAN', extract=TRUE)
Thank you!!
Another alternative using regmatches, regexpr and using perl=TRUE to make use of \K
^(?:[^|]*\|){2}\K[^|_]+(?=_HUMAN)
Regex demo
For example
regmatches(example, regexpr("^(?:[^|]*\\|){2}\\K[^|_]+(?=_HUMAN)", example, perl=TRUE))
Output
[1] "EIFCL" "EIF3C"
In your rm_between(example, '|', 'HUMAN', extract=TRUE) command, the | is used to match the leftmost | and HUMAN is used to match the left most HUMAN right after.
Note the default value for the FIXED argument is TRUE, so | and HUMAN are treated as literal chars.
You need to make the pattern a regex pattern, by setting fixed=FALSE. However, the ^(?:[^|]*\|){2} as the left argument regex will not work because the qdap package creates an ICU regex with lookarounds (since you use extract=TRUE that sets include.markers to FALSE), which is (?<=^(?:[^|]*\|){2}).*?(?=HUMAN).
As a workaround, you could use a constrained-width lookbehind, by replacing * with a limiting quantifier with a reasonably large max parameter. Say, if you do not expect more than a 1000 chars between each pipe, you may use {0,1000}:
rm_between(example, '^(?:[^|]{0,1000}\\|){2}', '_HUMAN', extract=TRUE, fixed=FALSE)
# => [[1]]
# [1] "EIFCL"
#
# [[2]]
# [1] "EIF3C"
However, you really should think of using simpler approaches, like those described in other answers. Here is another variation with sub:
sub("^(?:[^|]*\\|){2}(.*?)_HUMAN.*", "\\1", example)
# => [1] "EIFCL" "EIF3C"
Details
^ - startof strig
(?:[^|]*\\|){2} - two occurrences of any 0 or more non-pipe chars followed with a pipe char (so, matching up to and including the second |)
(.*?) - Group 1: any 0 or more chars, as few as possible
_HUMAN.* - _HUMAN and the rest of the string.
\1 keeps only Group 1 value in the result.
A stringr variation:
stringr::str_match(example, "^(?:[^|]*\\|){2}(.*?)_HUMAN")[,2]
# => [1] "EIFCL" "EIF3C"
With str_match, the captures can be accessed easily, we do it with [,2] to get Group 1 value.
this is not exactly what you asked for, but you can achieve the result with base R:
sub("^.*\\|([^\\|]+)_HUMAN.*$", "\\1", example)
This solution is an application of regular expression.
"^.*\\|([^\\|]+)_HUMAN.*$" matches the entire character string.
\\1 matches whatever was matched inside the first parenthesis.
Using regular gsub:
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
gsub(".*?\\|.*?\\|(.*?)_HUMAN", "\\1", example)
#> [1] "EIFCL" "EIF3C"
The part (.*?) is replaced by itself as the replacement contains the back-reference \\1.
If you absolutely prefer qdapRegex you can try:
rm_between(example, '.{0,100}\\|.{0,100}\\|', '_HUMAN', fixed = FALSE, extract = TRUE)
The reason why we have to use .{0,100} instead of .*? is that the underlying stringi needs a mamixmum length for the look-behind pattern (i.e. the left argument in rm_between).
Just saying that you could easily just use sapply()/strsplit():
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
unlist(sapply(strsplit(example, "|", fixed = T),
function(item) strsplit(item[3], "_HUMAN", fixed = T)))
# [1] "EIFCL" "EIF3C"
It just splits on | in the first list and on _HUMAN on every third element within that list.

How to extract words containing combinations of certain characters in R

In this sample text:
turns <- tolower(c("Does him good to stir him up now and again .",
"When , when I see him he w's on the settees .",
"Yes it 's been eery for a long time .",
"blissful timing , indeed it was "))
I'd like to extract all words that contain the letters y and e no matter what position or combination, namely yesand eery, using str_extract from stringr:
This regex, in which I determine that y occur immediately before e, matches not surprisingly only yes but not eery:
unlist(str_extract_all(turns, "\\b([a-z]+)?ye([a-z]+)?\\b"))
[1] "yes"
Putting yand e into a character class doesn't get me the desired result either in that all words either with y or with e are matched:
unlist(str_extract_all(turns, "\\b([a-z]+)?[ye]([a-z]+)?\\b"))
[1] "does" "when" "when" "see" "he" "the" "settees" "yes" "been" "eery" "time" "indeed"
So what is the right solution?
You may use both base R and stringr approaches:
stringr::str_extract_all(turns, "\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE))
Or, without turning the strings to lower case, you may use a case insensitive matching with (?i):
stringr::str_extract_all(turns, "(?i)\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE, ignore.case=TRUE))
See the regex demo and the R demo. Also, if you want to make it a tiny bit more efficient, you may use principle of contrast in the lookahead patterns: match any letters but y in the first and all letters but the e in the second using character class substraction:
stringr::str_extract_all(turns, "(?i)\\b(?=[\\p{L}--[y]]*y)(?=[\\p{L}--[e]]*e)\\p{L}+\\b")
Details
(?i) - case insensitive modifier
\b - word boundary
(?=\p{L}*y) - after 0 or more Unicode letters, there must be y ([\p{L}--[y]]* matches any 0 or more letters but y up to the first y)
(?=\p{L}*e) - after 0 or more Unicode letters, there must be e ([\p{L}--[e]]* matches any 0 or more letters but e up to the first e)
\p{L}+ - 1 or more Unicode letters
\b - word boundary
In case there is no urgent need to use stringr::str_extract you can get words containing the letters y and e in base with strsplit and grepl like:
tt <- unlist(strsplit(turns, " "))
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes" "eery"
In case you have letter chunks between words:
turns <- c("yes no ay ae 012y345e year.")
tt <- regmatches(turns, gregexpr("\\b[[:alpha:]]+\\b", turns))[[1]]
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes" "year"

r: regex for containing pattern with negation

Suppose I have the following two strings and want to use grep to see which match:
business_metric_one
business_metric_one_dk
business_metric_one_none
business_metric_two
business_metric_two_dk
business_metric_two_none
And so on for various other metrics. I want to only match the first one of each group (business_metric_one and business_metric_two and so on). They are not in an ordered list so I can't index and have to use grep. At first I thought to do:
.*metric.*[^_dk|^_none]$
But this doesn't seem to work. Any ideas?
You need to use a PCRE pattern to filter the character vector:
x <- c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")
grep("metric(?!.*_(?:dk|none))", x, value=TRUE, perl=TRUE)
## => [1] "business_metric_one" "business_metric_two"
See the R demo
The metric(?!.*(?:_dk|_none)) pattern matches
metric - a metric substring
(?!.*_(?:dk|none)) - that is not followed with any 0+ chars other than line break chars followed with _ and then either dk or none.
See the regex demo.
NOTE: if you need to match only such values that contain metric and do not end with _dk or _none, use a variation, metric.*$(?<!_dk|_none) where the (?<!_dk|_none) negative lookbehind fails the match if the string ends with either _dk or _none.
You can also do something like this:
grep("^([[:alpha:]]+_){2}[[:alpha:]]+$", string, value = TRUE)
# [1] "business_metric_one" "business_metric_two"
or use grepl to match dk and none, then negate the logical when you're indexing the original string:
string[!grepl("(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
more concisely:
string[!grepl("business_metric_[[:alpha:]]+_(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
Data:
string = c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")

Resources