str_detect how to distinguish between a1 and a11 - r

I try to find the presence of expresions "a1" and "a11" in abc object.
abc <- c("a1","a1|a11","a14","a11", "a11|a14", "a1|a3|a14", "a11|a16")
The first query finds everything, also a11, a14 etc
str_detect(abc,"a1")
Is there a way to distinguich somehow to find only the a1 and not a11, a12 etc.

Use word boundaries to enclose the pattern you want.
The last regex uses "\<" and "\>" to give an example of word boundaries that match boundaries at the beginning and at the end of the pattern, respectively. "\\b" matches either side of it.
abc <- c("a1","a1|a11","a14","a11", "a11|a14", "a1|a3|a14", "a11|a16")
stringr::str_detect(abc,"\\ba1\\b")
#> [1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE
grepl("\\ba1\\b", abc)
#> [1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE
grepl("\\<a1\\>", abc)
#> [1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE
Created on 2022-10-19 with reprex v2.0.2

You could use grepl with fixed = FALSE like this:
abc <- c("a1","a1|a11","a14","a11", "a11|a14", "a1|a3|a14", "a11|a16")
unlist(lapply(abc, \(x) grepl(x, 'a1', fixed = FALSE)))
#> [1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE
Created on 2022-10-19 with reprex v2.0.2

Related

regex a before b with and without whitespaces

I am trying to find all the string which include 'true' when there is no 'act' before it.
An example of possible vector:
vector = c("true","trueact","acttrue","act true","act really true")
What I have so far is this:
grepl(pattern="(?<!act)true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE TRUE TRUE
what I'm hopping for is
[1] TRUE TRUE FALSE FALSE FALSE
May be this works - i.e. to SKIP the match when there is 'act' as preceding substring but match true otherwise
grepl("(act.*true)(*SKIP)(*FAIL)|\\btrue", vector,
perl = TRUE, ignore.case = TRUE)
[1] TRUE TRUE FALSE FALSE FALSE
Here is one way to do so:
grepl(pattern="^(.(?<!act))*?true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE FALSE FALSE
^: start of the string
.: matches any character
(?<=): negative lookbehind
act: matches act
*?: matches .(?<!act) between 0 and unlimited times
true: matches true
see here for the regex demo

(How) can multiple backreference be used in alternation patterns?

This question is a spin-off from that question Function to count of consecutive digits in a string vector.
Assume I have strings such as x:
x <- c("555123", "57333", "21112", "12345", "22144", "44440")
and want to detect those strings where any number between 2 and 5 occurs in immediate duplication as many times as itself. That is, match if the string contains 22, 333, 4444, and 55555.
If I approach this task in small chunks using backreference, everything is fine:
str_detect(x, "(2)\\1{1}")
[1] FALSE FALSE FALSE FALSE **TRUE** FALSE
str_detect(x, "(3)\\1{2}")
[1] FALSE **TRUE** FALSE FALSE FALSE FALSE
str_detect(x, "(4)\\1{3}")
[1] FALSE FALSE FALSE FALSE FALSE **TRUE**
However, if I pursue a single solution for all matches using a vector with the allowed numbers:
digits <- 2:5
and an alternation pattern, such as this:
patt <- paste0("(", digits, ")\\1{", digits - 1, "}", collapse = "|")
patt
[1] "(2)\\1{1}|(3)\\1{2}|(4)\\1{3}|(5)\\1{4}"
and input patt into str_detect, this only detects the first alternative, namely (2)\\1{1}:
str_detect(x, patt)
[1] FALSE FALSE FALSE FALSE **TRUE** FALSE
Is it the backreference which cannot be used in alternation patterns? If so, then why does a for loop iterating through each option separately not work either?
res <- c()
for(i in 2:5){
res <- str_detect(x, paste0("(", i, ")\\1{", i - 1, "}"))
}
res
[1] FALSE FALSE FALSE FALSE FALSE FALSE
Advice on this matter is greatly appreciated!
In your pattern (2)\\1{1}|(3)\\1{2}|(4)\\1{3}|(5)\\1{4} the quantifier repeats matching the backreference to the first capture group. That is why you only match the first alternative.
You could repeat the next capture group instead as there are multiple groups.
(2)\\1{1}|(3)\\2{2}|(4)\\3{3}|(5)\\4{4}
The (2)\\1{1} can be just (2)\\1 but this is ok as you assembling the pattern dynamically
What about this?
> grepl(
+ paste0(sapply(2:5, function(i) sprintf("(%s)\\%s{%s}", i, i - 1, i - 1)), collapse = "|"),
+ x
+ )
[1] FALSE TRUE FALSE FALSE TRUE TRUE
or
> rowSums(sapply(2:5, function(i) grepl(sprintf("(%s)\\1{%s}", i, i - 1), x))) > 0
[1] FALSE TRUE FALSE FALSE TRUE TRUE
As mentioned in the comments, you need to update the regex:
patt = paste0(
"(", digits, ")\\", digits - 1, "{", digits - 1, "}",
collapse = "|"
)
str_detect(x, patt)
Output:
[1] FALSE TRUE FALSE FALSE TRUE TRUE
In your for loop, you are replacing res each time so when you print res at the end, you are seeing the result for when i is 5. If you use print() instead:
for(i in 2:5){
print(str_detect(x, paste0("(", i, ")\\1{", i - 1, "}")))
}
Output:
[1] FALSE FALSE FALSE FALSE TRUE FALSE
[1] FALSE TRUE FALSE FALSE FALSE FALSE
[1] FALSE FALSE FALSE FALSE FALSE TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE
If you wanted to use a loop:
map_lgl(x, function(str) {
any(map_lgl(
2:5,
~ str_detect(str, paste0("(", .x, ")\\1{", .x - 1, "}"))
))
})
Output:
[1] FALSE TRUE FALSE FALSE TRUE TRUE

Handling empty strings in string detection

I would like to use str_detect and not convert "" to another string pattern. Is there an easy way to deal with empty string patterns "" which right now generates a warning. I would like this to produce TRUE, FALSE, FALSE, FALSE, FALSE
library( tidyverse )
str_detect('matt', c( "matt","joe","liz","", NA))
We can use
library(stringr)
library(tidyr)
str_detect(replace_na(v1, ''), 'matt')
#[1] TRUE FALSE FALSE FALSE FALSE
If the match is not for a substring, then %in% would be useful
v1 %in% 'matt'
#[1] TRUE FALSE FALSE FALSE FALSE
data
v1 <- c( "matt","joe","liz","", NA)
If you're not tied to str_detect() perhaps try grepl()?
grepl("matt", c( "matt","joe","liz","", NA))
#[1] TRUE FALSE FALSE FALSE FALSE
Here is a way with package stringi the base of package stringr.
x <- c( "matt","joe","liz","", NA)
stringi::stri_detect_regex(x, 'matt') & !is.na(x)
#[1] TRUE FALSE FALSE FALSE FALSE
The NA value must be tested, if not stri_detect_* will return NA.
You could also do-
v1 <- c( "matt","joe","liz","", NA)
sapply(v1, identical, "matt")
Output-
matt joe liz <NA>
TRUE FALSE FALSE FALSE FALSE

regex fails with dollar sign

In R, I'm trying to match a series of strings from a vector of file names. I only want those without letters that end with .tif
allfiles <- c("181129_16_00_class_mlc.tif", "181129_16_00.tif.aux.xml", "181129_17_00_01_19.tif", "181129_17_00_01_20.tif", "181129_17_00_01_23.tif", "181129_17_00_01_24.tif", "181129_17_00_01_25.tif", "181129_17_00_01_26.tif", "181129_17_00_01_27.tif", "181129_17_00_01_28.tif", "181129_17_00_01_29.tif", "181129_17_00_01_30.tif")
grepl("^[0-9_]+[.tif]", allfiles)
grepl("^[0-9_]+[.tif]$", allfiles)
That returns:
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Why does the dollar sign fail? The result I expected from the second grepl was:
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
It's not $ what fails but the usage of brackets. Instead you want
grepl("^[0-9_]+\\.tif$", allfiles)
# [1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Meanwhile, ^[0-9_]+[.tif]$ means that after all the digits and/or _, at the end you just have t, i, f, or . That is, only one of those. For instance,
grepl("^[0-9_]+[.tif]$", "1234t")
# [1] TRUE
grepl("^[0-9_]+[.tif]$", "1234tt")
# [1] FALSE

Using grepl in R

In which cases could these 2 different ways of implimentation would give different results?
data(mtcars)
firstWay <- mtcars[grepl('6',mtcars$cyl),]
SecondWay <- mtcars[mtcars$cyl=='6',]
If these ways always give the same results which of them is recommended and why? Thanks
mtcars$cyl is a numeric column, so you would be better off comparing it to a number using mtcars[mtcars$cyl == 6, ].
But the difference between the equality operator == and grepl is that == will only be TRUE for members of the vector which are equal to "6", while grepl will match any member of the vector which has a 6 anywhere within it.
So, for example:
String == grepl
6 TRUE TRUE
123456 FALSE TRUE
6ABC FALSE TRUE
This is a long sentence which happens to have a 6 in it FALSE TRUE
Whereas this long sentence does not FALSE FALSE
The equivalent grepl pattern would be "^6$". There's a tutorial (one of many) on regex at http://www.regular-expressions.info/tutorial.html.
Well, I think that the fist difference is that with grepl you can subset even if you do not already know, for example 6, but you can try to search a rows that start or end with 6.
If you try to do this with normal subsetting technique you'll have an empty object because, for example ^6, is not recognized as a regular expression but as a string with the symbol ^ and 6.
I am sure there are other differences but I am sure professional users will provide more detailed answers.
For the side os which one could be preferred maybe there can be reasons of efficiency:
system.time(mtcars[grepl('^6',mtcars$cyl),])
user system elapsed
0.029 0.002 0.035
system.time(mtcars[mtcars$cyl=='6',])
user system elapsed
0.031 0.002 0.046
This little example can be just a guide and as #Nick K suggested first further (and precise) investigations have to be done with microbenchmark . Of course with big dataset I barely believe that a professional users (or one in need of speed) will prefer both of them but maybe it will rely on data table, or tools like dplyr written in lower level language and so more fast.
Using the package microbenchmark, we can see which is faster
library(microbenchmark)
m <- microbenchmark(mtcars[grepl('6',mtcars$cyl),], mtcars[mtcars$cyl=='6',], times=10000)
Unit: microseconds
expr min lq mean median uq max neval
mtcars[grepl("6", mtcars$cyl), ] 229.080 234.738 247.5324 236.693 239.417 6713.914 10000
mtcars[mtcars$cyl == "6", ] 214.902 220.210 231.0240 221.956 224.471 7759.507 10000
It looks like == is faster, so when possible you should use that
However, the functions do not do exactly the same thing. grepl searches for if the string is present at all wheras == checks whether the expressions are equal
grepl("6", mtcars$disp)
[1] TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
mtcars$disp == "6"
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Resources