(How) can multiple backreference be used in alternation patterns? - r

This question is a spin-off from that question Function to count of consecutive digits in a string vector.
Assume I have strings such as x:
x <- c("555123", "57333", "21112", "12345", "22144", "44440")
and want to detect those strings where any number between 2 and 5 occurs in immediate duplication as many times as itself. That is, match if the string contains 22, 333, 4444, and 55555.
If I approach this task in small chunks using backreference, everything is fine:
str_detect(x, "(2)\\1{1}")
[1] FALSE FALSE FALSE FALSE **TRUE** FALSE
str_detect(x, "(3)\\1{2}")
[1] FALSE **TRUE** FALSE FALSE FALSE FALSE
str_detect(x, "(4)\\1{3}")
[1] FALSE FALSE FALSE FALSE FALSE **TRUE**
However, if I pursue a single solution for all matches using a vector with the allowed numbers:
digits <- 2:5
and an alternation pattern, such as this:
patt <- paste0("(", digits, ")\\1{", digits - 1, "}", collapse = "|")
patt
[1] "(2)\\1{1}|(3)\\1{2}|(4)\\1{3}|(5)\\1{4}"
and input patt into str_detect, this only detects the first alternative, namely (2)\\1{1}:
str_detect(x, patt)
[1] FALSE FALSE FALSE FALSE **TRUE** FALSE
Is it the backreference which cannot be used in alternation patterns? If so, then why does a for loop iterating through each option separately not work either?
res <- c()
for(i in 2:5){
res <- str_detect(x, paste0("(", i, ")\\1{", i - 1, "}"))
}
res
[1] FALSE FALSE FALSE FALSE FALSE FALSE
Advice on this matter is greatly appreciated!

In your pattern (2)\\1{1}|(3)\\1{2}|(4)\\1{3}|(5)\\1{4} the quantifier repeats matching the backreference to the first capture group. That is why you only match the first alternative.
You could repeat the next capture group instead as there are multiple groups.
(2)\\1{1}|(3)\\2{2}|(4)\\3{3}|(5)\\4{4}
The (2)\\1{1} can be just (2)\\1 but this is ok as you assembling the pattern dynamically

What about this?
> grepl(
+ paste0(sapply(2:5, function(i) sprintf("(%s)\\%s{%s}", i, i - 1, i - 1)), collapse = "|"),
+ x
+ )
[1] FALSE TRUE FALSE FALSE TRUE TRUE
or
> rowSums(sapply(2:5, function(i) grepl(sprintf("(%s)\\1{%s}", i, i - 1), x))) > 0
[1] FALSE TRUE FALSE FALSE TRUE TRUE

As mentioned in the comments, you need to update the regex:
patt = paste0(
"(", digits, ")\\", digits - 1, "{", digits - 1, "}",
collapse = "|"
)
str_detect(x, patt)
Output:
[1] FALSE TRUE FALSE FALSE TRUE TRUE
In your for loop, you are replacing res each time so when you print res at the end, you are seeing the result for when i is 5. If you use print() instead:
for(i in 2:5){
print(str_detect(x, paste0("(", i, ")\\1{", i - 1, "}")))
}
Output:
[1] FALSE FALSE FALSE FALSE TRUE FALSE
[1] FALSE TRUE FALSE FALSE FALSE FALSE
[1] FALSE FALSE FALSE FALSE FALSE TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE
If you wanted to use a loop:
map_lgl(x, function(str) {
any(map_lgl(
2:5,
~ str_detect(str, paste0("(", .x, ")\\1{", .x - 1, "}"))
))
})
Output:
[1] FALSE TRUE FALSE FALSE TRUE TRUE

Related

regex a before b with and without whitespaces

I am trying to find all the string which include 'true' when there is no 'act' before it.
An example of possible vector:
vector = c("true","trueact","acttrue","act true","act really true")
What I have so far is this:
grepl(pattern="(?<!act)true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE TRUE TRUE
what I'm hopping for is
[1] TRUE TRUE FALSE FALSE FALSE
May be this works - i.e. to SKIP the match when there is 'act' as preceding substring but match true otherwise
grepl("(act.*true)(*SKIP)(*FAIL)|\\btrue", vector,
perl = TRUE, ignore.case = TRUE)
[1] TRUE TRUE FALSE FALSE FALSE
Here is one way to do so:
grepl(pattern="^(.(?<!act))*?true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE FALSE FALSE
^: start of the string
.: matches any character
(?<=): negative lookbehind
act: matches act
*?: matches .(?<!act) between 0 and unlimited times
true: matches true
see here for the regex demo

How can I reference the vectors directly and not the list in a nested lapply function?

I have a list of vectors:
this_list <- list(a <- c("this one","that one","these two"),b <- c("those some","heres more","two plus this","and one last one"),c <- c("the final one","it ends here"))
search words <- c("one","two")
I want to search the vectors in each element of the list for each/any keywords. I only care if a keyword appears, not which keyword.
lapply(search_words,grepl,lapply(this_list$,'['))
lapply(search_words,grepl,this_list'a')
These will tell me, respectively, if a list element has a keyword and if a vector element has a keyword, by keyword.
I want to have a function that at once search per list element, per vector element if a keyword is in there.
sample desired output:
this_list a
search_word "one"
true true false
search word "two
false false true
this_list b
search_word "one"
false false false true
Even better would be to not breakdown by search_word and just tell me, by list element, if a search word was found in the vectors
this_list 'a'
true true true
this_list 'b'
false false true true
...
sapply(this_list, function(L)
sapply(search_words, grepl, x = L, simplify = FALSE),
simplify = FALSE)
# $a
# $a$one
# [1] TRUE TRUE FALSE
# $a$two
# [1] FALSE FALSE TRUE
# $b
# $b$one
# [1] FALSE FALSE FALSE TRUE
# $b$two
# [1] FALSE FALSE TRUE FALSE
# $c
# $c$one
# [1] TRUE FALSE
# $c$two
# [1] FALSE FALSE
or
sapply(this_list, function(L)
rowSums(sapply(search_words, grepl, x = L, simplify = TRUE)) > 0,
simplify = FALSE)
# $a
# [1] TRUE TRUE TRUE
# $b
# [1] FALSE FALSE TRUE TRUE
# $c
# [1] TRUE FALSE
FYI, never use <- inside of a function call like list.
this_list <- list(a <- ..., b <- ...)
This doesn't name the entries in this_list, it produces objects in the current environment named a and b, and puts the contents of those objects into the list without names.
Instead, always use = inside of function calls (unless your intent is to create an object and use the object's contents only, disregarding the name ... this is valid use, but much less common.)
this_list <- list(a = ..., b = ...)
Data
this_list <- list(a = c("this one","that one","these two"),b = c("those some","heres more","two plus this","and one last one"),c = c("the final one","it ends here"))
search_words <- c("one","two")

Algorithm that returns false If the result has already come out before in R

I want to make an algorithm that returns false If the result has already come out before.
Below i attach the code i'm using, but it seems not to be the correct one.
I'll be appreciate any help.
x1 <- c("LMP","Dp","LMP","LMP","Dp")
x2 <- c("Dp","Dp","LMP","LMP","Dp")
for(i in 1:length(x)){
if(i==1){TRUE}else{
if(length(unique(x[1:i]))==1){FALSE}else{TRUE}}}
# The result that i want is
# for x1:
TRUE, TRUE, FALSE, FALSE, FALSE
# for x2:
TRUE, FALSE, TRUE, FALSE, FALSE
Use the duplicated instead of a for loop - duplicated returns TRUE for each duplicate element, by negating (!), TRUE -> FALSE, and viceversa.
!duplicated(x1)
[1] TRUE TRUE FALSE FALSE FALSE
!duplicated(x2)
[1] TRUE FALSE TRUE FALSE FALSE
It can be done in a for loop as well.
f1 <- function(vec) {
un1 <- ""
out <- logical(length(vec))
for(i in seq_along(vec)) {
if(!vec[i] %in% un1) {
un1 <- c(un1, vec[i])
out[i] <- TRUE
}
}
out
}
-testing
> f1(x1)
[1] TRUE TRUE FALSE FALSE FALSE
> f1(x2)
[1] TRUE FALSE TRUE FALSE FALSE

Find string with optional preceding string followed by an optional whitespace, both with negative lookbehind

I am not sure if the title of this question makes sense. I am looking for a string ("string") which can have an optional preceding string ("a"), which can or cannot be followed by a whitespace. All this should be with a negative lookbehind - this would basically be for the entire following expression.
My regex starts to fail with the negative lookbehind, which makes sense to me, and I wonder how to solve this.
This can be anywhere, and does not have to be at the start.
x <- c("string not false", "this is not a string", "this is a string", "not a string", "not astring", "a string", "astring", "string")
# all the below fail
grepl("(?<!not\\s{1})a?\\s?string", x, perl = TRUE)
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
grepl("(?<!not\\s{1})a\\s?string", x, perl = TRUE)
#> [1] FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
grepl("(?<!not\\s{1})(\\b|a)\\s?string", x, perl = TRUE)
#> [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
# expected output
#> [1] TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Why not avoid lookbehind and go for simple, asking what you want and what you don't want in two separated calls?
grepl("a?\\s?string", x) & !grepl("not\\s?a?\\s?string", x)
#[1] TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Note:
If you really want only one call to grepl, you need to detail a bit more what you want and what you don't want: if you only ask not to have "not" but don't specify that "not " ("not" followed by a space) isn't ok either, it won't work, you need to put it in the lookbehind. You also need to detail what you want in a lookahead because if you're too flexible in your regex (there can be a "a" with or without a space, etc.), grepl will still find a match.
The following code (more complicated than 2 grepl calls imo) works with your example:
grepl("(?<!(not)|(not ))(?=(^string)|(a string)|(astring))", x, perl=TRUE)
#[1] TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Data:
x <- c("string not false", "this is not a string", "this is a string", "not a string", "not astring", "a string", "astring", "string")
A greplsolution:
grepl("^(?!not).*string", x, perl = TRUE)
Alternatively, check out:
library(stringr)
str_detect(x, "\\bnot\\b", negate = TRUE)
[1] TRUE FALSE FALSE TRUE TRUE TRUE
grepl does not allow for pattern negation (but grepdoes!)
Data:
x <- c("this is a string", "not a string", "not astring", "a string", "astring", "string")

Handling empty strings in string detection

I would like to use str_detect and not convert "" to another string pattern. Is there an easy way to deal with empty string patterns "" which right now generates a warning. I would like this to produce TRUE, FALSE, FALSE, FALSE, FALSE
library( tidyverse )
str_detect('matt', c( "matt","joe","liz","", NA))
We can use
library(stringr)
library(tidyr)
str_detect(replace_na(v1, ''), 'matt')
#[1] TRUE FALSE FALSE FALSE FALSE
If the match is not for a substring, then %in% would be useful
v1 %in% 'matt'
#[1] TRUE FALSE FALSE FALSE FALSE
data
v1 <- c( "matt","joe","liz","", NA)
If you're not tied to str_detect() perhaps try grepl()?
grepl("matt", c( "matt","joe","liz","", NA))
#[1] TRUE FALSE FALSE FALSE FALSE
Here is a way with package stringi the base of package stringr.
x <- c( "matt","joe","liz","", NA)
stringi::stri_detect_regex(x, 'matt') & !is.na(x)
#[1] TRUE FALSE FALSE FALSE FALSE
The NA value must be tested, if not stri_detect_* will return NA.
You could also do-
v1 <- c( "matt","joe","liz","", NA)
sapply(v1, identical, "matt")
Output-
matt joe liz <NA>
TRUE FALSE FALSE FALSE FALSE

Resources