Handling empty strings in string detection - r

I would like to use str_detect and not convert "" to another string pattern. Is there an easy way to deal with empty string patterns "" which right now generates a warning. I would like this to produce TRUE, FALSE, FALSE, FALSE, FALSE
library( tidyverse )
str_detect('matt', c( "matt","joe","liz","", NA))

We can use
library(stringr)
library(tidyr)
str_detect(replace_na(v1, ''), 'matt')
#[1] TRUE FALSE FALSE FALSE FALSE
If the match is not for a substring, then %in% would be useful
v1 %in% 'matt'
#[1] TRUE FALSE FALSE FALSE FALSE
data
v1 <- c( "matt","joe","liz","", NA)

If you're not tied to str_detect() perhaps try grepl()?
grepl("matt", c( "matt","joe","liz","", NA))
#[1] TRUE FALSE FALSE FALSE FALSE

Here is a way with package stringi the base of package stringr.
x <- c( "matt","joe","liz","", NA)
stringi::stri_detect_regex(x, 'matt') & !is.na(x)
#[1] TRUE FALSE FALSE FALSE FALSE
The NA value must be tested, if not stri_detect_* will return NA.

You could also do-
v1 <- c( "matt","joe","liz","", NA)
sapply(v1, identical, "matt")
Output-
matt joe liz <NA>
TRUE FALSE FALSE FALSE FALSE

Related

str_detect how to distinguish between a1 and a11

I try to find the presence of expresions "a1" and "a11" in abc object.
abc <- c("a1","a1|a11","a14","a11", "a11|a14", "a1|a3|a14", "a11|a16")
The first query finds everything, also a11, a14 etc
str_detect(abc,"a1")
Is there a way to distinguich somehow to find only the a1 and not a11, a12 etc.
Use word boundaries to enclose the pattern you want.
The last regex uses "\<" and "\>" to give an example of word boundaries that match boundaries at the beginning and at the end of the pattern, respectively. "\\b" matches either side of it.
abc <- c("a1","a1|a11","a14","a11", "a11|a14", "a1|a3|a14", "a11|a16")
stringr::str_detect(abc,"\\ba1\\b")
#> [1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE
grepl("\\ba1\\b", abc)
#> [1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE
grepl("\\<a1\\>", abc)
#> [1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE
Created on 2022-10-19 with reprex v2.0.2
You could use grepl with fixed = FALSE like this:
abc <- c("a1","a1|a11","a14","a11", "a11|a14", "a1|a3|a14", "a11|a16")
unlist(lapply(abc, \(x) grepl(x, 'a1', fixed = FALSE)))
#> [1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE
Created on 2022-10-19 with reprex v2.0.2

regex a before b with and without whitespaces

I am trying to find all the string which include 'true' when there is no 'act' before it.
An example of possible vector:
vector = c("true","trueact","acttrue","act true","act really true")
What I have so far is this:
grepl(pattern="(?<!act)true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE TRUE TRUE
what I'm hopping for is
[1] TRUE TRUE FALSE FALSE FALSE
May be this works - i.e. to SKIP the match when there is 'act' as preceding substring but match true otherwise
grepl("(act.*true)(*SKIP)(*FAIL)|\\btrue", vector,
perl = TRUE, ignore.case = TRUE)
[1] TRUE TRUE FALSE FALSE FALSE
Here is one way to do so:
grepl(pattern="^(.(?<!act))*?true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE FALSE FALSE
^: start of the string
.: matches any character
(?<=): negative lookbehind
act: matches act
*?: matches .(?<!act) between 0 and unlimited times
true: matches true
see here for the regex demo

R: Return a vector using %in% operator

I have a vector with delimiters and I want to generate a vector of the same length with boolean values based on whether or not one of the delimited values contains what I am after. I cannot find a way to do this neatly in vector-based logic. As an example:
x <- c('a', 'a; b', 'ab; c', 'b; c', 'c; a', 'c')
Using some magic asking whether 'a' %in% x, I want to get the vector:
TRUE, TRUE, FALSE, FALSE, TRUE, FALSE
I initially tried the following:
'a' %in% trimws(strsplit(x, ';'))
But this unexpectedly collapses the entire list and returns TRUE, rather than a vector, since one of the elements in x is 'a'. Is there a way to get the vector I am looking for without rewriting the code into a for-loop?
Update: To consider white spaces:
library(stringr)
x <- str_replace_all(string=x, pattern=" ", repl="")
x
[1] "a" "a;b" "ab;c" "b;c" "c;a" "c"
str_detect(x, 'a$|a;')
[1] TRUE TRUE FALSE FALSE TRUE FALSE
First answer:
If you want to use str_detect we have to account on a + delimiter ;:
library(stringr)
str_detect(x, 'a$|a;')
[1] TRUE TRUE FALSE FALSE TRUE FALSE
Base R:
grepl("a", x)
or (when you want to use explicitly %in%):
sapply(strsplit(x,""), function(x){ "a" %in% x})
When working with strings and letters I always use the great library stringr
library(stringr)
x <- c('a', 'a; b', 'ab; c', 'b; c', 'c; a', 'c')
str_detect(x, "a")
If you would like to use %in%, here is a base R option
> mapply(`%in%`, list("a"), strsplit(x, ";\\s+"))
[1] TRUE TRUE FALSE FALSE TRUE FALSE
A more efficient way might be using grepl like below
> grepl("\\ba\\b",x)
[1] TRUE TRUE FALSE FALSE TRUE FALSE
You can read each item separately with scan, trim leading and trailing WS as you attempted, and test each resulting character vector in turn with:
sapply(x, function(x){"a" %in% trimws( scan( text=x, what="",sep=";", quiet=TRUE))})
a a; b ab; c b; c c; a c
TRUE TRUE FALSE FALSE TRUE FALSE
The top row of the result is just the names and would not affect a logical test that depended on this result. There is an unname function if needed.

(How) can multiple backreference be used in alternation patterns?

This question is a spin-off from that question Function to count of consecutive digits in a string vector.
Assume I have strings such as x:
x <- c("555123", "57333", "21112", "12345", "22144", "44440")
and want to detect those strings where any number between 2 and 5 occurs in immediate duplication as many times as itself. That is, match if the string contains 22, 333, 4444, and 55555.
If I approach this task in small chunks using backreference, everything is fine:
str_detect(x, "(2)\\1{1}")
[1] FALSE FALSE FALSE FALSE **TRUE** FALSE
str_detect(x, "(3)\\1{2}")
[1] FALSE **TRUE** FALSE FALSE FALSE FALSE
str_detect(x, "(4)\\1{3}")
[1] FALSE FALSE FALSE FALSE FALSE **TRUE**
However, if I pursue a single solution for all matches using a vector with the allowed numbers:
digits <- 2:5
and an alternation pattern, such as this:
patt <- paste0("(", digits, ")\\1{", digits - 1, "}", collapse = "|")
patt
[1] "(2)\\1{1}|(3)\\1{2}|(4)\\1{3}|(5)\\1{4}"
and input patt into str_detect, this only detects the first alternative, namely (2)\\1{1}:
str_detect(x, patt)
[1] FALSE FALSE FALSE FALSE **TRUE** FALSE
Is it the backreference which cannot be used in alternation patterns? If so, then why does a for loop iterating through each option separately not work either?
res <- c()
for(i in 2:5){
res <- str_detect(x, paste0("(", i, ")\\1{", i - 1, "}"))
}
res
[1] FALSE FALSE FALSE FALSE FALSE FALSE
Advice on this matter is greatly appreciated!
In your pattern (2)\\1{1}|(3)\\1{2}|(4)\\1{3}|(5)\\1{4} the quantifier repeats matching the backreference to the first capture group. That is why you only match the first alternative.
You could repeat the next capture group instead as there are multiple groups.
(2)\\1{1}|(3)\\2{2}|(4)\\3{3}|(5)\\4{4}
The (2)\\1{1} can be just (2)\\1 but this is ok as you assembling the pattern dynamically
What about this?
> grepl(
+ paste0(sapply(2:5, function(i) sprintf("(%s)\\%s{%s}", i, i - 1, i - 1)), collapse = "|"),
+ x
+ )
[1] FALSE TRUE FALSE FALSE TRUE TRUE
or
> rowSums(sapply(2:5, function(i) grepl(sprintf("(%s)\\1{%s}", i, i - 1), x))) > 0
[1] FALSE TRUE FALSE FALSE TRUE TRUE
As mentioned in the comments, you need to update the regex:
patt = paste0(
"(", digits, ")\\", digits - 1, "{", digits - 1, "}",
collapse = "|"
)
str_detect(x, patt)
Output:
[1] FALSE TRUE FALSE FALSE TRUE TRUE
In your for loop, you are replacing res each time so when you print res at the end, you are seeing the result for when i is 5. If you use print() instead:
for(i in 2:5){
print(str_detect(x, paste0("(", i, ")\\1{", i - 1, "}")))
}
Output:
[1] FALSE FALSE FALSE FALSE TRUE FALSE
[1] FALSE TRUE FALSE FALSE FALSE FALSE
[1] FALSE FALSE FALSE FALSE FALSE TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE
If you wanted to use a loop:
map_lgl(x, function(str) {
any(map_lgl(
2:5,
~ str_detect(str, paste0("(", .x, ")\\1{", .x - 1, "}"))
))
})
Output:
[1] FALSE TRUE FALSE FALSE TRUE TRUE

Algorithm that returns false If the result has already come out before in R

I want to make an algorithm that returns false If the result has already come out before.
Below i attach the code i'm using, but it seems not to be the correct one.
I'll be appreciate any help.
x1 <- c("LMP","Dp","LMP","LMP","Dp")
x2 <- c("Dp","Dp","LMP","LMP","Dp")
for(i in 1:length(x)){
if(i==1){TRUE}else{
if(length(unique(x[1:i]))==1){FALSE}else{TRUE}}}
# The result that i want is
# for x1:
TRUE, TRUE, FALSE, FALSE, FALSE
# for x2:
TRUE, FALSE, TRUE, FALSE, FALSE
Use the duplicated instead of a for loop - duplicated returns TRUE for each duplicate element, by negating (!), TRUE -> FALSE, and viceversa.
!duplicated(x1)
[1] TRUE TRUE FALSE FALSE FALSE
!duplicated(x2)
[1] TRUE FALSE TRUE FALSE FALSE
It can be done in a for loop as well.
f1 <- function(vec) {
un1 <- ""
out <- logical(length(vec))
for(i in seq_along(vec)) {
if(!vec[i] %in% un1) {
un1 <- c(un1, vec[i])
out[i] <- TRUE
}
}
out
}
-testing
> f1(x1)
[1] TRUE TRUE FALSE FALSE FALSE
> f1(x2)
[1] TRUE FALSE TRUE FALSE FALSE

Resources