How to use grepl on only part of a string - r

Vector<-c("Consider criterion1, criterion2, criterion3, stop considering criterion1,criterion2, criterion3")
Vector2<-c("Consider criterion2, criterion3, stop considering criterion1,criterion2, criterion3")
grepl("criterion1",Vector)
[1] TRUE
For this second condition I want to have FALSE as I would like to ignore all characters after the the word stop
grepl("criterion1",Vector2)
[1] FALSE

Few ways to tackle this:
You could remove everything after stop by using sub, to ensure that you only check before stop
grepl('criterion1', sub('stop.*', '', Vector))
[1] TRUE
grepl('criterion1', sub('stop.*', '', Vector2))
[1] FALSE
Or you could change the pattern altogether to ensure there is no stop before the value being checked.
grepl('^((?!stop).)*criterion1', Vector, perl = TRUE)
[1] TRUE
grepl('^((?!stop).)*criterion1', Vector2, perl = TRUE)
[1] FALSE
Note that grepl is vectorized on x hence we could simply do:
grepl('^((?!stop).)*criterion1', c(Vector, Vector2), perl = TRUE)
[1] TRUE FALSE

Related

stringr::str_starts returns TRUE when it shouldn't

I am trying to detect whether a string starts with either of the provided strings (separated by | )
name = "KKSWAP"
stringr::str_starts(name, "RTT|SWAP")
returns TRUE, but
str_starts(name, "SWAP|RTT")
returns FALSE
This behaviour seems wrong, as KKSWAP doesn't start with "RTT" or "SWAP". I would expect this to be false in both above cases.
The reason can be found in the code of the function :
function (string, pattern, negate = FALSE)
{
switch(type(pattern), empty = , bound = stop("boundary() patterns are not supported."),
fixed = stri_startswith_fixed(string, pattern, negate = negate,
opts_fixed = opts(pattern)), coll = stri_startswith_coll(string,
pattern, negate = negate, opts_collator = opts(pattern)),
regex = {
pattern2 <- paste0("^", pattern)
attributes(pattern2) <- attributes(pattern)
str_detect(string, pattern2, negate)
})
}
You can see, it pastes '^' in front of the parttern, so in your example it looks for '^RR|SWAP' and finds 'SWAP'.
If you want to look at more than one pattern you should use a vector:
name <- "KKSWAP"
stringr::str_starts(name, c("RTT","SWAP"))
# [1] FALSE FALSE
If you want just one answer, you can combine with any()
name <- "KKSWAP"
stringr::str_starts(name, c("RTT","SWAP"))
# [1] FALSE
The advantage of stringr::str_starts() is the vectorisation of the pattern argument, but if you don't need it grepl('^RTT|^SWAP', name), as suggested by TTS, is a good base R alternative.
Alternatively, the base function startsWith() suggested by jpsmith offers both the vectorized and | options :
startsWith(name, c("RTT","SWAP"))
# [1] FALSE FALSE
startsWith(name, "RTT|SWAP")
# [1] FALSE
I'm not familiar with the stringr version, but the base R version startsWith returns your desired result. If you don't have to use stringr, this may be a solution:
startsWith(name, "RTT|SWAP")
startsWith(name, "SWAP|RTT")
startsWith(name, "KK")
# > startsWith(name, "RTT|SWAP")
# [1] FALSE
# > startsWith(name, "SWAP|RTT")
# [1] FALSE
# > startsWith(name, "KK")
# [1] TRUE
The help text describes str_starts: Detect the presence or absence of a pattern at the beginning or end of a string. This might be why it's not behaving quite as expected.
pattern is the Pattern with which the string starts or ends.
We can add ^ regex to make it search at the beginning of string and get the expected result.
name = 'KKSWAP'
str_starts(name, '^RTT|^SWAP')
I would prefer grepl in this instance because it seems less misleading.
grepl('^RTT|^SWAP', name)

regex a before b with and without whitespaces

I am trying to find all the string which include 'true' when there is no 'act' before it.
An example of possible vector:
vector = c("true","trueact","acttrue","act true","act really true")
What I have so far is this:
grepl(pattern="(?<!act)true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE TRUE TRUE
what I'm hopping for is
[1] TRUE TRUE FALSE FALSE FALSE
May be this works - i.e. to SKIP the match when there is 'act' as preceding substring but match true otherwise
grepl("(act.*true)(*SKIP)(*FAIL)|\\btrue", vector,
perl = TRUE, ignore.case = TRUE)
[1] TRUE TRUE FALSE FALSE FALSE
Here is one way to do so:
grepl(pattern="^(.(?<!act))*?true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE FALSE FALSE
^: start of the string
.: matches any character
(?<=): negative lookbehind
act: matches act
*?: matches .(?<!act) between 0 and unlimited times
true: matches true
see here for the regex demo

R: Return a vector using %in% operator

I have a vector with delimiters and I want to generate a vector of the same length with boolean values based on whether or not one of the delimited values contains what I am after. I cannot find a way to do this neatly in vector-based logic. As an example:
x <- c('a', 'a; b', 'ab; c', 'b; c', 'c; a', 'c')
Using some magic asking whether 'a' %in% x, I want to get the vector:
TRUE, TRUE, FALSE, FALSE, TRUE, FALSE
I initially tried the following:
'a' %in% trimws(strsplit(x, ';'))
But this unexpectedly collapses the entire list and returns TRUE, rather than a vector, since one of the elements in x is 'a'. Is there a way to get the vector I am looking for without rewriting the code into a for-loop?
Update: To consider white spaces:
library(stringr)
x <- str_replace_all(string=x, pattern=" ", repl="")
x
[1] "a" "a;b" "ab;c" "b;c" "c;a" "c"
str_detect(x, 'a$|a;')
[1] TRUE TRUE FALSE FALSE TRUE FALSE
First answer:
If you want to use str_detect we have to account on a + delimiter ;:
library(stringr)
str_detect(x, 'a$|a;')
[1] TRUE TRUE FALSE FALSE TRUE FALSE
Base R:
grepl("a", x)
or (when you want to use explicitly %in%):
sapply(strsplit(x,""), function(x){ "a" %in% x})
When working with strings and letters I always use the great library stringr
library(stringr)
x <- c('a', 'a; b', 'ab; c', 'b; c', 'c; a', 'c')
str_detect(x, "a")
If you would like to use %in%, here is a base R option
> mapply(`%in%`, list("a"), strsplit(x, ";\\s+"))
[1] TRUE TRUE FALSE FALSE TRUE FALSE
A more efficient way might be using grepl like below
> grepl("\\ba\\b",x)
[1] TRUE TRUE FALSE FALSE TRUE FALSE
You can read each item separately with scan, trim leading and trailing WS as you attempted, and test each resulting character vector in turn with:
sapply(x, function(x){"a" %in% trimws( scan( text=x, what="",sep=";", quiet=TRUE))})
a a; b ab; c b; c c; a c
TRUE TRUE FALSE FALSE TRUE FALSE
The top row of the result is just the names and would not affect a logical test that depended on this result. There is an unname function if needed.

how do you do the equivalent of Excel's AND() and OR() operations in R?

drives_DF$block_device == ""
[1] TRUE TRUE TRUE FALSE TRUE
How do I reduce this down to a single FALSE like doing an AND() in Excel?
How do I reduce this down to a single TRUE like doing an OR() in Excel?
Wrapping your code with all() will return TRUE if all evaluated elements are TRUE
all(drives_DF$block_device == "")
[1] FALSE
Wrapping your code with any() will return TRUE if at least one of the evaluated elements is TRUE
any(drives_DF$block_device == "")
[1] TRUE
You can use any and all functions available in R to get the required like this:
#Considering a vector of boolean values
boolVector = c(F,T,F,T,F)
print(all(boolVector, na.rm = FALSE)) #AND OPERATION
print(any(boolVector, na.rm = FALSE)) #OR OPERATION
The output of the print statements are:
[1] FALSE
[1] TRUE

How to apply list of regex pattern on list

I have a list of strings and a list of patterns
like:
links <- c(
"http://www.google.com"
,"google.com"
,"www.google.com"
,"http://google.com"
,"http://google.com/"
,"www.google.com/#"
,"www.google.com/xpto"
,"http://google.com/xpto"
,"http://google.com/xpto&utml"
,"www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA")
patterns <- c(".com$","/$")
what i want is wipe out all links that matches this patterns.
and get this result:
"www.google.com/#"
"www.google.com/xpto"
"http://google.com/xpto"
"http://google.com/xpto&utml"
"www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
if i use
x<-lapply (patterns, grepl, links)
i get
[[1]]
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
what takes me to this 2 lists
> links[!x[[2]]]
[1] "http://www.google.com" "google.com"
[3] "www.google.com" "http://google.com"
[5] "www.google.com/#" "www.google.com/xpto"
[7] "http://google.com/xpto" "http://google.com/xpto&utml"
[9] "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
> links[!x[[1]]]
[1] "http://google.com/" "www.google.com/#"
[3] "www.google.com/xpto" "http://google.com/xpto"
[5] "http://google.com/xpto&utml" "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
in this case each result list wiped 1 pattern out.. but i wanted 1 list with all patterns wiped... how to apply the regex to only one result ... or somehow to merge the n boolean vectors always choosing true.
like:
b[1] <- c(TRUE,FALSE,FALSE,TRUE,FALSE)
b[2] <- c(FALSE,FALSE,TRUE,TRUE,FALSE)
b[3] <- c(FALSE,FALSE,FALSE,FALSE,FALSE)
res <- somefunction(b)
res
TRUE,FALSE,TRUE,TRUE,FALSE
In most cases the best solution will be to merge the regular expression patterns, and to apply a single pattern search, as shown in Thomas’ answer.
However, it is also trivial to merge logical vectors by combining them with logical operations. In your case, you want to compute the member-wise logical disjunction. Between two vectors, this can be computed as x | y. Between a list of multiple vectors, it can be computed using Reduce(|, logical_list).
In your case, this results in:
any_matching = Reduce(`|`, lapply(patterns, grepl, links))
result = links[! any_matching]
This should do what you want:
links[!sapply("(\\.com|/)$", grepl, links)]
Explanation:
You can use sapply so you get a vector and not a list
I'd use the pattern "(\\.com|/)$" (i.e. ends with .com OR /).
In the end I negate the resulting boolean vector using !.
You can try the base R code below, using grep
r <- grep(paste0(patterns,collapse = "|"),links,value = TRUE,invert = TRUE)
such that
> r
[1] "www.google.com/#"
[2] "www.google.com/xpto"
[3] "http://google.com/xpto"
[4] "http://google.com/xpto&utml"
[5] "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
You can do this using stringr::str_subset() function.
library(stringr)
str_subset(links, pattern = ".com$|/$", negate = TRUE)

Resources