I am trying to find all the string which include 'true' when there is no 'act' before it.
An example of possible vector:
vector = c("true","trueact","acttrue","act true","act really true")
What I have so far is this:
grepl(pattern="(?<!act)true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE TRUE TRUE
what I'm hopping for is
[1] TRUE TRUE FALSE FALSE FALSE
May be this works - i.e. to SKIP the match when there is 'act' as preceding substring but match true otherwise
grepl("(act.*true)(*SKIP)(*FAIL)|\\btrue", vector,
perl = TRUE, ignore.case = TRUE)
[1] TRUE TRUE FALSE FALSE FALSE
Here is one way to do so:
grepl(pattern="^(.(?<!act))*?true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE FALSE FALSE
^: start of the string
.: matches any character
(?<=): negative lookbehind
act: matches act
*?: matches .(?<!act) between 0 and unlimited times
true: matches true
see here for the regex demo
Related
Vector<-c("Consider criterion1, criterion2, criterion3, stop considering criterion1,criterion2, criterion3")
Vector2<-c("Consider criterion2, criterion3, stop considering criterion1,criterion2, criterion3")
grepl("criterion1",Vector)
[1] TRUE
For this second condition I want to have FALSE as I would like to ignore all characters after the the word stop
grepl("criterion1",Vector2)
[1] FALSE
Few ways to tackle this:
You could remove everything after stop by using sub, to ensure that you only check before stop
grepl('criterion1', sub('stop.*', '', Vector))
[1] TRUE
grepl('criterion1', sub('stop.*', '', Vector2))
[1] FALSE
Or you could change the pattern altogether to ensure there is no stop before the value being checked.
grepl('^((?!stop).)*criterion1', Vector, perl = TRUE)
[1] TRUE
grepl('^((?!stop).)*criterion1', Vector2, perl = TRUE)
[1] FALSE
Note that grepl is vectorized on x hence we could simply do:
grepl('^((?!stop).)*criterion1', c(Vector, Vector2), perl = TRUE)
[1] TRUE FALSE
This question is a spin-off from that question Function to count of consecutive digits in a string vector.
Assume I have strings such as x:
x <- c("555123", "57333", "21112", "12345", "22144", "44440")
and want to detect those strings where any number between 2 and 5 occurs in immediate duplication as many times as itself. That is, match if the string contains 22, 333, 4444, and 55555.
If I approach this task in small chunks using backreference, everything is fine:
str_detect(x, "(2)\\1{1}")
[1] FALSE FALSE FALSE FALSE **TRUE** FALSE
str_detect(x, "(3)\\1{2}")
[1] FALSE **TRUE** FALSE FALSE FALSE FALSE
str_detect(x, "(4)\\1{3}")
[1] FALSE FALSE FALSE FALSE FALSE **TRUE**
However, if I pursue a single solution for all matches using a vector with the allowed numbers:
digits <- 2:5
and an alternation pattern, such as this:
patt <- paste0("(", digits, ")\\1{", digits - 1, "}", collapse = "|")
patt
[1] "(2)\\1{1}|(3)\\1{2}|(4)\\1{3}|(5)\\1{4}"
and input patt into str_detect, this only detects the first alternative, namely (2)\\1{1}:
str_detect(x, patt)
[1] FALSE FALSE FALSE FALSE **TRUE** FALSE
Is it the backreference which cannot be used in alternation patterns? If so, then why does a for loop iterating through each option separately not work either?
res <- c()
for(i in 2:5){
res <- str_detect(x, paste0("(", i, ")\\1{", i - 1, "}"))
}
res
[1] FALSE FALSE FALSE FALSE FALSE FALSE
Advice on this matter is greatly appreciated!
In your pattern (2)\\1{1}|(3)\\1{2}|(4)\\1{3}|(5)\\1{4} the quantifier repeats matching the backreference to the first capture group. That is why you only match the first alternative.
You could repeat the next capture group instead as there are multiple groups.
(2)\\1{1}|(3)\\2{2}|(4)\\3{3}|(5)\\4{4}
The (2)\\1{1} can be just (2)\\1 but this is ok as you assembling the pattern dynamically
What about this?
> grepl(
+ paste0(sapply(2:5, function(i) sprintf("(%s)\\%s{%s}", i, i - 1, i - 1)), collapse = "|"),
+ x
+ )
[1] FALSE TRUE FALSE FALSE TRUE TRUE
or
> rowSums(sapply(2:5, function(i) grepl(sprintf("(%s)\\1{%s}", i, i - 1), x))) > 0
[1] FALSE TRUE FALSE FALSE TRUE TRUE
As mentioned in the comments, you need to update the regex:
patt = paste0(
"(", digits, ")\\", digits - 1, "{", digits - 1, "}",
collapse = "|"
)
str_detect(x, patt)
Output:
[1] FALSE TRUE FALSE FALSE TRUE TRUE
In your for loop, you are replacing res each time so when you print res at the end, you are seeing the result for when i is 5. If you use print() instead:
for(i in 2:5){
print(str_detect(x, paste0("(", i, ")\\1{", i - 1, "}")))
}
Output:
[1] FALSE FALSE FALSE FALSE TRUE FALSE
[1] FALSE TRUE FALSE FALSE FALSE FALSE
[1] FALSE FALSE FALSE FALSE FALSE TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE
If you wanted to use a loop:
map_lgl(x, function(str) {
any(map_lgl(
2:5,
~ str_detect(str, paste0("(", .x, ")\\1{", .x - 1, "}"))
))
})
Output:
[1] FALSE TRUE FALSE FALSE TRUE TRUE
I am not sure if the title of this question makes sense. I am looking for a string ("string") which can have an optional preceding string ("a"), which can or cannot be followed by a whitespace. All this should be with a negative lookbehind - this would basically be for the entire following expression.
My regex starts to fail with the negative lookbehind, which makes sense to me, and I wonder how to solve this.
This can be anywhere, and does not have to be at the start.
x <- c("string not false", "this is not a string", "this is a string", "not a string", "not astring", "a string", "astring", "string")
# all the below fail
grepl("(?<!not\\s{1})a?\\s?string", x, perl = TRUE)
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
grepl("(?<!not\\s{1})a\\s?string", x, perl = TRUE)
#> [1] FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
grepl("(?<!not\\s{1})(\\b|a)\\s?string", x, perl = TRUE)
#> [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
# expected output
#> [1] TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Why not avoid lookbehind and go for simple, asking what you want and what you don't want in two separated calls?
grepl("a?\\s?string", x) & !grepl("not\\s?a?\\s?string", x)
#[1] TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Note:
If you really want only one call to grepl, you need to detail a bit more what you want and what you don't want: if you only ask not to have "not" but don't specify that "not " ("not" followed by a space) isn't ok either, it won't work, you need to put it in the lookbehind. You also need to detail what you want in a lookahead because if you're too flexible in your regex (there can be a "a" with or without a space, etc.), grepl will still find a match.
The following code (more complicated than 2 grepl calls imo) works with your example:
grepl("(?<!(not)|(not ))(?=(^string)|(a string)|(astring))", x, perl=TRUE)
#[1] TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Data:
x <- c("string not false", "this is not a string", "this is a string", "not a string", "not astring", "a string", "astring", "string")
A greplsolution:
grepl("^(?!not).*string", x, perl = TRUE)
Alternatively, check out:
library(stringr)
str_detect(x, "\\bnot\\b", negate = TRUE)
[1] TRUE FALSE FALSE TRUE TRUE TRUE
grepl does not allow for pattern negation (but grepdoes!)
Data:
x <- c("this is a string", "not a string", "not astring", "a string", "astring", "string")
I got this vector:
bar <- c("aaa:something", "111:something", "a1a1:something", "1a:something")
I want to check whether before the colon (:) there are letters and numbers. It can be abitrarily many, but both need to be in there, so the result should be
FALSE, FALSE, TRUE, TRUE
How can I do that?
Assuming the numbers and letters will be in any order you can do :
grepl('([a-zA-Z]+[0-9]+)|([0-9]+[a-zA-Z]+):', bar)
#[1] FALSE FALSE TRUE TRUE
You can combine two grepl like:
grepl("[[:digit:]].*:", bar) & grepl("[[:alpha:]].*:", bar)
#[1] FALSE FALSE TRUE TRUE
#grepl("[0-9].*:", bar) & grepl("[a-zA-Z].*:", bar) #Alternative
To make it in one go you can use a non consuming expression:
grepl("(?=.*[[:digit:]]).*[[:alpha:]].*:", bar, perl=TRUE)
#[1] FALSE FALSE TRUE TRUE
grepl("[a-z]+\\d+.*\\:|\\d+[a-z]+.*\\:", bar, ignore.case = TRUE)
I want to check if the numbers I have in the list matches specific formatting (nnn.nnn.nnnn). I am expecting the code to return a boolean (FALSE, TRUE, FALSE, TRUE, FALSE, FALSE) but the last element returns TRUE when I want it to be FALSE.
library(stringr)
numbers <- c('571-566-6666', '456.456.4566', 'apple', '222.222.2222', '222 333
4444', '2345.234.2345')
str_detect(numbers, "[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}")
If I use:
str_detect(numbers, "[:digit:]{4}\\.[:digit:]{3}\\.[:digit:]{4}")
I get (FALSE, FALSE, FALSE, FALSE, FALSE, TRUE), so I know the pattern for the exact matches work but I am not sure why the first block of code returns TRUE for the last element when there are 4 numbers and not 3 before the '.'
It is because that last value has `345.234.2345' at the end and you don't have a requirement that your pattern start and end with the matching values.
Try this pattern:
"^[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}$"
If you wanted to match with a string possibly inside or one that was separate at the end or beginning by a space it might be more general to use:
"(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)"
Testing:
numbers <- c('571-566-6666', '456.456.4566', 'apple', '222.222.2222', '222 333
4444', '2345.234.2345', "interior test 456.456.4566 other",
'456.456.4566 beginning test', "end test 456.456.4566")
str_detect(numbers, "(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)")
#[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
And as Wictor is pointing out you could also use the word boundary operator as long as you double escape it in R patterns.
grepl("\\b[[:digit:]]{3}\\.[[:digit:]]{3}\\.[[:digit:]]{4}\\b", numbers)
[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Caveat: The stringr functions (which if I remember correctly are based on stringi functions) appear to be different than the "ordinary" R regex functions in that they allow using the special character classes without double bracketing.
grepl("(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)", numbers)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
grepl("(^|[ ])[[:digit:]]{3}\\.[[:digit:]]{3}\\.[[:digit:]]{4}([ ]|$)", numbers)
[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Apparently this is via an implicit setting of "fixed" to TRUE.