Pattern matching in R if string NOT followed but another string

Pattern matching in R if string NOT followed but another string - r

I am trying to match the following in R using str_detect from the stringr package.
I want to to detect if a given string if followed or preceeded by 'and' or '&'. For example, in:
string_1<-"A and B"
string_2<-"A B"
string_3<-"B and A"
string_4<-"A B and C"
I want str_detect(string_X) to be FALSE for string_1, string_3 and string_4 but TRUE for string_2.
I have tried:
str_detect(string_X,paste0(".*(?<!and |& )","A"))==TRUE & str_detect(string_X,paste0(".*","A","(?! and| &).*"))==TRUE)
I use paste0 because I want to run this over different strings. This works all the cases above except 4. I am new to regex, and it also does not seem very elegant. Is there a more general solution?
Thank you.

First let's combine your four strings into a single vector:
strings <- c(string_1, string_2, string_3, string_4)
Now using
library(stringr)
str_detect(strings, "(A|B)(?=\\s(and|&))", negate = TRUE)
we look for "A" or "B" followed by "and" or "&". So this returns
#> [1] FALSE TRUE FALSE FALSE
You could wrap it into a function:
detector <- function(letters, strings) {
pattern <- paste0("(", paste0(letters, collapse = "|"), ")(?=\\s(and|&))")
str_detect(strings, pattern, negate = TRUE)
}
detector(c("A", "B"), strings)
#> [1] FALSE TRUE FALSE FALSE
detector(c("A"), strings)
#> [1] FALSE TRUE TRUE TRUE
detector(c("B"), strings)
#> [1] TRUE TRUE FALSE FALSE
detector(c("C"), strings)
#> [1] TRUE TRUE TRUE TRUE

You can use a positive lookahead assertion to make sure that there is no A or B present followed by and or & and also not in the other order.
^(?!.*[AB] (?:and|&))(?!.*(?:and|&) [AB])
^ Start of string
(?!.*[AB] (?:and|&)) Assert that the string does not contain A or B followed by either and or &
(?!.*(?:and|&) [AB]) Assert that the string does not contain either and or & followed by either A or B
Regex demo | R demo
library(stringr)
string_1<-"A and B"
string_2<-"A B"
string_3<-"B and A"
string_4<-"A B and C"
string_5<-"& B"
strings <- c(string_1, string_2, string_3, string_4, string_5)
str_detect(strings, "^(?!.*[AB] (?:and|&))(?!.*(?:and|&) [AB])")
Output
[1] FALSE TRUE FALSE FALSE FALSE

Related

Find string with optional preceding string followed by an optional whitespace, both with negative lookbehind

I am not sure if the title of this question makes sense. I am looking for a string ("string") which can have an optional preceding string ("a"), which can or cannot be followed by a whitespace. All this should be with a negative lookbehind - this would basically be for the entire following expression.
My regex starts to fail with the negative lookbehind, which makes sense to me, and I wonder how to solve this.
This can be anywhere, and does not have to be at the start.
x <- c("string not false", "this is not a string", "this is a string", "not a string", "not astring", "a string", "astring", "string")
# all the below fail
grepl("(?<!not\\s{1})a?\\s?string", x, perl = TRUE)
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
grepl("(?<!not\\s{1})a\\s?string", x, perl = TRUE)
#> [1] FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
grepl("(?<!not\\s{1})(\\b|a)\\s?string", x, perl = TRUE)
#> [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
# expected output
#> [1] TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE

Why not avoid lookbehind and go for simple, asking what you want and what you don't want in two separated calls?
grepl("a?\\s?string", x) & !grepl("not\\s?a?\\s?string", x)
#[1] TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Note:
If you really want only one call to grepl, you need to detail a bit more what you want and what you don't want: if you only ask not to have "not" but don't specify that "not " ("not" followed by a space) isn't ok either, it won't work, you need to put it in the lookbehind. You also need to detail what you want in a lookahead because if you're too flexible in your regex (there can be a "a" with or without a space, etc.), grepl will still find a match.
The following code (more complicated than 2 grepl calls imo) works with your example:
grepl("(?<!(not)|(not ))(?=(^string)|(a string)|(astring))", x, perl=TRUE)
#[1] TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Data:
x <- c("string not false", "this is not a string", "this is a string", "not a string", "not astring", "a string", "astring", "string")

A greplsolution:
grepl("^(?!not).*string", x, perl = TRUE)
Alternatively, check out:
library(stringr)
str_detect(x, "\\bnot\\b", negate = TRUE)
[1] TRUE FALSE FALSE TRUE TRUE TRUE
grepl does not allow for pattern negation (but grepdoes!)
Data:
x <- c("this is a string", "not a string", "not astring", "a string", "astring", "string")

how to filter only vectors that contain all uppercase letters in all the strings in R

I need to filter rows that are uppercase in R. I managed to use the following code:
filter(str_detect(fruit, "^[:upper:]+$"))
However, some of the values of the column "fruit" contain two or three strings, and the code above only works for the cases when there is only one string. I can't share the data, but this example works for my purposes (only the str_detect part)
fruit <- c("apple", "ORANGE", "kiwi" ,"TWO PEARS", "A BIG PINEAPPLE", "LEMON")
str_detect(fruit, "^[:upper:]+$")
[1] FALSE TRUE FALSE FALSE FALSE TRUE
What I want is to also be able to identify "TWO PEARS" and "A BIG PINEAPPLE". Could you please help me?
Many thanks!

Try to include the space character class.
stringr::str_detect(fruit, "^[[:upper:][:space:]]+$")
#[1] FALSE TRUE FALSE TRUE TRUE TRUE
Following the comment to the question, negate uppercase:
stringr::str_detect(fruit, "^[^[:lower:]]+$")
#[1] FALSE TRUE FALSE TRUE TRUE TRUE

We can use grep in base R
grep("^[A-Z ]+$", fruit, value = TRUE)
#[1] "ORANGE" "TWO PEARS" "A BIG PINEAPPLE" "LEMON"
To get the other elements
grep("^[A-Z ]+$", fruit, value = TRUE, inverse = TRUE)

Grep when pattern is found exactly n times [duplicate]

This question already has answers here:
r grep by regex - finding a string that contains a sub string exactly one once
(6 answers)
Closed 3 years ago.
I am looking for a regex expression to capture strings where the pattern is repeated n times. Here is an example with expected output.
# find sentences with 2 occurrences of the word "is"
z = c("this is what it is and is not", "this is not", "this is it it is")
regex_function(z)
[1] FALSE FALSE TRUE
I have gotten this far:
grepl("(.*\\bis\\b.*){2}",z)
[1] TRUE FALSE TRUE
But this will return TRUE if there are at least 2 matches. How can I force it to look for strings with exactly 2 occurrences?

To find where the word is is contained two times you can remove all is with gsub and compare the length of the strings with nchar.
nchar(z) - nchar(gsub("(\\bis\\b)", "", z)) == 4
#[1] FALSE FALSE TRUE
or count the hits of gregexpr like:
sapply(gregexpr("\\bis\\b", z), function(x) sum(x>0)) == 2
#[1] FALSE FALSE TRUE
or with a regex in grepl
grepl("^(?!(.*\\bis\\b){3})(.*\\bis\\b){2}.*$", z, perl=TRUE)
#[1] FALSE FALSE TRUE

This is an option that works but needs 2 regex calls. I am still looking for a compact regex call which correctly solves this issue.
grepl("(.*\\bis\\b.*){2}",z) & !grepl("(.*\\bis\\b.*){3}",z)
Basically adding a grepl of n+1 and only keeping the ones that satisfy grep no 1 and do not satisfy grep no2.

library(stringi)
stri_count_regex(z, "\\bis\\b") == 2L
# [1] FALSE FALSE TRUE

with stringr:
library(stringr)
library(magrittr)
regex_function = function(str){
str_extract_all(str,"\\bis\\b")%>%
lapply(.,function(x){length(x) == 2}) %>%
unlist()
}
> regex_function(z)
[1] FALSE FALSE TRUE

How to match strings matching [a-z_]* but with non repetitive symbol "_"

I would like to match strings :
That are composed by [a-z_] ;
That doesn't start or end with "_" ;
That doesn't include repetitive "_" symbol.
So for example the expected matching results would be :
"x"; "x_x" > TRUE
"_x"; "x_"; "_x_"; "x__x" > FALSE
My problems to achieve this is that I can exclude strings ending or starting with "_" but my regexp also excludes length 1 strings.
grepl("^[a-z][a-z_]*[a-z]$", my.string)
My second issue is that I don't know how to negate a match for double characters grepl("(_)\\1", my.string) and how I can integrate it with the 1st part of my regexp.
If possible I would like to do this with perl = FALSE.

You need to use the following TRE regex:
grepl("^[a-z]+(?:_[a-z]+)*$", my.string)
See the regex demo
Details:
^ - start of string
[a-z]+ - one or more ASCII letters
(?:_[a-z]+)* - zero or more sequences (*) of
_ - an underscore
[a-z]+ - one or more ASCII letters
$ - end of string.
See R demo:
my.string <- c("x" ,"x_x", "x_x_x_x_x","_x", "x_", "_x_", "x__x")
grepl("^[a-z]+(?:_[a-z]+)*$", my.string)
## => [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE

This seems to identify the items correctly:
dat <- c("x" ,"x_x","_x", "x_", "_x_", "x__x")
grep("^_|__|_$", dat, invert=TRUE)
[1] 1 2
So try:
!grepl("^_|__|_$", dat)
[1] TRUE TRUE FALSE FALSE FALSE FALSE
Just uses negation and a pattern with three conditions separated by the regex logical OR operator "|".

Another regex that uses grouping ( and the * for numeration.
myString <- c("x_", "x", "_x", "x_x_x", "x_x", "x__x")
grepl("^([a-z]_)*[a-z]$", myString)
[1] FALSE TRUE FALSE TRUE TRUE FALSE
So ^([a-z]_)* matches 0 or more pairs of "[a-z]_" at the beginning of the string and [a-z]$ assures that the final character is a lower case alphabetical character.

How can I check if multiple strings exist in another string?

I have this string:
myStr <- "I am very beautiful btw"
str <- c("very","beauti","bt")
Now I want to check whether myStr includes all strings in str, how can I do this in R? For example above it should be TRUE.
Many Thanks

Yes, you can use grepl (not grep, actually), but you must run it once for each substring:
> sapply(str, grepl, myStr)
very beauti bt
TRUE TRUE TRUE
To get only one result if all of them are true, use all:
> all(sapply(str, grepl, myStr))
[1] TRUE
Edit:
In case you have more than one string to check, say:
myStrings <- c("I am very beautiful btw", "I am not beautiful btw")
You then run the sapply code, which will return a matrix with one row for each string in myStrings. Apply all on each row:
> apply(sapply(str, grepl, myStrings), 1, all)
[1] TRUE FALSE

Using stringr you could do:
str_detect(myStr, str)
Which returns a result for each substring:
#[1] TRUE TRUE TRUE
Or as per #thelatemail suggestion, if you want to know if all of them are true:
all(str_detect(myStr,str))
Which gives:
#[1] TRUE
You could also find the location (start, end) of every character in myStr that matches str
str_locate(myStr, str)
Which gives:
# start end
#[1,] 6 9
#[2,] 11 16
#[3,] 21 22

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Pattern matching in R if string NOT followed but another string - r

Related

Find string with optional preceding string followed by an optional whitespace, both with negative lookbehind

how to filter only vectors that contain all uppercase letters in all the strings in R

Grep when pattern is found exactly n times [duplicate]

How to match strings matching [a-z_]* but with non repetitive symbol "_"

How can I check if multiple strings exist in another string?

Categories

Resources