This question already has answers here:
r grep by regex - finding a string that contains a sub string exactly one once
(6 answers)
Closed 3 years ago.
I am looking for a regex expression to capture strings where the pattern is repeated n times. Here is an example with expected output.
# find sentences with 2 occurrences of the word "is"
z = c("this is what it is and is not", "this is not", "this is it it is")
regex_function(z)
[1] FALSE FALSE TRUE
I have gotten this far:
grepl("(.*\\bis\\b.*){2}",z)
[1] TRUE FALSE TRUE
But this will return TRUE if there are at least 2 matches. How can I force it to look for strings with exactly 2 occurrences?
To find where the word is is contained two times you can remove all is with gsub and compare the length of the strings with nchar.
nchar(z) - nchar(gsub("(\\bis\\b)", "", z)) == 4
#[1] FALSE FALSE TRUE
or count the hits of gregexpr like:
sapply(gregexpr("\\bis\\b", z), function(x) sum(x>0)) == 2
#[1] FALSE FALSE TRUE
or with a regex in grepl
grepl("^(?!(.*\\bis\\b){3})(.*\\bis\\b){2}.*$", z, perl=TRUE)
#[1] FALSE FALSE TRUE
This is an option that works but needs 2 regex calls. I am still looking for a compact regex call which correctly solves this issue.
grepl("(.*\\bis\\b.*){2}",z) & !grepl("(.*\\bis\\b.*){3}",z)
Basically adding a grepl of n+1 and only keeping the ones that satisfy grep no 1 and do not satisfy grep no2.
library(stringi)
stri_count_regex(z, "\\bis\\b") == 2L
# [1] FALSE FALSE TRUE
with stringr:
library(stringr)
library(magrittr)
regex_function = function(str){
str_extract_all(str,"\\bis\\b")%>%
lapply(.,function(x){length(x) == 2}) %>%
unlist()
}
> regex_function(z)
[1] FALSE FALSE TRUE
Related
I am trying to match the following in R using str_detect from the stringr package.
I want to to detect if a given string if followed or preceeded by 'and' or '&'. For example, in:
string_1<-"A and B"
string_2<-"A B"
string_3<-"B and A"
string_4<-"A B and C"
I want str_detect(string_X) to be FALSE for string_1, string_3 and string_4 but TRUE for string_2.
I have tried:
str_detect(string_X,paste0(".*(?<!and |& )","A"))==TRUE & str_detect(string_X,paste0(".*","A","(?! and| &).*"))==TRUE)
I use paste0 because I want to run this over different strings. This works all the cases above except 4. I am new to regex, and it also does not seem very elegant. Is there a more general solution?
Thank you.
First let's combine your four strings into a single vector:
strings <- c(string_1, string_2, string_3, string_4)
Now using
library(stringr)
str_detect(strings, "(A|B)(?=\\s(and|&))", negate = TRUE)
we look for "A" or "B" followed by "and" or "&". So this returns
#> [1] FALSE TRUE FALSE FALSE
You could wrap it into a function:
detector <- function(letters, strings) {
pattern <- paste0("(", paste0(letters, collapse = "|"), ")(?=\\s(and|&))")
str_detect(strings, pattern, negate = TRUE)
}
detector(c("A", "B"), strings)
#> [1] FALSE TRUE FALSE FALSE
detector(c("A"), strings)
#> [1] FALSE TRUE TRUE TRUE
detector(c("B"), strings)
#> [1] TRUE TRUE FALSE FALSE
detector(c("C"), strings)
#> [1] TRUE TRUE TRUE TRUE
You can use a positive lookahead assertion to make sure that there is no A or B present followed by and or & and also not in the other order.
^(?!.*[AB] (?:and|&))(?!.*(?:and|&) [AB])
^ Start of string
(?!.*[AB] (?:and|&)) Assert that the string does not contain A or B followed by either and or &
(?!.*(?:and|&) [AB]) Assert that the string does not contain either and or & followed by either A or B
Regex demo | R demo
library(stringr)
string_1<-"A and B"
string_2<-"A B"
string_3<-"B and A"
string_4<-"A B and C"
string_5<-"& B"
strings <- c(string_1, string_2, string_3, string_4, string_5)
str_detect(strings, "^(?!.*[AB] (?:and|&))(?!.*(?:and|&) [AB])")
Output
[1] FALSE TRUE FALSE FALSE FALSE
This question already has answers here:
r grep by regex - finding a string that contains a sub string exactly one once
(6 answers)
Closed 3 years ago.
I want to check if a special numbers ("2020", the whole number/year) appears twice in a string. I tried this but it did not work.
Who can help me?
grep(pattern = "2020{2}", x = "DataMW_2029__ForecastMW_2020")
Thank you :-)
You can use gregexpr to test if 2020 appears twice:
length(gregexpr("2020", "DataMW_2029__ForecastMW_2020")[[1]]) == 2
#[1] FALSE
length(gregexpr("2020", "DataMW_2020__ForecastMW_2020")[[1]]) == 2
#[1] TRUE
Or with a regex testing for 2 and more.
grepl("(.*2020){2}", "DataMW_2029__ForecastMW_2020")
#[1] FALSE
grepl("(.*2020){2}", "DataMW_2020__ForecastMW_2020")
#[1] TRUE
or for exact 2 hits:
grepl("^(?!(.*2020){3})(.*2020){2}.*$", "DataMW_2029__ForecastMW_2020", perl=TRUE)
#[1] FALSE
grepl("^(?!(.*2020){3})(.*2020){2}.*$", "DataMW_2020__ForecastMW_2020", perl=TRUE)
#[1] TRUE
grepl("^(?!(.*2020){3})(.*2020){2}.*$", "DataMW_2020__ForecastMW_2020_2020", perl=TRUE)
#[1] FALSE
I would use stringr::str_count():
x <- c("DataMW_2029__ForecastMW_2020", "DataMW_2020__ForecastMW_2020")
stringr::str_count(string = x, pattern = "2020")
# [1] 1 2
stringr::str_count(string = x, pattern = "2020") == 2
# [1] FALSE TRUE
This question already has answers here:
Test for numeric elements in a character string
(6 answers)
Closed 7 years ago.
I have tried the following, however, it goes wrong when the string contains any other character, say a space. As you can see below, there is a string called "subway 10", which does contain numeric characters, however, it is reported as false because of the space.
My string may contain any other character, but if it contains at least a single digit, I would like to get the indices of those strings from the array.
> mywords<- c("harry","met","sally","subway 10","1800Movies","12345")
> numbers <- grepl("^[[:digit:]]+$", mywords)
> letters <- grepl("^[[:alpha:]]+$", mywords)
> both <- grepl("^[[:digit:][:alpha:]]+$", mywords)
>
> mywords[xor((letters | numbers), both)] # letters & numbers mixed
[1] "1800Movies"
using \\d works for me:
grepl("\\d", mywords)
[1] FALSE FALSE FALSE TRUE TRUE TRUE
so does [[:digit:]]:
grepl("[[:digit:]]", mywords)
[1] FALSE FALSE FALSE TRUE TRUE TRUE
As #nrussel mentionned, you're testing if the strings contain only digits between the beginning ^ of the string till the end $.
You could also check if the strings contain something else than letters, using ^ inside brackets to negate the letters, but then "something else" is not only digits:
grepl("[^a-zA-Z]", mywords)
[1] FALSE FALSE FALSE TRUE TRUE TRUE
I have this string:
myStr <- "I am very beautiful btw"
str <- c("very","beauti","bt")
Now I want to check whether myStr includes all strings in str, how can I do this in R? For example above it should be TRUE.
Many Thanks
Yes, you can use grepl (not grep, actually), but you must run it once for each substring:
> sapply(str, grepl, myStr)
very beauti bt
TRUE TRUE TRUE
To get only one result if all of them are true, use all:
> all(sapply(str, grepl, myStr))
[1] TRUE
Edit:
In case you have more than one string to check, say:
myStrings <- c("I am very beautiful btw", "I am not beautiful btw")
You then run the sapply code, which will return a matrix with one row for each string in myStrings. Apply all on each row:
> apply(sapply(str, grepl, myStrings), 1, all)
[1] TRUE FALSE
Using stringr you could do:
str_detect(myStr, str)
Which returns a result for each substring:
#[1] TRUE TRUE TRUE
Or as per #thelatemail suggestion, if you want to know if all of them are true:
all(str_detect(myStr,str))
Which gives:
#[1] TRUE
You could also find the location (start, end) of every character in myStr that matches str
str_locate(myStr, str)
Which gives:
# start end
#[1,] 6 9
#[2,] 11 16
#[3,] 21 22
I have a vector of three character strings, and I'm trying to write a command that will find which members of the vector have a particular letter as the second character.
As an example, say I have this vector of 3-letter stings...
example = c("AWA","WOO","AZW","WWP")
I can use grepl and glob2rx to find strings with W as the first or last character.
> grepl(glob2rx("W*"),example)
[1] FALSE TRUE FALSE TRUE
> grepl(glob2rx("*W"),example)
[1] FALSE FALSE TRUE FALSE
However, I don't get the right result when I trying using it with glob2rx(*W*)
> grepl(glob2rx("*W*"),example)
[1] TRUE TRUE TRUE TRUE
I am sure my understanding of regular expressions is lacking, however this seems like a pretty straightforward problem and I can't seem to find the solution. I'd really love some assistance!
For future reference, I'd also really like to know if I could extend this to the case where I have longer strings. Say I have strings that are 5 characters long, could I use grepl in such a way to return strings where W is the third character?
I would have thought that this was the regex way:
> grepl("^.W",example)
[1] TRUE FALSE FALSE TRUE
If you wanted a particular position that is prespecified then:
> grepl("^.{1}W",example)
[1] TRUE FALSE FALSE TRUE
This would allow programmatic calculation:
pos= 2
n=pos-1
grepl(paste0("^.{",n,"}W"),example)
[1] TRUE FALSE FALSE TRUE
If you have 3-character strings and need to check the second character, you could just test the appropriate substring instead of using regular expressions:
example = c("AWA","WOO","AZW","WWP")
substr(example, 2, 2) == "W"
# [1] TRUE FALSE FALSE TRUE