This question already has answers here:
r grep by regex - finding a string that contains a sub string exactly one once
(6 answers)
Closed 3 years ago.
I want to check if a special numbers ("2020", the whole number/year) appears twice in a string. I tried this but it did not work.
Who can help me?
grep(pattern = "2020{2}", x = "DataMW_2029__ForecastMW_2020")
Thank you :-)
You can use gregexpr to test if 2020 appears twice:
length(gregexpr("2020", "DataMW_2029__ForecastMW_2020")[[1]]) == 2
#[1] FALSE
length(gregexpr("2020", "DataMW_2020__ForecastMW_2020")[[1]]) == 2
#[1] TRUE
Or with a regex testing for 2 and more.
grepl("(.*2020){2}", "DataMW_2029__ForecastMW_2020")
#[1] FALSE
grepl("(.*2020){2}", "DataMW_2020__ForecastMW_2020")
#[1] TRUE
or for exact 2 hits:
grepl("^(?!(.*2020){3})(.*2020){2}.*$", "DataMW_2029__ForecastMW_2020", perl=TRUE)
#[1] FALSE
grepl("^(?!(.*2020){3})(.*2020){2}.*$", "DataMW_2020__ForecastMW_2020", perl=TRUE)
#[1] TRUE
grepl("^(?!(.*2020){3})(.*2020){2}.*$", "DataMW_2020__ForecastMW_2020_2020", perl=TRUE)
#[1] FALSE
I would use stringr::str_count():
x <- c("DataMW_2029__ForecastMW_2020", "DataMW_2020__ForecastMW_2020")
stringr::str_count(string = x, pattern = "2020")
# [1] 1 2
stringr::str_count(string = x, pattern = "2020") == 2
# [1] FALSE TRUE
Related
I have a dataframe with strings such as these, some of which are existing English words and others which are not:
df <- data.frame(
strings = c("'tis"," &%##","aah", "notexistingword", "823942", "abaxile"))
Now I'd like to check which of them are real words by matching them to a large dictionary such as the GradyAugmented;
library(qdapDictionaries)
df$inGrady <- grepl(paste0("\\b(", paste(GradyAugmented[1:2500], collapse = "|"), ")\\b"), df$strings)
df
strings inGrady
1 'tis TRUE
2 &%## FALSE
3 aah TRUE
4 notexistingword FALSE
5 823942 FALSE
6 abaxile TRUE
Unfortunately, this works fine just as long as I restrict the size of GradyAugmented (the cut-off point from which it no longer seems to work is around size 2500). As soon as I use the whole dictionary I get an error, asserting there's an invalid regular expression. My hunch is that it's less the regex but a memory problem. How can that problem be resolved?
are you looking for something like this?
df$inGrady <- df$strings %in% GradyAugmented
# strings inGrady
# 1 'tis TRUE
# 2 &%## FALSE
# 3 aah TRUE
# 4 notexistingword FALSE
# 5 823942 FALSE
# 6 abaxile TRUE
This question already has an answer here:
How to match a string with a tolerance of one character?
(1 answer)
Closed 2 years ago.
So I would like my code below to return TRUE, even as the front 2 letters are different.
Is there a way to accomplish this? I know == does not work as it compares both exactly.
if("UKVICTORIA" == "USVICTORIA") {
print("TRUE")} else {
print("FALSE")
}
}
Use agrepl
> agrepl("UKVICTORIA", "USVICTORIA", max.distance = 1)
[1] TRUE
Note, if there is an extra character (Z), it returns FALSE
> agrepl("UZKVICTORIA", "USVICTORIA", max.distance = 1)
[1] FALSE
Remove first two characters and check the number of unique values.
length(unique(sub(".{2}", "", c("UKVICTORIA", "USVICTORIA")))) == 1
#[1] TRUE
This question already has answers here:
r grep by regex - finding a string that contains a sub string exactly one once
(6 answers)
Closed 3 years ago.
I am looking for a regex expression to capture strings where the pattern is repeated n times. Here is an example with expected output.
# find sentences with 2 occurrences of the word "is"
z = c("this is what it is and is not", "this is not", "this is it it is")
regex_function(z)
[1] FALSE FALSE TRUE
I have gotten this far:
grepl("(.*\\bis\\b.*){2}",z)
[1] TRUE FALSE TRUE
But this will return TRUE if there are at least 2 matches. How can I force it to look for strings with exactly 2 occurrences?
To find where the word is is contained two times you can remove all is with gsub and compare the length of the strings with nchar.
nchar(z) - nchar(gsub("(\\bis\\b)", "", z)) == 4
#[1] FALSE FALSE TRUE
or count the hits of gregexpr like:
sapply(gregexpr("\\bis\\b", z), function(x) sum(x>0)) == 2
#[1] FALSE FALSE TRUE
or with a regex in grepl
grepl("^(?!(.*\\bis\\b){3})(.*\\bis\\b){2}.*$", z, perl=TRUE)
#[1] FALSE FALSE TRUE
This is an option that works but needs 2 regex calls. I am still looking for a compact regex call which correctly solves this issue.
grepl("(.*\\bis\\b.*){2}",z) & !grepl("(.*\\bis\\b.*){3}",z)
Basically adding a grepl of n+1 and only keeping the ones that satisfy grep no 1 and do not satisfy grep no2.
library(stringi)
stri_count_regex(z, "\\bis\\b") == 2L
# [1] FALSE FALSE TRUE
with stringr:
library(stringr)
library(magrittr)
regex_function = function(str){
str_extract_all(str,"\\bis\\b")%>%
lapply(.,function(x){length(x) == 2}) %>%
unlist()
}
> regex_function(z)
[1] FALSE FALSE TRUE
Why does regexpr() not find the word foo in this case:
foobar <- data.frame(one=c("foo bar", "foo"))
regexpr("foo",foobar[,1])>1
[1] FALSE FALSE
But does in this case:
foobar <- data.frame(one=c("bar foo", " foo"))
regexpr("foo",foobar[,1])>1
[1] TRUE TRUE
It would be nice if you could give an explaination besids from a solution.
Thanks allot
The reason is because we are getting the position index with regexpr
regexpr("foo",foobar[,1])
#[1] 1 1
#attr(,"match.length")
#[1] 3 3
#attr(,"useBytes")
#[1] TRUE
and for the second one, it is
#[1] 5 2
I have this string:
myStr <- "I am very beautiful btw"
str <- c("very","beauti","bt")
Now I want to check whether myStr includes all strings in str, how can I do this in R? For example above it should be TRUE.
Many Thanks
Yes, you can use grepl (not grep, actually), but you must run it once for each substring:
> sapply(str, grepl, myStr)
very beauti bt
TRUE TRUE TRUE
To get only one result if all of them are true, use all:
> all(sapply(str, grepl, myStr))
[1] TRUE
Edit:
In case you have more than one string to check, say:
myStrings <- c("I am very beautiful btw", "I am not beautiful btw")
You then run the sapply code, which will return a matrix with one row for each string in myStrings. Apply all on each row:
> apply(sapply(str, grepl, myStrings), 1, all)
[1] TRUE FALSE
Using stringr you could do:
str_detect(myStr, str)
Which returns a result for each substring:
#[1] TRUE TRUE TRUE
Or as per #thelatemail suggestion, if you want to know if all of them are true:
all(str_detect(myStr,str))
Which gives:
#[1] TRUE
You could also find the location (start, end) of every character in myStr that matches str
str_locate(myStr, str)
Which gives:
# start end
#[1,] 6 9
#[2,] 11 16
#[3,] 21 22