R: grep multiple strings at once - r

I have a data frame with 1 variable and 5,000 rows, where each element is a string.
1. "Am open about my feelings."
2. "Take charge."
3. "Talk to a lot of different people at parties."
4. "Make friends easily."
5. "Never at a loss for words."
6. "Don't talk a lot."
7. "Keep in the background."
.....
5000. "Speak softly."
I need to find and output row numbers that correspond to 3 specific elements.
Currently, I use the following:
grep("Take charge." , df[,1])
grep("Make friends easily.", df[,1])
grep("Make friends easily.", df[,1])
And get the following output:
[1] 2 [2] 4 [3] 5000
Question 1. Is there a way to make syntax more succinct, so I do not have to use grep and df[,1] on every single line?
Questions 2. If so, how to output a single numerical array of the necessary row positions, so the result would look something like this?
2, 4, 5000
What I tried so far.
grep("Take charge." , "Make friends easily.","Make friends easily.",
df[,1]) # this didn't work
I tried to create a vector, called m1, that contains all three elements and then grep(m1, df[,1]) # this didn't work either

Since these are exact matches use this where phrases is a character vector of the phrases you want to match:
match(phrases, df[, 1])
This also works provided no phrase is a substring of another phrase:
grep(phrases, df[, 1])

Related

How to filter R dataset by multiple partial match strings, similar to SQL % wildcard? [duplicate]

This question already has answers here:
What's the R equivalent of SQL's LIKE 'description%' statement?
(4 answers)
Closed 11 days ago.
I have a dataset with with a field of interest and a list of strings (several hundred of them).
What I want to do is, for each line of the data, to check if the field has any of the partials strings in it.
Essentially, I want to replicate the SQL % wildcard. So, if for example a value is "Game123" and one of my strings is "Ga" I want that to be a match. (But I don't want "OGame" to match "Ga").
I'm hoping to write some statement like this:
df%>%
filter(My_Field contains any one of List_Of_Strings)
How do I fill in that filter statement?
I tried to use the %in% operator but couldn't make it work. I know how to use substrings to check against a single string, but I have a long list of them and need to check all of them.
R filter rows based on multiple partial strings applied to multiple columns: This post is similar to what I'm trying to do, but my list of substrings is 400 plus, so I can't write it all out manually in a grepl statement (I think?)
Since there is no particular dataset or reproductible example, I can think of a way to implement it with two apply functions and a smart use of regex. Remember that the regex operator ^ matches only if the following expression shows up in its beginning.
library(dplyr)
MyField <- c("OGame","Game123","Duck","Dugame","Aldubame")
df <- data.frame(MyField)
ListOfStrings <- c("^Ga","^Du") #Notice the use of ^ here
match_s <- function(patterns,entry){
lapply(patterns,grepl,x = entry) %>% unlist() %>% any()
}
df$match_string <- lapply(df$MyField, match_s, pat = ListOfStrings)
df %>% filter(match_string == 1)
With dplyr (using stringr for words and sentences as examples) and grepl in conjunction with \\b to get the word boundary match at the beginning.
library(stringr)
library(dplyr)
set.seed(22)
tibble(sentences) %>%
rowwise() %>%
filter(any(sapply(words[sample(length(words), 10)], function(x)
grepl(paste0("\\b", x), sentences)))) %>%
ungroup()
# A tibble: 32 × 1
sentences
<chr>
1 It's easy to tell the depth of a well.
2 Kick the ball straight and follow through.
3 A king ruled the state in the early days.
4 March the soldiers past the next hill.
5 The dune rose from the edge of the water.
6 The grass curled around the fence post.
7 Cats and Dogs each hate the other.
8 The harder he tried the less he got done.
9 He knew the skill of the great young actress.
10 The club rented the rink for the fifth night.
# … with 22 more rows
I guess the problem you're facing is this:
You have a list of what could be called key words (what you call "a list of strings") and a vector/column with text (what you call "a field of interest") and your goal is to filter the vector/column on whether or not any of the key words is present. If that's correct the solution might be this:
Data:
a. List of key words:
keys <- c("how", "why", "what")
b. Dataframe with a vector/column of text:
df <- data.frame(
text = c("Hi there", "How are you?", "I'm fine.", "So how's work?", "Ah kinda stressful.", "Why?", "Well you know")
)
Solution:
To filter df on keys in text you need to convert keys into a regex alternation pattern (by collapsing the strings with |). Depending on your keys it may be useful or even necessary to also include word \\boundary markers (in case the keys values need to match as such, but not occurring inside other words). And finally, if there may be an issue with lower- or upper-case, we can use the case-insensitive flag (?i):
df %>%
filter(str_detect(text, str_c("(?i)\\b(", str_c(keys, collapse = "|"), ")\\b")))
text
1 How are you?
2 So how's work?
3 Why?

Filter rows based on dynamic pattern

I have speech data in in a dataframe dfin column Orthographic:
df <- data.frame(
Orthographic = c("this is it at least probably",
"well not probably it's not intuitive",
"sure no it's I mean it's very intuitive",
"I don't mean to be rude but it's anything but you know",
"well okay maybe"),
Repeat = c(NA, "probably", "it's,intuitive", "I,mean,it's", NA),
Repeat_pattern = c(NA, "\\b(probably)\\b", "\\b(it's|intuitive)\\b", "\\b(I,mean|it's)\\b",
NA))
I want to filter rows based on a dynamic pattern, namely the occurrence of no, never, not as words OR n't before any of the words listed in column Repeat. However, using the pattern \\b(no|never|not)\\b|n't\\b\\s together with the alternation patterns in column Repeat_pattern, I get this error:
df %>%
filter(grepl(paste0("\\b(no|never|not)\\b|n't\\b\\s", Repeat_pattern), Orthographic))
Orthographic Repeat Repeat_pattern
1 well not probably it's not intuitive probably \\b(probably)\\b
2 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
Warning message:
In grepl(paste0("\\b(no|never|not)\\b|n't\\b\\s", Repeat_pattern), :
argument 'pattern' has length > 1 and only the first element will be used
I don't know why "only the first element will be used" as the two pattern components seem to connect well:
paste0("\\b(no|never|not)\\b|n't\\b\\s", df$Repeat_pattern)
[1] "\\b(no|never|not)\\b|n't\\b\\sNA" "\\b(no|never|not)\\b|n't\\b\\s\\b(probably)\\b"
[3] "\\b(no|never|not)\\b|n't\\b\\s\\b(it's|intuitive)\\b" "\\b(no|never|not)\\b|n't\\b\\s\\b(I,mean|it's)\\b"
[5] "\\b(no|never|not)\\b|n't\\b\\sNA"
The expected output is this:
2 well not probably it's not intuitive probably \\b(probably)\\b
3 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
4 I don't mean to be rude but it's anything but you know I,mean,it's \\b(I,mean|it's)\\b
It looks like a vectorization issue here, you need to use stringr::str_detect here rather than grepl.
Also, you did not group the negative word alternatives well, all of them must reside in a single group and your n't is now obligatory in a string.
Alse, NA values are coerced to text and added to the regex patterns, while it seems you want to discard the items where Repeat_pattern is NA.
You can fix your code by using
df %>%
filter(ifelse(is.na(Repeat_pattern), FALSE, str_detect(Orthographic, paste0("(?:\\bno|\\bnever|\\bnot|n't)\\b.*", Repeat_pattern))))
Output:
Orthographic Repeat Repeat_pattern
1 well not probably it's not intuitive probably \\b(probably)\\b
2 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
3 I don't mean to be rude but it's anything but you know I,mean,it's \\b(I|mean|it's)\\b
I also think the last pattern must be \\b(I|mean|it's)\\b, not \\b(I,mean|it's)\\b.
If there can only be whitespace between the "no" words and the word from Repeat column, replace .* with \\s+ in my pattern. I used .*\b to make sure there is a match anywhere to the right of the "no" words.

Counting all the matchings of a pattern in a vector in R

I have a boolean vector in which I want to count the number of occurrences of some patterns.
For example, for the pattern "(1,1)" and the vector "(1,1,1,0,1,1,1)", the answer should be 4.
The only built-in function I found to help is grepRaw, which finds the occurrences of a particular string in a longer string. However, it seems to fail when the sub-strings matching the pattern overlap:
length(grepRaw("11","1110111",all=TRUE))
# [1] 2
Do you have any ideas to obtain the right answer in this case?
Edit 1
I'm afraid that Rich's answer works for the particular example I posted, but fails in a more general setting:
> sum(duplicated(rbind(c(FALSE,FALSE),embed(c(TRUE,TRUE,TRUE,FALSE,TRUE,TRUE,TRUE),2))))
[1] 3
In this other example, the expected answer would be 0.
Using the function rollapply you can apply a moving window of width = 2 summing the values. Then you can sum the records where the result is equal to 2 i.e. sum(c(1,1))
library(zoo)
z <- c(1,1,1,0,1,1,1)
sum(rollapply(z, 2, sum) == 2)

Multiple Pattern Search - Pick the most hit Line from document

I am trying to search a list of terms or keywords in list of sentences. Here, I wanted to pick that line from the list of lines (which are review comments from customers) which matches to most of my terms or keywords that appear in.
At present i am doing this,
mydata<-c("i like this product, awesome",
"i could not go with this produt, since s/w is problem",
"Very good s/w. keep up the good work. i really like it")
terms<-c("really, "good", "like", "product")
termco(mydata, 1:3, terms)
and i get
3 word.count really good like product
1 1 5 0 0 1(20.00%) 1(20.00%)
2 2 11 0 0 0 0
3 3 12 1(8.33%) 2(16.67%) 1(8.33%) 0
I also tried few other suggestions HERE. But i could not get the result i wanted. But solution is very nice.
My expectation is that, a particular line or lines should only be displayed which has maximum number of terms or keywords that i am searching.
In this case i expected below line, since i have maximum number of terms or keywords present i.e., "really, "good", and "like"
"Very good s/w. keep up the good work. i really like it"
Thank in Advance!!
Here is a base R solution using apply and grep. The basic idea is to call grep(term, sentence), for every term in a given sentence. Then, we sum the number of hit terms for each sentence. Note carefully that we add word boundary markers around each term. This is to prevent false matches where a term happens to be a substring of another word in a sentence.
sapply(mydata, function(x) {
Reduce("+", sapply(terms, function(y) {
sum(grep(paste0("\\b", y, "\\b"), x))
}))
})
i like this product, awesome
2
i could not go with this product, since s/w is problem
1
Very good s/w. keep up the good work. i really like it
3
Demo
Using stringr's str_count can help as well:
Using str_count to get the counts(4 in total for last record) of the all matches and then use which.max to get the index of the vector(In this case it will return 3, which means third element the vector mydata)
mydata[which.max(stringr::str_count(mydata, paste0(terms, collapse="|")))]
Incase you want an absolute match with boundary conditions, you may use:
mydata[which.max(stringr::str_count(mydata,paste0("\\b",paste0(terms, collapse="\\b|\\b"),"\\b")))]
In your case both will give you same answer, however second will give less number of matches. E.g. when you have words like "keeping" instead of "keep" in a sentence. In this case the later regex will not match as its not absolute however, prior regex will match as there are no boundary conditions set.
Output:
> mydata[which.max(stringr::str_count(mydata, paste0(terms, collapse="|")))]
[1] "Very good s/w. keep up the good work. i really like it"

Counting positive smiles in string using R

In src$Review each row is filled with text in Russian. I want to count the number of positive smiles in each row. For example, in "My apricot is orange)) (for sure)" I want to count not just the quantity of outbound brackets (i.e., excluding general brackets in "(for sure)"), but the amount of positive smiling characters ("))" — at least two outbound brackets, number of ":)", ":-)"). So, it works only if at least two outbound brackets are exhibited.
Assume there is a string "I love this girl!)))) (she makes me happy) every day:):) :-)!" Here we count: )))) (4 units), ":)" (2 units), ":-)" (1 unit). After we combine the number of units (i.e., 7). Pay attention that we don't count brackets in "(she makes me happy)".
Now I have following code in my script:
smilecounts <- str_count(src$Review, "[))]")
It counts only the total amount of bracket pairs ("()") (as I understand comparing data set and derivation of this command).
I only need the total amount of ":)", ":-)", "))" (the total number of outbound brackets which display as "))" in rows) to be counted. For example, in ")))))" appear 5 outbound brackets, the condition of at least two outbound brackets together is satisfied, than we count the total amount of brackets in this part of text (i.e., 5 outbound brackets).
Thank you so much for help in advance.
We can use regex lookarounds to extract the ) that follows a ) or : or :=, then use length to get the count.
length(str_extract_all(str1, '(?<=\\)|\\!)\\)')[[1]])
#[1] 4
length(str_extract_all(str1, '(?<=:)\\)')[[1]])
#[1] 2
length(str_extract_all(str1, '(?<=:-)\\)')[[1]])
#[1] 1
Or this can be done using a loop
pat <- c('(?<=\\)|\\!)\\)', '(?<=:)\\)', '(?<=:-)\\)')
sum(sapply(lapply(pat, str_extract_all, string=str1),
function(x) length(unlist(x))))
#[1] 7
data
str1 <- "I love this girl!)))) (she makes me happy) every day:):) :-)!"
One way with regexpr and regmatches:
vec <- "I love this girl!)))) (she makes me happy) every day:):) :-)!"
Solution:
#matches the locations of :-) or ))+ or :)
a <- gregexpr(':-)+|))+|:)+', vec)
#extracts those
b <- regmatches(vec, a)[[1]]
b
#[1] "))))" ":)" ":)" ":-)"
#table counts the instances
b
)))) :-) :)
1 1 2
Then I suppose you could count the number of single )s using
nchar(b[1])
[1] 4
Or in a more automated way:
tab <- table(b)
#the following means "if a name of the table consists only of ) then
#count the number of )s"
tab2 <- ifelse(gsub(')','', names(table(b)))=='', nchar(names(table(b))), table(b))
names(tab2) <- names(tab)
> tab2
)))) :-) :)
4 1 2

Resources