Multiple Pattern Search - Pick the most hit Line from document - r

I am trying to search a list of terms or keywords in list of sentences. Here, I wanted to pick that line from the list of lines (which are review comments from customers) which matches to most of my terms or keywords that appear in.
At present i am doing this,
mydata<-c("i like this product, awesome",
"i could not go with this produt, since s/w is problem",
"Very good s/w. keep up the good work. i really like it")
terms<-c("really, "good", "like", "product")
termco(mydata, 1:3, terms)
and i get
3 word.count really good like product
1 1 5 0 0 1(20.00%) 1(20.00%)
2 2 11 0 0 0 0
3 3 12 1(8.33%) 2(16.67%) 1(8.33%) 0
I also tried few other suggestions HERE. But i could not get the result i wanted. But solution is very nice.
My expectation is that, a particular line or lines should only be displayed which has maximum number of terms or keywords that i am searching.
In this case i expected below line, since i have maximum number of terms or keywords present i.e., "really, "good", and "like"
"Very good s/w. keep up the good work. i really like it"
Thank in Advance!!

Here is a base R solution using apply and grep. The basic idea is to call grep(term, sentence), for every term in a given sentence. Then, we sum the number of hit terms for each sentence. Note carefully that we add word boundary markers around each term. This is to prevent false matches where a term happens to be a substring of another word in a sentence.
sapply(mydata, function(x) {
Reduce("+", sapply(terms, function(y) {
sum(grep(paste0("\\b", y, "\\b"), x))
}))
})
i like this product, awesome
2
i could not go with this product, since s/w is problem
1
Very good s/w. keep up the good work. i really like it
3
Demo

Using stringr's str_count can help as well:
Using str_count to get the counts(4 in total for last record) of the all matches and then use which.max to get the index of the vector(In this case it will return 3, which means third element the vector mydata)
mydata[which.max(stringr::str_count(mydata, paste0(terms, collapse="|")))]
Incase you want an absolute match with boundary conditions, you may use:
mydata[which.max(stringr::str_count(mydata,paste0("\\b",paste0(terms, collapse="\\b|\\b"),"\\b")))]
In your case both will give you same answer, however second will give less number of matches. E.g. when you have words like "keeping" instead of "keep" in a sentence. In this case the later regex will not match as its not absolute however, prior regex will match as there are no boundary conditions set.
Output:
> mydata[which.max(stringr::str_count(mydata, paste0(terms, collapse="|")))]
[1] "Very good s/w. keep up the good work. i really like it"

Related

Filter rows based on dynamic pattern

I have speech data in in a dataframe dfin column Orthographic:
df <- data.frame(
Orthographic = c("this is it at least probably",
"well not probably it's not intuitive",
"sure no it's I mean it's very intuitive",
"I don't mean to be rude but it's anything but you know",
"well okay maybe"),
Repeat = c(NA, "probably", "it's,intuitive", "I,mean,it's", NA),
Repeat_pattern = c(NA, "\\b(probably)\\b", "\\b(it's|intuitive)\\b", "\\b(I,mean|it's)\\b",
NA))
I want to filter rows based on a dynamic pattern, namely the occurrence of no, never, not as words OR n't before any of the words listed in column Repeat. However, using the pattern \\b(no|never|not)\\b|n't\\b\\s together with the alternation patterns in column Repeat_pattern, I get this error:
df %>%
filter(grepl(paste0("\\b(no|never|not)\\b|n't\\b\\s", Repeat_pattern), Orthographic))
Orthographic Repeat Repeat_pattern
1 well not probably it's not intuitive probably \\b(probably)\\b
2 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
Warning message:
In grepl(paste0("\\b(no|never|not)\\b|n't\\b\\s", Repeat_pattern), :
argument 'pattern' has length > 1 and only the first element will be used
I don't know why "only the first element will be used" as the two pattern components seem to connect well:
paste0("\\b(no|never|not)\\b|n't\\b\\s", df$Repeat_pattern)
[1] "\\b(no|never|not)\\b|n't\\b\\sNA" "\\b(no|never|not)\\b|n't\\b\\s\\b(probably)\\b"
[3] "\\b(no|never|not)\\b|n't\\b\\s\\b(it's|intuitive)\\b" "\\b(no|never|not)\\b|n't\\b\\s\\b(I,mean|it's)\\b"
[5] "\\b(no|never|not)\\b|n't\\b\\sNA"
The expected output is this:
2 well not probably it's not intuitive probably \\b(probably)\\b
3 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
4 I don't mean to be rude but it's anything but you know I,mean,it's \\b(I,mean|it's)\\b
It looks like a vectorization issue here, you need to use stringr::str_detect here rather than grepl.
Also, you did not group the negative word alternatives well, all of them must reside in a single group and your n't is now obligatory in a string.
Alse, NA values are coerced to text and added to the regex patterns, while it seems you want to discard the items where Repeat_pattern is NA.
You can fix your code by using
df %>%
filter(ifelse(is.na(Repeat_pattern), FALSE, str_detect(Orthographic, paste0("(?:\\bno|\\bnever|\\bnot|n't)\\b.*", Repeat_pattern))))
Output:
Orthographic Repeat Repeat_pattern
1 well not probably it's not intuitive probably \\b(probably)\\b
2 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
3 I don't mean to be rude but it's anything but you know I,mean,it's \\b(I|mean|it's)\\b
I also think the last pattern must be \\b(I|mean|it's)\\b, not \\b(I,mean|it's)\\b.
If there can only be whitespace between the "no" words and the word from Repeat column, replace .* with \\s+ in my pattern. I used .*\b to make sure there is a match anywhere to the right of the "no" words.

R: grep multiple strings at once

I have a data frame with 1 variable and 5,000 rows, where each element is a string.
1. "Am open about my feelings."
2. "Take charge."
3. "Talk to a lot of different people at parties."
4. "Make friends easily."
5. "Never at a loss for words."
6. "Don't talk a lot."
7. "Keep in the background."
.....
5000. "Speak softly."
I need to find and output row numbers that correspond to 3 specific elements.
Currently, I use the following:
grep("Take charge." , df[,1])
grep("Make friends easily.", df[,1])
grep("Make friends easily.", df[,1])
And get the following output:
[1] 2 [2] 4 [3] 5000
Question 1. Is there a way to make syntax more succinct, so I do not have to use grep and df[,1] on every single line?
Questions 2. If so, how to output a single numerical array of the necessary row positions, so the result would look something like this?
2, 4, 5000
What I tried so far.
grep("Take charge." , "Make friends easily.","Make friends easily.",
df[,1]) # this didn't work
I tried to create a vector, called m1, that contains all three elements and then grep(m1, df[,1]) # this didn't work either
Since these are exact matches use this where phrases is a character vector of the phrases you want to match:
match(phrases, df[, 1])
This also works provided no phrase is a substring of another phrase:
grep(phrases, df[, 1])

General function to get the frequency of certain word in a string

I am trying to write a function to get a frequency of specific word from some text. Then use this function to calculate the frequency of selected word for each row in the data frame.
So far, what I am doing is creating a function that takes the input of a string and a pattern (i.e. str, pattern). Since grep captures all the patterns in the string, I felt length would take care of the job of capturing the frequency of the selected pattern.
word_count = function(str,pattern) {
string = gsub("[[:punct:]]","",strsplit(str," "))
x = grep("pattern",string,value=TRUE)
return(length(x))
}
For data frame(my_df) it looks like this:
id description
123 "It is cozy and pretty comfy. I think you will have good time
here."
232 "NOT RECOMMENDED whatsover. You will suffer here."
3333 "BEACHES are awesome overhere!! Highly recommended!!"
...so forth(more than obs.15000)
I have actually made all the description portion to lower case, so it is actually more like this:
id description
123 "it is cozy and pretty comfy. i think you will have good time
here."
232 "not recommended whatsover. you will suffer here."
3333 "beaches are awesome overhere!! highly recommended!!"
...so forth(more than obs.15000)
Then what I really want to my function to do:
word_count(my_df$description[1],recommended)
[1] 0
word_count(my_df$description[3],highly)
[1] 1
But what it is doing:
word_count(my_df$description[1],recommended)
[1] 2
word_count(my_df$description[3],highly)
[1] 2
It is essentially returning wrong answer. Hopefully, I want to use this function to apply on all the rows in the data frame, and I am planning on doing so by using if. However, while testing for individual rows, it doesn't seem to do the job I want.
You can change the function to
word_count = function(str,pattern) {
sum(grepl(pattern, strsplit(str, " ")[[1]]))
}
We first the split the string on empty space (" ") then search for pattern in every word using grepl. As grepl returns TRUE/FALSE values to count the number of times a pattern occurred we can directly use sum.
Then when you try the function it will return you the expected output.
word_count(df$description[1],"recommended")
#[1] 0
word_count(df$description[3],"highly")
#[1] 1
However, note that there is str_count function in stringr which can give you directly the number of occurrences for every row
stringr::str_count(df$description, "recommended")
#[1] 0 1 1

How to create a word grouping report using R language and .Net?

I would like to create a simple application in C# that takes in a group of words, then returns all groupings of those individual words from a data set.
For example, given car and bike, return a list of groups/combinations of words (with the number of combinations found) from a data set.
To further clarify - given a category named "car", I would like to see a list of word groupings with the word "car". This category could also be several words rather than just one.
With a sample data set of:
CAR:
Another car for sale
Blue car on the horizon
For Sale - used car
this car is painted blue
should return
car : for sale : 2
car : blue : 2
I'd like to set a threshold, say 20 or greater, so if there are over 20 instances of the word(s) with car, then display them - category, words, count, where only category is known; words and count is determined by the algorithm.
The data set is in a SQL Server 2008 table, and I was hoping to use something like a .Net implementation of R to accomplish this.
I am guessing that the best way to accomplish this may be with the R programming language, and am only now looking at R.Net.
I would prefer to do this with .Net, as that is what I am most familiar with, but open to suggestions.
Can someone with some experience with this lead me in the right direction?
Thanks.
It seems your question consists of 4 parts:
Getting data from SQL Server 2008
Extracting substrings from a set of strings
Setting a threshold for when to accept that number
Producing some document or other output (?) containing this.
For 1, I think that's a different question (see the RODBC package), but I won't be dealing with that here as that's not the main part of your question. You've left 4. a little vague and I think that's also peripheral to the meat of your question.
Part 2 can be easily dealt with using regular expressions:
countstring <- function(string, pattern){
stringcount <- sum(grepl(pattern, string, ignore.case=TRUE), na.rm=TRUE)
paste(deparse(substitute(string)), pattern, stringcount, sep=" : ")
}
This function basically gets a vector of strings and a pattern to search for. It finds which of them match and gets the sum of the number that do (ie the count). It then prints out these together in one string. For example:
car <- c("Another car for sale", "Blue car on the horizon", "For Sale - used car", "this car is painted blue")
countstring(car, "blue")
## [1] "car : blue : 2"
Part 3 requires a small change to the function
countstring <- function(string, pattern, threshold=20){
stringcount <- sum(grepl(pattern, string, ignore.case=TRUE), na.rm=TRUE)
if(stringcount >= threshold){
paste(deparse(substitute(string)), pattern, stringcount, sep=" : ")
}
}

Counting syllables

I'm looking to assign some different readability scores to text in R such as the Flesh Kincaid.
Does anyone know of a way to segment words into syllables using R? I don't necessarily need the syllable segments themselves but a count.
so for instance:
x <- c('dog', 'cat', 'pony', 'cracker', 'shoe', 'Popsicle')
would yield:
1, 1, 2, 2, 1, 3
Each number corresponding the the number of syllables in the word.
qdap version 1.1.0 does this task:
library(qdap)
x <- c('dog', 'cat', 'pony', 'cracker', 'shoe', 'Popsicle')
syllable_sum(x)
## [1] 1 1 2 2 1 3
gsk3 is correct: if you want a correct solution, it is non-trivial.
For example, you have to watch out for strange things like silent e at the end of a word (eg pane), or know when it's not silent, as in finale.
However, if you just want a quick-and-dirty approximation, this will do it:
> nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", tolower( x ))))
[1] 1 1 2 2 1 3
To understand how the parts work, just strip away the function calls from the outside in, starting with nchar and then gsub, etc... ...until the expression makes sense to you.
But my guess is, considering a fight between R's power vs the profusion of exceptions in the English language, you could get a decent answer (maybe 99% right?) parsing through normal text, without a lot of work - heck, the simple parser above may get 90%+ right. With a little more work, you could deal with silent e's if you like.
It all depends on your application - whether this is good enough or you need something more accurate.
Some tools for NLP are available here:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
The task is non-trivial though. More hints (including an algorithm you could implement) here:
Detecting syllables in a word
The koRpus package will help you out immensley, but it's a little difficult to work with.
stopifnot(require(koRpus))
tokens <- tokenize(text, format="obj", lang='en')
flesch.kincaid(tokens)

Resources