Filter rows based on dynamic pattern - r

I have speech data in in a dataframe dfin column Orthographic:
df <- data.frame(
Orthographic = c("this is it at least probably",
"well not probably it's not intuitive",
"sure no it's I mean it's very intuitive",
"I don't mean to be rude but it's anything but you know",
"well okay maybe"),
Repeat = c(NA, "probably", "it's,intuitive", "I,mean,it's", NA),
Repeat_pattern = c(NA, "\\b(probably)\\b", "\\b(it's|intuitive)\\b", "\\b(I,mean|it's)\\b",
NA))
I want to filter rows based on a dynamic pattern, namely the occurrence of no, never, not as words OR n't before any of the words listed in column Repeat. However, using the pattern \\b(no|never|not)\\b|n't\\b\\s together with the alternation patterns in column Repeat_pattern, I get this error:
df %>%
filter(grepl(paste0("\\b(no|never|not)\\b|n't\\b\\s", Repeat_pattern), Orthographic))
Orthographic Repeat Repeat_pattern
1 well not probably it's not intuitive probably \\b(probably)\\b
2 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
Warning message:
In grepl(paste0("\\b(no|never|not)\\b|n't\\b\\s", Repeat_pattern), :
argument 'pattern' has length > 1 and only the first element will be used
I don't know why "only the first element will be used" as the two pattern components seem to connect well:
paste0("\\b(no|never|not)\\b|n't\\b\\s", df$Repeat_pattern)
[1] "\\b(no|never|not)\\b|n't\\b\\sNA" "\\b(no|never|not)\\b|n't\\b\\s\\b(probably)\\b"
[3] "\\b(no|never|not)\\b|n't\\b\\s\\b(it's|intuitive)\\b" "\\b(no|never|not)\\b|n't\\b\\s\\b(I,mean|it's)\\b"
[5] "\\b(no|never|not)\\b|n't\\b\\sNA"
The expected output is this:
2 well not probably it's not intuitive probably \\b(probably)\\b
3 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
4 I don't mean to be rude but it's anything but you know I,mean,it's \\b(I,mean|it's)\\b

It looks like a vectorization issue here, you need to use stringr::str_detect here rather than grepl.
Also, you did not group the negative word alternatives well, all of them must reside in a single group and your n't is now obligatory in a string.
Alse, NA values are coerced to text and added to the regex patterns, while it seems you want to discard the items where Repeat_pattern is NA.
You can fix your code by using
df %>%
filter(ifelse(is.na(Repeat_pattern), FALSE, str_detect(Orthographic, paste0("(?:\\bno|\\bnever|\\bnot|n't)\\b.*", Repeat_pattern))))
Output:
Orthographic Repeat Repeat_pattern
1 well not probably it's not intuitive probably \\b(probably)\\b
2 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
3 I don't mean to be rude but it's anything but you know I,mean,it's \\b(I|mean|it's)\\b
I also think the last pattern must be \\b(I|mean|it's)\\b, not \\b(I,mean|it's)\\b.
If there can only be whitespace between the "no" words and the word from Repeat column, replace .* with \\s+ in my pattern. I used .*\b to make sure there is a match anywhere to the right of the "no" words.

Related

How to filter R dataset by multiple partial match strings, similar to SQL % wildcard? [duplicate]

This question already has answers here:
What's the R equivalent of SQL's LIKE 'description%' statement?
(4 answers)
Closed 11 days ago.
I have a dataset with with a field of interest and a list of strings (several hundred of them).
What I want to do is, for each line of the data, to check if the field has any of the partials strings in it.
Essentially, I want to replicate the SQL % wildcard. So, if for example a value is "Game123" and one of my strings is "Ga" I want that to be a match. (But I don't want "OGame" to match "Ga").
I'm hoping to write some statement like this:
df%>%
filter(My_Field contains any one of List_Of_Strings)
How do I fill in that filter statement?
I tried to use the %in% operator but couldn't make it work. I know how to use substrings to check against a single string, but I have a long list of them and need to check all of them.
R filter rows based on multiple partial strings applied to multiple columns: This post is similar to what I'm trying to do, but my list of substrings is 400 plus, so I can't write it all out manually in a grepl statement (I think?)
Since there is no particular dataset or reproductible example, I can think of a way to implement it with two apply functions and a smart use of regex. Remember that the regex operator ^ matches only if the following expression shows up in its beginning.
library(dplyr)
MyField <- c("OGame","Game123","Duck","Dugame","Aldubame")
df <- data.frame(MyField)
ListOfStrings <- c("^Ga","^Du") #Notice the use of ^ here
match_s <- function(patterns,entry){
lapply(patterns,grepl,x = entry) %>% unlist() %>% any()
}
df$match_string <- lapply(df$MyField, match_s, pat = ListOfStrings)
df %>% filter(match_string == 1)
With dplyr (using stringr for words and sentences as examples) and grepl in conjunction with \\b to get the word boundary match at the beginning.
library(stringr)
library(dplyr)
set.seed(22)
tibble(sentences) %>%
rowwise() %>%
filter(any(sapply(words[sample(length(words), 10)], function(x)
grepl(paste0("\\b", x), sentences)))) %>%
ungroup()
# A tibble: 32 × 1
sentences
<chr>
1 It's easy to tell the depth of a well.
2 Kick the ball straight and follow through.
3 A king ruled the state in the early days.
4 March the soldiers past the next hill.
5 The dune rose from the edge of the water.
6 The grass curled around the fence post.
7 Cats and Dogs each hate the other.
8 The harder he tried the less he got done.
9 He knew the skill of the great young actress.
10 The club rented the rink for the fifth night.
# … with 22 more rows
I guess the problem you're facing is this:
You have a list of what could be called key words (what you call "a list of strings") and a vector/column with text (what you call "a field of interest") and your goal is to filter the vector/column on whether or not any of the key words is present. If that's correct the solution might be this:
Data:
a. List of key words:
keys <- c("how", "why", "what")
b. Dataframe with a vector/column of text:
df <- data.frame(
text = c("Hi there", "How are you?", "I'm fine.", "So how's work?", "Ah kinda stressful.", "Why?", "Well you know")
)
Solution:
To filter df on keys in text you need to convert keys into a regex alternation pattern (by collapsing the strings with |). Depending on your keys it may be useful or even necessary to also include word \\boundary markers (in case the keys values need to match as such, but not occurring inside other words). And finally, if there may be an issue with lower- or upper-case, we can use the case-insensitive flag (?i):
df %>%
filter(str_detect(text, str_c("(?i)\\b(", str_c(keys, collapse = "|"), ")\\b")))
text
1 How are you?
2 So how's work?
3 Why?

Remove Everything Except Specific Words From Text

I'm working with twitter data using R. I have a large data frame where I need to remove everything from the text except from specific information. Specifically, I want to remove everything except from statistical information. So basically, I want to keep numbers as well as words such as "half", "quarter", "third". Also is there a way to also keep symbols such as "£", "%", "$"?
I have been using "gsub" to try and do this:
df$text <- as.numeric(gsub(".*?([0-9]+).*", "\\1", df$text))
This code removes everything except from numbers, however information regarding any words was gone. I'm struggling to figure out how I would be able to keep specific words within the text as well as the numbers.
Here's a mock data frame:
text <- c("here is some text with stuff inside that i dont need but also some that i do, here is a word half and quarter also 99 is too old for lego", "heres another one with numbers 132 1244 5950 303 2022 and one and a half", "plz help me with code i am struggling")
df <- data.frame(text)
I would like to be be able to end up with data frame outputting:
Also, I've included a N/A table in the picture because some of my observations will have neither a number or the specific words. The goal of this code is really just to be able to say that these observations contain some form of statistical language and these other observations do not.
Any help would be massively appreciate and I'll do my best to answer any Q's!
I am sure there is a more elegant solution, but I believe this will accomplish what you want!
df$newstrings <- unlist(lapply(regmatches(df$text, gregexpr("half|quarter|third|[[:digit:]]+", df$text)), function(x) paste(x, collapse = "")))
df$newstrings[df$newstrings == ""] <- NA
> df$newstrings
# [1] "halfquarter99" "132124459503032022half" NA
You can capture what you need to keep and then match and consume any character to replace with a backreference to the group value:
text <- c("here is some text with stuff inside that i dont need but also some that i do, here is a word half and quarter also 99 is too old for lego", "heres another one with numbers 132 1244 5950 303 2022 and one and a half", "plz help me with code i am struggling")
gsub("(half|quarter|third|\\d+)|.", "\\1", text)
See the regex demo. Details:
(half|quarter|third|\d+) - a half, quarter or third word, or one or more digits
| - or
. - any single char.
The \1 in the replacement pattern puts the captured vaue back into the resulting string.
Output:
[1] "halfquarter99" "132124459503032022half" ""

R: grep multiple strings at once

I have a data frame with 1 variable and 5,000 rows, where each element is a string.
1. "Am open about my feelings."
2. "Take charge."
3. "Talk to a lot of different people at parties."
4. "Make friends easily."
5. "Never at a loss for words."
6. "Don't talk a lot."
7. "Keep in the background."
.....
5000. "Speak softly."
I need to find and output row numbers that correspond to 3 specific elements.
Currently, I use the following:
grep("Take charge." , df[,1])
grep("Make friends easily.", df[,1])
grep("Make friends easily.", df[,1])
And get the following output:
[1] 2 [2] 4 [3] 5000
Question 1. Is there a way to make syntax more succinct, so I do not have to use grep and df[,1] on every single line?
Questions 2. If so, how to output a single numerical array of the necessary row positions, so the result would look something like this?
2, 4, 5000
What I tried so far.
grep("Take charge." , "Make friends easily.","Make friends easily.",
df[,1]) # this didn't work
I tried to create a vector, called m1, that contains all three elements and then grep(m1, df[,1]) # this didn't work either
Since these are exact matches use this where phrases is a character vector of the phrases you want to match:
match(phrases, df[, 1])
This also works provided no phrase is a substring of another phrase:
grep(phrases, df[, 1])

R Regex for matching comma separated sections in a column/vector

The original Title for this Question was : R Regex for word boundary excluding space.It reflected the manner I was approaching the problem in. However, this is a better solution to my particular problem. It should work as long as a particular delimiter is used to separate items within a 'cell'
This must be very simple, but I've hit a brick wall on it.
I have a dataframe column where each cell(row) is a comma separated list of items. I want to find the rows that have a specific item.
df<-data.frame( nms= c("XXXCAP,XXX CAPITAL LIMITED" , "XXX,XXX POLYMERS LIMITED, 3455" , "YYY,XXX REP LIMITED,999,XXX" ),
b = c('A', 'X', "T"))
nms b
1 XXXCAP,XXX CAPITAL LIMITED A
2 XXX,XXX POLYMERS LIMITED, 3455 X
3 YYY,XXX REP LIMITED,999,XXX T
I want to search for rows that have item XXX. Rows 2 and 3 should match. Row 1 has the string XXX as part of a larger string and obviously should not match.
However, because XXX in row 1 is separated by spaces in each side, I am having trouble filtering it out with \\b or [[:<:]]
grep("\\bXXX\\b",df$nms, value = F) #matches 1,2,3
The easiest way to do this of course is strsplit() but I'd like to avoid it.Any suggestions on performance are welcome.
When \b does not "work", the problem usually lies in the definition of the "whole word".
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
It seems you want to only match a word in between commas or start/end of the string).
You may use a PCRE regex (note the perl=TRUE argument) like
(?<![^,])XXX(?![^,])
See the regex demo (the expression is "converted" to use positive lookarounds due to the fact it is a demo with a single multiline string).
Details
(?<![^,]) (equal to (?<=^|,)) - either start of the string or a comma
XXX - an XXX word
(?![^,]) (equal to (?=$|,)) - either end of the string or a comma
R demo:
> grep("(?<![^,])XXX(?![^,])",df$nms, value = FALSE, perl=TRUE)
## => [1] 2 3
The equivalent TRE regex will look like
> grep("(?:^|,)XXX(?:$|,)",df$nms, value = FALSE)
Note that here, non-capturing groups are used to match either start of string or , (see (?:^|,)) and either end of string or , (see ((?:$|,))).
This is perhaps a somewhat simplistic solution, but it works for the examples which you've provided:
library(stringr)
df$nms %>%
str_replace_all('\\s', '') %>% # Removes all spaces, tabs, newlines, etc
str_detect('(^|,)XXX(,|$)') # Detects string XXX surrounded by comma or beginning/end
[1] FALSE TRUE TRUE
Also, have a look at this cheatsheet made by RStudio on Regular Expressions - it is very nicely made and very useful (I keep going back to it when I'm in doubt).

Multiple Pattern Search - Pick the most hit Line from document

I am trying to search a list of terms or keywords in list of sentences. Here, I wanted to pick that line from the list of lines (which are review comments from customers) which matches to most of my terms or keywords that appear in.
At present i am doing this,
mydata<-c("i like this product, awesome",
"i could not go with this produt, since s/w is problem",
"Very good s/w. keep up the good work. i really like it")
terms<-c("really, "good", "like", "product")
termco(mydata, 1:3, terms)
and i get
3 word.count really good like product
1 1 5 0 0 1(20.00%) 1(20.00%)
2 2 11 0 0 0 0
3 3 12 1(8.33%) 2(16.67%) 1(8.33%) 0
I also tried few other suggestions HERE. But i could not get the result i wanted. But solution is very nice.
My expectation is that, a particular line or lines should only be displayed which has maximum number of terms or keywords that i am searching.
In this case i expected below line, since i have maximum number of terms or keywords present i.e., "really, "good", and "like"
"Very good s/w. keep up the good work. i really like it"
Thank in Advance!!
Here is a base R solution using apply and grep. The basic idea is to call grep(term, sentence), for every term in a given sentence. Then, we sum the number of hit terms for each sentence. Note carefully that we add word boundary markers around each term. This is to prevent false matches where a term happens to be a substring of another word in a sentence.
sapply(mydata, function(x) {
Reduce("+", sapply(terms, function(y) {
sum(grep(paste0("\\b", y, "\\b"), x))
}))
})
i like this product, awesome
2
i could not go with this product, since s/w is problem
1
Very good s/w. keep up the good work. i really like it
3
Demo
Using stringr's str_count can help as well:
Using str_count to get the counts(4 in total for last record) of the all matches and then use which.max to get the index of the vector(In this case it will return 3, which means third element the vector mydata)
mydata[which.max(stringr::str_count(mydata, paste0(terms, collapse="|")))]
Incase you want an absolute match with boundary conditions, you may use:
mydata[which.max(stringr::str_count(mydata,paste0("\\b",paste0(terms, collapse="\\b|\\b"),"\\b")))]
In your case both will give you same answer, however second will give less number of matches. E.g. when you have words like "keeping" instead of "keep" in a sentence. In this case the later regex will not match as its not absolute however, prior regex will match as there are no boundary conditions set.
Output:
> mydata[which.max(stringr::str_count(mydata, paste0(terms, collapse="|")))]
[1] "Very good s/w. keep up the good work. i really like it"

Resources