Remove words in one column if present in another column - r

I have a dataframe that is in this format:
A <- c("John Smith", "Red Shirt", "Family values are better")
B <- c("John is a very highly smart guy", "We tried the tea but didn't enjoy it at all", "Family is very important as it gives you values")
df <- as.data.frame(A, B)
My intention is to get the result back as:
ID A B
1 John Smith is a very highly smart guy
2 Red Shirt We tried the tea but didn't enjoy it at all
3 Family values are better is very important as it gives you
I have tried:
test<-df %>% filter(sapply(1:nrow(.), function(i) grepl(A[i], B[i])))
But this doesn't get me the desired output.

One solution is to use mapply along with strsplit.
The trick is to split df$A in separate words and collapse those words separated by | and then use it as pattern in gsub to replace with "".
lst <- strsplit(df$A, split = " ")
df$B <- mapply(function(x,y){gsub(paste0(x,collapse = "|"), "",df$B[y])},lst,1:length(lst))
df
# A B
# 1 John Smith is a very highly smart guy
# 2 Red Shirt We tried the tea but didn't enjoy it at all
# 3 Family values are better is very important as it gives you
Another option is as:
df$B <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",df$A),df$B)
Data:
A <- c("John Smith", "Red Shirt", "Family values are better")
B <- c("John is a very highly smart guy", "We tried the tea but didn't enjoy it at all", "Family is very important as it gives you values")
df <- data.frame(A, B, stringsAsFactors = FALSE)

Just another option using stringr::str_split_fixed function:
library(stringr)
str_split_fixed(sapply(paste(df$A,df$B, sep=" columnbreaker "),
function(i){
paste(unique(
strsplit(as.character(i), split=" ")[[1]]),
collapse = " ")}),
" columnbreaker ", 2)
# [,1] [,2]
# [1,] "John Smith" "is a very highly smart guy"
# [2,] "Red Shirt" "We tried the tea but didn't enjoy it at all"
# [3,] "Family values are better" "is very important as it gives you"

Related

How to extract all matching patterns (words in a string) in a dataframe column?

I have two dataframes. one (txt.df) has a column with a text I want to extract phrases from (text). The other (wrd.df) has a column with the phrases (phrase). both are big dataframes with complex texts and strings but lets say:
txt.df <- data.frame(id = c(1, 2, 3, 4, 5),
text = c("they love cats and dogs", "he is drinking juice",
"the child is having a nap on the bed", "they jump on the bed and break it",
"the cat is sleeping on the bed"))
wrd.df <- data.frame(label = c('a', 'b', 'c', 'd', 'e', 'd'),
phrase = c("love cats", "love dogs", "juice drinking", "nap on the bed", "break the bed",
"sleeping on the bed"))
what I finally need is a txt.df with another column which contains labels of the phrases detected.
what I tried was creating a column in wrd.df in which I tokenized the phrases like this
wrd.df$token <- sapply(wrd.df$phrase, function(x) unlist(strsplit(x, split = " ")))
and then tried to write a custom function to sapply over the tokens column with grepl/str_detect
get the names (labels) of those which were all true
Extract.Fun <- function(text, df, label, token){
for (i in token) {
truefalse[i] <- sapply(token[i], function (x) grepl(x, text))
truenames[i] <- names(which(truefalse[i] == T))
removedup[i] <- unique(truenames[i])
return(removedup)
}
and then sapply this custom function on my txt.df$text to have a new column with the labels.
txt.df$extract <- sapply(txt.df$text, function (x) Extract.Fun(x, wrd.df, "label", "token"))
but I'm not good with custom functions and am really stuck. I would appreciate any help.
P.S. It would be very good if i could also have partial matches like "drink juice" and "broke the bed"... but it's not a priority... fine with the original ones.
If you need to match the exact phrases, the regex_join() from the fuzzyjoin-package is what you need.
fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "phrase"), mode = "left" )
id text label phrase
1 1 they love cats and dogs a love cats
2 2 he is drinking juice <NA> <NA>
3 3 the child is having a nap on the bed d nap on the bed
4 4 they jump on the bed and break it <NA> <NA>
5 5 the cat is sleeping on the bed d sleeping on the bed
If you want to match all words, I guess you can build a regex out of the phrases that cover such behaviour...
update
#build regex for phrases
#done by splitting the phrases to individual words, and then paste the regex together
wrd.df$regex <- unlist( lapply( lapply( strsplit( wrd.df$phrase, " "),
function(x) paste0( "(?=.*", x, ")", collapse = "" ) ),
function(x) paste0( "^", x, ".*$") ) )
fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "regex"), mode = "left" )
id text label phrase regex
1 1 they love cats and dogs a love cats ^(?=.*love)(?=.*cats).*$
2 1 they love cats and dogs b love dogs ^(?=.*love)(?=.*dogs).*$
3 2 he is drinking juice c juice drinking ^(?=.*juice)(?=.*drinking).*$
4 3 the child is having a nap on the bed d nap on the bed ^(?=.*nap)(?=.*on)(?=.*the)(?=.*bed).*$
5 4 they jump on the bed and break it e break the bed ^(?=.*break)(?=.*the)(?=.*bed).*$
6 5 the cat is sleeping on the bed d sleeping on the bed ^(?=.*sleeping)(?=.*on)(?=.*the)(?=.*bed).*$

Approximate string matching in R between two datasets

I have the following dataset containing film titles and the corresponding genre, while another dataset contains plain text where these titles might be quoted or not:
dt1
title genre
Secret in Their Eyes Dramas
V for Vendetta Action & Adventure
Bottersnikes & Gumbles Kids' TV
... ...
and
dt2
id Text
1. "I really liked V for Vendetta"
2 "Bottersnikes & Gumbles was a great film .... "
3. " In any case, in my opinion bottersnikes &gumbles was a great film ..."
4 "#thewitcher was an interesting series
5 "Secret in Their Eye is a terrible film! but I Like V per Vendetta"
... etc
what I want to obtain is a function that matched those titles in dt1 and tries to find them in the text in dt2:
if it finds any match or approximate match I want to have a column in dt2 that tells with the title that was mentioned in the text. if more than one is mentioned I want a any titles separated by a comma.
dt2
id Text mentions
1. "I really liked V for Vendetta" "V for Vendetta"
2 "Bottersnikes & Gumbles was a great film .... " "Bottersnikes & Gumbles"
3. " In any case, in my opinion bottersnikes &gumbles was a great film ..." "Bottersnikes & Gumbles"
4 "#thewitcher was an interesting series NA
5 "Secret in Their Eye is a terrible film! but I Like V per Vendetta" "Secret in Their Eyes, V for Vendetta"
... etc
You can do the fuzzy matching via agrep(), which here I've used for each title with lapply() to generate a logical vector of matches for each Text, and then used an apply() across a data.frame from this match to create the vector of matched titles.
You can tweak the max.distance value but this worked just fine on your example.
dt1 <- data.frame(
title = c("Secret in Their Eyes", "V for Vendetta", "Bottersnikes & Gumbles"),
genre = c("Dramas", "Action & Adventure", "Kids' TV"),
stringsAsFactors = FALSE
)
dt2 <- data.frame(
id = 1:5,
Text = c(
"I really liked V for Vendetta",
"Bottersnikes & Gumbles was a great film .... ",
"In any case, in my opinion bottersnikes &gumbles was a great film ...",
"#thewitcher was an interesting series",
"Secret in Their Eye is a terrible film! but I Like V per Vendetta"
),
stringsAsFactors = FALSE
)
match_titles <- function(target, titles) {
matches <- lapply(titles, agrepl, target,
max.distance = 0.3,
ignore.case = TRUE, fixed = TRUE
)
matched_titles <- apply(
data.frame(matches), 1,
function(y) paste(titles[y], collapse = ",")
)
matched_titles
}
dt2$titles <- match_titles(dt2$Text, dt1$title)
dt2
## id Text
## 1 1 I really liked V for Vendetta
## 2 2 Bottersnikes & Gumbles was a great film ....
## 3 3 In any case, in my opinion bottersnikes &gumbles was a great film ...
## 4 4 #thewitcher was an interesting series
## 5 5 Secret in Their Eye is a terrible film! but I Like V per Vendetta
## titles
## 1 V for Vendetta
## 2 Bottersnikes & Gumbles
## 3 Bottersnikes & Gumbles
## 4
## 5 Secret in Their Eyes,V for Vendetta

Words matching in two columns using r

I have two data frames in that DF1 is (word dictionary) and DF2 is sentences.I want to make text matching in such a way that If word in DF1 matches to DF2 sentence(any word from sentence) then output should be column with yes if match or No if won't match data frames are as follow:
(DF1) word dictionary:
DF1 <- c("csi", "dsi", "market", "share", "improvement", "dealers", "increase")
(DF2)sentences:
DF2 <- c("Customer satisfaction index improvement", "reduction in retail cycle", "Improve market share", "% recovery from vendor")
and output should be:
Customer satisfaction index improvement ( yes)
reduction in retail cycle (no)
Improve market share (yes)
% recovery from vendor (no)
note- yes and No is different column showing result of text matching
Can anyone help .....thanks in advance
You could do it like this:
df <- data.frame(sentence = c("Customer satisfaction index improvement", "reduction in retail cycle", "Improve market share", "% recovery from vendor"))
words <- c("csi", "dsi", "market", "share", "improvement", "dealers", "increase")
# combine the words in a regular expression and bind it as column yes
df <- cbind(df, yes = grepl(paste(words, collapse = "|"), df$sentence))
This outputs
sentence yes
1 Customer satisfaction index improvement TRUE
2 reduction in retail cycle FALSE
3 Improve market share TRUE
4 % recovery from vendor FALSE
See it working on ideone.com.
Try this:
DF1 <- c("csi", "dsi", "market", "share", "improvement", "dealers", "increase")
DF2 <- c("Customer satisfaction index improvement", "reduction in retail cycle", "Improve market share", "% recovery from vendor")
result <- cbind(DF2, "word found" = ifelse(rowSums(sapply(DF1, grepl, x = DF2)) > 0, "YES", "NO"))
> result
DF2 word found
[1,] "Customer satisfaction index improvement" "YES"
[2,] "reduction in retail cycle" "NO"
[3,] "Improve market share" "YES"
[4,] "% recovery from vendor" "NO"

Maximum occurrence of any set of words in text in R

Given a set of lines, I have to find maximum occurrence of words(need not be single word, can be set of words also.)
say, I have a text like,
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
I want output to be,
john beck - 3
chemical engineer - 2
Is there any function or package which does this?
Try this:
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
library(tau)
library(tm)
tokens <- MC_tokenizer(string)
tokens <- tokens[tokens != ""]
string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ")
## if you want only bi-grams:
tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
# Freq
# john beck 3
# chemical engineer 2
## if you want uni-, bi- and tri-grams:
nmin <- 1; nmax <- 3
tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
# Freq
# beck 3
# john 3
# john beck 3
# chemical 2
# engineer 2
# is 2
# chemical engineer 2
Could also try this, using the quanteda package:
require(quanteda)
mydfm <- dfm(string, ngrams = 1:2, concatenator = "_", stem = TRUE, verbose = FALSE)
topfeatures(mydfm)
## beck john john_beck chemic chemical_engin engin is
## 3 3 3 2 2 2 2
## an an_chem are
## 1 1 1
You lose the stems, but this counts "john beck" three times instead of just two (since without stemming, "john beck's" will be a separate type).
It's simpler though!

Parallel for loop in R

I have data.frame sent with sentences in sent$words and dictionary with pos/neg words in wordsDF data frame (wordsDF[x,1]). Positive words = 1 and negative = -1 (wordsDF[x,2]). The words in that wordsDF data frame are sorted in decreasing order according to their length (length of string). I used this purpose for my following function.
How this function works:
1) Count occurancies of words stored in wordsDF through each sentences
2) Compute sentiment score: count of occurencies particular word (wordsDF) in particular sentence * sentiment value for that word (positive = 1, negative = -1)
3) Remove that matched word from sentence for another iteration.
Original solution using of stringr package:
scoreSentence_01 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
count <- str_count(sentence, wordsDF[x,1])
score <- (score + (count * wordsDF[x,2])) # compute score (count * sentValue)
sentence <- str_replace_all(sentence, wordsDF[x,1], " ")
}
score
}
Faster solution - rows 4 and 5 replace row 4 in original solution.
scoreSentence_02 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
sd <- function(text) {stri_count(text, regex=wordsDF[x,1])}
results <- sapply(sentence, sd, USE.NAMES=F)
score <- (score + (results * wordsDF[x,2])) # compute score (count * sentValue)
sentence <- str_replace_all(sentence, wordsDF[x,1], " ")
}
score
}
Calling functions is:
scoreSentence_Score <- scoreSentence_01(sent$words)
In real I'm using data set with 300.000 sentences and dictionary with positive and negative words - overall 7.000 words. This approach is very very slow for that and because my beginer knowledge in R programming I'm in the end of my efforts.
Could you anyone help me, how to rewrite this function into vectorized or parallel solution, please. Any help or advice is very appreciated. Thank you very much in advance.
Dummy data:
sent <- data.frame(words = c("great just great right size and i love this notebook", "benefits great laptop at the top",
"wouldnt bad notebook and very good", "very good quality", "bad orgtop but great",
"great improvement for that great improvement bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
"wouldnt bad")
negWords <- c("hate","bad","not good","horrible")
# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(10000,sent$words))
sent <- coredata(sent)[rep(seq(nrow(sent)),10000),]
sent$words <- paste(c(""), sent$words, c(""), collapse = NULL)
rownames(sent) <- NULL
# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
wordsDF$words <- paste(c(""), wordsDF$words, c(""), collapse = NULL)
rownames(wordsDF) <- NULL
Desired output is:
words user scoreSentence_Score
great just great right size and i love this notebook 1 4
benefits great laptop at the top 2 2
wouldnt bad notebook and very good 3 2
very good quality 4 1
bad orgtop but great 5 0
great improvement for that great improvement bad product but overall is not good 6 0
notebook is not good but i love batterytop 7 0
Okay, now that I know you have to work around phrases and words... here's another shot at it. Basically, you have to split out your phrases first, score them, remove them from the string, then score your words...
library(stringr)
sent <- data.frame(words = c("great just great right size and i love this notebook", "benefits great laptop at the top",
"wouldnt bad notebook and very good", "very good quality", "bad orgtop but great",
"great improvement for that great improvement bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
"wouldnt bad")
negWords <- c("hate","bad","not good","horrible")
sent$words2 <- sent$words
# split bad into words and phrases...
bad_phrases <- negWords[grepl(" ", negWords)]
bad_words <- negWords[!negWords %in% bad_phrases]
bad_words <- paste0("\\b", bad_words, "\\b")
pos_phrases <- posWords[grepl(" ", posWords)]
pos_words <- posWords[!posWords %in% pos_phrases]
pos_words <- paste0("\\b", pos_words, "\\b")
score <- - str_count(sent$words2, paste(bad_phrases, collapse="|"))
sent$words2 <- gsub(paste(bad_phrases, collapse="|"), "", sent$words2)
score <- score + str_count(sent$words2, paste(pos_phrases, collapse="|"))
sent$words2 <- gsub(paste(pos_phrases, collapse="|"), "", sent$words2)
score <- score + str_count(sent$words2, paste(pos_words, collapse="|")) - str_count(sent$words2, paste(bad_words, collapse="|"))
score
can't you just do:
library("stringr")
scoreSentence_Score <- str_count(sent$words, wordsDF[,1]) - str_count(sent$words, wordsDF[,2])

Resources