Approximate string matching in R between two datasets

Approximate string matching in R between two datasets - r

I have the following dataset containing film titles and the corresponding genre, while another dataset contains plain text where these titles might be quoted or not:
dt1
title genre
Secret in Their Eyes Dramas
V for Vendetta Action & Adventure
Bottersnikes & Gumbles Kids' TV
... ...
and
dt2
id Text
1. "I really liked V for Vendetta"
2 "Bottersnikes & Gumbles was a great film .... "
3. " In any case, in my opinion bottersnikes &gumbles was a great film ..."
4 "#thewitcher was an interesting series
5 "Secret in Their Eye is a terrible film! but I Like V per Vendetta"
... etc
what I want to obtain is a function that matched those titles in dt1 and tries to find them in the text in dt2:
if it finds any match or approximate match I want to have a column in dt2 that tells with the title that was mentioned in the text. if more than one is mentioned I want a any titles separated by a comma.
dt2
id Text mentions
1. "I really liked V for Vendetta" "V for Vendetta"
2 "Bottersnikes & Gumbles was a great film .... " "Bottersnikes & Gumbles"
3. " In any case, in my opinion bottersnikes &gumbles was a great film ..." "Bottersnikes & Gumbles"
4 "#thewitcher was an interesting series NA
5 "Secret in Their Eye is a terrible film! but I Like V per Vendetta" "Secret in Their Eyes, V for Vendetta"
... etc

You can do the fuzzy matching via agrep(), which here I've used for each title with lapply() to generate a logical vector of matches for each Text, and then used an apply() across a data.frame from this match to create the vector of matched titles.
You can tweak the max.distance value but this worked just fine on your example.
dt1 <- data.frame(
title = c("Secret in Their Eyes", "V for Vendetta", "Bottersnikes & Gumbles"),
genre = c("Dramas", "Action & Adventure", "Kids' TV"),
stringsAsFactors = FALSE
)
dt2 <- data.frame(
id = 1:5,
Text = c(
"I really liked V for Vendetta",
"Bottersnikes & Gumbles was a great film .... ",
"In any case, in my opinion bottersnikes &gumbles was a great film ...",
"#thewitcher was an interesting series",
"Secret in Their Eye is a terrible film! but I Like V per Vendetta"
),
stringsAsFactors = FALSE
)
match_titles <- function(target, titles) {
matches <- lapply(titles, agrepl, target,
max.distance = 0.3,
ignore.case = TRUE, fixed = TRUE
)
matched_titles <- apply(
data.frame(matches), 1,
function(y) paste(titles[y], collapse = ",")
)
matched_titles
}
dt2$titles <- match_titles(dt2$Text, dt1$title)
dt2
## id Text
## 1 1 I really liked V for Vendetta
## 2 2 Bottersnikes & Gumbles was a great film ....
## 3 3 In any case, in my opinion bottersnikes &gumbles was a great film ...
## 4 4 #thewitcher was an interesting series
## 5 5 Secret in Their Eye is a terrible film! but I Like V per Vendetta
## titles
## 1 V for Vendetta
## 2 Bottersnikes & Gumbles
## 3 Bottersnikes & Gumbles
## 4
## 5 Secret in Their Eyes,V for Vendetta

Related

How to extract all matching patterns (words in a string) in a dataframe column?

I have two dataframes. one (txt.df) has a column with a text I want to extract phrases from (text). The other (wrd.df) has a column with the phrases (phrase). both are big dataframes with complex texts and strings but lets say:
txt.df <- data.frame(id = c(1, 2, 3, 4, 5),
text = c("they love cats and dogs", "he is drinking juice",
"the child is having a nap on the bed", "they jump on the bed and break it",
"the cat is sleeping on the bed"))
wrd.df <- data.frame(label = c('a', 'b', 'c', 'd', 'e', 'd'),
phrase = c("love cats", "love dogs", "juice drinking", "nap on the bed", "break the bed",
"sleeping on the bed"))
what I finally need is a txt.df with another column which contains labels of the phrases detected.
what I tried was creating a column in wrd.df in which I tokenized the phrases like this
wrd.df$token <- sapply(wrd.df$phrase, function(x) unlist(strsplit(x, split = " ")))
and then tried to write a custom function to sapply over the tokens column with grepl/str_detect
get the names (labels) of those which were all true
Extract.Fun <- function(text, df, label, token){
for (i in token) {
truefalse[i] <- sapply(token[i], function (x) grepl(x, text))
truenames[i] <- names(which(truefalse[i] == T))
removedup[i] <- unique(truenames[i])
return(removedup)
}
and then sapply this custom function on my txt.df$text to have a new column with the labels.
txt.df$extract <- sapply(txt.df$text, function (x) Extract.Fun(x, wrd.df, "label", "token"))
but I'm not good with custom functions and am really stuck. I would appreciate any help.
P.S. It would be very good if i could also have partial matches like "drink juice" and "broke the bed"... but it's not a priority... fine with the original ones.

If you need to match the exact phrases, the regex_join() from the fuzzyjoin-package is what you need.
fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "phrase"), mode = "left" )
id text label phrase
1 1 they love cats and dogs a love cats
2 2 he is drinking juice <NA> <NA>
3 3 the child is having a nap on the bed d nap on the bed
4 4 they jump on the bed and break it <NA> <NA>
5 5 the cat is sleeping on the bed d sleeping on the bed
If you want to match all words, I guess you can build a regex out of the phrases that cover such behaviour...
update
#build regex for phrases
#done by splitting the phrases to individual words, and then paste the regex together
wrd.df$regex <- unlist( lapply( lapply( strsplit( wrd.df$phrase, " "),
function(x) paste0( "(?=.*", x, ")", collapse = "" ) ),
function(x) paste0( "^", x, ".*$") ) )
fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "regex"), mode = "left" )
id text label phrase regex
1 1 they love cats and dogs a love cats ^(?=.*love)(?=.*cats).*$
2 1 they love cats and dogs b love dogs ^(?=.*love)(?=.*dogs).*$
3 2 he is drinking juice c juice drinking ^(?=.*juice)(?=.*drinking).*$
4 3 the child is having a nap on the bed d nap on the bed ^(?=.*nap)(?=.*on)(?=.*the)(?=.*bed).*$
5 4 they jump on the bed and break it e break the bed ^(?=.*break)(?=.*the)(?=.*bed).*$
6 5 the cat is sleeping on the bed d sleeping on the bed ^(?=.*sleeping)(?=.*on)(?=.*the)(?=.*bed).*$

Remove words in one column if present in another column

I have a dataframe that is in this format:
A <- c("John Smith", "Red Shirt", "Family values are better")
B <- c("John is a very highly smart guy", "We tried the tea but didn't enjoy it at all", "Family is very important as it gives you values")
df <- as.data.frame(A, B)
My intention is to get the result back as:
ID A B
1 John Smith is a very highly smart guy
2 Red Shirt We tried the tea but didn't enjoy it at all
3 Family values are better is very important as it gives you
I have tried:
test<-df %>% filter(sapply(1:nrow(.), function(i) grepl(A[i], B[i])))
But this doesn't get me the desired output.

One solution is to use mapply along with strsplit.
The trick is to split df$A in separate words and collapse those words separated by | and then use it as pattern in gsub to replace with "".
lst <- strsplit(df$A, split = " ")
df$B <- mapply(function(x,y){gsub(paste0(x,collapse = "|"), "",df$B[y])},lst,1:length(lst))
df
# A B
# 1 John Smith is a very highly smart guy
# 2 Red Shirt We tried the tea but didn't enjoy it at all
# 3 Family values are better is very important as it gives you
Another option is as:
df$B <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",df$A),df$B)
Data:
A <- c("John Smith", "Red Shirt", "Family values are better")
B <- c("John is a very highly smart guy", "We tried the tea but didn't enjoy it at all", "Family is very important as it gives you values")
df <- data.frame(A, B, stringsAsFactors = FALSE)

Just another option using stringr::str_split_fixed function:
library(stringr)
str_split_fixed(sapply(paste(df$A,df$B, sep=" columnbreaker "),
function(i){
paste(unique(
strsplit(as.character(i), split=" ")[[1]]),
collapse = " ")}),
" columnbreaker ", 2)
# [,1] [,2]
# [1,] "John Smith" "is a very highly smart guy"
# [2,] "Red Shirt" "We tried the tea but didn't enjoy it at all"
# [3,] "Family values are better" "is very important as it gives you"

Extract n Words Around Defined Term (Multicase)

I have a vector of text strings, such as:
Sentences <- c("I would have gotten the promotion, but TEST my attendance wasn’t good enough.Let me help you with your baggage.",
"Everyone was busy, so I went to the movie alone. Two seats were vacant.",
"TEST Rock music approaches at high velocity.",
"I am happy to take your TEST donation; any amount will be greatly TEST appreciated.",
"A purple pig and a green donkey TEST flew a TEST kite in the middle of the night and ended up sunburnt.",
"Rock music approaches at high velocity TEST.")
I would like to extract n (for example: three) words (a word is characterized by a whitespace before and after character(s)) AROUND (i.e., before and after) a particular term (e.g., 'TEST').
Improtant: Several matches should be allowed (i.e., if a particular term occurs more than one times, the intended solution should capture those cases).
The result might look like this (the format can be improved):
S1 <- c(before = "the promotion, but", after = "my attendance wasn’t")
S2 <- c(before = "", after = "")
S3 <- c(before = "", after = "Rock music approaches")
S4a <- c(before = "to take your", after = "donation; any amount")
S4b <- c(before = "will be greatly", after = "appreciated.")
S5a <- c(before = "a green donkey", after = "flew a TEST")
S5b <- c(before = "TEST flew", after = "kite in the")
S6 <- c(before = "at high velocit", after = "")
How can I do this? I already figured out other psots, which are either only for one-case-matches or relate to fixed sentence structures.

The quanteda package has a great function for this: kwic() (keywords in context).
Out of the box, this works pretty well on your example:
library("quanteda")
names(Sentences) <- paste0("S", seq_along(Sentences))
(kw <- kwic(Sentences, "TEST", window = 3))
#
# [S1, 9] promotion, but | TEST | my attendance wasn't
# [S3, 1] | TEST | Rock music approaches
# [S4, 7] to take your | TEST | donation; any
# [S4, 15] will be greatly | TEST | appreciated.
# [S5, 8] a green donkey | TEST | flew a TEST
# [S5, 11] TEST flew a | TEST | kite in the
# [S6, 7] at high velocity | TEST | .
(kw2 <- as.data.frame(kw)[, c("docname", "pre", "post")])
# docname pre post
# 1 S1 promotion , but my attendance wasn't
# 2 S3 Rock music approaches
# 3 S4 to take your donation ; any
# 4 S4 will be greatly appreciated .
# 5 S5 a green donkey flew a TEST
# 6 S5 TEST flew a kite in the
# 7 S6 at high velocity .
That's probably a better format than the separate objects you ask for you in the question. But to get as close as possible to your target, you can further transform it as follows.
# this picks up the empty matching sentence S2
(kw3 <- merge(kw2,
data.frame(docname = names(Sentences), stringsAsFactors = FALSE),
all.y = TRUE))
# replaces the NA with the empty string
kw4 <- as.data.frame(lapply(kw3, function(x) { x[is.na(x)] <- ""; x} ),
stringsAsFactors = FALSE)
# renames pre/post to before/after
names(kw4)[2:3] <- c("before", "after")
# makes the docname unique
kw4$docname <- make.unique(kw4$docname)
kw4
# docname before after
# 1 S1 promotion , but my attendance wasn't
# 2 S2
# 3 S3 Rock music approaches
# 4 S4 to take your donation ; any
# 5 S4.1 will be greatly appreciated .
# 6 S5 a green donkey flew a TEST
# 7 S5.1 TEST flew a kite in the
# 8 S6 at high velocity .

Maximum occurrence of any set of words in text in R

Given a set of lines, I have to find maximum occurrence of words(need not be single word, can be set of words also.)
say, I have a text like,
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
I want output to be,
john beck - 3
chemical engineer - 2
Is there any function or package which does this?

Try this:
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
library(tau)
library(tm)
tokens <- MC_tokenizer(string)
tokens <- tokens[tokens != ""]
string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ")
## if you want only bi-grams:
tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
# Freq
# john beck 3
# chemical engineer 2
## if you want uni-, bi- and tri-grams:
nmin <- 1; nmax <- 3
tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
# Freq
# beck 3
# john 3
# john beck 3
# chemical 2
# engineer 2
# is 2
# chemical engineer 2

Could also try this, using the quanteda package:
require(quanteda)
mydfm <- dfm(string, ngrams = 1:2, concatenator = "_", stem = TRUE, verbose = FALSE)
topfeatures(mydfm)
## beck john john_beck chemic chemical_engin engin is
## 3 3 3 2 2 2 2
## an an_chem are
## 1 1 1
You lose the stems, but this counts "john beck" three times instead of just two (since without stemming, "john beck's" will be a separate type).
It's simpler though!

How do I make this nested for loop work faster

My data is as shown below:
txt$txt:
my friend stays in adarsh nagar
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc
I have an exhaustive list of city names. Listing few of them below:
city:
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta
I am searching for city names (from the "city" list I have) in txt$txt and extracting them into another column if they are present. So the simple loop below works for me... but it's taking a lot of time on the bigger dataset.
for(i in 1:nrow(txt)){
a <- c()
for(j in 1:nrow(city)){
a[j] <- grepl(paste("\\b",city[j,1],"\\b", sep = ""),txt$txt[i])
}
txt$city[i] <- ifelse(sum(a) > 0, paste(city[which(a),1], collapse = "_"), "NONE")
}
I tried to use an apply function, and this is the maximum i could get to.
apply(as.matrix(txt$txt), 1, function(x){ifelse(sum(unlist(strsplit(x, " ")) %in% city[,1]) > 0, paste(unlist(strsplit(x, " "))[which(unlist(strsplit(x, " ")) %in% city[,1])], collapse = "_"), "NONE")})
[1] "NONE" "NONE" "bangalore" "bkc"
Desired Output:
> txt
txt city
1 my friend stays in adarsh nagar adarsh nagar
2 I changed one apple one samsung S3 n one sony experia z. NONE
3 Hi girls..Friends meet at bangalore bangalore
4 what do u think of ccd at bkc bkc
I want a faster process in R, which does the same thing what the for loop above does. Please advise. Thanks

Here's a possibility using stri_extract_first_regex from stringi package:
library(stringi)
# prepare some data
df <- data.frame(txt = c("in adarsh nagar", "sony experia z", "at bangalore"))
city <- c("ahmedabad", "adarsh nagar", "airoli", "bangalore")
df$city <- stri_extract_first_regex(str = df$txt, regex = paste(city, collapse = "|"))
df
# txt city
# 1 in adarsh nagar adarsh nagar
# 2 sony experia z <NA>
# 3 at bangalore bangalore

This should be much faster:
bigPattern <- paste('(\\b',city[,1],'\\b)',collapse='|',sep='')
txt$city <- sapply(regmatches(txt$txt,gregexpr(bigPattern,txt$txt)),FUN=function(x) ifelse(length(x) == 0,'NONE',paste(unique(x),collapse='_')))
Explanation:
in the first line we build a big regular expression matching all the cities, e.g. :
(\\bahmedabad\\b)|(\\badarsh nagar\\b)|(\\bairoli\\b)| ...
Then we use gregexpr in combination with regmatches, in this way we get a list of the matches for each element in txt$txt.
Finally, with a simple sapply, for each element of the list we concatenate the matched cities (after removing the duplicates i.e. cities mentioned more than one time).

Try this:
# YOUR DATA
##########
txt <- readLines(n = 4)
my friend stays in adarsh nagar and airoli
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc
city <- readLines(n = 8)
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta
# MATCHING
##########
matches <- unlist(setNames(lapply(city, grep, x = txt, fixed = TRUE),
city))
(res <- (sapply(1:length(txt), function(x)
paste0(names(matches)[matches == x], collapse = "___"))))
# [1] "adarsh nagar___airoli" ""
# [3] "bangalore" "bkc"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Approximate string matching in R between two datasets - r

Related

How to extract all matching patterns (words in a string) in a dataframe column?

Remove words in one column if present in another column

Extract n Words Around Defined Term (Multicase)

Maximum occurrence of any set of words in text in R

How do I make this nested for loop work faster

Categories

Resources