keeping the best string matched by fuzzy matching in R

keeping the best string matched by fuzzy matching in R - r

I have two dataframes in R. one a dataframe of the phrases I want to match along with their synonyms in another column (df.word), and the other a data frame of the strings I want to match along with codes (df.string). The strings are complicated but to make it easy say we have:
df.word <- data.frame(label = c('warm wet', 'warm dry', 'cold wet'),
synonym = c('hot and drizzling\nsunny and raining','sunny and clear sky\ndry sunny day', 'cold winds and raining\nsnowing'))
df.string <- data.frame(day = c(1,2,3,4),
weather = c('there would be some drizzling at dawn but we will have a hot day', 'today there are cold winds and a bit of raining or snowing at night', 'a sunny and clear sky is what we have today', 'a warm dry day'))
I want to create df.string$extract in which I want to have the best match available for the string.
a column like this
df$extract <- c('warm wet', 'cold wet', 'warm dry', 'warm dry')
thanks in advance for anyone helping.

There are a few points that I did not quite understand in your question; however, I am proposing a solution for your question. Check whether it will work for you.
I assume that you want to find the best-matching labels for the weather texts. If so, you can use stringsim function from library(stringdist) in the following way.
First Note: If you clean the \n in your data, the result will be more accurate. So, I clean them for this example, but if you want you can keep them.
Second Note: You can change the similarity distance based on the different methods. Here I used cosine similarity, which is a relatively good starting point. If you want to see the alternative methods, please see the reference of the function:
?stringsim
The clean data is as follow:
df.word <- data.frame(
label = c("warm wet", "warm dry", "cold wet"),
synonym = c(
"hot and drizzling sunny and raining",
"sunny and clear sky dry sunny day",
"cold winds and raining snowing"
)
)
df.string <- data.frame(
day = c(1, 2, 3, 4),
weather = c(
"there would be some drizzling at dawn but we will have a hot day",
"today there are cold winds and a bit of raining or snowing at night",
"a sunny and clear sky is what we have today",
"a warm dry day"
)
)
Install the library and load it
install.packages('stringdist')
library(stringdist)
Create a n x m matrix that contains the similarity scores for each whether text with each synonym. The rows show each whether text and the columns represent each synonym group.
match.scores <- sapply( ## Create a nested loop with sapply
seq_along(df.word$synonym), ## Loop for each synonym as 'i'
function(i) {
sapply(
seq_along(df.string$weather), ## Loop for each weather as 'j'
function(j) {
stringsim(df.word$synonym[i], df.string$weather[j], ## Check similarity
method = "cosine", ## Method cosine
q = 2 ## Size of the q -gram: 2
)
}
)
}
)
r$> match.scores
[,1] [,2] [,3]
[1,] 0.3657341 0.1919924 0.24629819
[2,] 0.6067799 0.2548236 0.73552828
[3,] 0.3333974 0.6300619 0.21791793
[4,] 0.1460593 0.4485426 0.03688556
Get the best matches across the rows for each whether text, find the labels with the highest matching scores, and add these labels to the data frame.
ranked.match <- apply(match.scores, 1, which.max)
df.string$extract <- df.word$label[ranked.match]
df.string
r$> df.string
day weather extract
1 1 there would be some drizzling at dawn but we will have a hot day warm wet
2 2 today there are cold winds and a bit of raining or snowing at night cold wet
3 3 a sunny and clear sky is what we have today warm dry
4 4 a warm dry day warm dry

Related

R Regex for Postive Look-Around to Match Following

I have a dataframe in R. I want to match with and keep the row if
"woman" is the first or
the second word in a sentence, or
if it is the third word in a sentence and preceded by the words "no," "not," or "never."
phrases_with_woman <- structure(list(phrase = c("woman get degree", "woman obtain justice",
"session woman vote for member", "woman have to end", "woman have no existence",
"woman lose right", "woman be much", "woman mix at dance", "woman vote as member",
"woman have power", "woman act only", "she be woman", "no committee woman passed vote")), row.names = c(NA,
-13L), class = "data.frame")
In the above example, I want to be able to match with all rows except for "she be woman."
This is my code so far. I have a positive look-around ((?<=woman\\s)\\w+") that seems to be on the right track, but it matches with too many preceding words. I tried using {1} to match with just one preceding word, but this syntax didn't work.
matches <- phrases_with_woman %>%
filter(str_detect(phrase, "^woman|(?<=woman\\s)\\w+"))
Help is appreciated.

Each of the conditions can be an alternative although the last one requires two alternatives assuming that no/not/never can be either the first or second word.
library(dplyr)
pat <- "^(woman|\\w+ woman|\\w+ (no|not|never) woman|(no|not|never) \\w+ woman)\\b"
phrases_with_woman %>%
filter(grepl(pat, phrase))

I haven't come up with a regex solution but here is a workaround.
library(dplyr)
library(stringr)
phrases_with_woman %>%
filter(str_detect(word(phrase, 1, 2), "\\bwoman\\b") |
(word(phrase, 3) == "woman" & str_detect(word(phrase, 1, 2), "\\b(no|not|never)\\b")))
# phrase
# 1 woman get degree
# 2 woman obtain justice
# 3 session woman vote for member
# 4 woman have to end
# 5 woman have no existence
# 6 woman lose right
# 7 woman be much
# 8 woman mix at dance
# 9 woman vote as member
# 10 woman have power
# 11 woman act only
# 12 no committee woman passed vote

Give string values in vector an auto index

i have two vectors:
names_of_p <- c("John", "Adam", "James", "Robert")
speeds <- c("Slow", "Fast", "Average", "Slow")
And i need the show to slowest person, i did it with if's and if else's, but i wonder if there is easier way to do it with like auto give "Slow" = 1 , "Average" = 2 and so on. In other words attach values to them.
At the end it should be vector like
names_speeds <- c(names_of_p, speed)
And then so i can compare persons and get who is faster.

You could turn speeds into an ordered factor, which would preserve the labeling while also creating an underlying numerical representation:
names_of_p <- c("John", "Adam", "James", "Robert")
speeds <- c("Slow", "Fast", "Average", "Slow")
speeds <- factor(speeds, levels = c('Slow', 'Average', 'Fast'), ordered = T)
names_of_p[order(speeds)]
[1] "John" "Robert" "James" "Adam"
names_of_p[as.numeric(speeds) < 3]
[1] "John" "James" "Robert"
It might also be a good idea to store the data in a data frame rather in separate vectors:
library(tidyverse)
df <- data.frame(
names_of_p = names_of_p,
speeds = factor(speeds, levels = c('Slow', 'Average', 'Fast'), ordered = T)
)
df %>%
arrange(speeds)
names_of_p speeds
<chr> <ord>
1 John Slow
2 Robert Slow
3 James Average
4 Adam Fast
df %>%
filter(as.numeric(speeds) < 3)
names_of_p speeds
<chr> <ord>
1 John Slow
2 James Average
3 Robert Slow

First assign names to the vector speeds then you get a named vector.
After that you can use which:
names(speeds) <- names
which(speeds=="Slow")
John Robert
1 4

How to get nearest matching string along with score from column from another table?

I am trying to get nearest matching string along with the score by using "stringdist" package with method = jw.(Jaro-winkler)
First data frame (df_1) consists of 2 columns and I want to get the nearest string from str_2 from df_2 and score for that match.
I have gone through the package and found some solution which I will show below:
year = c(2001,2001,2002,2003,2005,2006)
str_1 =c("The best ever Puma wishlist","I finalised on buy a top from Myntra","Its perfect for a day at gym",
"Check out PUMA Unisex Black Running","i have been mailing my issue daily","xyz")
df_1 = data.frame(year,str_1)
ID = c(100,211,155,367,678,2356,927,829,397)
str_2 = c("VeRy4G3c7X","i have been mailing my issue twice","I finalised on buy a top from jobong",
"Qm9dZzJdms","Check out PUMA Unisex Black Running","The best Puma wishlist","Its a day at gym",
"fOpBRWCdSh","")
df_2 = data.frame(ID,str_2)
I need to get the nearest match from str_2 column from df_2, and the final table would look like below with:
stringdist( a, b, method = c( "jw")
df_1$Nearest_matching = c("The best Puma wishlist","I finalised on buy a top from jobong","Its a day at gym","Check out PUMA Unisex Black Running","i have been mailing my issue twice",NA)
df_1$Nearest_matching_score =c(0.099,0.092,0.205,0,0.078,NA).

Here is what I came to based on the documentation of the stringdist package:
First I created a distance matrix between str_1 and str_2, then I assigned column names to it like this:
nearest_matching <- stringdistmatrix(df_1$str_1,df_2$str_2, method = "jw")
colnames(nearest_matching) <- str_2
Then I selected the smallest value (distance) from each row.
apply(nearest_matching, 1, FUN = min)
output:
> apply(nearest_matching, 1, FUN = min)
[1] 0.09960718 0.09259259 0.20535714 0.00000000 0.07843137 0.52222222
Finally, I wrote out the column names corresponding to these values:
colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]
output:
> colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]
[1] "The best Puma wishlist" "I finalised on buy a top from jobong" "Its a day at gym"
[4] "Check out PUMA Unisex Black Running" "i have been mailing my issue twice" "VeRy4G3c7X"

Here is a way to find the closest match and score for each value in df_1$str_1.
library(dplyr)
library(purrr)
library(stringdist)
result <- bind_cols(df_1, map_df(df_1$str_1, function(x) {
vals <- stringdist(x, df_2$str_2, method = 'jw')
data.frame(Nearest_matching = df_2$str_2[which.min(vals)],
Nearest_matching_score = max(vals))
}))
# year str_1
#1 2001 The best ever Puma wishlist
#2 2001 I finalised on buy a top from Myntra
#3 2002 Its perfect for a day at gym
#4 2003 Check out PUMA Unisex Black Running
#5 2005 i have been mailing my issue daily
#6 2006 xyz
# Nearest_matching Nearest_matching_score
#1 The best Puma wishlist 0.7419753
#2 I finalised on buy a top from jobong 0.7481481
#3 Its a day at gym 0.7428571
#4 Check out PUMA Unisex Black Running 0.6238095
#5 i have been mailing my issue twice 0.6235294
#6 VeRy4G3c7X 1.0000000

How do I count the number of words from a list mentioned in a data frame in R

I have a data frame with a review and text column with multiple rows. I also have a list containing words. I want a for loop to examine each row of the data frame to sum the number of words found in the from the list. I want to keep each row sum separated by the row and place the results into a new result data frame.
#Data Frame
Review Text
1 I like to run and play.
2 I eat cookies.
3 I went to swim in the pool.
4 I like to sleep.
5 I like to run, play, swim, and eat.
#List Words
Run
Play
Eat
Swim
#Result Data Frame
Review Count
1 2
2 1
3 1
4 0
5 4

Here is a solution for base R, where gregexpr is used for counting occurences.
Given the pattern as below
pat <- c("Run", "Play", "Eat", "Swim")
then the counts added to the data frame can be made via:
df$Count <- sapply(gregexpr(paste0(tolower(pat),collapse = "|"),tolower(df$Text)),
function(v) ifelse(-1 %in% v, 0,length(v)))
such that
> df
Review Text Count
1 1 I like to run and play 2
2 2 I eat cookies 1
3 3 I went to swim in the pool. 1
4 4 I like to sleep. 0
5 5 I like to run, play, swim, and eat. 4

We can use stringr::str_count after pasting the words together as one pattern.
df$Count <- stringr::str_count(df$Text,
paste0("\\b", tolower(words), "\\b", collapse = "|"))
df
# Review Text Count
#1 1 I like to run and play. 2
#2 2 I eat cookies. 1
#3 3 I went to swim in the pool. 1
#4 4 I like to sleep. 0
#5 5 I like to run, play, swim, and eat. 4
data
df <- structure(list(Review = 1:5, Text = structure(c(2L, 1L, 5L, 4L,
3L), .Label = c("I eat cookies.", "I like to run and play.",
"I like to run, play, swim, and eat.", "I like to sleep.",
"I went to swim in the pool."), class = "factor")), class =
"data.frame", row.names = c(NA, -5L))
words <- c("Run","Play","Eat","Swim")

Base R solution (note this solution is intentionally case insensitive):
# Create a vector of patterns to search for:
patterns <- c("Run", "Play", "Eat", "Swim")
# Split on the review number, apply a term counting function (for each review number):
df$term_count <- sapply(split(df, df$Review),
function(x){length(grep(paste0(tolower(patterns), collapse = "|"),
tolower(unlist(strsplit(x$Text, "\\s+")))))})
Data:
df <- data.frame(Review = 1:5, Text = as.character(c("I like to run and play",
"I eat cookies",
"I went to swim in the pool.",
"I like to sleep.",
"I like to run, play, swim, and eat.")),
stringsAsFactors = FALSE)

matching text between two different data frames in R

I have the following data in a data frame:
structure(list(`head(ker$text)` = structure(1:6, .Label = c("#_rpg_17 little league travel tourney. These parents about to be wild.",
"#auscricketfan #davidwarner31 yes WI tour is coming soon", "#keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR",
"#NWAWhatsup tour of duty in NWA considered a dismal assignment? Companies send in their best ppl and then those ppl don't want to leave",
"Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy",
"Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO"
), class = "factor")), .Names = "head(ker$text)", row.names = c(NA,
-6L), class = "data.frame")
I have another data frame that contains hashtags extracted from the above data frame. It is as follows:
structure(list(destination = c("#topstation", "#destination", "#munnar",
"#Kerala", "#Delhi", "#beach")), .Names = "destination", row.names = c(NA,
6L), class = "data.frame")
I want to create a new column in my first data frame, which will have contain only the tags matched with the second data frame. For example, the first line of df1 does not have any hashtags, hence this cell in the new column will be blank. However, the second line contains 4 hashtags, of which three of them are matching with the second data frame. I have tried using:
str_match
str_extract
functions. I came very close to getting this using a code given in one of the posts here.
new_col <- ker[unlist(lapply(destn$destination, agrep, ker$text)), ]
While I understand, I am getting a list as an output I am getting an error indicating
replacement has 1472 rows, data has 644
I have tried setting max.distance to different parameters, each gave me differential errors. Can someone help me with a solution? One alternative which I am thinking of is to have each hashtag in a separate column, but not sure if it will help me in analysing the data further with other variables that I have. The output I am looking for is as follows:
text new_col new_col2 new_col3
statement1
statement2
statement3 #destination #munnar #topstation
statement4
statement5 #Kerala
statement6 #Kerala

library(stringi);
df1$tags <- sapply(stri_extract_all(df1[[1]],regex='#\\w+'),function(x) paste(x[x%in%df2[[1]]],collapse=','));
df1;
## head(ker$text) tags
## 1 #_rpg_17 little league travel tourney. These parents about to be wild.
## 2 #auscricketfan #davidwarner31 yes WI tour is coming soon
## 3 #keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR #destination,#munnar,#topstation
## 4 #NWAWhatsup tour of duty in NWA considered a dismal assignment? Companies send in their best ppl and then those ppl don't want to leave
## 5 Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy #Kerala
## 6 Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO #Kerala
Edit: If you want a separate column for each tag:
library(stringi);
m <- sapply(stri_extract_all(df1[[1]],regex='#\\w+'),function(x) x[x%in%df2[[1]]]);
df1 <- cbind(df1,do.call(rbind,lapply(m,`[`,1:max(sapply(m,length)))));
df1;
## head(ker$text) 1 2 3
## 1 #_rpg_17 little league travel tourney. These parents about to be wild. <NA> <NA> <NA>
## 2 #auscricketfan #davidwarner31 yes WI tour is coming soon <NA> <NA> <NA>
## 3 #keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR #destination #munnar #topstation
## 4 #NWAWhatsup tour of duty in NWA considered a dismal assignment? Companies send in their best ppl and then those ppl don't want to leave <NA> <NA> <NA>
## 5 Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy #Kerala <NA> <NA>
## 6 Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO #Kerala <NA> <NA>

You could do something like this:
library(stringr)
results <- sapply(df$`head(ker$text)`,
function(x) { str_match_all(x, paste(df2$destination, collapse = "|")) })
df$matches <- results
If you want to separate the results out, you can use:
df <- cbind(df, do.call(rbind, lapply(results,[, 1:max(sapply(results, length)))))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

keeping the best string matched by fuzzy matching in R - r

Related

R Regex for Postive Look-Around to Match Following

Give string values in vector an auto index

How to get nearest matching string along with score from column from another table?

How do I count the number of words from a list mentioned in a data frame in R

matching text between two different data frames in R

Categories

Resources