How to compare two strings word by word in R - r

I have a dataset, let's call it "ORIGINALE", composed by several different rows for only two columns, the first called "DESCRIPTION" and the second "CODICE". The description column has the right information while the column codice, which is the key, is almost always empty, therefore I'm tryng to search for the corresponding codice in another dataset, let's call it "REFERENCE". I am using the column desciption, which is in natural language, and trying to match it with the description in the second dataset. I have to match word by word since there may be a different order of words, synonims or abbreviations. Then I calcolate the similarity score to keep only the best match and accept those above a certain score. Is there a way to improve it? I'm working with around 300000 rows and, even though I know is always going to take time, perhaps there could be a way to make it even just slightly faster.
ORIGINALE <- data.frame(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = (NA, NA, NA))
REFERENE <- dataframe (DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
algoritmo <- function(ORIGINALE, REFERENCE) {
split1 <- strsplit(x$DESCRIPTION, " ")
split2 <- strsplit(y$DESCRIPTION, " ")
risultato <- vector()
distanza <- vector()
for(i in 1:NROW(split1)) {
best_dist <- -5
closest_match <- -5
for(j in 1:NROW(split2)) {
dist <- stringsim(as.character(split1[i]), as.character(split2[j]))
if (dist > best_dist) {
closest_match <- y$DESCRIPTION[j]
best_dist <- dist
}
}
distanza <- append(distanza, best_dist)
risultato <- append(risultato, closest_match)
}
confronto <<- tibble(x$DESCRIPTION, risultato, distanza)
}
match <- subset.data.frame(confronto, confronto$distanza >= "0.6")
missing <- subset.data.frame(confronto, confronto$distanza <"0.6")

The R tm (text mining) library can help here:
library(tm)
library(proxy) # for computing cosine similarity
library(data.table)
ORIGINALE = data.table(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = c(NA, NA, NA))
REFERENCE = data.table(DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
# combine ORIGINALE and REFERENCE into one data.table
both = rbind(ORIGINALE,REFERENCE)
# create "doc_id" and "text" columns (required by tm)
both[,doc_id:=1:.N]
names(both)[1] = 'text'
# convert to tm corpus
corpus = SimpleCorpus(DataframeSource(both))
# convert to a tm document term matrix
dtm = DocumentTermMatrix(corpus)
# convert to a regular matrix
dtm = as.matrix(dtm)
# look at it (t() transpose for readability)
t(dtm)
Docs
Terms 1 2 3 4 5 6
123 1 0 0 0 1 0
peter 1 0 0 0 1 0
rose 1 0 0 0 1 0
street 1 0 0 1 1 1
chicago 0 1 0 0 0 1
flower 0 1 0 0 0 1
jane 0 1 0 0 0 1
jenny 0 1 0 0 0 1
str 0 1 0 0 0 0
430f 0 0 1 1 0 0
miss 0 0 1 0 0 0
name 0 0 1 1 0 0
sarah 0 0 1 1 0 0
strt 0 0 1 0 0 0
washington 0 0 1 1 0 0
brown 0 0 0 1 0 0
green 0 0 0 0 1 0
# compute similarity between each combination of documents 1:3 and documents 4:6
similarity = proxy::dist(dtm[1:3,], dtm[4:6,], method="cosine")
# result:
ORIGINALE REFERENCE document
document 4 5 6
1 0.7958759 0.1055728 0.7763932 <-- difference (smaller = more similar)
2 1.0000000 1.0000000 0.2000000
3 0.3333333 1.0000000 1.0000000
# make a table of which REFERENCE document is most similar
most_similar = rbindlist(
apply(
similarity,1,function(x){
data.table(i=which.min(x),distance=min(x))
}
)
)
# result:
i distance
1: 2 0.1055728
2: 3 0.2000000
3: 1 0.3333333
# rows 1, 2, 3 or rows of ORIGINALE; i: 2 3 1 are rows of REFERENCE
# add the results back to ORIGINALE
ORIGINALE1 = cbind(ORIGINALE,most_similar)
REFERENCE[,i:=1:.N]
ORIGINALE2 = merge(ORIGINALE1,REFERENCE,by='i',all.x=T,all.y=F)
# result:
i DESCRIPTION.x CODICE.x distance DESCRIPTION.y CODICE.y
1: 1 washington miss sarah 430f name strt NA 0.3333333 sarah brown name street 430f washington 135tg67
2: 2 mr peter 123 rose street 3b LA NA 0.1055728 peter green 123 rose street 3b LA aw56
3: 3 4c flower str jenny jane Chicago NA 0.2000000 jenny jane flower street 4c Chicago 83776250
# now the documents are in a different order than in ORIGINALE2.
# this is caused by merging by i (=REFERENCE document row).
# if order is important, then add these two lines around the merge line:
ORIGINALE1[,ORIGINALE_i:=1:.N]
ORIGINALE2 = merge(...
ORIGINALE2 = ORIGINALE2[order(ORIGINALE_i)]

Good question. for loops are slow in R:
for(i in 1:NROW(split1)) {
for(j in 1:NROW(split2)) {
For fast R, you need to vectorize your algorithm. I'm not that handy with data.frame anymore, so I'll use its successor, data.table.
library(data.table)
ORIGINALE = data.table(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = c(NA, NA, NA))
REFERENCE = data.table(DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
# split DESCRIPTION to make tables that have one word per row
ORIGINALE_WORDS = ORIGINALE[,.(word=unlist(strsplit(DESCRIPTION,' ',fixed=T))),.(DESCRIPTION,CODICE)]
REFERENCE_WORDS = REFERENCE[,.(word=unlist(strsplit(DESCRIPTION,' ',fixed=T))),.(DESCRIPTION,CODICE)]
# remove empty words introduced by extra spaces in your DESCRIPTIONS
ORIGINALE_WORDS = ORIGINALE_WORDS[word!='']
REFERENCE_WORDS = REFERENCE_WORDS[word!='']
# merge the tables by word
merged = merge(ORIGINALE_WORDS,REFERENCE_WORDS,by='word',all=F,allow.cartesian=T)
# count matching words for each combination of ORIGINALE DESCRIPTION and REFERENCE DESCRIPTION and CODICE
counts = merged[,.N,.(DESCRIPTION.x,DESCRIPTION.y,CODICE.y)]
# keep only the highest N CODICE.y for each DESCRIPTION.x
topcounts = merged[order(-N)][!duplicated(DESCRIPTION.x)]
# merge the counts back to ORIGINALE
result = merge(ORIGINALE,topcounts,by.x='DESCRIPTION',by.y='DESCRIPTION.x',all.x=T,all.y=F)
Here is result:
DESCRIPTION CODICE DESCRIPTION.y CODICE.y N
1: 4c flower str jenny jane Chicago NA jenny jane flower street 4c Chicago 83776250 5
2: mr peter 123 rose street 3b LA NA peter green 123 rose street 3b LA aw56 6
3: washington miss sarah 430f name strt NA sarah brown name street 430f washington 135tg67 4
PS: There are more memory-efficient ways to do this, and this code could cause your machine to crash due to an out-of-memory error or go slowly due to needing virtual memory, but if not, it should be faster than the for loops.

What about :
library(stringdist)
library(dplyr)
library(tidyr)
data_o <- ORIGINALE %>% mutate(desc_o = DESCRIPTION) %>% select(desc_o)
data_r <- REFERENE %>% mutate(desc_r = DESCRIPTION) %>% select(desc_r)
data <- crossing(data_o,data_r)
data %>% mutate(dist= stringsim(as.character(desc_o),as.character(desc_r))) %>%
group_by(desc_o) %>%
filter(dist==max(dist))
desc_o desc_r dist
<chr> <chr> <dbl>
1 " 4c flower str jenny jane Chicago" jenny jane flower street 4c Chicago 0.486
2 "mr peter 123 rose street 3b LA" peter green 123 rose street 3b LA 0.758
3 "washington miss sarah 430f name strt" sarah brown name street 430f washington 0.385

Related

searching for texting and storing results in new columns within the dataframe

I have a data frame (df1) with one column, with each entry/row/observation consisting of a long string of text (df1$text). In a separate data frame (df2) I have one column, with each entry/row/observation consisting of a single name (df2$name).
I would like to note for each row in df1 which of the names in df2$name appear in the text. Ideally, I'd like to store whether a name appears in df1$text as a 1/0 value that is stored in a new column in df1 (i.e. dummy variables), that is named for that name:
> df1
text
1 ...
2 ...
3 ...
4 ...
> df2
name
1 John
2 James
3 Jerry
4 Jackson
After code is executed:
> df1
text John James Jerry Jackson
1 ... 1 1 0 1
2 ... 0 0 0 1
3 ... 1 1 0 1
4 ... 1 0 0 1
Is there a way to do this without using a for loop? my text fields are long and I have many observations in both df1 and df2.
A base R option using lapply -
df1[df2$name] <- lapply(df2$name, function(x) +(grepl(x, df1$text)))
If you want the match to be case insensitive then add ignore.case = TRUE in grepl.
I'm not sure that you did not provide reproducible example. So, I made dummy data df1 myself like
df1 <- data.frame(
text = c("John James John Jakson",
"Jackson abcd zxcv",
"John Jackson James Jerr aa",
"John Jackson JAJAJAJA")
)
text
1 John James John Jakson
2 Jackson abcd zxcv
3 John Jackson James Jerr aa
4 John Jackson JAJAJAJA
Then, you may try using dplyr like
library(dplyr)
df1 %>%
mutate(John = as.numeric(grepl("John", text)),
James = as.numeric(grepl("James", text)),
Jerry = as.numeric(grepl("Jerry", text)),
Jackson = as.numeric(grepl("Jackson", text))
)
text John James Jerry Jackson
1 John James John Jakson 1 1 0 0
2 Jackson abcd zxcv 0 0 0 1
3 John Jackson James Jerr aa 1 1 0 1
4 John Jackson JAJAJAJA 1 0 0 1

Weighting a String Distance Metric based on regular expressions

Is it possible to weight a string distance metric such as the Damerau-Levenshtein distance where the weight changes based on the character type?
I am looking to create a fuzzy match of addresses and need to weight numbers and letters differently so that an address like:
"5 James Street" and "5 Jmaes Street" are considered identical and
"5 James Street" and "6 James Street" are considered different.
I considered splitting the addresses into numbers and letters prior to applying the string distance however this will miss flats at "5a" and "5b". The ordering is also not consistent amongst the data set so one entry may be "James Street 5".
I am using R with the stringdist package currently but not restricted to these.
Thanks!
Here's an idea. It involves a bit of manual processing but it might be a good starting point. First, we compute the approximate string distance between the addresses using adist() (or stringdist() with the best suited method to your data) without paying attention to street numbers.
m <- adist(v)
rownames(m) <- v
> m
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#5 James Street 0 2 3 1 4 17 17
#5 Jmaes Street 2 0 4 3 6 17 17
#5#Jam#es Str$eet 3 4 0 4 6 17 17
#6 James Street 1 3 4 0 4 17 17
#James Street 5 4 6 6 4 0 16 17
#10a Cold Winter Road 17 17 17 17 16 0 1
#10b Cold Winter Road 17 17 17 17 17 1 0
In this case, we can clearly identify the two clusters, but we could also use hclust() to visualize a dendrogram.
cl <- hclust(as.dist(m))
plot(cl)
rect.hclust(cl, 2)
Then, we tag each street to it's corresponding cluster of similarities, iterate through them and check for matching street numbers.
library(dplyr)
res <- data.frame(cluster = cutree(cl, 2)) %>%
tibble::rownames_to_column("address") %>%
mutate(
# Extract all components of the address
lst = stringi::stri_extract_all_words(address),
# Identify the component containing the street number and return it
num = purrr::map_chr(lst, .f = ~ grep("\\d+", .x, value = TRUE))) %>%
# For each cluster, tag matching street numbers
mutate(group = group_indices_(., .dots = c("cluster", "num")))
Which gives:
# address cluster lst num group
#1 5 James Street 1 5, James, Street 5 1
#2 5 Jmaes Street 1 5, Jmaes, Street 5 1
#3 5#Jam#es Str$eet 1 5, Jam, es, Str, eet 5 1
#4 6 James Street 1 6, James, Street 6 2
#5 James Street 5 1 James, Street, 5 5 1
#6 10a Cold Winter Road 2 10a, Cold, Winter, Road 10a 3
#7 10b Cold Winter Road 2 10b, Cold, Winter, Road 10b 4
You could then pull() the unique addresses based on group using distinct():
> distinct(res, group, .keep_all = TRUE) %>% pull(address)
#[1] "5 James Street" "6 James Street" "10a Cold Winter Road"
# "10b Cold Winter Road"
Data
v <- c("5 James Street", "5 Jmaes Street", "5#Jam#es Str$eet", "6 James Street",
"James Street 5", "10a Cold Winter Road", "10b Cold Winter Road")

Remove observations from DF if duplicate in specific columns while other columns must differ

I have a large data frame with multiple columns and many rows (200k). I order the rows by a group variable, and each group can have one or more entries. The other columns for each group should have identical values, however in some cases they don't. It looks like this:
group name age color city
1 Anton 50 orange NY
1 Anton 21 red NY
1 Anton 21 red NJ
2 Martin 78 black LA
2 Martin 78 blue LA
3 Maria 29 red NC
3 Maria 29 pink LV
4 Jake 33 blue NJ
I want to delete all entries of a group if age or city is not identical for all rows of the group (indication of observation error). Otherwise, I want to keep all the entries.
The output I'm hoping for would be:
group name age color city
2 Martin 78 black LA
2 Martin 78 blue LA
4 Jake 33 blue NJ
The closest I have come is this:
dup <- df[ duplicated(df[,c("group","name","color")]) | duplicated(df[,c("group","name","color")],fromLast=TRUE) ,"group"]
df_nodup <- df[!(df$group %in% dup),]
However, this is far from doing everything that I need.
P.s.: I had the same question answered for py/pandas. I'd like to have a solution for R, as well however.
/e: While Frank's answer was helpful to understand the principle of a solution and his second suggestion worked, it was very slow. (took ~15min on my df).
user20650's answer was harder to comprehend, but runs tremendously faster (~10sec).
A similar approach to Franks, you can count the length of the unique combinations of age and city by group - do this using ave. You can then subset your data if the length of the unique combinations is greater than one
# your data
df <- read.table(text="group name age color city
1 Anton 50 orange NY
1 Anton 21 red NY
1 Anton 21 red NJ
2 Martin 78 black LA
2 Martin 78 blue LA
3 Maria 29 red NC
3 Maria 29 pink LV
4 Jake 33 blue NJ ", header=T)
# calculate and subset
df[with(df, ave(paste(age, city), group, FUN=function(x) length(unique(x))))==1,]
# group name age color city
# 4 2 Martin 78 black LA
# 5 2 Martin 78 blue LA
# 8 4 Jake 33 blue NJ
Here is an approach:
temp <- tapply(df$group, list(df$name, df$age, df$city), unique)
temp[!is.na(temp)] <- 1
keepers <- names(which(apply(temp, 1, sum, na.rm=TRUE)==1))
df[df$name %in% keepers, ]
#4 2 Martin 78 black LA
#5 2 Martin 78 blue LA
#8 4 Jake 33 blue NJ
Alternate, slightly simpler approach:
temp2 <- unique(df[,c('name','age','city')])
keepers2 <- names(which(tapply(temp2$name, temp2$name, length)==1))
df[df$name %in% keepers2, ]
# group name age color city
#4 2 Martin 78 black LA
#5 2 Martin 78 blue LA
#8 4 Jake 33 blue NJ
Here's an approach using dplyr:
df <- read.table(text = "
group name age color city
1 Anton 50 orange NY
1 Anton 21 red NY
1 Anton 21 red NJ
2 Martin 78 black LA
2 Martin 78 blue LA
3 Maria 29 red NC
3 Maria 29 pink LV
4 Jake 33 blue NJ
", header = TRUE)
library(dplyr)
df %>%
group_by(group) %>%
filter(n_distinct(age) == 1 && n_distinct(city) == 1)
I think it's pretty easy to see what's going on - you group, then filter to keep groups when there is only one distinct age and city.

How to produce an R count matrix

In R, I can return the count results using the specific column names I am interested in as an array as below.
require("plyr")
bevs <- data.frame(cbind(name = c("Bill", "Llib"), drink = c("coffee", "tea", "cocoa", "water"), cost = seq(1:8)))
count(bevs, c("name", "drink"))
# produces
name drink freq
1 Bill cocoa 2
2 Bill coffee 2
3 Llib tea 2
4 Llib water 2
How can I get the count result of two specific column names in a matrix which has columns: all unique drinks, rows: all unique names and cells: freqs (like below)?
cocoa coffee tea water
Bill 2 2 0 0
Llib 0 0 2 2
P.S: Obviously, the solution does not need to use plyr.
You want a contingency table, which you can create using table:
table(bevs[, c("name", "drink")])
# drink
#name cocoa coffee tea water
# Bill 2 2 0 0
# Llib 0 0 2 2

Deduplicate dataframe based on criteria in R?

I've got this dataframe:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18
As you can see the value "Jane" appears 3 times. What I want to do is to deduplicate the list based on the variable "Name" but because the rest of the columns are important to me, I want to keep the rows that have the most information in them. For example if I was to deduplicate the above file in excel, it would keep the first value of "Jane" and delete all the other ones. But the first value of "Jane" (row no3) has got missing information in the other columns.
So in other words I want to deduplicate the list by "Name" but add a criteria to keep the rows that have any other value different from "0" in the column "Age". This way the result I would get would be this:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane US F 30
4 Kate GB F 18
I have tried this
file3 <- file1[!duplicated(file1$Name),]
But like excel it keeps the value of "Jane" that has no usable information in the other columns.
How do I sort the rows based on column "Age" in a Z-A order so that anything that has "0" will be on the bottom and will be removed when I deduplicate the list?
Cheers
David
Try this trick
ind <- with(DF,
Country !=0 &
Gender %in% c('F', 'M') &
Age !=0)
DF[ind, ]
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
4 Jane US F 30
6 Kate GB F 18
So far it works well and produces your desired output
EDIT
library(doBy)
orderBy(~ -Age+Name, DF) # Sort decreasingly by Age and Name
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Or simply using Base functions:
DF[order(DF$Age, DF$Name, decreasing = TRUE), ]
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Now you can select by indexing the correct rows meeting your conditions, I really think the first part is better than these two lasts.
If all duplicated rows have the value zero in column Age, it will work with subset:
# the data
file1 <- read.table(text="Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18", header = TRUE, stringsAsFactors = FALSE)
# create a subset of the data
subset(file1, Age > 0)
# Name Country Gender Age
# 1 John GB M 25
# 2 Mark US M 35
# 4 Jane US F 30
# 6 Kate GB F 18

Resources