approximate string matching within single list - r - r

I have a list in a data frame of thousands of names in a long list. Many of the names have small differences in them which make them slightly different. I would like to find a way to match these names. For example:
names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.')
I've looked at amatch in the stringdist function, as well as agrep, but these all require a master list of names that are used to match another list of names against. In my case, I don't have such a master list so I'd like to create one from the data by identifying names with highly similar patterns so I can look at them and decide whether they're the same person (which in many cases they are). I'd like an output in a new column that helps me to know these are a likely match, and maybe some sort of similarity score based on Levenshtein distance or something. Maybe something like this:
names match SimilarityScore
1 jon smith a 9
2 jon, smith a 8
3 Jon Smith a 9
4 jon smith et al a 5
5 bob seger b 9
6 bob, seger b 8
7 bobby seger b 7
8 bob seger jr. b 5
Is something like this possible?

Drawing upon the post found here I have found that hierarchical text clustering will do what I'm looking for.
names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.','jake','jakey','jack','jakeyfied')
# Levenshtein Distance
e <- adist(names)
rownames(e) <- names
hc <- hclust(as.dist(e))
plot(hc)
rect.hclust(hc,k=3) #the k value provides the number of clusters
df <- data.frame(names,cutree(hc,k=3))
The output looks really good if you pick the right number of clusters (three in this case):
names cutree.hc..k...3.
jon smith jon smith 1
jon, smith jon, smith 1
Jon Smith Jon Smith 1
jon smith et al jon smith et al 1
bob seger bob seger 2
bob, seger bob, seger 2
bobby seger bobby seger 2
bob seger jr. bob seger jr. 2
jake jake 3
jakey jakey 3
jack jack 3
jakeyfied jakeyfied 3
However, names are oftentimes more complex than this, and after adding a few more difficult names, I found that the default adist options didn't give the best clustering:
names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.','jake','jakey','jack','jakeyfied','1234 ranch','5678 ranch','9983','7777')
d <- adist(names)
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=6)
I was able to improve upon this by increasing the cost of the substitution value to 2 and leaving the insertion and deletion costs at 1, and ignoring case. This helped to to minimize the mistaken grouping of totally different four character number strings, which I didn't want grouped:
d <- adist(names,ignore.case=TRUE, costs=c(i=1,d=1,s=2)) #i=insertion, d=deletion s=substitution
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=6
I further fine tuned the clustering by removing common terms such as "ranch" and "et al" using the gsub tool in the grep package and increasing the number of clusters by one:
names<-gsub("ranch","",names)
names<-gsub("et al","",names)
d <- adist(names,ignore.case=TRUE, costs=c(i=1,d=1,s=2))
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=7)
Although there are methods to let the data sort out the best number of clusters instead of manually trying to pick the number, I found that it was easiest to use trial and error, although there is information here about that approach.

The suggestion by Roman in the comments on the natural language processing is probably the best place to start. But for a back-of-the-envelope type of approach you can look at the distance in terms of ascii code:
mynames = c("abcd efghijkl mn","zbcd efghijkl mn","bbcd efghijkl mn","erqe")
asc <- function(x) { strtoi(charToRaw(x),16L) }
namesToChar= sapply(mynames, asc)
maxLength= max(unlist(lapply(namesToChar,length)))
namesToChar =lapply(namesToChar, function(x) { c(x, rep(-1, times = maxLength-length(x) )) } )
namesToChar = do.call("rbind",namesToChar)
dist(namesToChar,method="euclidean")
dist(namesToChar,method="canberra")
Though it seems to give OK enough numbers for the sample,
> dist(namesToChar,method="manhattan")
abcd efghijkl mn zbcd efghijkl mn bbcd efghijkl mn
zbcd efghijkl mn 25
bbcd efghijkl mn 1 24
erqe 257 274 256
this approach suffers from the fact that there does not seem to be an adequate distance method for the dist function for what you want to do. An element-wise binary comparison followed by a more standard distance perhaps ('manhattan' seems closest to your needs)? You could always implement that yourself of course. Also the -1 fill out is a hack here, you would need to replace that with the average ascii code of your sample if you decide to go this route.
For a similarity score versus the overall population you can take inverse of the average distance against each other word.

Related

Filter where there are at least two pattern matches

I have a lot of text data in a data.table. I have several text patterns that I'm interested in. I want to subset the table so it shows text that matches at least two of the patterns.
This is further complicated by the fact that some of the patterns already are an either/or, for example something like "paul|john".
I think I either want an expression that would mean directly to subset on that basis, or alternatively if I could count the number of times the patterns occur I could then use that as a tool to subset. I've seen ways to count the number of times patterns occur but not where the info is clearly linked to the IDs in the original dataset, if that makes sense.
At the moment the best I can think of would be to add a column to the data.table for each pattern, check if each pattern matches individually, then filter on the sum of the patterns. This seems quite convoluted so I am hoping there is a better way, as there are quite a lot of patterns to check!
Example data
text_table <- data.table(ID = (1:5), text = c("lucy, sarah and paul live on the same street",
"lucy has only moved here recently",
"lucy and sarah are cousins",
"john is also new to the area",
"paul and john have known each other a long time"))
text_patterns <- as.character(c("lucy", "sarah", "paul|john"))
With the example data, I would want IDs 1 and 3 in the subsetted data.
Thanks for your help!
We can paste the 'text_patterns' with the |, use that as pattern in 'str_count' to get the count of matching substring, and check if it is greater than 1 to filter the rows of the data.table
library(data.table)
text_table[str_count(text, paste(text_patterns, collapse="|")) >1]
# ID text
#1: 1 lucy, sarah and paul live on the same street
#2: 3 lucy and sarah are cousins
#3: 5 paul and john have known each other a long time
Update
If we need to consider each 'text_pattern' as a fixed pattern, we loop through the patterns, check whether the pattern is present (str_detect) and get the sum of all the patterns with + to create the logical vector for subsetting rows
i1 <- text_table[, Reduce(`+`, lapply(text_patterns,
function(x) str_detect(text, x))) >1]
text_table[i1]
# ID text
#1: 1 lucy, sarah and paul live on the same street
#2: 3 lucy and sarah are cousins

r string match exact words [duplicate]

This question already has answers here:
Matching multiple patterns
(6 answers)
Closed 6 years ago.
I have a column in a dataframe where each row contains bunch of names and these names are seperated by a comma , as shown below
Col1
----------------------------------------------------
Missy Monroe, Andy Dalton P, Deny Grove, Easton West
Susan Schmidt, Bella Blu, Dennis Lee H, Georges Madison
Maya Unger, Kal Rapinsky, Richard Izzo, Rob Kolfax
Bismark Bison, Twyla Yellow Bird Bell, Yost Jefferson
I am searching for three names in this column, Missy Monroe, or Dennis Lee, or Bismark Bison if any one of these names are found then a value Yes should be imputed in the second column, if neither of these names are found then value should be No in the second column. The final output should be as follows.
Col1 Results
----------------------------------------------------------------------
Missy Monroe, Andy Dalton P, Deny Grove, Easton West Yes
Susan Schmidt, Bella Blu, Dennis Lee H, Georges Madison Yes
Maya Unger, Kal Rapinsky, Richard Izzo, Rob Kolfax No
Bismark Bison, Twyla Yellow Bird Bell, Yost Jefferson Yes
Any help on accomplishing this is much appreciated.
This should work for a data frame df:
df$Results <- ifelse(grepl("(Missy Monroe|Dennis Lee|Bismark Bison)",
df$Col1), "Yes", "No")
The grepl function returns TRUE or FALSE, which is a perfect input for ifelse.
As #david-arenburg notes, if you are planning on using this column for additional data analysis, it is probably a better idea to construct it as a logical vector rather than a string vector. In this case,
df$Results <- grepl("(Missy Monroe|Dennis Lee|Bismark Bison)", df$Col1)
will suffice.

R - Combinations of a dataframe with constraints

I'm trying to make an R script for fantasy football (proper UK football, not hand egg :-)) where I can input a list of players in a csv and it will spit out every 11-player combination, which meet various constraints.
Here's my sample dataframe:
df <- read.csv("Filename.csv",
header = TRUE)
> print(df)
Name Positon Team Salary
1 Eric Dier D TOT 9300000
2 Erik Pieters D STO 9200000
3 Christian Fuchs D LEI 9100000
4 Héctor Bellerín D ARS 9000000
5 Charlie Daniels D BOU 9000000
6 Ben Davies D TOT 8900000
7 Federico Fernández D SWA 8800000
8 Per Mertesacker D ARS 8800000
9 Alberto Moreno D LIV 8700000
10 Chris Smalling D MUN 8700000
11 Seamus Coleman D EVE 8700000
12 Jan Vertonghen D TOT 8700000
13 Romelu Lukaku F EVE 12700000
14 Harry Kane F TOT 12500000
15 Max Gradel F BOU 11900000
16 Alexis Sánchez F ARS 11300000
17 Jamie Vardy F LEI 11200000
18 Theo Walcott F ARS 10700000
19 Olivier Giroud F ARS 10700000
20 Wilfried Bony F MCI 10000000
21 Kristoffer Nordfeldt G SWA 7000000
22 Joe Hart G MCI 6800000
23 Jack Rose G WBA 6600000
24 Asmir Begovic G CHE 6600000
25 Mesut Özil M ARS 15600000
26 Riyad Mahrez M LEI 15200000
27 Ross Barkley M EVE 13300000
28 Dimitri Payet M WHM 12800000
29 Willian M CHE 12500000
30 Bertrand Traore M CHE 12500000
31 Kevin De Bruyne M MCI 12400000
And the constraints are as follows:
1) The total salary of each 11-player lineup cannot exceed 100,000,000
2) There can only be a maximum of four players from one team. E.g. four player from 'CHE' (Chelsea).
3) There is a limit of how many players within each 11-player lineup can be from each position. There can be:
1 G (goalkeeper), 3 to 4 D (defender), 3 to 5 M (midfielder), 1 to 3 F (forward)
I'd like every 11 player combination that meets the above contraints to be returned. Order is not important (e.g. 1,2,3 is considered the same as 2,1,3 and shouldn't be duplicated) and a player can appear in more than one lineup.
I've done a fair bit of research and played around but can't seem to get anywhere with this. I'm new to R. I don't expect anyone to nail this for me, but if someone could point a newbie like myself in the right direction it would be much appreciated.
Thanks.
This can be solved as linear integer program using the library LPSolve.
This kind of problems are very well solvable -- opposed to what has been written before -- as typical the number of solutions are much smaller than the domain size.
You can add for each Player a zero one variable, whether or not that player is in the team.
The package can be installed using
install.packages("lpSolve")
install.packages("lpSolveAPI")
The documentation can be found at: https://cran.r-project.org/web/packages/lpSolve/lpSolve.pdf
First constraint sum of players 11
The salary is basically a sum of all players variable multiplied by the salary column and so on....
To get a proper solutions you need to specify in
lp.solve(all.bin=TRUE
Such that all variables referring to players are either zero or one.
( I understood that you are trying to learn, that's why I refrain from giving a full solution)
EDIT
As I got down-voted probably because of not giving the full solution. Kind of sad as as the original author explicitly wrote that he doesn't expect a full solution
library(lpSolve)
df <- read.csv("/tmp/football.csv",header = TRUE,sep=";")
f.obj <- rep(1,nrow(df))
f.con <-
matrix(c(f.obj <- rep(1,nrow(df)),
as.vector(df$Salary),
(df$Positon=="G") *1.0,
(df$Positon=="D") *1.0,
(df$Positon=="D") *1.0,
(df$Positon=="M") *1.0,
(df$Positon=="M") *1.0,
(df$Positon=="F") *1.0,
(df$Positon=="F") *1.0),nrow=9,byrow= TRUE)
f.dir <- c("==", "<=","==",">=","<=",">=","<=",">=","<=")
f.rhs<- c(11, #number players
100000000, #salary
1 , #Goalkeeper
3 , # def min
4 , # def max
3 , # mdef min
5, # mdef max
1, # for, min
3 # wor, max
)
solutions <- lp ("max", f.obj, f.con, f.dir, f.rhs,all.bin=TRUE)
I didn't add the Team Constraint as it wouldn't have provided any additionally insights here....
** EDIT2 **
This might come handy if you change your data set
R lpsolve binary find all possible solutions
A brute-force way to tackle this, (which is also beautifully parallelizable and guarantees you all possible combinations) is to calculate all 11-player permutations and then filter out the combinations that don't conform to your limits in a stepwise manner.
To make a program like this fit into your computer's memory, give each player a unique integer ID and create vectors of IDs as team sets. When you then implement your filters your functions can refer to the player info by that ID in a single dataframe.
Say df is your data frame with all player data.
df$id <- 1:nrow(df)
Get all combinations of ids:
# This will take a long time or run out of memory!
# In my 2.8Gz laptop took 466 seconds just for your 31 players
teams <- combn(df$id, 11)
Of course, if your dataframe is big (like hundreds of players) this implementation could take impossibly long to finish. You probably would be better off just sampling 11-sets from your player set without replacement and construct teams in an "on demand" fashion.
A more clever way is to partition your dataset according to player position into - one for goalkeepers, one for defence, etc. And then use the above approach to create permutations of different players from each position and combine the end results. It would take ridiculously less amount of time, it would still be parallelizable and exhaustive (give you all possible combinations).

Consolidate data table factor levels in R

Suppose I have a very large data table, one column of which is "ManufacturerName". The data was not entered uniformly, so it's pretty messy. For example, there may be observations like:
ABC Inc
ABC, Inc
ABC Incorporated
A.B.C.
...
Joe Shmos Plumbing
Joe Shmo Plumbing
...
I am looking for an automated way in R to try and consider similar names as one factor level. I have learned the syntax to manually do this, for example:
levels(df$ManufacturerName) <- list(ABC=c("ABC", "A.B.C", ....), JoeShmoPlumbing=c(...))
But I'm trying to think of an automated solution. Obviously it's not going to be perfect as I can't anticipate every type of permutation in the data table. But maybe something that searches the factor levels, strips out punctuation/special characters, and creates levels based on common first words. Or any other ideas. Thanks!
Look into the stringdist package. For starters, you could do something like this:
library(stringdist)
x <- c("ABC Inc", "ABC, Inc", "ABC Incorporated", "A.B.C.", "Joe Shmos Plumbing", "Joe Shmo Plumbing")
d <- stringdistmatrix(x)
# 1 2 3 4 5
# 2 1
# 3 9 10
# 4 6 7 15
# 5 16 16 16 18
# 6 15 15 15 17 1
For more help, see ?stringdistmatrix or do searches on StackOverflow for fuzzy matching, approximate string matching, string distance functions, and agrep.

Create a unique ID by fuzzy matching of names (via agrep using R)

Using R, I am trying match on people's names in a dataset structured by year and city. Due to some spelling mistakes, exact matching is not possible, so I am trying to use agrep() to fuzzy match names.
A sample chunk of the dataset is structured as follows:
df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013", "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA","PAULO CEZAR FERREIRA DE ARAUJO","PAULO CESAR FERREIRA DE ARAUJO","SEBASTIAO BOCALOM RODRIGUES","JOAO DE ALMEIDA","PAULO CESAR FERREIRA DE ARAUJO"), ncol=3,dimnames=list(seq(1:8),c("citycode","year","candidate")) ))
The neat version:
citycode year candidate
1 1200013 1996 AGUSTINHO FORTUNATO FILHO
2 1200013 1996 ANTONIO PEREIRA NETO
3 1200013 1996 FERNANDO JOSE DA COSTA
4 1200013 1996 PAULO CEZAR FERREIRA DE ARAUJO
5 1200013 2000 PAULO CESAR FERREIRA DE ARAUJO
6 1200013 2000 SEBASTIAO BOCALOM RODRIGUES
7 1200013 2004 JOAO DE ALMEIDA
8 1200013 2004 PAULO CESAR FERREIRA DE ARAUJO
I'd like to check in each city separately, whether there are candidates appearing in several years. E.g. in the example,
PAULO CEZAR FERREIRA DE ARAUJO
PAULO CESAR FERREIRA DE ARAUJO
appears twice (with a spelling mistake). Each candidate across the entire data set should be assigned a unique numeric candidate ID. The dataset is fairly large (5500 cities, approx. 100K entries) so a somewhat efficient coding would be helpful. Any suggestions as to how to implement this?
EDIT: Here is my attempt (with help from the comments thus far) that is very slow (inefficient) in achieving the task at hand. Any suggestions as to improvements to this?
f <- function(x) {matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE)
levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))]
x
}
temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE)
df$candidatenew <- unlist(temp)
df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)
EDIT 2: Now running at good speed. Problem was the comparison to many factors at every step (Thanks for pointing that out, Blue Magister). Reducing the comparison to only the candidates in one group (i.e. a city) runs the command in 5 seconds for 80,000 lines - a speed I can live with.
df$candidate <- as.character(df$candidate)
f <- function(x) {x <- as.factor(x)
matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE)
levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))]
as.character(x)
}
temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE)
df$candidatenew <- unlist(temp)
df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)
Here's my shot at it. It's probably not very efficient, but I think it will get the job done. I assume that df$candidates is of class factor.
#fuzzy matches candidate names to other candidate names
#compares each pair of names only once
##by looking at names that have a greater index
matches <- unlist(lapply(1:(length(levels(df[["candidate"]]))-1),
function(x) {max(x,x + agrep(
pattern=levels(df[["candidate"]])[x],
x=levels(df[["candidate"]])[-seq_len(x)]
))}
))
#assigns new levels (omits the last level because that doesn't change)
levels(df[["candidate"]])[-length(levels(df[["candidate"]]))] <-
levels(df[["candidate"]])[matches]
Ok, given that the focus is on the efficiency, I'd suggest the following.
First, note that in order of efficiency from first principles we could predict that exact matching will be much faster than grep which will be faster than fuzzy grep. So exact match, then fuzzy grep for the remaining observations.
Second, vectorize and avoid loops. The apply commands aren't necessarily faster, so stick to native vectorization if you can. All the grep commands are natively vectorized, but it's going to be hard to avoid a *ply or loop to compare each element to the vector of others to match to.
Third, use outside information to narrow the problem down. Do fuzzy matching on names only within each city or state, which will dramatically reduce the number of comparisons which must be made, for instance.
You can combine the first and third principles: You might even try exact matching on the first character of each string, then fuzzy matching within that.

Resources