Record linking and fuzzy name matching in big datasets in R - r

I'm trying to merge two large datasets. The common variable, first and last name, vary in spelling between the datasets and there are many duplicates, even between similarly spelled names. I've included download links for the files and some R code below. I'll walk through what I've tried and what went wrong.
There are a few R tutorials that have tried to tackle (the common) problem of record linking, but none of dealt with large datasets. I'm hoping the SO community can help me solve this problem.
The first dataset is a large file (several hundred thousand
rows) of Federal Elections Commission political contributions.
The second is a custom dataset of the name and companies of
every Internet company founder (~5,000 rows)
https://www.dropbox.com/s/lfbr9lmurv791il/010614%20CB%20Founders%20%20-%20CB%20Founders.csv?dl=0
--Attempted code matching with regular expressions--
My first attempt, thanks to the help of previous SO suggestions, was to use agrep and regular string matching. This narrowed down the names, but resulted in too many duplicates
#Load files#
expends12 <- fread("file path for FEC", sep="|", header=FALSE)
crunchbase.raw <- fread("file path for internet founders")
exp <- expends12
cr <- crunchbase.raw
#user regular string matching#
exp$xsub= gsub("^([^,]+)\\, (.{7})(.+)", "\\2 \\1", tolower(expends12$V8))
cr$ysub= gsub("^(.{7})([^ ]+) (.+)", "\\1 \\3", tolower(cr$name))
#merge files#
fec.merge <- merge(exp,cr, by.x="xsub", by.y="ysub")
The result is 6,900 rows, so there are (a lot) of duplicates. Many rows are people with similar names as Internet founders, such as Alexander Black, but are from different states and have different job titles. So, now its a question of finding the real Internet founder.
One option to narrow the results would be filter the results by states. So, I might only take the Alexander Black from California or New York, because that is where most startups are founded. I might also only take certain job titles, such as CEO or founder. But, many founders had jobs before and after their companies, so i wouldn't want to narrow by job title too much.
Alternatively, there is an r package, RecordLinkage, but as I far as I can tell, there needs to be similar rows and columns between the datasets, which is a nonstarter for this task
I'm familiar with R, but have somewhat limited statistical knowledge and programming ability. Any step-by-step help is very much appreciated. Thank you and please let me know if there's any trouble downloading the data.

Why don't you select the columns you need from both datasets, rename them similarly and in the result object, you get the row indices for matches returned. As long as you don't reorder things, you can use the results to match both datasets.

Related

Querying UniProt and RefSeq databases with FASTA headers

This is my first post to StackOverflow. I've been playing with this problem for over a week, and I haven't found a solution using the search function or my limited computational skills. I have a dataset comprised columns of FASTA headers, Sequence, and count number. The FASTA headers technically contain all the info I need, but that's where I'm running into problems...
Different formats
Some of the entries come from UniProt:
tr|V5RFN6|V5RFN6_EBVG_Epstein-Barr_nuclear_antigen_2_(Fragment)_OS=Epstein-Barr_virus_(strain_GD1)_GN=EBNA2_PE=4_SV=1
Some of the entries come from RefSeq:
gi|139424477|ref|YP_001129441.1|EBNA-2[Human_herpesvirus_4_type_2]
Synonyms
I'd like to make graphs using count number vs virus or gene, and I thought it would be easy enough to split up the headers and go from there. However, what I'm discovering is that there's a seemingly endless number of permutations on the names of virus and gene names. In the example above, EBV goes by no less than 4 names, and each individual gene has several different formattings.
I used a lengthy ifelse statement to create a column for virus family name. I shortened the following below to just include EBV, but you can imagine it stretching on for all common viruses.
EBV <- c("EBVG", "Human_herpesvirus_4", "Epstein", "Human_gammaherpesvirus_4")
joint.virus <- joint.virus %>% mutate(Virus_Family =
ifelse(grepl(paste(EBV, collapse = "|"), x = joint.virome$name), "EBV", NA))
This isn't so bad, but I had to do something similar for all of EBV's ~85 genes. Not only was this tedious, but it isn't feasible to do this for all the viruses I want to look at.
I looked into querying the databases using the UniProt.ws package to pull out organism name and gene name, but you need to start from the taxID (which isn't included in the UniProt header). I feel like there should be some way to use the FASTA header to get the organism name and gene name.
I am presently using R. I would greatly appreciate any advice going forward. Is there a package that I'm overlooking? Should I be using a different tool to do this?
Thanks!

Text matching using R when strings are dissimilar

I am trying to identify observations that match between two datasets, using text string vectors $contractor and $employer, and create a TRUE/FALSE indicator on whether the contractor is in the employer list.
library(caTools)
list<-data.frame(ID=c(1:6),
employer=c("a.c. construction","abc concrete company","xyz pool construction inc","frank studebager llc","annoying contractors llc","beaumont ditch digging co inc"))
jobs<-data.frame(contractor=c("a-c construction","hank hill construction","xyz pool const incorporated","frank studebaer co","hank hill const"),
value=c(400000,284590,410280,310980))
jobs$match<-pmatch(jobs$contractor,list$employer,duplicates.ok=TRUE)
The pmatch command says there are 0 matches, but this is because the company names are sloppily entered and not spelled consistently; there are obviously matches. I have also used the fuzzy matching command agrepl, but in my actual data the number and quality of matching varies incredibly with small changes to the accepted Levenshtein distance.
There are also some answers here and here but my lack of advanced programming experience has kept me from applying the concepts there. Any thoughts are appreciated!

How to fuzzy match character strings of persons' names listed variously firstName lastName or lastName firstName and with misspellings [duplicate]

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/

Fixing string variables with varying spellings, etc

I have a dataset with individuals names, addresses, phone numbers, etc. Some individuals appear multiple times, with slightly varying names/ and/or addressees and/or phone numbers. A snippet of the fake data is shown below:
first last address phone
Jimmy Bamboo P.O. Box 1190 xxx-xx-xx00
Jimmy W. Bamboo P.O. Box 1190 xxx-xx-xx22
James West Bamboo P.O. Box 219 xxx-66-xxxx
... and so on. Some times E. is spelled out as east, St. as Street, at other times they are not.
What I need to do is run through almost 120,000 rows of data to identify each unique individual based on their names, addresses, and phone numbers. Anyone have a clue as to how this might be done without manually running through each record, one at a time? The more I stare at it the more I think its impossible without making some judgment calls and saying if at least two or three fields are the same treat this as a single individual.
thanks!!
Ani
As I mentioned in the comments, this is not trivial. You have to decide the trade-off of programmer time/solution complexity with results. You will not achieve 100% results. You can only approach it, and the time and complexity cost will increase the closer to 100% you get. Start with an easy solution (exact matches), and see what issue most commonly causes the missed matches. Implement a fuzzy solution to address that. Rinse and repeat.
There are several tools you can use (we use them all).
1) distance matching, like Damerau Levenshtein . you can use this for names, addresses and other things. It handles error like transpositions, minor spelling, omitted characters, etc.
2) phonetic word matching - soundex is not good. There are other more advanced ones. We ended up writing our own to handle the mix of ethnicities we commonly encounter.
3) nickname lookups - many nicknames will not get caught by either phonetic or distance matching - names like Fanny for Frances. There are many nicknames like that. You can build a lookup of nicknames to regular name. Consider though the variations like Jennifer -> Jen, Jenny, Jennie, Jenee, etc.
Names can be tough. Creative spelling of names seems to be a current fad. For instance, our database has over 30 spelling variations of the name Kaitlynn, and they are all spellings of actual names. This makes nickname matching tough when you're trying to match Katy to any of those.
Here are some other answers on similar topics I've made here on stackoverflow:
Processing of mongolian names
How to solve Dilemma of storing human names in MySQL and keep both discriminability and a search for similar names?
MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard
You can calculate the pairwise matrix of Levenshtein distances.
See this recent post for more info: http://www.markvanderloo.eu/yaRb/2013/02/26/the-stringdist-package/

Techniques for finding near duplicate records

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/

Resources