Fuzzy/flex/scatter string matching - string-matching

Everybody has seen "fuzzy" string matching used by Textmate to search files (also by emacs' ido/icicles, vim's Command-T, Sublime Text 2, XCode etc) by entering partial path/name of a file (possibly not entering whole name, just some parts). This is also popularized by applications like QuickSilver, LaunchBar and Alfred.
So I'm wondering if there is any ideas how do I build an index which will speed up such a matching? I have list of thousands of strings (around 7k right now - list of songs from iTunes), and I'd like to match them fast. Right now I've just taken QuickSilver scoring algorithm which on certain queries can take 5 seconds to perform.
Any ideas how do I speed up this are welcome.

Related

Google Ngram Viewer - English One Million

I'm training a language model in PyTorch and I'd need the most common one million words in English to serve as dictionary.
From what I've understood, the Google Ngram English One Million (1-grams) might suit to this task, but after downloading every part (0-9) of this dataset and using tail on them to check if they were what I supposed, I found out that no part of this dataset contains words beyond the F letter.
As far as I understood, any Version 1 file has its ngrams alphabetically and cronologically sorted and I'm concerned if it might be possible that the most common one million words do not go beyond the F?
Or am I missing the point of this dataset and it isn't the most commond one million words?
Try shuf <file> to get a random sorting and you will see the data covers all letters. What you see at the end of the files is not an f but the ligature fl.

How to find out the longest definition entry in an English dictionary text file?

I asked over at the English Stack Exchange, "What is the English word with the longest single definition?" The best answer they could give is that I would need a program that could figure out the longest entry in a (text) file listing dictionary definitions, by counting the amount of characters or words in a given entry, and then provide a list of the longest entries. I also asked at Superuser but they couldn't come up with an answer either, so I decided to give it a shot here.
I managed to find a dictionary file which converted to text has the following format:
a /a/ indefinite article (an before a vowel) 1 any, some, one (have a cookie). 2 one single thing (there’s not a store for miles). 3 per, for each (take this twice a day).
aardvark /ard-vark/ n an African mammal with a long snout that feeds on ants.
abacus /a-ba-kus, a-ba-kus/ n a counting frame with beads.
As you can see, each definition comes after the pronunciation (enclosed by slashes), and then either:
1) ends with a period, or
2) ends before an example (enclosed by parenthesis), or
3) follows a number and ends with a period or before an example, when a word has multiple definitions.
What I would need, then, is a function or program that can distinguish each definition (including considering multiple definitions of a single word as separate ones), then count the amount of characters and/or words within (ignoring the examples in parenthesis since that is not the proper definition), and finally provide a list of the longest definitions (I don't think I would need more than say, a top 20 or so to compare). If the file format was an issue, I can convert the file to PDF, EPUB, etc. with no problem. And, I guess ideally I would want to be able to choose between counting length by characters and by words, if it was possible.
How should I go to do this? I have little experience from programming classes I took a long time ago, but I think it's better to assume I know close to nothing about programming at all.
Thanks in advance.
I'm not going to write code for you, but I'll help think the problem through. Pick the programming language you're most familiar with from long ago, and give it a whack. When you run in to problems, come back and ask for help.
I'd chop this task up into a bunch of subproblems:
Read the dictionary file from the filesystem.
Chunk the file up into discrete entries. If it's a text file like you show, most programming languages have a facility to easily iterate linewise through a file (i.e. take a line ending character or character sequence as the separator).
Filter bad entries: in your example, your lines appear separated by an empty line. As you iterate, you'll just drop those.
Use your human observation and judgement to look for strong patterns in the data that you can give communicate as firm rules -- this is one of the central activities of programming. You've already started identifying some patterns in your question, i.e.
All entries have a preamble with the pronounciation and part of speech.
A multiple definition entry will be interspersed with lone numerals.
Otherwise, a single definition just follows the preamble.
Write the rules you've invented into code. It'll go something like this: First find a way to lop off the word itself and the preamble. With the remainder, identify multiple-def entries by presence of lone numerals or whatever; if it's not, treat it as single-def.
For each entry, iterate over each of the one-or-more definitions you've identified.
Write a function that will count a definition either word-wise or character-wise. If word-wise, you'll probably tokenize based on whitespace. Counting the length of a string character-wise is trivial in most programming languages. Why not implement both!
Keep a data structure in memory as you iterate the file to track "longest". For each definition in each entry, after you apply the length calculation, you'll compare against the previous longest entry. If the new one is longer, you'll record this new leading word and its word count in your data structure. Comparing 'greater than' and storing a variable are fundamental in most programming languages, so while this is the real meat of your program, this shouldn't be hard.
Implement some way to display your results once iteration is done. This may be as simple as a print statement.
Finally, write the glue code that lets you execute the program easily. A program like this could easily be a command-line tool that takes one or two arguments (the path to the file to be analyzed, perhaps you pass your desired counting method 'character|word' as an argument too, since you implemented both). Different languages vary in how easy it is to create an executable to run from the command line, but most support it, so it's a good option for tasks like this.

How to fuzzy match character strings of persons' names listed variously firstName lastName or lastName firstName and with misspellings [duplicate]

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/

Fixing string variables with varying spellings, etc

I have a dataset with individuals names, addresses, phone numbers, etc. Some individuals appear multiple times, with slightly varying names/ and/or addressees and/or phone numbers. A snippet of the fake data is shown below:
first last address phone
Jimmy Bamboo P.O. Box 1190 xxx-xx-xx00
Jimmy W. Bamboo P.O. Box 1190 xxx-xx-xx22
James West Bamboo P.O. Box 219 xxx-66-xxxx
... and so on. Some times E. is spelled out as east, St. as Street, at other times they are not.
What I need to do is run through almost 120,000 rows of data to identify each unique individual based on their names, addresses, and phone numbers. Anyone have a clue as to how this might be done without manually running through each record, one at a time? The more I stare at it the more I think its impossible without making some judgment calls and saying if at least two or three fields are the same treat this as a single individual.
thanks!!
Ani
As I mentioned in the comments, this is not trivial. You have to decide the trade-off of programmer time/solution complexity with results. You will not achieve 100% results. You can only approach it, and the time and complexity cost will increase the closer to 100% you get. Start with an easy solution (exact matches), and see what issue most commonly causes the missed matches. Implement a fuzzy solution to address that. Rinse and repeat.
There are several tools you can use (we use them all).
1) distance matching, like Damerau Levenshtein . you can use this for names, addresses and other things. It handles error like transpositions, minor spelling, omitted characters, etc.
2) phonetic word matching - soundex is not good. There are other more advanced ones. We ended up writing our own to handle the mix of ethnicities we commonly encounter.
3) nickname lookups - many nicknames will not get caught by either phonetic or distance matching - names like Fanny for Frances. There are many nicknames like that. You can build a lookup of nicknames to regular name. Consider though the variations like Jennifer -> Jen, Jenny, Jennie, Jenee, etc.
Names can be tough. Creative spelling of names seems to be a current fad. For instance, our database has over 30 spelling variations of the name Kaitlynn, and they are all spellings of actual names. This makes nickname matching tough when you're trying to match Katy to any of those.
Here are some other answers on similar topics I've made here on stackoverflow:
Processing of mongolian names
How to solve Dilemma of storing human names in MySQL and keep both discriminability and a search for similar names?
MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard
You can calculate the pairwise matrix of Levenshtein distances.
See this recent post for more info: http://www.markvanderloo.eu/yaRb/2013/02/26/the-stringdist-package/

Techniques for finding near duplicate records

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/

Resources