Related
I asked over at the English Stack Exchange, "What is the English word with the longest single definition?" The best answer they could give is that I would need a program that could figure out the longest entry in a (text) file listing dictionary definitions, by counting the amount of characters or words in a given entry, and then provide a list of the longest entries. I also asked at Superuser but they couldn't come up with an answer either, so I decided to give it a shot here.
I managed to find a dictionary file which converted to text has the following format:
a /a/ indefinite article (an before a vowel) 1 any, some, one (have a cookie). 2 one single thing (there’s not a store for miles). 3 per, for each (take this twice a day).
aardvark /ard-vark/ n an African mammal with a long snout that feeds on ants.
abacus /a-ba-kus, a-ba-kus/ n a counting frame with beads.
As you can see, each definition comes after the pronunciation (enclosed by slashes), and then either:
1) ends with a period, or
2) ends before an example (enclosed by parenthesis), or
3) follows a number and ends with a period or before an example, when a word has multiple definitions.
What I would need, then, is a function or program that can distinguish each definition (including considering multiple definitions of a single word as separate ones), then count the amount of characters and/or words within (ignoring the examples in parenthesis since that is not the proper definition), and finally provide a list of the longest definitions (I don't think I would need more than say, a top 20 or so to compare). If the file format was an issue, I can convert the file to PDF, EPUB, etc. with no problem. And, I guess ideally I would want to be able to choose between counting length by characters and by words, if it was possible.
How should I go to do this? I have little experience from programming classes I took a long time ago, but I think it's better to assume I know close to nothing about programming at all.
Thanks in advance.
I'm not going to write code for you, but I'll help think the problem through. Pick the programming language you're most familiar with from long ago, and give it a whack. When you run in to problems, come back and ask for help.
I'd chop this task up into a bunch of subproblems:
Read the dictionary file from the filesystem.
Chunk the file up into discrete entries. If it's a text file like you show, most programming languages have a facility to easily iterate linewise through a file (i.e. take a line ending character or character sequence as the separator).
Filter bad entries: in your example, your lines appear separated by an empty line. As you iterate, you'll just drop those.
Use your human observation and judgement to look for strong patterns in the data that you can give communicate as firm rules -- this is one of the central activities of programming. You've already started identifying some patterns in your question, i.e.
All entries have a preamble with the pronounciation and part of speech.
A multiple definition entry will be interspersed with lone numerals.
Otherwise, a single definition just follows the preamble.
Write the rules you've invented into code. It'll go something like this: First find a way to lop off the word itself and the preamble. With the remainder, identify multiple-def entries by presence of lone numerals or whatever; if it's not, treat it as single-def.
For each entry, iterate over each of the one-or-more definitions you've identified.
Write a function that will count a definition either word-wise or character-wise. If word-wise, you'll probably tokenize based on whitespace. Counting the length of a string character-wise is trivial in most programming languages. Why not implement both!
Keep a data structure in memory as you iterate the file to track "longest". For each definition in each entry, after you apply the length calculation, you'll compare against the previous longest entry. If the new one is longer, you'll record this new leading word and its word count in your data structure. Comparing 'greater than' and storing a variable are fundamental in most programming languages, so while this is the real meat of your program, this shouldn't be hard.
Implement some way to display your results once iteration is done. This may be as simple as a print statement.
Finally, write the glue code that lets you execute the program easily. A program like this could easily be a command-line tool that takes one or two arguments (the path to the file to be analyzed, perhaps you pass your desired counting method 'character|word' as an argument too, since you implemented both). Different languages vary in how easy it is to create an executable to run from the command line, but most support it, so it's a good option for tasks like this.
I'm in the process of evaluating how successful a script I wrote is and kind of a quick and dirty method I've employed is looking at the first few values and last few values of a single variable and doing a few calculations with them based on the same values in another netcdf file.
I know that there are better ways to approach this but again, this is a really quick and dirty method that has worked for me so far. My question though is that by looking at the raw data through ncdump, is there a way to tell which vertical layer that data belongs to? In my example, the file has 14 layers. I"m assuming that the first few values are a part of the surface layer and the last few values are a part of the top layer, but I suspect that this assumption is wrong, at least in part.
As a follow-up question, what would then be the easiest 'proper' way to tell what layer data belongs to? Thank you in advance!
ncview and NCO are both very powerful and quick command line operators to view data inside a netcdf file.
ncview: http://meteora.ucsd.edu/~pierce/ncview_home_page.html
NCO: http://nco.sourceforge.net/
You can easily show variables over all layers for example with
ncks -d layer,0,13 some_infile.nc
ncdump dumps the data with the last dimension varying fastest (http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/CDL-Syntax.html) so if 'layer' is the slowest/first dimension, the earlier values are all in the first layer, while the last few values are in the last layer.
As to whether the first layer is the top or bottom layer, you'd have to look to the 'layer' dimension and its data.
I wanna know if someone know how to do transformation of the channel four (FLH 4) without using the standard transformations offer by the flowCore package?
The values of the channel four are between 1 and 4096 and i need to convert in values between 1 and 246 with the rule 10^(x/1024).
Thank you.
Better to use flowTrans mclMultivArcSinh transformation.
trans<-flowTrans(flowData, "mclMultivArcSinh",colnames(flowData)[3:12], n2f=FALSE, parameters.only=FALSE)
You must not tranform FSC-A,SSC-A and Time thats why i have colnames [3:12].
you could get a custom transform in by doing something like..
plot(transform(someFlowFrame, FSC-H=10^(FSC-H/1024), SSC-H=10^(SSC-H/1024)), c("FSC-H","SSC-H"))
however as 10^(4096/1024) returns a max value of 10000 for your hypothetical example, the plot with your ranges -
plot(transform(someFlowFrame, FSC-H=10^(FSC-H/1024), SSC-H=10^(SSC-H/1024)), c("FSC-H","SSC-H"), xlim=c(0,256), ylim=c(0,256))
doesn't look good.
I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/
I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.
For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"
should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
Any ideas or references is much appreciated
I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.
I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.
Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.
Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).
Not a perfect solution by any stretch, but I don't think you are expecting one?
The key understanding here is that you do have a proper distance metric. That is in fact not your problem at all. Your problem is in classification.
Let me give you an example. Say you have 20 entries for the Foo X1 and 20 for the Foo Y1. You can safely assume they are two groups. On the other hand, if you have 39 entries for the Bar X1 and 1 for the Bar Y1, you should treat them as a single group.
Now, the distance X1 <-> Y1 is the same in both examples, so why is there a difference in the classification? That is because Bar Y1 is an outlier, whereas Foo Y1 isn't.
The funny part is that you do not actually need to do a whole lot of work to determine these groups up front. You simply do an recursive classification. You start out with node per group, and then add the a supernode for the two closest nodes. In the supernode, store the best assumption, the size of its subtree and the variation in it. As many of your strings will be identical, you'll soon get large subtrees with identical entries. Recursion ends with the supernode containing at the root of the tree.
Now map the canonical names against this tree. You'll quickly see that each will match an entire subtree. Now, use the distances between these trees to pick the distance cutoff for that entry. If you have both Foo X1 and Foo Y1 products in the database, the cut-off distance will need to be lower to reflect that.
edg's answer is in the right direction, I think - you need to distinguish key words from fluff.
Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.
If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?
You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."
Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854
Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.
You might be able to make use of a trigram search for this. I must admit I've never seen the algorithm to implement an index, but have seen it working in pharmaceutical applications, where it copes very well indeed with badly misspelt drug names. You might be able to apply the same kind of logic to this problem.
This is a problem of record linkage. The dedupe python library provides a complete implementation, but even if you don't use python, the documentation has a good overview of how to approach this problem.
Briefly, within the standard paradigm, this task is broken into three stages
Compare the fields, in this case just the name. You can use one or more comparator for this, for example an edit distance like the Levenshtein distance or something like the cosine distance that compares the number of common words.
Turn an array fo distance scores into a probability that a pair of records are truly about the same thing
Cluster those pairwise probability scores into groups of records that likely all refer to the same thing.
You might want to create logic that ignores the letter/number combination of model numbers (since they're nigh always extremely similar).
Not having any experience with this type of problem, but I think a very naive implementation would be to tokenize the search term, and search for matches that happen to contain any of the tokens.
"Canon PowerShot A20 IS", for example, tokenizes into:
Canon
Powershot
A20
IS
which would match each of the other items you want to show up in the results. Of course, this strategy will likely produce a whole lot of false matches as well.
Another strategy would be to store "keywords" with each item, such as "camera", "canon", "digital camera", and searching based on items that have matching keywords. In addition, if you stored other attributes such as Maker, Brand, etc., you could search on each of these.
Spell checking algorithms come to mind.
Although I could not find a good sample implementation, I believe you can modify a basic spell checking algorithm to comes up with satisfactory results. i.e. working with words as a unit instead of a character.
The bits and pieces left in my memory:
Strip out all common words (a, an, the, new). What is "common" depends on context.
Take the first letter of each word and its length and make that an word key.
When a suspect word comes up, looks for words with the same or similar word key.
It might not solve your problems directly... but you say you were looking for ideas, right?
:-)
That is exactly the problem I'm working on in my spare time. What I came up with is:
based on keywords narrow down the scope of search:
in this case you could have some hierarchy:
type --> company --> model
so that you'd match
"Digital Camera" for a type
"Canon" for company and there you'd be left with much narrower scope to search.
You could work this down even further by introducing product lines etc.
But the main point is, this probably has to be done iteratively.
We can use the Datadecision service for matching products.
It will allow you to automatically match your product data using statistical algorithms. This operation is done after defining a threshold score of confidence.
All data that cannot be automatically matched will have to be manually reviewed through a dedicated user interface.
The online service uses lookup tables to store synonyms as well as your manual matching history. This allows you to improve the data matching automation next time you import new data.
I worked on the exact same thing in the past. What I have done is using an NLP method; TF-IDF Vectorizer to assign weights to each word. For example in your case:
Canon PowerShot a20IS
Canon --> weight = 0.05 (not a very distinguishing word)
PowerShot --> weight = 0.37 (can be distinguishing)
a20IS --> weight = 0.96 (very distinguishing)
This will tell your model which words to care and which words to not. I had quite good matches thanks to TF-IDF.
But note this: a20IS cannot be recognized as a20 IS, you may consider to use some kind of regex to filter such cases.
After that, you can use a numeric calculation like cosine similarity.