This is my first post with StackOverflow, I apologize if I violate any rules.
I am working with the R package qdap on spellchecking very messy medical record text. The goal of this work is to identify misspellings of drug side effects in order to build a side effect misspelling dictionary. The text I am working with contains many, many misspellings, abbreviations, and other things that make a simple spellcheck difficult. After I run a spellcheck on a small doctors note, I get hundreds of words returned to me by the spellcheck program. This makes it difficult to search for the side effect misspellings that I care about.
I attempted to use the following code to create a dictionary consisting only of correctly spelled side effects, so that qdap will trigger closely misspelled words as belonging to this dictionary. The problem is that with this, nearly every word in the text, properly or improperly spelled is not returned as incorrect (i.e. "notable" is spelled wrong and "nausea" is the suggested replacement from my dictionary).
dictionary <- readLines("dictionary.txt")
check_spelling(text$NOTE_TEXT[3379],range = 0, dictionary = dictionary,
assume.first.correct=FALSE)
Here the term "dictionary" is my self-built side-effects dictionary, and check_spelling is being run on text contained in a csv file. Is there any way to omit words that are very far away from words contained in my dictionary from appearing in the spellcheck function (such as my previous example)? This way I can cut down the number of words I am seeing in my spell_check output and identify only the misspelled side effects.
As a small note, changing assume.first.correct to TRUE will not change anything, because the dictionary does not run with it set that way.
Related
I am trying to find short variations of texts in long text strings.
long.string = "A lot of irrelevant text that features some of the words from the relevant sentence, including decision, affirmed, and order. The result of this long decision process is affirmed, without any exceptions. It is order that the instructions be executed very slowly. Even more irrelevant text that features some of the words from the relevant sentence, including decision, affirmed, and order, as well as promptly.
variation = "result.*?affirmed.*?promptly"
The text of interest is in bold. The variation would be used in a grepl to tell me if the bold text is in the long string. However, I will still get a hit in this circumstance, even though the word "promptly" in the long string is outside of the two sentences of interest.
Assuming I want to look in a 2-3 sentence (ending with a period) radius, how do I construct my variation so that it does return a hit even when some words are outside of the radius?
I asked over at the English Stack Exchange, "What is the English word with the longest single definition?" The best answer they could give is that I would need a program that could figure out the longest entry in a (text) file listing dictionary definitions, by counting the amount of characters or words in a given entry, and then provide a list of the longest entries. I also asked at Superuser but they couldn't come up with an answer either, so I decided to give it a shot here.
I managed to find a dictionary file which converted to text has the following format:
a /a/ indefinite article (an before a vowel) 1 any, some, one (have a cookie). 2 one single thing (there’s not a store for miles). 3 per, for each (take this twice a day).
aardvark /ard-vark/ n an African mammal with a long snout that feeds on ants.
abacus /a-ba-kus, a-ba-kus/ n a counting frame with beads.
As you can see, each definition comes after the pronunciation (enclosed by slashes), and then either:
1) ends with a period, or
2) ends before an example (enclosed by parenthesis), or
3) follows a number and ends with a period or before an example, when a word has multiple definitions.
What I would need, then, is a function or program that can distinguish each definition (including considering multiple definitions of a single word as separate ones), then count the amount of characters and/or words within (ignoring the examples in parenthesis since that is not the proper definition), and finally provide a list of the longest definitions (I don't think I would need more than say, a top 20 or so to compare). If the file format was an issue, I can convert the file to PDF, EPUB, etc. with no problem. And, I guess ideally I would want to be able to choose between counting length by characters and by words, if it was possible.
How should I go to do this? I have little experience from programming classes I took a long time ago, but I think it's better to assume I know close to nothing about programming at all.
Thanks in advance.
I'm not going to write code for you, but I'll help think the problem through. Pick the programming language you're most familiar with from long ago, and give it a whack. When you run in to problems, come back and ask for help.
I'd chop this task up into a bunch of subproblems:
Read the dictionary file from the filesystem.
Chunk the file up into discrete entries. If it's a text file like you show, most programming languages have a facility to easily iterate linewise through a file (i.e. take a line ending character or character sequence as the separator).
Filter bad entries: in your example, your lines appear separated by an empty line. As you iterate, you'll just drop those.
Use your human observation and judgement to look for strong patterns in the data that you can give communicate as firm rules -- this is one of the central activities of programming. You've already started identifying some patterns in your question, i.e.
All entries have a preamble with the pronounciation and part of speech.
A multiple definition entry will be interspersed with lone numerals.
Otherwise, a single definition just follows the preamble.
Write the rules you've invented into code. It'll go something like this: First find a way to lop off the word itself and the preamble. With the remainder, identify multiple-def entries by presence of lone numerals or whatever; if it's not, treat it as single-def.
For each entry, iterate over each of the one-or-more definitions you've identified.
Write a function that will count a definition either word-wise or character-wise. If word-wise, you'll probably tokenize based on whitespace. Counting the length of a string character-wise is trivial in most programming languages. Why not implement both!
Keep a data structure in memory as you iterate the file to track "longest". For each definition in each entry, after you apply the length calculation, you'll compare against the previous longest entry. If the new one is longer, you'll record this new leading word and its word count in your data structure. Comparing 'greater than' and storing a variable are fundamental in most programming languages, so while this is the real meat of your program, this shouldn't be hard.
Implement some way to display your results once iteration is done. This may be as simple as a print statement.
Finally, write the glue code that lets you execute the program easily. A program like this could easily be a command-line tool that takes one or two arguments (the path to the file to be analyzed, perhaps you pass your desired counting method 'character|word' as an argument too, since you implemented both). Different languages vary in how easy it is to create an executable to run from the command line, but most support it, so it's a good option for tasks like this.
I have one paragrah of text (a vector of words) and I would like to see if it is "part" of a long text (a vector of words). However, I am know that this paragraph does not appear in the text in its exact form, but with slight changes: a few words could miss, the order could be slightly different, some words could be inserted as parenthetical elements etc.
I am currently implementing solutions "by hand", such as looking if most of the words of the paragraph are in the text, looking the distance between these words, their order, etc...
I was however wondering if there is no built-in method to do that?
I already checked the tm package, but it does not seem to do that...
Any idea?
I fear that you are stuck with hand-writing an approach, e.g. grep-ing some word groups and having some kind of matching threshold.
I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.
I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/