Why are my "y"s being changed for "i"s when using {litsearchr} in R to develop an academic search strategy? - r

I'm planning a systematic review and have been using the litsearchr package to help generate good search terms for the scientific databases.
This problem I'm seeing is that when I print my final search string, it has changed what I would expect to be "y" letters for "i" letters at the end of words that end in "y". For example, my review will involve intermittent style sports (e.g. team sports, tennis etc) and thermoregulation, so my search terms include words like “rugby” and “body temperature”, but the output is giving me terms such as “rugbi*” and “bodi* temperatur*”.
Anyone that knows this package or searching scientific/academic databases well, do you think this is expected behaviour?
Thanks!!

Related

How to fuzzy match character strings of persons' names listed variously firstName lastName or lastName firstName and with misspellings [duplicate]

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/

searchTwitter - search tweet but not handle

Trying to dabble in doing some basic sentiment analysis using twitteR library and searchTwitter function. Say I'm searching for tweets specific to "Samsung". I can retrieve the tweets with the below command:
samsung_t = searchTwitter("#samsung", n=1500, lang="en",cainfo="cacert.pem")
This I know will return all the tweets containing the hash-tag #samsung. However, if I wanted to search for tweets containing "samsung" in them: I give the same command but without the "#"
samsung_t = searchTwitter("samsung", n=1500, lang="en",cainfo="cacert.pem")
This however will return all the tweets containing the term "samsung" in them including the handle. For example: it will return a tweet: "#I_Love_Samsung: I like R programming", which is completely irrelevant to my criteria. If I wanted to do a sentiment analysis on say, "Samsung phones", I'm afraid that data like this can skew the results.
Is there a way I can force searchTwitter to only look in the "Tweet" but not the "Handle"?
Thanks a lot in advance.
Looking at the search API documentation and the listing of available search operators, I don't think the twitter search API offers this specific search capability (which seems kind of strange, frankly). I think your best bet is probably to run your search with the tools available to you and filter out the tweets that don't match your criteria from the results you get back.

Genbank query (package seqinr): searching in sequence description

I am using the function query() of package seqinr to download myoglobin DNA sequences from Genbank. E.g.:
query("myoglobins","K=myoglobin AND SP=Turdus merula")
Unfortunately, for a lot of the species I'm looking for I don't get any sequence at all (or for this species, only a very short one), even though I find sequences when I search manually on the website. This is because of searching for "myoglobin" in the keywords only, while often there isn't any entry in there. Often the protein type is only specified in the name ("definition" on Genbank) -- but I have no idea how to search for this.
The help page on query() doesn't seem to offer any option for this in the details, a "generic search" without any "K=" doesn't work, and I haven't found anything via googling.
I'd be happy about any links, explanations and help. Thank you! :)
There is a complete manual for the seqinr package which describes the query language more in depth in chapter 5 (available at http://seqinr.r-forge.r-project.org/seqinr_2_0-1.pdf). I was trying to do a similar query and the description for many of the genes/cds is blank so they don't come up when searching using the k= option. One alternative would be to search for the organism alone, then match gene names in the individual annotations and pull out the accession numbers, which you could then use to re-query the database for your sequences.
This would pull out the annotation for the first gene:
choosebank("emblTP")
query("ACexample", "sp=Turdus merula")
getName(ACexample$req[[1]])
annotations <- getAnnot(ACexample$req[[1]])
cat(annotations, sep = "\n")
I think that this would be a pretty time consuming way to tackle the problem but there doesn't seem to be an efficient way of searching the annotations directly. I'd be interested in any solutions you might come up with.

Rhyme Dictionary from CMU pronunciation database

I'm looking for a free or open source rhyming database.
I've found the CMU pronunciation "database" and its series of apps but I can't make sense of them or figure out where the data's coming from.
A simple text file with the word and its phonemes is all I need.
Does anybody here know where I'd find one or where I would begin to derive such a list from the CMU files?
cmudict
The cmudict is a text file and it's format is really simple. First, the word is listed. Then, there are two spaces. Everything following the two spaces is the pronunciation. Where a word may have two different ways of being spoken you will see two entries for the word like
word
word(1)
At the beginning of the file they've listed symbols and punctuation. The symbol is followed by the english spelling of said symbols name with no space between them. This is then followed by the two space divider and the arpabet code. Since you're only looking for rhymes you don't have to do anything special with the symbols section since you're never going to be looking for a rhyme to ...ELLIPSIS
ARPAbet
The information about how ARPAbet codes map to IPA is listed in wikipedia http://en.wikipedia.org/wiki/Arpabet and each mapping shows example words. It's pretty easy to see how the two relate to one another and that may help you to understand how to read the ARPAbet codes if you are familiar with IPA.
Summary
Basically, if you've already found the cmudict then you've already got what you asked for: a database of words and their pronunciations. To find words that rhyme you'll have to parse the flat file into a table and run a query to find words that end with the same ARPAbet code.
General Theory of Doing Stuff to Things
Part: Stuff
create a new database
create a table in the database with three fields: index, word, arpabet
read the cmudict file line by line
for each line split it into two parts where two consecutive spaces are found AND
increment the index count, then insert the index number, word, and arpabet code
Then Umm...
Once you've got the data into whatever kind of database you choose, you can then use that database to find correlations between the arpabet codes. You could find rhymes, consonance, assonance, and other mnemonic devices. It would go something like
Part: Thing
get a word you want to find a rhyme for
query the database for the arpabet equivalent of the word
split the arpabet code into pieces by breaking it up everywhere there is a space
take the last piece of the code and, query the database for words whose arpabet codes end matches said piece
Do fancy things with the rhymes
Shortcuts and Spoilers
I got bored and wrote a Node.js module that covers "Part: Stuff" listed above. If you've got Node.js installed on your machine you can get the module by running npm install cmudict-to-sqlite See https://npmjs.org/package/cmudict-to-sqlite for the README or just look in the module for docs.
Rhyme Logic using CMU Pronouncing Dictionary
OK. Suppose you want to use CMU Pronouncing Dictionary data (example file: cmudict-0.7b) to build a list of all the words that rhyme with "LOVE".
Here's how you might do it:
First, you need to learn the pronunciation of "LOVE". You'll find this line in the dictionary, where "LOVE" and "L AH1 V" are separated by two spaces:
LOVE L AH1 V
This is saying that the word LOVE is pronounced like L AH1 V.
Then, find the vowel phoneme that has primary stress. In other words, look for the number "1" in that pronunciation. The text directly to the left of the 1 is the vowel sound that has primary stress (AH). That text, and everything to the right of it are your "rhyme phonemes" (for the lack of a better term). So the rhyme phonemes for LOVE are AH1 V.
We're half done! Now we just have to find other words whose pronunciations end with AH1 V. If you're playing along in Notepad++, try a Find All In Current Document for pattern AH1 V$ using Search Mode of "Regular expression". This will match lines like:
Line 392: ABOVE AH0 B AH1 V
Line 10266: BELOVE B IH0 L AH1 V
Line 30204: DENEUVE D IH0 N AH1 V
Line 30205: DENEUVE(1) D IY0 N AH1 V
Line 34064: DOVE D AH1 V
Line 48177: GLOVE G L AH1 V
Line 49053: GOV G AH1 V
... etc
Rhyming woooooords!
There are plenty of ways to implement this, and plenty of corner cases, but this is roughly the approach that many electronic rhyming dictionaries appear to take when finding perfect rhymes.
Hypothetical SQL approach to storing rhyme data
Obviously, performance will be a problem if you just scan the dictionary every time someone wants a rhyme. If that's a concern, you might try storing or indexing the data differently.
Although it's not the most efficient on disk space, I've had a good experience storing this stuff in a SQL table with indexed columns.
For a simple conceptual example, you could compute the "rhyme phonemes" of all words in the dictionary, then insert them into a "Rhymes" table whose columns are { WordText, RhymePhonemes }. For example, you might see records like:
{"ABOVE", "AH1 V"}
{"DOVE", "AH1 V"}
{"OUTLIVE", "IH1 V"}
{"GRADUATE", "AE1 JH AH0 W AH0 T"}
{"GRADUATE", "AE1 JH AH0 W EY2 T"}
... etc
Then, to find rhymes, you'd issue a query like:
SELECT OTHER.WordText
FROM Rhymes INPUT
INNER JOIN Rhymes OTHER ON OTHER.RhymePhonemes = INPUT.RhymePhonemes
WHERE INPUT.WordText = 'love' AND
OTHER.WordText <> INPUT.WordText
ORDER BY OTHER.WordText
This also comes in handy if you're planning on printing a dictionary where all similar-sounding words are grouped together.
There are of course plenty of other ways to store/search the data of varying trade-offs, but hopefully this gets you started.
I've also had some luck storing the raw pronunciation in the database in varying "full" formats (forward and reversed strings of the pronunciation, with stress marks and without stress marks, etc) but not "chopped" into specific pieces like a rhyme-phoneme column.
Gotchas
Again, the original explanation with "love" will absolutely get you in the ballpark of rhyming. However, along the way you'll probably run into other gotchas to consider. Here's a heads-up:
Some words have multiple pronunciations. In the CMU dictionary, the alternate pronunciations are marked with text like (1), (2), etc following the word as in GRADUATE(2). If someone wants a rhyme of these words, you have to decide between showing rhymes of ALL matched pronunciations, or having the user choose which pronunciation they really meant.
What do you do when the pronunciation has two or more "1"s? Pick the first one? Pick the last one? If you pick the last one, you'll find more rhymes, but it might not be the most natural choice of stress.
What do you do when the pronunciation has no "1"s? It doesn't happen a lot, but it happens, like: ACCREDIT AH0 K R EH2 D AH0 T and AIKIN EY0 K IH0 N. In this case I'd pick the next best stress (e.g. pick the 2, 3, 4, etc if the 1 is absent). If they're all 0's, I don't have any good advice.
Some pronunciations are missing. It's a great start, but it doesn't have all the words or spellings of words you might want. US spelling is preferred over UK spelling.
Some pronunciations are not what you'd expect, and you may want to prune. For example there's a pronunciation of "or" that sounds like "er".
You may want to compare the "rhyme phonemes" with stress marks removed. This only matters for words whose primary stress is not on the last vowel (so you don't see the problem on the "love" example).
I'm actively working on something like this right now, using the general approach suggested by Plate, and extending it. Here's my source code. Hope it helps!
You could always use http://www.rhymezone.com/ and search a word and then put its rhyme matches into a text file if you are only using a small demo subset. If you want a full database of words. You could hook up a dictionary to a zombieJS UI automation and then screen scrape the words and put them into your own database. This would allow you to create your own rhyme database. Although to be honest, that's quite an undertaking for your original request

Techniques for finding near duplicate records

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/

Resources