I'm trying to compile a decent .zwl file for squiggly spell checking in Flex; using British words, not American as supplied by default.
Ive managed to create a decent British list of words and ran them through the AdobeSpellingGen app to get a .zwl; great stuff.
However i need to add into this list a list of names, so they wont be flagged.
Does anyone know of a good source of either free, or paid for list of English Fore and surnames? Im trying BT as i type :)
Thanks, any help with this would be greatly appreciated.
There are lots of baby names sites out there. This might be a good one http://www.listofbabynames.org/a_boys.htm as it would be fairly easy to copy.
I'll keep looking
You can screen scrape http://www.britishsurnames.co.uk/browse for a list of surnames. I'm not sure where you'd find first names though.
gnu aspell has spell checking for common names. You can try it out here:
http://chxo.com/scripts/spellcheck.php?showsource=1
source is here: http://aspell.net/
i'm not too familiar with it though so i couldn't tell you how to extract the dictionaries.
The US Census site has a list of >150k first names and surnames from the 1990 and 2000 censuses, at
http://www.census.gov/genealogy/www/
Of course, these aren't UK names, but might do if you can't find anything better.
Related
I have a .tsv file with a list of words/vocabulary in English without a translation or dictionary definition from my Amazon Kindle and I want to update the table with each word's definition by some dictionary. There are over a 1000 words on that list and I have no way of doing this manually.
Is there any app or program that might do the trick?
If programming something is necessary I pretty good in R and a bit in Swift. I haven't found an R package that might apply.
Anyone any ideas? I would really appreciate it. Thanks!
Here is a sample.
Sample
Most of that table is blank on the right side. I'd like some sort of definition for each word in those blanks.
I have a list of names in my dataframe and I want to find a way to query them in Wikipedia, although it's not as simple as just appending the name to "https://en.wikipedia.org/wiki/", I want to actually query Wikipedia so that there will be a suggestion even if its not spelt correctly. So for example if I were to put in Dick Dawkins, it'd come up with Richard Dawkins. I checked and that is actually the first hit on Wikipedia.
Ideally I'd want to use RVest but I don't want to manually get every url. Is this possible?
You are right. I, too, had a hard time getting Dick Dawkins out of the wikipedia. So much so that even searching for Dick Dawkins on the wikipedia search brought me straight to Richard Dawkins.
However, if you want to search for a term (say "Richard Dawkins") then Wikipedia has a proper API for you (https://www.mediawiki.org/wiki/API:Tutorial). You can play around and find the right parameters that work for you.
Just to get you started, I wrote a function (which is somewhat similar to rg255's post). You can change the parameter for MySearch function. Please make sure that spaces in search string are replaced by '%20' for every query from your dataframe. Simple gsub function should do the job. You will also have to install 'jsonlite' package for this to work.
library(jsonlite)
MySearch <- function(srsearch){
FullSearchString <- paste("http://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=",srsearch,"&format=json",sep="")
Response <- fromJSON(FullSearchString)
return(Response)
}
Response <- MySearch("Richard%20Dawkins")
You can now use the parsed JSON to use the properties that you want. As I said, you will have to play with the parameters to get it right.
Please let me know if this is not what you wanted.
I am using the function query() of package seqinr to download myoglobin DNA sequences from Genbank. E.g.:
query("myoglobins","K=myoglobin AND SP=Turdus merula")
Unfortunately, for a lot of the species I'm looking for I don't get any sequence at all (or for this species, only a very short one), even though I find sequences when I search manually on the website. This is because of searching for "myoglobin" in the keywords only, while often there isn't any entry in there. Often the protein type is only specified in the name ("definition" on Genbank) -- but I have no idea how to search for this.
The help page on query() doesn't seem to offer any option for this in the details, a "generic search" without any "K=" doesn't work, and I haven't found anything via googling.
I'd be happy about any links, explanations and help. Thank you! :)
There is a complete manual for the seqinr package which describes the query language more in depth in chapter 5 (available at http://seqinr.r-forge.r-project.org/seqinr_2_0-1.pdf). I was trying to do a similar query and the description for many of the genes/cds is blank so they don't come up when searching using the k= option. One alternative would be to search for the organism alone, then match gene names in the individual annotations and pull out the accession numbers, which you could then use to re-query the database for your sequences.
This would pull out the annotation for the first gene:
choosebank("emblTP")
query("ACexample", "sp=Turdus merula")
getName(ACexample$req[[1]])
annotations <- getAnnot(ACexample$req[[1]])
cat(annotations, sep = "\n")
I think that this would be a pretty time consuming way to tackle the problem but there doesn't seem to be an efficient way of searching the annotations directly. I'd be interested in any solutions you might come up with.
I have a dataset with individuals names, addresses, phone numbers, etc. Some individuals appear multiple times, with slightly varying names/ and/or addressees and/or phone numbers. A snippet of the fake data is shown below:
first last address phone
Jimmy Bamboo P.O. Box 1190 xxx-xx-xx00
Jimmy W. Bamboo P.O. Box 1190 xxx-xx-xx22
James West Bamboo P.O. Box 219 xxx-66-xxxx
... and so on. Some times E. is spelled out as east, St. as Street, at other times they are not.
What I need to do is run through almost 120,000 rows of data to identify each unique individual based on their names, addresses, and phone numbers. Anyone have a clue as to how this might be done without manually running through each record, one at a time? The more I stare at it the more I think its impossible without making some judgment calls and saying if at least two or three fields are the same treat this as a single individual.
thanks!!
Ani
As I mentioned in the comments, this is not trivial. You have to decide the trade-off of programmer time/solution complexity with results. You will not achieve 100% results. You can only approach it, and the time and complexity cost will increase the closer to 100% you get. Start with an easy solution (exact matches), and see what issue most commonly causes the missed matches. Implement a fuzzy solution to address that. Rinse and repeat.
There are several tools you can use (we use them all).
1) distance matching, like Damerau Levenshtein . you can use this for names, addresses and other things. It handles error like transpositions, minor spelling, omitted characters, etc.
2) phonetic word matching - soundex is not good. There are other more advanced ones. We ended up writing our own to handle the mix of ethnicities we commonly encounter.
3) nickname lookups - many nicknames will not get caught by either phonetic or distance matching - names like Fanny for Frances. There are many nicknames like that. You can build a lookup of nicknames to regular name. Consider though the variations like Jennifer -> Jen, Jenny, Jennie, Jenee, etc.
Names can be tough. Creative spelling of names seems to be a current fad. For instance, our database has over 30 spelling variations of the name Kaitlynn, and they are all spellings of actual names. This makes nickname matching tough when you're trying to match Katy to any of those.
Here are some other answers on similar topics I've made here on stackoverflow:
Processing of mongolian names
How to solve Dilemma of storing human names in MySQL and keep both discriminability and a search for similar names?
MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard
You can calculate the pairwise matrix of Levenshtein distances.
See this recent post for more info: http://www.markvanderloo.eu/yaRb/2013/02/26/the-stringdist-package/
With the foreign package, I'm reading in a .sav file. When I open the file with PSPP there are 95 variables. However, read.spss("file") responds with a list of 353 variables. The extra variables are blank filled fields with 220 spaces. Has anyone ever experienced this?
Before you ask, I am unable to provide a reproducible example, as the data file and its contents are proprietary.
One obvious solution would be to search for list elements that contain only spaces and set them the list element to NULL or each element with 220 spaces to NA and then drop NA columns.
But I'd like to avoid having to further post process my files if necessary. Does anyone have a fix for this?
I've had something similar before. This happened when the data was exported from SPSS CATI (the field interview application), rather than the SPSS we know and love.
In my case the resolution was to play around with the arguments to read.spss. I found that setting use.missings=FALSE resolved the problem, i.e. something like:
read.spss(global$datafile, to.data.frame=TRUE, use.missings=FALSE)
Good luck, and my sympathy. I know how frustrating this was for me.