I have this problem in R where I have a list of Spanish communities and inside each community there is a list of towns/municipalities.
For example, this is a list of municipalities inside the community of Catalonia.
https://en.wikipedia.org/wiki/Municipalities_of_Catalonia
So; Catalonia is one community and within this community it has a list of towns/cities which I would like to group/ assign a new value 'Catalona'.
I have a list of all the municipalities/towns/cities in my dataset and I would like to group them into communities such as; Andalusia, Catalonia, Basque Country, Madrid etc.
Firstly, how can I go about grouping these rows into the list of communities?
For example; el prat de llobregat is a municipality within Catalonia so I would like to assign this to the region of Catalonia. Getafe is a municipality of Madrid so I would like to assign this to a value of Madrid. Alicante is a municipality of Valencia so I would like to assign this to a value Valencia. Etc.
#
That was my first question and if you are able to help with just that, I would be very thankful.
However, my dataset is not that clean, I did my best to remove Spanish accents, remove unnecessary code identifiers in the municipality names but there still contains some small errors. For example, castellbisbal is a municipality of Catalonia, however some entries have very small spelling mistakes, i.e. including 1 'l' instead of two, spelling; (castelbisbal).
These errors are human errors and are very small, is there a way I can work around this?
I was thinking of a vector of all correctly spelt names and then rename the incorrectly spelt names based on a percentage of incorectness, could this work? For instance castellbisbal is 13 characters long, and has an error of 1 character, with less than an 8% error rate. Can I rename values based on an error rate?
Do you have any suggestions on how I can proceed with the second part?
Any tips/suggestions would be great.
As for the spelling errors, have you tried the soundex algorithm? It was meant for that and at least two R packages implement it.
library(stringdist)
phonetic("barradas")
[1] "B632"
phonetic("baradas")
[1] "B632"
And the soundex codes for for the same words are the same with package phonics.
library(phonics)
soundex("barradas")
[1] "B632"
soundex("baradas")
[1] "B632"
All you would have to do would be to compare soundex codes, not the words themselves. Note that soundex was designed for the english language so it can only handle english language characters, not accents. But you say you are already taking care of those, so it might work with the words you have to process.
Related
I am trying to identify observations that match between two datasets, using text string vectors $contractor and $employer, and create a TRUE/FALSE indicator on whether the contractor is in the employer list.
library(caTools)
list<-data.frame(ID=c(1:6),
employer=c("a.c. construction","abc concrete company","xyz pool construction inc","frank studebager llc","annoying contractors llc","beaumont ditch digging co inc"))
jobs<-data.frame(contractor=c("a-c construction","hank hill construction","xyz pool const incorporated","frank studebaer co","hank hill const"),
value=c(400000,284590,410280,310980))
jobs$match<-pmatch(jobs$contractor,list$employer,duplicates.ok=TRUE)
The pmatch command says there are 0 matches, but this is because the company names are sloppily entered and not spelled consistently; there are obviously matches. I have also used the fuzzy matching command agrepl, but in my actual data the number and quality of matching varies incredibly with small changes to the accepted Levenshtein distance.
There are also some answers here and here but my lack of advanced programming experience has kept me from applying the concepts there. Any thoughts are appreciated!
I have one dataframe with less than 5000 rows (csv file). I have plenty of columns, one of them is the company name.
However, there are many duplicates with different names, for example, one company can be called: HH 785 EN
And his duplicate could be called : HH 785EN or HH784 EN
Every duplicates have like 1 or 2 differents characters from the original company.
I'm looking for an algorithm that could potentially detect these duplicates.
Most of the fuzzy match problems I have seen have 2 datasets involved which isn't my case.
I have seen many algorithm which takes one word and a list as entry, but I want to check my whole column of companies names with itself.
Thanks for your help.
I think you are looking for agrep function that does Levenshtein distance. You can combine agrep with sapply to find the fuzzy match.
sapply(df$company_name,agrep,df$company_name)
I'm trying to merge two large datasets. The common variable, first and last name, vary in spelling between the datasets and there are many duplicates, even between similarly spelled names. I've included download links for the files and some R code below. I'll walk through what I've tried and what went wrong.
There are a few R tutorials that have tried to tackle (the common) problem of record linking, but none of dealt with large datasets. I'm hoping the SO community can help me solve this problem.
The first dataset is a large file (several hundred thousand
rows) of Federal Elections Commission political contributions.
The second is a custom dataset of the name and companies of
every Internet company founder (~5,000 rows)
https://www.dropbox.com/s/lfbr9lmurv791il/010614%20CB%20Founders%20%20-%20CB%20Founders.csv?dl=0
--Attempted code matching with regular expressions--
My first attempt, thanks to the help of previous SO suggestions, was to use agrep and regular string matching. This narrowed down the names, but resulted in too many duplicates
#Load files#
expends12 <- fread("file path for FEC", sep="|", header=FALSE)
crunchbase.raw <- fread("file path for internet founders")
exp <- expends12
cr <- crunchbase.raw
#user regular string matching#
exp$xsub= gsub("^([^,]+)\\, (.{7})(.+)", "\\2 \\1", tolower(expends12$V8))
cr$ysub= gsub("^(.{7})([^ ]+) (.+)", "\\1 \\3", tolower(cr$name))
#merge files#
fec.merge <- merge(exp,cr, by.x="xsub", by.y="ysub")
The result is 6,900 rows, so there are (a lot) of duplicates. Many rows are people with similar names as Internet founders, such as Alexander Black, but are from different states and have different job titles. So, now its a question of finding the real Internet founder.
One option to narrow the results would be filter the results by states. So, I might only take the Alexander Black from California or New York, because that is where most startups are founded. I might also only take certain job titles, such as CEO or founder. But, many founders had jobs before and after their companies, so i wouldn't want to narrow by job title too much.
Alternatively, there is an r package, RecordLinkage, but as I far as I can tell, there needs to be similar rows and columns between the datasets, which is a nonstarter for this task
I'm familiar with R, but have somewhat limited statistical knowledge and programming ability. Any step-by-step help is very much appreciated. Thank you and please let me know if there's any trouble downloading the data.
Why don't you select the columns you need from both datasets, rename them similarly and in the result object, you get the row indices for matches returned. As long as you don't reorder things, you can use the results to match both datasets.
I have a dataset with individuals names, addresses, phone numbers, etc. Some individuals appear multiple times, with slightly varying names/ and/or addressees and/or phone numbers. A snippet of the fake data is shown below:
first last address phone
Jimmy Bamboo P.O. Box 1190 xxx-xx-xx00
Jimmy W. Bamboo P.O. Box 1190 xxx-xx-xx22
James West Bamboo P.O. Box 219 xxx-66-xxxx
... and so on. Some times E. is spelled out as east, St. as Street, at other times they are not.
What I need to do is run through almost 120,000 rows of data to identify each unique individual based on their names, addresses, and phone numbers. Anyone have a clue as to how this might be done without manually running through each record, one at a time? The more I stare at it the more I think its impossible without making some judgment calls and saying if at least two or three fields are the same treat this as a single individual.
thanks!!
Ani
As I mentioned in the comments, this is not trivial. You have to decide the trade-off of programmer time/solution complexity with results. You will not achieve 100% results. You can only approach it, and the time and complexity cost will increase the closer to 100% you get. Start with an easy solution (exact matches), and see what issue most commonly causes the missed matches. Implement a fuzzy solution to address that. Rinse and repeat.
There are several tools you can use (we use them all).
1) distance matching, like Damerau Levenshtein . you can use this for names, addresses and other things. It handles error like transpositions, minor spelling, omitted characters, etc.
2) phonetic word matching - soundex is not good. There are other more advanced ones. We ended up writing our own to handle the mix of ethnicities we commonly encounter.
3) nickname lookups - many nicknames will not get caught by either phonetic or distance matching - names like Fanny for Frances. There are many nicknames like that. You can build a lookup of nicknames to regular name. Consider though the variations like Jennifer -> Jen, Jenny, Jennie, Jenee, etc.
Names can be tough. Creative spelling of names seems to be a current fad. For instance, our database has over 30 spelling variations of the name Kaitlynn, and they are all spellings of actual names. This makes nickname matching tough when you're trying to match Katy to any of those.
Here are some other answers on similar topics I've made here on stackoverflow:
Processing of mongolian names
How to solve Dilemma of storing human names in MySQL and keep both discriminability and a search for similar names?
MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard
You can calculate the pairwise matrix of Levenshtein distances.
See this recent post for more info: http://www.markvanderloo.eu/yaRb/2013/02/26/the-stringdist-package/
First off, this may be the wrong Forum for this question, as it's pretty darn R+Bioconductor specific. Here's what I have:
library('GEOquery')
GDS = getGEO('GDS785')
cd4T = GDS2eSet(GDS)
cd4T <- cd4T[!fData(cd4T)$symbol == "",]
Now cd4T is an ExpressionSet object which wraps a big matrix with 19794 rows (probesets) and 15 columns (samples). The final line gets rid of all probesets that do not have corresponding gene symbols. Now the trouble is that most genes in this set are assigned to more than one probeset. You can see this by doing
gene_symbols = factor(fData(cd4T)$Gene.symbol)
length(gene_symbols)-length(levels(gene_symbols))
[1] 6897
So only 6897 of my 19794 probesets have unique probeset -> gene mappings. I'd like to somehow combine the expression levels of each probeset associated with each gene. I don't care much about the actual probe id for each probe. I'd like very much to end up with an ExpressionSet containing the merged information as all of my downstream analysis is designed to work with this class.
I think I can write some code that will do this by hand, and make a new expression set from scratch. However, I'm assuming this can't be a new problem and that code exists to do it, using a statistically sound method to combine the gene expression levels. I'm guessing there's a proper name for this also but my googles aren't showing up much of use. Can anyone help?
I'm not an expert, but from what I've seen over the years everyone has their own favorite way of combining probesets. The two methods that I've seen used the most on a large scale has been using only the probeset which has the largest variance across the expression matrix and the other being to take the mean of the probesets and creating a meta-probeset out of it. For smaller blocks of probesets I've seen people use more intensive methods involving looking at per-probeset plots to get a feel for what's going on ... generally what happens is that one probeset turns out to be the 'good' one and the rest aren't very good.
I haven't seen generalized code to do this - as an example we recently realized in my lab that a few of us have our own private functions to do this same thing.
The word you are looking for is 'nsFilter' in R genefilter package. This function assign two major things, it looks for only entrez gene ids, rest of the probesets will be filtered out. When an entrez id has multiple probesets, then the largest value will be retained and the others removed. Now you have unique entrez gene id mapped matrix. Hope this helps.