Joining data sets in R where the unique ids have spelling mistakes - r

Hi I am trying to join two large datasets >10000 entries each. To do this I have created a ‘unique ID’ - a combination of full name and date of birth which are present in both. However, the datasets have spelling mistakes/ different characters in the IDs so when using left join many won’t match. I don’t have access to fuzyjoin/ match so can’t use this to partially match them. Someone has suggested using adist(). How am I able to use this to match and merge the datasets or to flag ones which are close to matching? As simple as possible please I have only been using R for a few weeks!
Examples of code would be amazing

You could just rename them to names that are spelled correctly:
df$correct_spelling <- df$incorrect_spelling

This may a bit of a manual solution, but perhaps a base - R solution would be to look through unique values of the join fields and correct any that are misspelled using the grep() function and creating a crosswalk to merge into the dataframes with misspelled unique IDs. Here's a trivial example of what I mean:
Let's say we have a dataframe of scientists and their year of birth, and then we have a second dataframe with the scientists' names and their field of study, but the "names" column is riddled with spelling errors. Here is the code to make the example dataframes:
##Fake Data##
Names<-c("Jill", "Maria", "Carlos", "DeAndre") #Names
BirthYears<-c(1974, 1980, 1991, 1985) # Birthyears
Field<-c("Astronomy", "Geology", "Medicine", "Ecology") # Fields of science
Mispelled<-c("Deandre", "Marai", "Jil", "Clarlos")# Names misspelled
##Creating Dataframes##
DF<-data.frame(Names=Names, Birth=BirthYears) # Dataframe with correct spellings
DF2<-data.frame(Names=Mispelled, Field=Field) # Dataframe with incorrect spellings we want to fix and merge
What we can do is find all the unique values of the correctly spelled and the incorrectly spelled versions of the scientists' names using a regular expression replacement function gsub().
Mispelled2<-unique(DF2$Names) # Get unique values of names from misspelled dataframe
Correct<-unique(DF$Names) # Get unique values of names from correctly spelled dataframe
fix<-NULL #blank vector to save results from loop
for(i in 1:length(Mispelled2)){#Looping through all unique mispelled values
ptn<-paste("^",substring(Mispelled2[i],1,1), "+", sep="") #Creating a regular expression string to find patterns similar to the correct name
fix[i]<-grep(ptn, Correct, value=TRUE) #Finding and saving replacement name values
}#End loop
You'll have to come up with the regular expressions necessary for your situation, here is a link to a cheatsheet with how to build regular expressions
https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
Now we can make a dataframe crosswalking the misspelled names with the correct spelling ie., Row 1 would have "Deandre" and "DeAndre" Row 2 would have "Jil" and "Jill."
CWX<-data.frame(Name_wrong=Mispelled2, Name_correct=fix)
Finally we merge the crosswalk to the dataframe with the incorrect spellings, and then merge the resultant dataframe to the dataframe with the correct spellings
Mispelled3<-merge(DF2, CWX, by.x="Names", by.y="Name_wrong")
Joined_DF<-merge(DF, Mispelled3[,-1], by.x="Names", by.y="Name_correct")

Here is what I was able to come up with for your question about matching in multiple ways. It's a bit clunky, but it works with this below example data. The trick is making the call to agrep() sensitive enough to not match names that partially match but are truly different, but flexible enough that it allows for partial matches and misspellings:
Example1<-"deborahoziajames04/14/2000"
Example2<-"Somepersonnotdeborah04/15/2002"
Example3<-"AnotherpersonnamedJames01/23/1995"
Misspelled1<-"oziajames04/14/2000"
Misspelled2<-"deborahozia04/14/2000"
Misspelled3<-"deborahoziajames10/14/1990"
Misspelled4<-"personnamedJames"
String<-c(Example1, Example2, Example3)
Misspelled<-c(Misspelled1, Misspelled2, Misspelled3, Misspelled4)
Spell_Correct<-function(String, Misspelled){
out<-NULL
for(i in 1:length(Misspelled)){
ptn_front<-paste('^', Misspelled[i], "\\B", sep="")
ptn_mid<-paste('\\B', Misspelled[i], "\\B", sep="")
ptn_end<-paste('\\B', Misspelled[i], "$", sep="")
ptn<-c(ptn_front, ptn_mid, ptn_end)
Nchar_M<-nchar(Misspelled[i])
Nchar_S<-nchar(String)
out_front<-agrep(pattern=ptn[1], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
out_mid<-agrep(pattern=ptn[2], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
out_end<-agrep(pattern=ptn[3], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
out_test<-list(out_front, out_mid, out_end)
for (j in 1:length(out_test)){
if(length(out_test[j])==1)
use_me<-out_test[j]
}
out[i]<-use_me
}
return(unlist(out))
}
Spell_Correct(String, Misspelled)
Basically this just repeating the previous answer multiple times by using the loop and tweaking the regular expression to try a beginning, middle, and end call to agrep(). Depending on how bad the misspellings are, you may need to play around with the max.distance and cost arguments. Good Luck.
Take Care,
-Sean

Related

Nb_words code counting partial matches in R

before I get started, I would like you to know that I am completely new to coding in R. For a group assignment our professor set up a database by scraping data from Amazon. Within the database, which is called 'dat', there is a column named 'product_name'. We were given a set group of utilitarian words. I think you can guess where this is going. Within the column 'product_name' we have to find for each product name whether any of the utilitarian words appeared. If yes, how many times. We were given the following code by our professor to use for this assignment:
nb_words <- function(lexicon,corpus){
rowSums(sapply(lexicon, function(x) grepl(x, corpus)))
}
after which i created the following codes:
uti_words <-c("additives","antioxidant","artificial", "busy", "calcium","calories", "carb", "carbohydrates", "chemicals", "cholesterol", "convenient", "dense", "diet", "fast")
sentences <- (dat$product_name)
nb_words (lexicon=uti_words,corpus=sentences)
when i run nb_words, however, I noticed something went wrong. A sentence contained the word 'breakfast'. My code counted this as a match because the word 'fast' from 'uti_words' matched with it. I don't want this to happen, does anyone know how to make it so that I only get exact matches and no partial matches?
We may have to add word boundary (\\b) to avoid partial matches
uti_words <- paste0("\\b", trimws(uti_words), "\\b")
Or another option is to change the grepl part of the code with fixed = TRUE
nb_words <- function(lexicon,corpus){
rowSums(sapply(lexicon, function(x) grepl(x, corpus, fixed = TRUE)))
}

Creating Sub Lists from A to Z from a Master List

Task
I am attempting to use better functionality (loop or vector) to parse down a larger list into 26(maybe 27) smaller lists based on each letter of the alphabet (i.e. the first list contains all entries of the larger list that start with the letter A, the second list with the letter B ... the possible 27th list contains all remaining entries that use either numbers of other characters).
I am then attempting to ID which names on the list are similar by using the adist function (for instance, I need to correct company names that are misspelled. e.g. Companyy A needs to be corrected to Company A).
Code thus far
#creates a vector for all uniqueID/stakeholders whose name starts with "a" or "A"
stakeA <- grep("^[aA].*", uniqueID, value=TRUE)
#creates a distance matrix for all stakeholders whose name starts with "a" or "A"
stakeAdist <- (adist(stakeA), ignore.case=TRUE)
write.table(stakeAdist, "test.csv", quote=TRUE, sep = ",", row.names=stakeA, col.names=stakeA)
Explanation
I was able to complete the first step of my task using the above code; I have created a list of all the entries that begin with the letter A and then calculated the "distance" between each entry (appears in a matrix).
Ask One
I can copy and paste this code 26 times and move my way through the alphabet, but I figure that is likely a more elegant way to do this, and I would like to learn it!
Ask Two
To "correct" the entries, thus far I have resorted to writing a table and moving to Excel. In Excel I have to insert a row entry to have the matrix properly align (I suppose this is a small flaw in my code). To correct the entries, I use conditional formatting to highlight all instances where adist is between say 1 and 10 and then have to manually go through the highlights and correct the lists.
Any help on functions / methods to further automate this / better strategies using R would be great.
It would help to have an example of your data, but this might work.
EDIT: I am assuming your data is in a data.frame named df
for(i in 1:26) {
stake <- subset(df, uniqueID==grep(paste0('^[',letters[i],LETTERS[i],'].*'), df$uniqueID, value=T))
stakeDist <- adist(stakeA,ignore.case=T)
write.table(stakeDist, paste0("stake_",LETTERS[i],".csv"), quote=T, sep=',')
}
Using a combination of paste0, and the builtin letters and LETTERS this creates your grep expression.
Using subset, the correct IDs are extracted
paste0 will also create a unique filename for write.table().
And it is all tied together using a for()-loop

Data Frame containing hyphens using R

I have created a list (Based on items in a column) in order to subset my dataset into smaller datasets relating to a particular variable. This list contains strings with hyphens in them -.
dim.list <- c('Age_CareContactDate-Gender', 'Age_CareContactDate-Group',
'Age_ServiceReferralReceivedDate-Gender',
'Age_ServiceReferralReceivedDate-Gender-0-18',
'Age_ServiceReferralReceivedDate-Group',
'Age_ServiceReferralReceivedDate-Group-ReferralReason')
I have then written some code to loop through each item in this list subsetting my main data.
for (i in dim.list) {assign(paste("df1.",i,sep=""),df[df$Dimension==i,])}
This works fine, however when I come to aggregate this in order to get some summary statistics I can't reference the dataset as R stops reading after the hyphen (I assume that the hyphen is some special character)
If I use a different list without hyphens e.g.
dim.list.abr <- c('ACCD_Gen','ACCD_Grp',
'ASRRD_Gen',
'ASRRD_Gen_0_18',
'ASRRD_Grp',
'ASRRD_Grp_RefRsn')
When my for loop above executes I get 6 data.frames with no observations.
Why is this happening?
Comment to answer:
Hyphens aren't allowed in standard variable names. Think of a simple example: a-b. Is it a variable name with a hyphen or is it a minus b? The R interpreter assumes a minus b, because it doesn't require spaces for binary operations. You can force non-standard names to work using backticks, e.g.,
# terribly confusing names:
`a-b` <- 5
`x+y` <- 10
`mean(x^2)` <- "this is awful"
but you're better off following the rules and using standard names without special characters like + - * / % $ # # ! & | ^ ( [ ' " in them. At ?quotes there is a section on Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
So that's why you're getting an error, but what you're doing isn't good practice. I completely agree with Axeman's comments. Use split to divide up your data frame into a list. And keep it in a list rather than use assign, it will be much easier to loop over or use lapply with that way. You might want to read my answer at How to make a list of data frames for a lot of discussion and examples.
Regarding your comment "dim.list is not the complete set of unique entries in the Dimensions column", that just means you need to subset before you split:
nice_list = df[df$Dimension %in% dim.list, ]
nice_list = split(nice_list, nice_list$Dimension)

obtaining count of phrases contained between parentheses and containing specific character

There must be a simple answer to this, but I'm new to regex and couldn't find one.
I have a dataframe (df) with text strings arranged in a column vector of length n (df$text). Each of the texts in this column is interspersed with parenthetical phrases. I can identify these phrases using:
regmatches(df$text, gregexpr("(?<=\\().*?(?=\\))", df$text, perl=T))[[1]]
The code above returns all text between parentheses. However, I'm only interested in parenthetical phrases that contain 'v.' in the format 'x v. y', where x and y are any number of characters (including spaces) between the parentheses; for example, '(State of Arkansas v. John Doe)'. Matching phrases (court cases) are always of this format: open parentheses, word beginning with capital letter, possible spaces and other words, v., another word beginning with a capital letter, and possibly more spaces and words, close parentheses.
I'd then like to create a new column containing counts of x v. y phrases in each row.
Bonus if there's a way to do this separately for the same phrases denoted by italics rather than enclosed in parentheses: State of Arkansas v. John Doe (but perhaps this should be posed as a separate question).
Thanks for helping a newbie!
I believe I have figured out what you want, but it is hard to tell without example data. I have made and example data frame to work with. If it is not what you are going for, please give an example.
df <- data.frame(text = c("(Roe v. Wade) is not about boats",
"(Dred Scott v. Sandford) and (Plessy v. Ferguson) have not stood the test of time",
"I am trying to confuse you (this is not a court case)",
"this one is also confusing (But with Capital Letters)",
"this is confusing (With Capitols and v. d)"),
stringsAsFactors = FALSE)
The regular expression I think you want is:
cases <- regmatches(df$text, gregexpr("(?<=\\()([[:upper:]].*? v\\. [[:upper:]].*?)(?=\\))",
df$text, perl=T))
You can then get the number of cases and add it to your data frame with:
df$numCases <- vapply(cases, length, numeric(1))
As for italics, I would really need an example of your data. usually that kind of formatting isn't stored when you read in a string in R, so the italics effectively don't exist anymore.
Change your regex like below,
regmatches(df$text, gregexpr("(?<=\\()[^()]*\\sv\\.\\s[^()]*(?=\\))", df$text, perl=T))[[1]]
DEMO

How to grep for all-but-one matching columns in R

I am trying to subset a large data frame with my columns of interest. I do so using the grep function, this selects one column too many ("has_socio"), which I would like to remove.
The following code does exactly what I want, but I find it unpleasant to look at. I want to do it in one line. Aside from just calling the first subset inside the second subset, can it be optimized?
DF <- read.dta("./big.dta")
DF0 <- na.omit(subset(DF, select=c(other_named_vars, grep("has_",names(DF)))))
DF0 <- na.omit(subset(DF0, select=-c(has_socio)))
I know similar questions have been asked (e.g. Subsetting a dataframe in R by multiple conditions) but I do not find one that addresses this issue specifically. I recognize I could just write the grep RE more carefully, but I feel the above code more clearly expresses my intent.
Thanks.
Replace your grep with:
vec <- c("blah", "has_bacon", "has_ham", "has_socio")
grep("^has_(?!socio$)", vec, value=T, perl=T)
# [1] "has_bacon" "has_ham"
(?!...) is a negative lookahead operator, which looks ahead and makes sure that its contents do not follow the actual matching piece behind of it (has_ being the matching piece).
setdiff(grep("has_", vec, value = TRUE), "has_socio")
## [1] "has_bacon" "has_ham"

Resources