R - Merging two data files based on partial matching of inconsistent full name formats - r

I'm looking for a way to merge two data files based on partial matching of participants' full names that are sometimes entered in different formats and sometimes misspelled. I know there are some different function options for partial matches (eg agrep and pmatch) and for merging data files but I need help with a) combining the two; b) doing partial matching that can ignore middle names; c) in the merged data file store both original name formats and d) retain unique values even if they don't have a match.
For example, I have the following two data files:
File name: Employee Data
Full Name Date Started Orders
ANGELA MUIR 6/15/14 25
EILEEN COWIE 6/15/14 44
LAURA CUMMING 10/6/14 43
ELENA POPA 1/21/15 37
KAREN MACEWAN 3/15/99 39
File name: Assessment data
Candidate Leading Factor SI-D SI-I
Angie muir I -3 12
Caroline Burn S -5 -3
Eileen Mary Cowie S -5 5
Elena Pope C -4 7
Henry LeFeuvre C -5 -1
Jennifer Ford S -3 -2
Karen McEwan I -4 10
Laura Cumming S 0 6
Mandip Johal C -2 2
Mubarak Hussain D 6 -1
I want to merge them based on names (Full Name in df1 and Candidate in df2) ignoring middle name (eg Eilen Cowie = Eileen Mary Cowie), extra spaces (Laura Cumming = Laura Cumming); misspells (e.g. Elena Popa = Elena Pope) etc.
The ideal output would look like this:
Name Full Name Candidate Date Started Orders Leading Factor SI-D SI-I
ANGELA MUIR ANGELA MUIR Angie muir 6/15/14 25 I -3 12
Caroline Burn N/A Caroline Burn N/A N/A S -5 -3
EILEEN COWIE EILEEN COWIE Eileen Mary Cowie 6/15/14 44 S -5 5
ELENA POPA ELENA POPA Elena Pope 1/21/15 37 C -4 7
Henry LeFeuvre N/A Henry LeFeuvre N/A N/A C -5 -1
Jennifer Ford N/A Jennifer Ford N/A N/A S -3 -2
KAREN MACEWAN KAREN MACEWAN Karen McEwan 3/15/99 39 I -4 10
LAURA CUMMING LAURA CUMMING Laura Cumming 10/6/14 43 S 0 6
Mandip Johal N/A Mandip Johal N/A N/A C -2 2
Mubarak Hussain N/A Mubarak Hussain N/A N/A D 6 -1
Any suggestions would be greatly appreciated!

Here's a process that may help. You will have to inspect the results and make adjustments as needed.
df1
# v1 v2
#1 ANGELA MUIR 6/15/14
#2 EILEEN COWIE 6/15/14
#3 AnGela Smith 5/3/14
df2
# u1 u2
#1 Eileen Mary Cowie I-3
#2 Angie muir -5 5
index <- sapply(df1$v1, function(x) {
agrep(x, df2$u1, ignore.case=TRUE, max.distance = .5)
}
)
index <- unlist(index)
df2$u1[index] <- names(index)
merge(df1, df2, by.x='v1', by.y='u1')
# v1 v2 u2
#1 ANGELA MUIR 6/15/14 -5 5
#2 EILEEN COWIE 6/15/14 I-3
I had to adjust the argument max.distance in the index function. It may not work for your data, but adjust and test if it works. If this doesn't help, there is a package called stringdist that may have a more robust matching function in amatch.
Data
v1 <- c('ANGELA MUIR', 'EILEEN COWIE', 'AnGela Smith')
v2 <- c('6/15/14', '6/15/14', '5/3/14')
u1 <- c('Eileen Mary Cowie', 'Angie muir')
u2 <- c('I-3', '-5 5')
df1 <- data.frame(v1, v2, stringsAsFactors=F)
df2 <- data.frame(u1, u2, stringsAsFactors = F)

Related

Match strings by distance between non-equal length ones

Say we have the following datasets:
Dataset A:
name age
Sally 22
Peter 35
Joe 57
Samantha 33
Kyle 30
Kieran 41
Molly 28
Dataset B:
name company
Samanta A
Peter B
Joey C
Samantha A
My aim is to match both datasets while ordering the subsequent one's values by distance and keeping only the relevant matches. In other words, the output should look as follows below:
name_a name_b age company distance
Peter Peter 35 B 0.00
Samantha Samantha 33 A 0.00
Samantha Samanta 33 A 0.04166667
Joe Joey 57 C 0.08333333
In this example I'm calculating the distance using method = "jw" in stringdist, but I'm happy with any other method that might work. Until now I've been doing attempts with packages such as stringr or stringdist.
You can use stringdist_inner_join to join the two dataframes and use levenshteinSim to get the similarity between the two names.
library(fuzzyjoin)
library(dplyr)
stringdist_inner_join(A, B, by = 'name') %>%
mutate(distance = 1 - RecordLinkage::levenshteinSim(name.x, name.y)) %>%
arrange(distance)
# name.x age name.y company distance
#1 Peter 35 Peter B 0.000
#2 Samantha 33 Samantha A 0.000
#3 Samantha 33 Samanta A 0.125
#4 Joe 57 Joey C 0.250

How to convert a n x 3 data frame into a square (ordered) matrix?

I need to reshape a table or (data frame) to be able to use an R package (NetworkRiskMetrics). Suppose I have a data frame of lenders, borrowers and loan values:
lender borrower loan_USD
John Mark 100
Mark Paul 45
Joe Paul 30
Dan Mark 120
How do I convert this data frame into:
John Mark Joe Dan Paul
John
Mark
Joe
Dan
Paul
(placing zeros in empty cells)?
Thanks.
Use reshape function
d <- data.frame(lander=c('a','b','c', 'a'), borower=c('m','p','m','p'), loan=c(10,20,15,12))
d
loan lander borower
10.1 1 a m
20.1 1 b p
15.1 1 c m
12.1 1 a p
reshape(data=d, direction='long', varying=list('lander','borower'), idvar='loan', timevar='loan')
lander borower loan
1 a m 10
2 b p 20
3 c m 15
4 a p 12

Erasing duplicates with NA values

I have a data frame like this:
names <- c('Mike','Mike','Mike','John','John','John','David','David','David','David')
dates <- c('04-26','04-26','04-27','04-28','04-27','04-26','04-01','04-02','04-02','04-03')
values <- c(NA,1,2,4,5,6,1,2,NA,NA)
test <- data.frame(names,dates,values)
Which is:
names dates values
1 Mike 04-26 NA
2 Mike 04-26 1
3 Mike 04-27 2
4 John 04-28 4
5 John 04-27 5
6 John 04-26 6
7 David 04-01 1
8 David 04-02 2
9 David 04-02 NA
10 David 04-03 NA
I'd like to get rid of duplicates with NA values. So, in this case, I have a valid observation from Mike on 04-26 and also have a valid observation from David on 04-02, so rows 1 and 9 should be erased and I will end up with:
names dates values
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
I tried to use duplicated function, something like this:
test[!duplicated(test[,c('names','dates')]),]
But that does not work since some NA values come before the valid value. Do you have any suggestions without trying things like merge or making another data frame?
Update: I'd like to keep rows with NA that are not duplicates.
What about this way?
library(dplyr)
test %>% group_by(names, dates) %>% filter((n()>=2 & !is.na(values)) | n()==1)
Source: local data frame [8 x 3]
Groups: names, dates [8]
names dates values
(fctr) (fctr) (dbl)
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
Here is an attempt in data.table:
# set up
libary(data.table)
setDT(test)
# construct condition
test[, dupes := max(duplicated(.SD)), .SDcols=c("names", "dates"), by=c("names", "dates")]
# print out result
test[dupes == 0 | !is.na(values),]
Here is a similar method using base R, except that the dupes variable is kept separately from the data.frame:
dupes <- duplicated(test[c("names", "dates")])
# this generates warnings, but works nonetheless
dupes <- ave(dupes, test$names, test$dates, FUN=max)
# print out result
test[dupes == 0 | !is.na(test$values),]
If there are duplicated rows where the values variable is NA, and these duplicates add nothing to the data, then you can drop them prior to running the code above:
testNoNADupes <- test[!(duplicated(test) & is.na(test$values)),]
This should work based on your sample.
test <- test[order(test$values),]
test <- test[!(duplicated(test$names) & duplicated(test$dates) & is.na(test$values)),]

Rank values based on individual users

I have a data set that looks like the follow:
User Area
Sarah 123.4
Sarah 20.5
Sarah 43
Sam 86
Sam 101
Sam 32.6
Justin 45
Justin 125.8
Justin 39
Justin 88.4
Zac 21
Zac 4
Zac 111
I want to sort the greatest area to smallest, however I want separate top areas for each individual user.
I have tried: test$Ranking1 <- order(test$User, test$Area, decreasing = FALSE ), but this ranks them all together
I then tried: test$Ranking1 <- ave(test$User, test$Area, FUN= rank ), and while others seem to have said this will work my output/ results give the middle (average) value a ranking of 1 and then going up by which is closest to the average. I was 1 to be the largest area not the average.
Any suggestions?
This can be done very easily with data.table:
library(data.table) # Load package
setDT(dat) # convert to data.table
dat[,max(Area),by=User] # compute
dat[,sort(Area),by=User] # Sort increasing
dat[,sort(Area,decreasing = T),by=User] # Sort decreasing
Hope this helps!
Read the documentation of the package, it's very helpful.
I am assuming that you want to rank area within each individual, and also want to know the largest area for each individual:
## make up data
set.seed(1)
user <- rep(LETTERS[sample(26, 5)], each=sample(5, 1))
area <- rnorm(length(user), 100, 20)
d <- data.frame(user, area)
library(dplyr)
d %>%
group_by(user) %>%
mutate(ranking=rank(-area), top_area=max(area)) %>%
ungroup()
user area ranking top_area
1 G 131.90562 1 131.9056
2 G 106.59016 4 131.9056
3 G 83.59063 5 131.9056
4 G 109.74858 3 131.9056
5 G 114.76649 2 131.9056
6 J 111.51563 2 130.2356
7 J 93.89223 4 130.2356
8 J 130.23562 1 130.2356
9 J 107.79686 3 130.2356
10 J 87.57519 5 130.2356
...
I think this is your desired output? If you want the order reversed, wrap a rev() around the rank() function.
x = "User Area
Sarah 123.4
Sarah 20.5
Sarah 43
Sam 86
Sam 101
Sam 32.6
Justin 45
Justin 125.8
Justin 39
Justin 88.4
Zac 21
Zac 4
Zac 111"
rank.foo = function(x) {
z = numeric()
for (i in as.character(unique(x$User)))
{z = c(z, rank(subset(x, User == i)$Area))}
return(z)
}
cbind(df, rank.foo(df))
User Area rank.foo(df)
Sarah 123.4 3
Sarah 20.5 1
Sarah 43.0 2
Sam 86.0 2
Sam 101.0 3
Sam 32.6 1
Justin 45.0 2
Justin 125.8 4
Justin 39.0 1
Justin 88.4 3
Zac 21.0 2
Zac 4.0 1
Zac 111.0 3

Deduplicate dataframe based on criteria in R?

I've got this dataframe:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18
As you can see the value "Jane" appears 3 times. What I want to do is to deduplicate the list based on the variable "Name" but because the rest of the columns are important to me, I want to keep the rows that have the most information in them. For example if I was to deduplicate the above file in excel, it would keep the first value of "Jane" and delete all the other ones. But the first value of "Jane" (row no3) has got missing information in the other columns.
So in other words I want to deduplicate the list by "Name" but add a criteria to keep the rows that have any other value different from "0" in the column "Age". This way the result I would get would be this:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane US F 30
4 Kate GB F 18
I have tried this
file3 <- file1[!duplicated(file1$Name),]
But like excel it keeps the value of "Jane" that has no usable information in the other columns.
How do I sort the rows based on column "Age" in a Z-A order so that anything that has "0" will be on the bottom and will be removed when I deduplicate the list?
Cheers
David
Try this trick
ind <- with(DF,
Country !=0 &
Gender %in% c('F', 'M') &
Age !=0)
DF[ind, ]
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
4 Jane US F 30
6 Kate GB F 18
So far it works well and produces your desired output
EDIT
library(doBy)
orderBy(~ -Age+Name, DF) # Sort decreasingly by Age and Name
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Or simply using Base functions:
DF[order(DF$Age, DF$Name, decreasing = TRUE), ]
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Now you can select by indexing the correct rows meeting your conditions, I really think the first part is better than these two lasts.
If all duplicated rows have the value zero in column Age, it will work with subset:
# the data
file1 <- read.table(text="Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18", header = TRUE, stringsAsFactors = FALSE)
# create a subset of the data
subset(file1, Age > 0)
# Name Country Gender Age
# 1 John GB M 25
# 2 Mark US M 35
# 4 Jane US F 30
# 6 Kate GB F 18

Resources