Match strings by distance between non-equal length ones - r

Say we have the following datasets:
Dataset A:
name age
Sally 22
Peter 35
Joe 57
Samantha 33
Kyle 30
Kieran 41
Molly 28
Dataset B:
name company
Samanta A
Peter B
Joey C
Samantha A
My aim is to match both datasets while ordering the subsequent one's values by distance and keeping only the relevant matches. In other words, the output should look as follows below:
name_a name_b age company distance
Peter Peter 35 B 0.00
Samantha Samantha 33 A 0.00
Samantha Samanta 33 A 0.04166667
Joe Joey 57 C 0.08333333
In this example I'm calculating the distance using method = "jw" in stringdist, but I'm happy with any other method that might work. Until now I've been doing attempts with packages such as stringr or stringdist.

You can use stringdist_inner_join to join the two dataframes and use levenshteinSim to get the similarity between the two names.
library(fuzzyjoin)
library(dplyr)
stringdist_inner_join(A, B, by = 'name') %>%
mutate(distance = 1 - RecordLinkage::levenshteinSim(name.x, name.y)) %>%
arrange(distance)
# name.x age name.y company distance
#1 Peter 35 Peter B 0.000
#2 Samantha 33 Samantha A 0.000
#3 Samantha 33 Samanta A 0.125
#4 Joe 57 Joey C 0.250

Related

Questions about how to divide and find averages of a dataset

Let's say I have a dataset where I have a list of names and their ages
Tom 65
Sam 40
Sue 88
Kay 4
Jon 25
Lia 85
Ian 39
Joe 10
Bea 17
Jan 43
Jen 17
Ike 24
Jay 35
Cam 77
Jin 12
Ron 1
Ray 45
Leo 29
Ken 98
Mel 56
Amy 49
Joy 67
Ivy 3
Noe 14
Max 31
Jax 61
Lee 19
Ace 28
Ben 5
Guy 74
I'm trying to divide the dataset into ten equal bins by descending order (Ex. the first bin will have Ken, Sue, and Lia and the last bin will have Ben, Ivy, and Ron) and I want to find the average age for each bin (So the average age for the first bin would be 90.33). I was able to do this on MS excel quite easily but I'm not exactly sure how to do this efficiently on R. Any suggestions?
We can use cut to create a group and then summarise by taking the mean
library(dplyr)
df1 %>%
group_by(grp = cut(v2, breaks = 10)) %>%
summarise(v1 = list(v1), v2 = mean(v2))

Joining rows with columns to create vertical table

I am trying to figure out how to join 2 data frames to create a vertical table of the data. Here is some sample data:
people <- data.frame(person = c("John","David","Peter"), company = c("A", "B", "C"))
grades <- data.frame(person1=c(10, 40, 50, 60), person2=c(60,70,80, 100), person3=c(33,44,55, 75))
NOTE: The order of the columns in grades is the same as the order of the person column in the people data frame.
I would like to get a data frame like the following but can't think of how to get there. Would prefer a solution using base R (am using an older version of R so some packages don't work for me):
person | company | grade
-------------------------
John | A | 10
John | A | 40
John | A | 50
John | A | 60
David | B | 60
David | B | 70
David | B | 80
David | B | 100
Peter | C | 33
Peter | C | 44
Peter | C | 55
Peter | C | 75
We change the column names of 'grades' with 'person' column from 'people', gather into 'long' format and then do a left_join
library(tidyverse)
setNames(grades, people$person) %>%
gather(person, grade) %>%
left_join(people)
# person grade company
#1 John 10 A
#2 John 40 A
#3 John 50 A
#4 John 60 A
#5 David 60 B
#6 David 70 B
#7 David 80 B
#8 David 100 B
#9 Peter 33 C
#10 Peter 44 C
#11 Peter 55 C
#12 Peter 75 C
Or using base R with merge
merge(stack(setNames(grades, people$person)),
people, all.x = TRUE, by.x = 'ind', by.y = 'person')
A base R option using cbind would be
idx <- rep(seq_along(people$person), each = dim(grades)[1])
cbind(people[idx,], stack(unlist(grades))["values"])
Result
# person company values
#1 John A 10
#1.1 John A 40
#1.2 John A 50
#1.3 John A 60
#2 David B 60
#2.1 David B 70
#2.2 David B 80
#2.3 David B 100
#3 Peter C 33
#3.1 Peter C 44
#3.2 Peter C 55
#3.3 Peter C 75
Use unlist and stack on grades to get
stack(unlist(grades))
values ind
1 10 john_grades1
2 40 john_grades2
3 50 john_grades3
4 60 john_grades4
5 60 david1
6 70 david2
7 80 david3
8 100 david4
9 33 pj1
10 44 pj2
11 55 pj3
12 75 pj4
Since "The order of the columns in grades is the same as the order of the person column in the people data frame." we can use cbind next, after we expanded people to have the correct number of rows.
(idx <- rep(seq_along(people$person), each = dim(grades)[1]))
# [1] 1 1 1 1 2 2 2 2 3 3 3 3
Another option, probably a little faster would be
cbind(people[idx,], data.frame(grade = unlist(grades, use.names = FALSE)))

R - Merging two data files based on partial matching of inconsistent full name formats

I'm looking for a way to merge two data files based on partial matching of participants' full names that are sometimes entered in different formats and sometimes misspelled. I know there are some different function options for partial matches (eg agrep and pmatch) and for merging data files but I need help with a) combining the two; b) doing partial matching that can ignore middle names; c) in the merged data file store both original name formats and d) retain unique values even if they don't have a match.
For example, I have the following two data files:
File name: Employee Data
Full Name Date Started Orders
ANGELA MUIR 6/15/14 25
EILEEN COWIE 6/15/14 44
LAURA CUMMING 10/6/14 43
ELENA POPA 1/21/15 37
KAREN MACEWAN 3/15/99 39
File name: Assessment data
Candidate Leading Factor SI-D SI-I
Angie muir I -3 12
Caroline Burn S -5 -3
Eileen Mary Cowie S -5 5
Elena Pope C -4 7
Henry LeFeuvre C -5 -1
Jennifer Ford S -3 -2
Karen McEwan I -4 10
Laura Cumming S 0 6
Mandip Johal C -2 2
Mubarak Hussain D 6 -1
I want to merge them based on names (Full Name in df1 and Candidate in df2) ignoring middle name (eg Eilen Cowie = Eileen Mary Cowie), extra spaces (Laura Cumming = Laura Cumming); misspells (e.g. Elena Popa = Elena Pope) etc.
The ideal output would look like this:
Name Full Name Candidate Date Started Orders Leading Factor SI-D SI-I
ANGELA MUIR ANGELA MUIR Angie muir 6/15/14 25 I -3 12
Caroline Burn N/A Caroline Burn N/A N/A S -5 -3
EILEEN COWIE EILEEN COWIE Eileen Mary Cowie 6/15/14 44 S -5 5
ELENA POPA ELENA POPA Elena Pope 1/21/15 37 C -4 7
Henry LeFeuvre N/A Henry LeFeuvre N/A N/A C -5 -1
Jennifer Ford N/A Jennifer Ford N/A N/A S -3 -2
KAREN MACEWAN KAREN MACEWAN Karen McEwan 3/15/99 39 I -4 10
LAURA CUMMING LAURA CUMMING Laura Cumming 10/6/14 43 S 0 6
Mandip Johal N/A Mandip Johal N/A N/A C -2 2
Mubarak Hussain N/A Mubarak Hussain N/A N/A D 6 -1
Any suggestions would be greatly appreciated!
Here's a process that may help. You will have to inspect the results and make adjustments as needed.
df1
# v1 v2
#1 ANGELA MUIR 6/15/14
#2 EILEEN COWIE 6/15/14
#3 AnGela Smith 5/3/14
df2
# u1 u2
#1 Eileen Mary Cowie I-3
#2 Angie muir -5 5
index <- sapply(df1$v1, function(x) {
agrep(x, df2$u1, ignore.case=TRUE, max.distance = .5)
}
)
index <- unlist(index)
df2$u1[index] <- names(index)
merge(df1, df2, by.x='v1', by.y='u1')
# v1 v2 u2
#1 ANGELA MUIR 6/15/14 -5 5
#2 EILEEN COWIE 6/15/14 I-3
I had to adjust the argument max.distance in the index function. It may not work for your data, but adjust and test if it works. If this doesn't help, there is a package called stringdist that may have a more robust matching function in amatch.
Data
v1 <- c('ANGELA MUIR', 'EILEEN COWIE', 'AnGela Smith')
v2 <- c('6/15/14', '6/15/14', '5/3/14')
u1 <- c('Eileen Mary Cowie', 'Angie muir')
u2 <- c('I-3', '-5 5')
df1 <- data.frame(v1, v2, stringsAsFactors=F)
df2 <- data.frame(u1, u2, stringsAsFactors = F)

Remove observations from DF if duplicate in specific columns while other columns must differ

I have a large data frame with multiple columns and many rows (200k). I order the rows by a group variable, and each group can have one or more entries. The other columns for each group should have identical values, however in some cases they don't. It looks like this:
group name age color city
1 Anton 50 orange NY
1 Anton 21 red NY
1 Anton 21 red NJ
2 Martin 78 black LA
2 Martin 78 blue LA
3 Maria 29 red NC
3 Maria 29 pink LV
4 Jake 33 blue NJ
I want to delete all entries of a group if age or city is not identical for all rows of the group (indication of observation error). Otherwise, I want to keep all the entries.
The output I'm hoping for would be:
group name age color city
2 Martin 78 black LA
2 Martin 78 blue LA
4 Jake 33 blue NJ
The closest I have come is this:
dup <- df[ duplicated(df[,c("group","name","color")]) | duplicated(df[,c("group","name","color")],fromLast=TRUE) ,"group"]
df_nodup <- df[!(df$group %in% dup),]
However, this is far from doing everything that I need.
P.s.: I had the same question answered for py/pandas. I'd like to have a solution for R, as well however.
/e: While Frank's answer was helpful to understand the principle of a solution and his second suggestion worked, it was very slow. (took ~15min on my df).
user20650's answer was harder to comprehend, but runs tremendously faster (~10sec).
A similar approach to Franks, you can count the length of the unique combinations of age and city by group - do this using ave. You can then subset your data if the length of the unique combinations is greater than one
# your data
df <- read.table(text="group name age color city
1 Anton 50 orange NY
1 Anton 21 red NY
1 Anton 21 red NJ
2 Martin 78 black LA
2 Martin 78 blue LA
3 Maria 29 red NC
3 Maria 29 pink LV
4 Jake 33 blue NJ ", header=T)
# calculate and subset
df[with(df, ave(paste(age, city), group, FUN=function(x) length(unique(x))))==1,]
# group name age color city
# 4 2 Martin 78 black LA
# 5 2 Martin 78 blue LA
# 8 4 Jake 33 blue NJ
Here is an approach:
temp <- tapply(df$group, list(df$name, df$age, df$city), unique)
temp[!is.na(temp)] <- 1
keepers <- names(which(apply(temp, 1, sum, na.rm=TRUE)==1))
df[df$name %in% keepers, ]
#4 2 Martin 78 black LA
#5 2 Martin 78 blue LA
#8 4 Jake 33 blue NJ
Alternate, slightly simpler approach:
temp2 <- unique(df[,c('name','age','city')])
keepers2 <- names(which(tapply(temp2$name, temp2$name, length)==1))
df[df$name %in% keepers2, ]
# group name age color city
#4 2 Martin 78 black LA
#5 2 Martin 78 blue LA
#8 4 Jake 33 blue NJ
Here's an approach using dplyr:
df <- read.table(text = "
group name age color city
1 Anton 50 orange NY
1 Anton 21 red NY
1 Anton 21 red NJ
2 Martin 78 black LA
2 Martin 78 blue LA
3 Maria 29 red NC
3 Maria 29 pink LV
4 Jake 33 blue NJ
", header = TRUE)
library(dplyr)
df %>%
group_by(group) %>%
filter(n_distinct(age) == 1 && n_distinct(city) == 1)
I think it's pretty easy to see what's going on - you group, then filter to keep groups when there is only one distinct age and city.

Rank values based on individual users

I have a data set that looks like the follow:
User Area
Sarah 123.4
Sarah 20.5
Sarah 43
Sam 86
Sam 101
Sam 32.6
Justin 45
Justin 125.8
Justin 39
Justin 88.4
Zac 21
Zac 4
Zac 111
I want to sort the greatest area to smallest, however I want separate top areas for each individual user.
I have tried: test$Ranking1 <- order(test$User, test$Area, decreasing = FALSE ), but this ranks them all together
I then tried: test$Ranking1 <- ave(test$User, test$Area, FUN= rank ), and while others seem to have said this will work my output/ results give the middle (average) value a ranking of 1 and then going up by which is closest to the average. I was 1 to be the largest area not the average.
Any suggestions?
This can be done very easily with data.table:
library(data.table) # Load package
setDT(dat) # convert to data.table
dat[,max(Area),by=User] # compute
dat[,sort(Area),by=User] # Sort increasing
dat[,sort(Area,decreasing = T),by=User] # Sort decreasing
Hope this helps!
Read the documentation of the package, it's very helpful.
I am assuming that you want to rank area within each individual, and also want to know the largest area for each individual:
## make up data
set.seed(1)
user <- rep(LETTERS[sample(26, 5)], each=sample(5, 1))
area <- rnorm(length(user), 100, 20)
d <- data.frame(user, area)
library(dplyr)
d %>%
group_by(user) %>%
mutate(ranking=rank(-area), top_area=max(area)) %>%
ungroup()
user area ranking top_area
1 G 131.90562 1 131.9056
2 G 106.59016 4 131.9056
3 G 83.59063 5 131.9056
4 G 109.74858 3 131.9056
5 G 114.76649 2 131.9056
6 J 111.51563 2 130.2356
7 J 93.89223 4 130.2356
8 J 130.23562 1 130.2356
9 J 107.79686 3 130.2356
10 J 87.57519 5 130.2356
...
I think this is your desired output? If you want the order reversed, wrap a rev() around the rank() function.
x = "User Area
Sarah 123.4
Sarah 20.5
Sarah 43
Sam 86
Sam 101
Sam 32.6
Justin 45
Justin 125.8
Justin 39
Justin 88.4
Zac 21
Zac 4
Zac 111"
rank.foo = function(x) {
z = numeric()
for (i in as.character(unique(x$User)))
{z = c(z, rank(subset(x, User == i)$Area))}
return(z)
}
cbind(df, rank.foo(df))
User Area rank.foo(df)
Sarah 123.4 3
Sarah 20.5 1
Sarah 43.0 2
Sam 86.0 2
Sam 101.0 3
Sam 32.6 1
Justin 45.0 2
Justin 125.8 4
Justin 39.0 1
Justin 88.4 3
Zac 21.0 2
Zac 4.0 1
Zac 111.0 3

Resources