R function for finding similar names? - r

I'm working with a big dataset of names and need to be able to group by the individual. It's possible that in the dataset there are names that appear different but are the same person, such as John Doe or John A. Doe, or Michael Smith and Mike Smith. Is there a way for R to find instances like these and recognize them as the same person?
df <- data.frame(
name = c("John Doe", "John A. Doe", "Jane Smith", "Jane Anderson", "Jane Anderson Lowell",
"Jane B. Smith", "John Doe", "Jane Smith", "Michael Smith",
"Mike Smith", "A.K. Ross", "Ana Kristina Ross"),
rating = c(1,2,1,1,2,3,1,4,2,1,3,2)
)
Here, there are multiple repeated individuals, whether the variant be a middle initial, a shortened name, a lengthened name, or someone whose last name changed. I've been trying to find a function that could give a similarity percentage of characters in name matches, and from there I could manually examine cases of high percentage to evaluate if they are indeed the same person. My end goal is to find the average rating by person, where I would need to sort by the individual.

There are many algorithms that measure string distance. Here is a simple approach for this example dataset using stringdist package. As suggested by the documentation of stringdist() function, Jaro-Winkler distance is used to find the string distance between a name pair. Note that I only paired the names with the same first two letters. Through eye-balling, a string distance of 0.15 seems to be a reasonable threshold to define a match.
library(tidyverse)
library(stringdist)
get_string_distance <- function(x) {
if (length(x) == 1) {
data.frame(name1 = x, name2 = x, string_distance = NA_real_)
} else {
x %>%
unique() %>%
combn(2) %>%
t() %>%
as.data.frame() %>%
setNames(c("name1", "name2")) %>%
mutate(string_distance = stringdist(name1, name2, method = "jw"))
}
}
dat <- df %>%
mutate(two_letters = str_sub(name, 1, 2)) %>%
nest_by(two_letters) %>%
mutate(same_name = list(get_string_distance(data$name))) %>%
ungroup()
dat1 <- dat %>%
unnest(same_name) %>%
filter(string_distance < 0.15) %>%
select(name1, name2, string_distance)
dat1
# # A tibble: 4 x 3
# name1 name2 string_distance
# <chr> <chr> <dbl>
# 1 Jane Smith Jane B. Smith 0.0769
# 2 Jane Anderson Jane Anderson Lowell 0.117
# 3 John Doe John A. Doe 0.0909
# 4 Michael Smith Mike Smith 0.136

Related

R: Add new column in dataframe based on values in another dataframe column with repetitive values

My Dataframe1 looks like this, with Pattern and Name as my column headings:
Pattern Name
Floral Rose
Vector Jess
Medieval Monica
Victorian Marta
Floral Jane
Vector Monica
Vector Elise
Medieval Jess
Floral Monica
Then I have dataframe2, with Name and Rank as headings:
Name Rank
Jess Twenty
Elise One
Jane Two
Rose Ten
Marta Three
Monica Five
I would like to add Dataframe1 to dataframe2, and have one row per name for Rank and Pattern to be something like this (where there is extra information for Pattern, they both merge into a single row and be separated by comma:
Name Rank Pattern
Jess Twenty Vector, Medieval
Elise One Vector
Jane Two Floral
Rose Ten Floral
Marta Three Victorian
Monica Five Medieval, Floral, Vector
I have used:
dataframe2$Pattern <- Dataframe1$Pattern [match(dataframe2$Name, Dataframe1$Name)]
but it only captures one Pattern only. Is there any straightforward way to do it?
Cheers!
Data:
df1 <- tribble(
~Pattern, ~Name,
"Floral", "Rose",
"Vector", "Jess",
"Medieval" , "Monica",
"Victorian", "Marta",
"Floral", "Jane",
"Vector", "Monica",
"Vector", "Elise",
"Medieval", "Jess",
"Floral", "Monica"
)
df2 <- tribble(
~Name, ~Rank,
"Jess", "Twenty",
"Elise", "One",
"Jane", "Two",
"Rose", "Ten",
"Marta", "Three",
"Monica", "Five"
)
In first data.frame you need to find all Pattern for different Name and paste them together. You can achieve that by using group_by + summarise + str_c functions. Afterwards you need to join two tables by Name:
library(tidyverse)
df1 %>%
group_by(Name) %>%
summarise(Pattern = str_c(Pattern, collapse = ", ")) %>%
inner_join(df2, by = "Name")
# A tibble: 6 x 3
Name Pattern Rank
<chr> <chr> <chr>
1 Elise Vector One
2 Jane Floral Two
3 Jess Vector, Medieval Twenty
4 Marta Victorian Three
5 Monica Medieval, Vector, Floral Five
6 Rose Floral Ten
You also use a left join:
left_join(dataframe2,dataframe1,by='Name') %>%
group_by(Name, Rank) %>%
summarise(Pattern = str_c(Pattern, collapse = ", "))

How to extract matches from stringr::str_detect in R into a list vector

I am trying to perform the following search on a database of text.
Here is the sample database, df
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6),
name = c("john doe", "carol jones", "jimmy smith",
"jenny ruiz", "joey jones", "tim brown"),
place = c("reno nevada", "poland maine", "warsaw poland",
"trenton new jersey", "brooklyn new york", "atlanta georgia")
)
I have a vector of strings which contains terms I am trying to find.
new_search <- c("poland", "jones")
I pass the vector to str_detect to find ANY of the strings in new_search in ANY of the columns in df and then return rows which match...
df %>%
filter_all(any_vars(str_detect(., paste(new_search, collapse = "|"))))
Question... how can I extract the results of str_detect into a new column?
For each row which is returned... I would like to generate a list of the terms which were successfully matched and put them in a list or character vector (matched_terms)...something like this...
id name place matched_terms
1 2 carol jones poland maine c("jones", "poland")
2 3 jimmy smith warsaw poland c("poland")
3 5 joey jones brooklyn new york c("jones")
This is my naive solution:
new_search <- c("poland", "jones") %>% paste(collapse = "|")
df %>%
mutate(new_var = str_extract_all(paste(name, place), new_search))
You can extract all the patterns in multiple columns using str_extract_all, combine them into one column with unite. unite combines the column into one string hence the empty values are turned into "character(0)" which we remove using str_remove_all and keep only those rows that have any matched term.
library(tidyverse)
pat <- str_c(new_search, collapse = "|")
df %>%
mutate(across(-id, ~str_extract_all(., pat), .names = '{col}_new')) %>%
unite(matched_terms, ends_with('new'), sep = ',') %>%
mutate(matched_terms = str_remove_all(matched_terms,
'character\\(0\\),?|,character\\(0\\)')) %>%
filter(matched_terms != '')
# id name place matched_terms
#1 2 carol jones poland maine jones,poland
#2 3 jimmy smith warsaw poland poland
#3 5 joey jones brooklyn new york jones

Matching strings to values in a different data frame

Consider this data frame, containing multiple entries for a person named Steve/Stephan Jones and a person named Steve/Steven Smith (as well as Jane Jones and Matt/Matthew Smith)
df <- data.frame(First = c("Steve", "Stephan", "Steve", "Jane", "Steve", "Steven", "Matt"),
Last = c(rep("Jones", 4), rep("Smith", 3)))
What I'd like is to match values of First to the appropriate value of Name in this data frame.
nicknames <- data.frame(Name = c("Stephan", "Steven", "Stephen", "Matthew"),
N1 = c(rep("Steve", 3), "Matt"))
To yield this target
target <- data.frame(First = c("Stephan", "Stephan", "Stephan", "Jane", "Steven", "Steven", "Matthew"),
Last = c(rep("Jones", 4), rep("Smith", 3)))
The issue is that there are multiple values of Name corresponding to a N1 (or First) value of "Steve", so I need to check within each group based of df$Last to see which version of Steven/Stephan/Stephen is correct.
Using something like this
library(dplyr)
library(stringr)
df %>%
group_by(Last) %>%
mutate(First = First[which.max(str_length(First))])
won't work because the value of "Jane" in row 4 will be converted to "Stephan"
I'm not sure, if this solves your problem and is consistent to your desired output:
library(dplyr)
df %>%
mutate(id = row_number()) %>%
left_join(nicknames, by=c("First" = "N1")) %>%
mutate(real_name = coalesce(Name, First)) %>%
group_by(Last, real_name) %>%
mutate(id = n()) %>%
group_by(Last, First) %>%
filter(id==max(id)) %>%
select(-Name, -id)
returns
# A tibble: 7 x 3
# Groups: Last, First [6]
First Last real_name
<chr> <chr> <chr>
1 Steve Jones Stephan
2 Stephan Jones Stephan
3 Steve Jones Stephan
4 Jane Jones Jane
5 Steve Smith Steven
6 Steven Smith Steven
7 Matt Smith Matthew

Pivot_longer: Rotating multiple columns of data with same data types

I'm trying to rotate multiple columns of data into single, data-type consistent columns.
I've created a minimum example below.
library(tibble)
library(dplyr)
# I have data like this
df <- tibble(contact_1_prefix=c('Mr.','Mrs.','Dr.'),
contact_2_prefix=c('Dr.','Mr.','Mrs.'),
contact_1 = c('Bob Johnson','Robert Johnson','Bobby Johnson'),
contact_2 = c('Tommy Two Tones','Tommy Three Tones','Tommy No Tones'),
contact_1_loc = c('Earth','New York','Los Angeles'),
contact_2_loc = c('London','Geneva','Paris'))
# My attempt at a solution:
df %>% rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols=c(matches('_[12]_')),
names_to=c('.value','dat'),
names_pattern = "(.*)_[1-2]_(.*)") %>%
pivot_wider(names_from='dat',values_from='contact')
#What I want is to widen that data to achieve a tibble with these two example lines
df_desired <- tibble(name=c('Bob Johnson','Tommy Two Tones'),
loc =c('Earth','London'),
prefix=c('Mr.','Dr.'))
I want all names under name, all locations under loc, and all prefixes under prefix.
If I use just this snippet from the middle statement:
df %>% rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols=c(matches('_[12]_')),
names_to=c('.value','dat'),
names_pattern = "(.*)_[1-2]_(.*)")
The dput of the output is:
structure(list(dat = c("prefix", "prefix", "name", "name", "loc",
"loc", "prefix", "prefix", "name", "name", "loc", "loc", "prefix",
"prefix", "name", "name", "loc", "loc"), contact = c("Mr.", "Dr.",
"Bob Johnson", "Tommy Two Tones", "Earth", "London", "Mrs.",
"Mr.", "Robert Johnson", "Tommy Three Tones", "New York", "Geneva",
"Dr.", "Mrs.", "Bobby Johnson", "Tommy No Tones", "Los Angeles",
"Paris")), row.names = c(NA, -18L), class = c("tbl_df", "tbl",
"data.frame"))
From that, I thought for sure pivot_wider was the solution, but there is a name conflict.
I assume a single pivot_longer statement will achieve the task. I studied Gathering wide columns into multiple long columns using pivot_longer carefully but can't quite figure this out. I have to admit I don't quite understand what the names_to = c(".value", "group") phrase does.
In any event, any help is appreciated.
Thanks
You were on the right path. Renaming is needed since only the name columns do not have any suffix to identify them. .value identifies part of the original column name that you want to uniquely identify as new columns. If you remove everything until the last underscore the part that remains are the new column names which you can specify using regex in names_pattern.
library(dplyr)
library(tidyr)
df %>%
rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols = everything(),
names_to = '.value',
names_pattern = '.*_(\\w+)')
# prefix name loc
# <chr> <chr> <chr>
#1 Mr. Bob Johnson Earth
#2 Dr. Tommy Two Tones London
#3 Mrs. Robert Johnson New York
#4 Mr. Tommy Three Tones Geneva
#5 Dr. Bobby Johnson Los Angeles
#6 Mrs. Tommy No Tones Paris
Here is a solution using split.default
data.table::rbindlist(
lapply( split.default( df, gsub( "[^0-9]+", "", names(df) ) ),
data.table::setnames,
new = c("prefix", "name", " loc" ) ) )
# prefix name loc
# 1: Mr. Bob Johnson Earth
# 2: Mrs. Robert Johnson New York
# 3: Dr. Bobby Johnson Los Angeles
# 4: Dr. Tommy Two Tones London
# 5: Mr. Tommy Three Tones Geneva
# 6: Mrs. Tommy No Tones Paris

Capitalize with dplyr

I am doing data cleaning with dplyr.
One of the things I want to do is to capitalize values in certain columns.
data$surname
john
Mary
John
mary
...
I suppose I have to use the mutate function of dplyr with something like this
titleCase <- function(x) {
+ s <- strsplit(as.character(x), " ")[[1]]
+ paste(toupper(substring(s, 1, 1)), substring(s, 2),
+ sep = "", collapse = " ")
+ }
But how to combine both? I get all kinds of errors or truncated data frames
Thanks
We can use sub
sub("(.)", "\\U\\1", data$surname, perl=TRUE)
#[1] "John" "Mary" "John" "Mary"
Implementing in the dplyr workflow
library(dplyr)
data %>%
mutate(surname = sub("(.)", "\\U\\1", surname, perl=TRUE))
If we need to do this on multiple columns
data %>%
mutate_each(funs(sub("(.)", "\\U\\1", ., perl=TRUE)))
Just to check
res <- data1 %>%
mutate(surname = sub("(.)", "\\U\\1", surname, perl=TRUE))
sum(grepl("[A-Z]", substr(res$surname, 1,1)))
#[1] 500000
data
data <- data.frame(surname=c("john", "Mary", "John", "mary"),
firstname = c("abe", "Jacob", "george", "jen"), stringsAsFactors=FALSE)
data1 <- data.frame(surname = sample(c("john", "Mary", "John", "mary"),
500000, replace=TRUE), stringsAsFactors=FALSE)
A little late to the party but you can use stringr package
library(stringr)
library(dplyr)
example1 <- tibble(names = c("john" ,"Mary", "John", "mary"))
example1 %>%
mutate(names = str_to_title(names))
## names
## <chr>
## 1 John
## 2 Mary
## 3 John
## 4 Mary
This will still work if you want all terms capitalized
example2 <- tibble(names = c("john james" ,"Mary carey", "John Jack", "mary Harry"))
example2 %>%
mutate(names = str_to_title(names))
## names
## <chr>
## 1 John James
## 2 Mary Carey
## 3 John Jack
## 4 Mary Harry
If you only want the first term capitalized, str_to_sentence() will work
example2 %>%
mutate(names = str_to_sentence(names))
## names
## <chr>
## 1 John james
## 2 Mary carey
## 3 John jack
## 4 Mary harry
There is a dedicated function for this that you can try:
R.utils::capitalize(data$surname)
If this needs to be implemented into a dplyr procedure, one could try the following:
library(dplyr)
library(R.utils)
data %>% mutate(surname = capitalize(surname))

Resources