R function for recognizing name variants? - r

I'm working with a big dataset of names and need to be able to group by the individual. It's possible that in the dataset there are names that appear different but are the same person, such as John Doe or John A. Doe, or Michael Smith and Mike Smith. Is there a way for R to find instances like these and recognize them as the same person?
df <- data.frame(
name = c("John Doe", "John A. Doe", "Jane Smith", "Jane Anderson", "Jane Anderson Lowell",
"Jane B. Smith", "John Doe", "Jane Smith", "Michael Smith",
"Mike Smith", "A.K. Ross", "Ana Kristina Ross"),
rating = c(1,2,1,1,2,3,1,4,2,1,3,2)
)
Here, there are multiple repeated individuals, whether the variant be a middle initial, a shortened name, a lengthened name, or someone whose last name changed. I've been trying to find a function that could give a similarity percentage of characters in name matches, and from there I could manually examine cases of high percentage to evaluate if they are indeed the same person. My end goal is to find the average rating by person, where I would need to sort by the individual.

The actual solution depends on your data and the variations that it can have. How can you identify one unique individual? Maybe extracting first and last name from each name would help?
library(dplyr)
library(stringr)
df %>%
group_by(firstname = word(name, 1),
lastname = word(name, -1)) %>%
summarise(rating = mean(rating)) %>%
ungroup
# firstname lastname rating
# <chr> <chr> <dbl>
#1 Jane Anderson 1
#2 Jane Smith 1.33
#3 John Doe 2

We could do this with extract from tidyr
library(dplyr)
library(tidyr)
df %>%
tidyr::extract(name, into = c('firstname', 'lastname'),
"^(\\w+).*\\s(\\w+)$") %>%
group_by(firstname, lastname) %>%
summarise(rating = mean(rating, na.rm = TRUE), .groups = 'drop')
# A tibble: 3 x 3
firstname lastname rating
<chr> <chr> <dbl>
1 Jane Anderson 1
2 Jane Smith 1.33
3 John Doe 2

Related

Using unlist on a column of strings within a data frame

I have a data frame with a column that contains a string with multiple names separated by commas:
df = data.frame(my.text = c("John Smith, Johnny Smith, John Smith", "John Doe, Doe, Johnny", c="Jane Doe, Jane Doe"))
df
my.text
1 John Smith, Johnny Smith, John Smith
2 John Doe, Doe, Johnny
3 Jane Doe, Jane Doe
I'd like to eliminate the duplicate names within in each row (i.e. get unique names) and store these at my.text so it looks this way:
df
my.text
1 John Smith, Johnny Smith
2 John Doe, Doe, Johnny
3 Jane Doe
This code achieves this for a single string/row:
df$mytext[1] = paste(unique(unlist(strsplit(df$mytext[1], split = ", "))), collapse = ", ")
But how do I apply this on the entire my.text column? I have tried mapply but cannot figure out how to send it so many functions all at once. Or perhaps there's a better way I'm overlooking?
strsplit is already vectorized, but to reduce it to a single string again, we can use lapply and paste:
sapply(strsplit(df$my.text, ",\\s*"), function(z) paste(unique(z), collapse = ", "))
# [1] "John Smith, Johnny Smith" "John Doe, Doe, Johnny" "Jane Doe"

In R, is there a way to identify similar string values in two columns in a dataframe?

I have a large dataframe with 70,000 observations with column A and column B having pairs of nurses and physicians who worked together the same shift. Unfortunately there are some observations here and there (I can't quite gauge how many but it's a minority) where they are the same person in column A and column B but their names are spelled slightly differently because of the addition of a middle name or a nickname in one column but not in the other. I want to create a dataframe that ONLY has those rows. Is there a way to use a %like% and which function or something similar to identify all of these rows?
Here is an example of what I have:
A
B
Jimmy Fallon
Harry Potter
Jimmy Fallon
James Fallon
Harry Potter
John Oliver
Harry Potter
Harold Potter
What I want:
A
B
Jimmy Fallon
James Fallon
Harry Potter
Harold Potter
One possible option is to use adist then filter to the rows that have a low distance. This method kind of assumes that there is a common element in each column (e.g., the last name).
library(tidyverse)
df %>%
rowwise() %>%
filter(adist(x=A,y=B,ignore.case=TRUE) <= 3)
Output
A B
<chr> <chr>
1 Jimmy Fallon James Fallon
2 Harry Potter Harold Potter
Or with base R:
df[subset(t(t(mapply(adist, df$A, df$B))) <= 3),]
Data
df <- structure(list(A = c("Jimmy Fallon", "Jimmy Fallon", "Harry Potter",
"Harry Potter"), B = c("Harry Potter", "James Fallon", "John Oliver",
"Harold Potter")), class = "data.frame", row.names = c(NA, -4L
))
Determining Cutoff
You might need to change the cutoff filtering value depending on your data. However, you could get the distance and find where your best cutoff would be when names are slightly misspelled.
df2 <- data.frame(A = c("Jimmy Fallon", "Jimmy Fallon", "Harry Potter", "Hary Poter"),
B = c("Harry Potter", "James Fallo", "John Oliver", "Harold Potter"))
df %>%
rowwise() %>%
mutate(dist = adist(x=A,y=B,ignore.case=TRUE)) %>%
as.data.frame %>%
arrange(dist)
A B dist
1 Jimmy Fallon James Fallo 4
2 Hary Poter Harold Potter 4
3 Harry Potter John Oliver 9
4 Jimmy Fallon Harry Potter 10
So, now we know that 4 would be a better cutoff for filtering.

grepl for first column into last column: is this the most efficient

I have a list of names from different sources in one data set: one set is organized by FirstName LastName; the other has FullName. I want to see if the first name or the last name is within the full name column, and create a flag. Two questions:
First, I used this solution, but the resulting data doesn't have the right amount of rows, and I'm not sure how to get it to make a flag. I tried to turn it into an ifelse statement, but got another error. How do I fix this so if FirstName is in FullName, I flag True (or 1), otherwise I flag False (or 0)?
Second, I have a few million names, is this an efficient way to do things?
FirstName = c("mary", "paul", "mother", "john", "red", "little", "king")
LastName = c("berry", "hollywood", "theresa", "jones", "rover", "tim", "arthur")
FullName = c("mary berry", "anthony horrowitz", "jennifer lawrence", "john jones", "red rover", "mick jagger", "king arthur")
df = data.frame(FirstName, LastName, FullName)
#attempt 1 and error
df$match_firstname <- df[mapply(grepl, df$FirstName, df$FullName), ]
Error in `$<-.data.frame`(`*tmp*`, match_firstname, value = list(FirstName = c("mary", :
replacement has 4 rows, data has 7
#attempt 2 and error
df$match_firstname <- ifelse(df[mapply(grepl, df$FirstName, df$FullName), ], 1, 0)
Error in ifelse(df[mapply(grepl, df$FirstName, df$FullName), ], 1, 0) :
'list' object cannot be coerced to type 'logical'
Instead we could use str_detect which is vectorized for both pattern and string whereas in the Map/mapply code, it is looping over each row and thus could be less efficient
library(dplyr)
library(stringr)
df %>%
filter(str_detect(FullName, FirstName))
-output
FirstName LastName FullName
1 mary berry mary berry
2 john jones john jones
3 red rover red rover
4 king arthur king arthur
If we want to add a new binary column, instead of filtering, convert the logical to binary with as.integer or +
df <- df %>%
mutate(match_firstname = +(str_detect(FullName, FirstName)))
-output
FirstName LastName FullName match_firstname
1 mary berry mary berry 1
2 paul hollywood anthony horrowitz 0
3 mother theresa jennifer lawrence 0
4 john jones john jones 1
5 red rover red rover 1
6 little tim mick jagger 0
7 king arthur king arthur 1
The error in the OP's code is because we are assigning a subset of data into a new column in the original dataset which obviously result in length difference
df[mapply(grepl, df$FirstName, df$FullName), ]
FirstName LastName FullName
1 mary berry mary berry
4 john jones john jones
5 red rover red rover
7 king arthur king arthur
Similar to the previous solution, use +
df$match_firstname <- +(mapply(grepl, df$FirstName, df$FullName))

Summing Rows Next to a Name in R

I'm working on a banking project where I'm trying to find a yearly sum of money spent, while the dataset has these listed as monthly transactions.
Month Name Money Spent
2 John Smith 10
3 John Smith 25
4 John Smith 20
2 Joe Nais 10
3 Joe Nais 25
4 Joe Nais 20
Right now, this is the code I have:
OTData <- OTData %>%
mutate(
OTData,
Full Year = [CODE NEEDED TO SUM UP]
)
Thanks!
As #Pawel said, there's no question here. I assume you want:
df <- data.frame(Month = c(2,3,4,2,3,4),
Name = c("John Smith", "John Smith", "John Smith",
"Joe Nais", "Joe Nais", "Joe Nais"),
Money_Spent = c(10,25,20,10,25,20))
df %>%
group_by(Name) %>%
summarize(Full_year = sum(Money_Spent))
Name Full_year
<fct> <dbl>
1 Joe Nais 55
2 John Smith 55
NOTE: You're going to run into trouble if you include spaces in your variable names. You really should replace them with ., _, or camelCase as in the above example.

Merge data frames with partial id

Say I have these two data frames:
> df1 <- data.frame(name = c('John Doe',
'Jane F. Doe',
'Mark Smith Simpson',
'Sam Lee'))
> df1
name
1 John Doe
2 Jane F. Doe
3 Mark Smith Simpson
4 Sam Lee
> df2 <- data.frame(family = c('Doe', 'Smith'), size = c(2, 6))
> df2
family size
1 Doe 2
2 Smith 6
I want to merge both data frames in order to get this:
name family size
1 John Doe Doe 2
2 Jane F. Doe Doe 2
3 Mark Smith Simpson Smith 6
4 Sam Lee <NA> NA
But I can't wrap my head around a way to do this apart from the following very convoluted solution, which is becoming very messy with my real data, which has over 100 "family names":
> df3 <- within(df1, {
family <- ifelse(test = grepl('Doe', name),
yes = 'Doe',
no = ifelse(test = grepl('Smith', name),
yes = 'Smith',
no = NA))
})
> merge(df3, df2, all.x = TRUE)
family name size
1 Doe John Doe 2
2 Doe Jane F. Doe 2
3 Smith Mark Smith Simpson 6
4 <NA> Sam Lee NA
I've tried taking a look into pmatch as well as the solutions provided at R partial match in data frame, but still haven't found what I'm looking for.
Rather than attempting to use regular expressions and partial matches, you could split the names up into a lookup-table format, where each component of a person's name is kept in a row, and matched to their full name:
df1 <- data.frame(name = c('John Doe',
'Jane F. Doe',
'Mark Smith Simpson',
'Sam Lee'),
stringsAsFactors = FALSE)
df2 <- data.frame(family = c('Doe', 'Smith'), size = c(2, 6),
stringsAsFactors = FALSE)
library(tidyr)
library(dplyr)
str_df <- function(x) {
ss <- strsplit(unlist(x)," ")
data.frame(family = unlist(ss),stringsAsFactors = FALSE)
}
splitnames <- df1 %>%
group_by(name) %>%
do(str_df(.))
splitnames
name family
1 Jane F. Doe Jane
2 Jane F. Doe F.
3 Jane F. Doe Doe
4 John Doe John
5 John Doe Doe
6 Mark Smith Simpson Mark
7 Mark Smith Simpson Smith
8 Mark Smith Simpson Simpson
9 Sam Lee Sam
10 Sam Lee Lee
Now you can just merge or join this with df2 to get your answer:
left_join(df2,splitnames)
Joining by: "family"
family size name
1 Doe 2 Jane F. Doe
2 Doe 2 John Doe
3 Smith 6 Mark Smith Simpson
Potential problem: if one person's first name is the same as somebody else's last name, you'll get some incorrect matches!
Here is one strategy, you could use lapply with grep match over all the family names. This will find them at any position. First let me define a helper function
transindex<-function(start=1) {
function(x) {
start<<-start+1
ifelse(x, start-1, NA)
}
}
and I will also be using the function coalesce.R to make things a bit simpler. Here the code i'd run to match up df2 to df1
idx<-do.call(coalesce, lapply(lapply(as.character(df2$family),
function(x) grepl(paste0("\\b", x, "\\b"), as.character(df1$name))),
transindex()))
Starting on the inside and working out, i loop over all the family names in df2 and grep for those values (adding "\b" to the pattern so i match entire words). grepl will return a logical vector (TRUE/FALSE). I then apply the above helper function transindex() to change those vector to be either the index of the row in df2 that matched, or NA. Since it's possible that a row may match more than one family, I simply choose the first using the coalesce helper function.
Not that I can match up the rows in df1 to df2, I can bring them together with
cbind(df1, size=df2[idx,])
name family size
# 1 John Doe Doe 2
# 1.1 Jane F. Doe Doe 2
# 2 Mark Smith Simpson Smith 6
# NA Sam Lee <NA> NA
Another apporoach that looks valid, at least with the sample data:
df1name = as.character(df1$name)
df1name
#[1] "John Doe" "Jane F. Doe" "Mark Smith Simpson" "Sam Lee"
regmatches(df1name, regexpr(paste(df2$family, collapse = "|"), df1name), invert = T) <- ""
df1name
#[1] "Doe" "Doe" "Smith" ""
cbind(df1, df2[match(df1name, df2$family), ])
# name family size
#1 John Doe Doe 2
#1.1 Jane F. Doe Doe 2
#2 Mark Smith Simpson Smith 6
#NA Sam Lee <NA> NA

Resources