String update for each group in data frame - r

I have a large data frame df like this:
firstname = c("John L", "Robert C", "John", "J L", "Tom F", "T F", "Tom")
lastname = c("Doe", "Doe", "Doe", "Doe", "Frost", "Frost", "Frost")
id = c(178, 649, 384, 479, 539, 261, 347)
df = data.frame(firstname, lastname, id)
Which looks as below in df view:
firstname lastname id
John L Doe 178
Robert C Doe 649
John Doe 384
J L Doe 479
Tom F Frost 539
T F Frost 261
Tom Frost 347
As you see, the firstname in data frame is inconsistent. Sometime it is just an initial for example. I would like to have consistent firstname. I would like to have an output data frame like this:
firstname lastname id
John L Doe 178
Robert C Doe 649
John L Doe 384
John L Doe 479
Tom F Frost 539
Tom F Frost 261
Tom F Frost 347
I have tried few approaches like grouping by lastname and then getting longest string for each group and then updating firstname in the if elseif statement by matching with other firstname in the group using below
> sapply(strsplit("John L Doe"," "), function(a) paste(a[1],a[3]))
[1] "John Doe"
> sapply(strsplit("John L Doe"," "), function(a) paste(substr(a[1],1,1),a[2],a[3]))
[1] "J L Doe"
It did not work as I realized taking a longest string in the group is not a good approach.
Mapping from initials of the firstname to the full form of firstname is always going to be correct. For example, there will be "John L Doe". But, he will have 3 variants in his firstname. For example, "John L", "John", and "J L". It is because these are list of authors on a very narrow subjects. There is a just inconsistencies in the formatting of the name which I would like to fix. Having one consistent name will help me to do more analysis on a wider scale.
How can I do this in R?

The following solution produces your expected result, but bear in mind if Jack L Doe and John L Doe exist, J L Doe will map to the first longest name.
firstname = c("John L", "Robert C", "John", "J L", "Tom F", "T F", "Tom", "Jack L", "Robert Can","R C", "R C")
lastname = c("Doe", "Doe", "Doe", "Doe", "Frost", "Frost", "Frost", "Doe","Frost","Doe", "Frost")
id = c(178, 649, 384, 479, 539, 261, 347,100,200,300,400)
df = data.frame(firstname, lastname, id,stringsAsFactors = FALSE)
df$Initials <- sapply(strsplit(as.vector(firstname), " "), function(x) paste(substr(x, 1,1), collapse=""))
df$LongName<-apply(df,1,function(x) {
if(sub("\\s","",x[["firstname"]]) == x[["Initials"]]){
choices<-df$firstname[ grepl(x[["Initials"]], df$Initials) & df$lastname == x[["lastname"]]]
}
else{
choices<-df$firstname[ grepl(x[["Initials"]], df$Initials) & grepl(x[["firstname"]], df$firstname) & df$lastname == x[["lastname"]]]
}
choices[which.max(nchar(choices))]
}
)
Result
> df
firstname lastname id Initials LongName
1 John L Doe 178 JL John L
2 Robert C Doe 649 RC Robert C
3 John Doe 384 J John L
4 J L Doe 479 JL John L
5 Tom F Frost 539 TF Tom F
6 T F Frost 261 TF Tom F
7 Tom Frost 347 T Tom F
8 Jack L Doe 100 JL Jack L
9 Robert Can Frost 200 RC Robert Can
10 R C Doe 300 RC Robert C
11 R C Frost 400 RC Robert Can

Your usecase is not entirely clear.
As mentioned, there are issues if you have people with same last name, same initial but different first name. If you are convinced that this will never be the case in your data, then the solution may be quite simple.
However, if what you're trying to do is to find out if the names refer to the same people, you'll need a lot more, and that means diving into the subject of Entity Reconciliation.
There are some neat R packages for this (I've worked on a project involving entity reconciliation) including RecordLinkage, but the bottom line is: if you want reliable record linkage, you'll need at least a little more than first name & last name

What you are trying to achieve is usually done with a dictionary matching every spelling variant to a preferred name. There are smart solutions out there based on text similarity and text mining. Except if you already have the dictionary linking c("JL", J L", J.L.", etc....) to John L. I would not do it in R.
Have a look at DataWrangler, Trifacta, Dataiku or Openrefine they all have a free version that will do what you are looking for. I know that Openrefine (was GoogleRefine before) can be scripted.

Related

Using unlist on a column of strings within a data frame

I have a data frame with a column that contains a string with multiple names separated by commas:
df = data.frame(my.text = c("John Smith, Johnny Smith, John Smith", "John Doe, Doe, Johnny", c="Jane Doe, Jane Doe"))
df
my.text
1 John Smith, Johnny Smith, John Smith
2 John Doe, Doe, Johnny
3 Jane Doe, Jane Doe
I'd like to eliminate the duplicate names within in each row (i.e. get unique names) and store these at my.text so it looks this way:
df
my.text
1 John Smith, Johnny Smith
2 John Doe, Doe, Johnny
3 Jane Doe
This code achieves this for a single string/row:
df$mytext[1] = paste(unique(unlist(strsplit(df$mytext[1], split = ", "))), collapse = ", ")
But how do I apply this on the entire my.text column? I have tried mapply but cannot figure out how to send it so many functions all at once. Or perhaps there's a better way I'm overlooking?
strsplit is already vectorized, but to reduce it to a single string again, we can use lapply and paste:
sapply(strsplit(df$my.text, ",\\s*"), function(z) paste(unique(z), collapse = ", "))
# [1] "John Smith, Johnny Smith" "John Doe, Doe, Johnny" "Jane Doe"

grepl for first column into last column: is this the most efficient

I have a list of names from different sources in one data set: one set is organized by FirstName LastName; the other has FullName. I want to see if the first name or the last name is within the full name column, and create a flag. Two questions:
First, I used this solution, but the resulting data doesn't have the right amount of rows, and I'm not sure how to get it to make a flag. I tried to turn it into an ifelse statement, but got another error. How do I fix this so if FirstName is in FullName, I flag True (or 1), otherwise I flag False (or 0)?
Second, I have a few million names, is this an efficient way to do things?
FirstName = c("mary", "paul", "mother", "john", "red", "little", "king")
LastName = c("berry", "hollywood", "theresa", "jones", "rover", "tim", "arthur")
FullName = c("mary berry", "anthony horrowitz", "jennifer lawrence", "john jones", "red rover", "mick jagger", "king arthur")
df = data.frame(FirstName, LastName, FullName)
#attempt 1 and error
df$match_firstname <- df[mapply(grepl, df$FirstName, df$FullName), ]
Error in `$<-.data.frame`(`*tmp*`, match_firstname, value = list(FirstName = c("mary", :
replacement has 4 rows, data has 7
#attempt 2 and error
df$match_firstname <- ifelse(df[mapply(grepl, df$FirstName, df$FullName), ], 1, 0)
Error in ifelse(df[mapply(grepl, df$FirstName, df$FullName), ], 1, 0) :
'list' object cannot be coerced to type 'logical'
Instead we could use str_detect which is vectorized for both pattern and string whereas in the Map/mapply code, it is looping over each row and thus could be less efficient
library(dplyr)
library(stringr)
df %>%
filter(str_detect(FullName, FirstName))
-output
FirstName LastName FullName
1 mary berry mary berry
2 john jones john jones
3 red rover red rover
4 king arthur king arthur
If we want to add a new binary column, instead of filtering, convert the logical to binary with as.integer or +
df <- df %>%
mutate(match_firstname = +(str_detect(FullName, FirstName)))
-output
FirstName LastName FullName match_firstname
1 mary berry mary berry 1
2 paul hollywood anthony horrowitz 0
3 mother theresa jennifer lawrence 0
4 john jones john jones 1
5 red rover red rover 1
6 little tim mick jagger 0
7 king arthur king arthur 1
The error in the OP's code is because we are assigning a subset of data into a new column in the original dataset which obviously result in length difference
df[mapply(grepl, df$FirstName, df$FullName), ]
FirstName LastName FullName
1 mary berry mary berry
4 john jones john jones
5 red rover red rover
7 king arthur king arthur
Similar to the previous solution, use +
df$match_firstname <- +(mapply(grepl, df$FirstName, df$FullName))

Summing Rows Next to a Name in R

I'm working on a banking project where I'm trying to find a yearly sum of money spent, while the dataset has these listed as monthly transactions.
Month Name Money Spent
2 John Smith 10
3 John Smith 25
4 John Smith 20
2 Joe Nais 10
3 Joe Nais 25
4 Joe Nais 20
Right now, this is the code I have:
OTData <- OTData %>%
mutate(
OTData,
Full Year = [CODE NEEDED TO SUM UP]
)
Thanks!
As #Pawel said, there's no question here. I assume you want:
df <- data.frame(Month = c(2,3,4,2,3,4),
Name = c("John Smith", "John Smith", "John Smith",
"Joe Nais", "Joe Nais", "Joe Nais"),
Money_Spent = c(10,25,20,10,25,20))
df %>%
group_by(Name) %>%
summarize(Full_year = sum(Money_Spent))
Name Full_year
<fct> <dbl>
1 Joe Nais 55
2 John Smith 55
NOTE: You're going to run into trouble if you include spaces in your variable names. You really should replace them with ., _, or camelCase as in the above example.

String updating by group - performance improvement

I have a data frame df:
df <- structure(list(firstname = c("John L", "Robert C", "John", "J L", "Tom F", "T F", "Tom", "Jan Paul W R", "Jan Paul", "J P W R", "J P"),
lastname = c("Doe", "Doe", "Doe", "Doe", "Frost", "Frost", "Frost", "Wilson", "Wilson", "Wilson", "Wilson"),
initial = c("JL", "RC", "J", "JL", "TF", "TF", "T", "JPWR", "JP", "JPWR", "JP")), .Names =c("firstname","lastname", "initial"), row.names = c(NA, -11L), class ="data.frame")
I want to replace all shorter first names to longest first name in a group that has a same last name with different initials and/or firstname. So, my resulting data frame df would look like this:
firstname lastname initial LongName
1 John L Doe JL John L
2 Robert C Doe RC Robert C
3 John Doe J John L
4 J L Doe JL John L
5 Tom F Frost TF Tom F
6 T F Frost TF Tom F
7 Tom Frost T Tom F
8 Jan Paul W R Wilson JPWR Jan Paul W R
9 Jan Paul Wilson JP Jan Paul W R
10 J P W R Wilson JPWR Jan Paul W R
11 J P Wilson JP Jan Paul W R
At present, I am doing this using grepl and if else, as below:
df$LongName <- apply(df,1,function(x) {
if(gsub("[[:space:]]","",x[["firstname"]]) == x[["initial"]]){
Longname <- df$firstname[grepl(x[["initial"]], df$initial) & df$lastname == x[["lastname"]]]
}
else{
Longname <- df$firstname[grepl(x[["initial"]], df$initial) & grepl(x[["firstname"]], df$firstname) & df$lastname == x[["lastname"]]]
}
Longname[which.max(nchar(Longname))]
})
The code above works well but it is slow for a large data frame since it uses if else. I was thinking if I can optimize the running time. So, I am looking for an alternative approach to speed up.
Here's an entertaining way using adist with an insertion cost of 0 to create a string distance matrix:
library(dplyr)
df <- structure(list(firstname = c("John L", "Robert C", "John", "J L", "Tom F", "T F", "Tom", "Jan Paul W R", "Jan Paul", "J P W R", "J P"),
lastname = c("Doe", "Doe", "Doe", "Doe", "Frost", "Frost", "Frost", "Wilson", "Wilson", "Wilson", "Wilson"),
initial = c("JL", "RC", "J", "JL", "TF", "TF", "T", "JPWR", "JP", "JPWR", "JP")), .Names =c("firstname","lastname", "initial"),
row.names = c(NA, -11L), class ="data.frame")
df %>%
group_by(lastname) %>%
mutate(fullname = {
# Boolean matrix of where string distance with an insertion cost of 0 is 0
d <- adist(initial, firstname, costs = c(i = 0)) == 0;
# set TRUE values to the number of characters of that string
d[d] <- nchar(firstname[col(d)][d]);
# return whichever firstname has the most characters
firstname[max.col(d)]
})
#> # A tibble: 11 x 4
#> # Groups: lastname [3]
#> firstname lastname initial fullname
#> <chr> <chr> <chr> <chr>
#> 1 John L Doe JL John L
#> 2 Robert C Doe RC Robert C
#> 3 John Doe J John L
#> 4 J L Doe JL John L
#> 5 Tom F Frost TF Tom F
#> 6 T F Frost TF Tom F
#> 7 Tom Frost T Tom F
#> 8 Jan Paul W R Wilson JPWR Jan Paul W R
#> 9 Jan Paul Wilson JP Jan Paul W R
#> 10 J P W R Wilson JPWR Jan Paul W R
#> 11 J P Wilson JP Jan Paul W R

Merge data frames with partial id

Say I have these two data frames:
> df1 <- data.frame(name = c('John Doe',
'Jane F. Doe',
'Mark Smith Simpson',
'Sam Lee'))
> df1
name
1 John Doe
2 Jane F. Doe
3 Mark Smith Simpson
4 Sam Lee
> df2 <- data.frame(family = c('Doe', 'Smith'), size = c(2, 6))
> df2
family size
1 Doe 2
2 Smith 6
I want to merge both data frames in order to get this:
name family size
1 John Doe Doe 2
2 Jane F. Doe Doe 2
3 Mark Smith Simpson Smith 6
4 Sam Lee <NA> NA
But I can't wrap my head around a way to do this apart from the following very convoluted solution, which is becoming very messy with my real data, which has over 100 "family names":
> df3 <- within(df1, {
family <- ifelse(test = grepl('Doe', name),
yes = 'Doe',
no = ifelse(test = grepl('Smith', name),
yes = 'Smith',
no = NA))
})
> merge(df3, df2, all.x = TRUE)
family name size
1 Doe John Doe 2
2 Doe Jane F. Doe 2
3 Smith Mark Smith Simpson 6
4 <NA> Sam Lee NA
I've tried taking a look into pmatch as well as the solutions provided at R partial match in data frame, but still haven't found what I'm looking for.
Rather than attempting to use regular expressions and partial matches, you could split the names up into a lookup-table format, where each component of a person's name is kept in a row, and matched to their full name:
df1 <- data.frame(name = c('John Doe',
'Jane F. Doe',
'Mark Smith Simpson',
'Sam Lee'),
stringsAsFactors = FALSE)
df2 <- data.frame(family = c('Doe', 'Smith'), size = c(2, 6),
stringsAsFactors = FALSE)
library(tidyr)
library(dplyr)
str_df <- function(x) {
ss <- strsplit(unlist(x)," ")
data.frame(family = unlist(ss),stringsAsFactors = FALSE)
}
splitnames <- df1 %>%
group_by(name) %>%
do(str_df(.))
splitnames
name family
1 Jane F. Doe Jane
2 Jane F. Doe F.
3 Jane F. Doe Doe
4 John Doe John
5 John Doe Doe
6 Mark Smith Simpson Mark
7 Mark Smith Simpson Smith
8 Mark Smith Simpson Simpson
9 Sam Lee Sam
10 Sam Lee Lee
Now you can just merge or join this with df2 to get your answer:
left_join(df2,splitnames)
Joining by: "family"
family size name
1 Doe 2 Jane F. Doe
2 Doe 2 John Doe
3 Smith 6 Mark Smith Simpson
Potential problem: if one person's first name is the same as somebody else's last name, you'll get some incorrect matches!
Here is one strategy, you could use lapply with grep match over all the family names. This will find them at any position. First let me define a helper function
transindex<-function(start=1) {
function(x) {
start<<-start+1
ifelse(x, start-1, NA)
}
}
and I will also be using the function coalesce.R to make things a bit simpler. Here the code i'd run to match up df2 to df1
idx<-do.call(coalesce, lapply(lapply(as.character(df2$family),
function(x) grepl(paste0("\\b", x, "\\b"), as.character(df1$name))),
transindex()))
Starting on the inside and working out, i loop over all the family names in df2 and grep for those values (adding "\b" to the pattern so i match entire words). grepl will return a logical vector (TRUE/FALSE). I then apply the above helper function transindex() to change those vector to be either the index of the row in df2 that matched, or NA. Since it's possible that a row may match more than one family, I simply choose the first using the coalesce helper function.
Not that I can match up the rows in df1 to df2, I can bring them together with
cbind(df1, size=df2[idx,])
name family size
# 1 John Doe Doe 2
# 1.1 Jane F. Doe Doe 2
# 2 Mark Smith Simpson Smith 6
# NA Sam Lee <NA> NA
Another apporoach that looks valid, at least with the sample data:
df1name = as.character(df1$name)
df1name
#[1] "John Doe" "Jane F. Doe" "Mark Smith Simpson" "Sam Lee"
regmatches(df1name, regexpr(paste(df2$family, collapse = "|"), df1name), invert = T) <- ""
df1name
#[1] "Doe" "Doe" "Smith" ""
cbind(df1, df2[match(df1name, df2$family), ])
# name family size
#1 John Doe Doe 2
#1.1 Jane F. Doe Doe 2
#2 Mark Smith Simpson Smith 6
#NA Sam Lee <NA> NA

Resources