How do I match names in two dataframes and output the matched value? - r

Say I have two dataframes:
a <- c("smith", "lee", "black", "gonzalez", "rodriguez")
df1 <- as.data.frame(a)
df1
a
1 smith
2 lee
3 black
4 gonzalez
5 rodriguez
b <- c("harry smith", "john smith", "laura smith", "carol black", "peter h. black", "cora lee", "benjamin d. black", "gonzalez 12323902130", "rodriguez 0931029321")
df2 <- as.data.frame(b)
df2
b
1 harry smith
2 john smith
3 laura smith
4 carol black
5 peter h. black
6 cora lee
7 benjamin d. black
8 gonzalez 12323902130
9 rodriguez 0931029321
If "harry smith" matches with anything from df1$a, I want it to output "smith." Ideally, I'll have something like this:
b <- c("harry smith", "john smith", "laura smith", "carol black", "peter h. black", "cora lee", "benjamin d. black", "gonzalez 12323902130", "rodriguez 0931029321")
match <- c("smith", "smith", "smith", "black", "black", "lee", "black", "gonzalez", "rodriguez")
df <- as.data.frame(b, match)
df
df
match b
smith harry smith
smith john smith
smith laura smith
black carol black
black peter h. black
lee cora lee
black benjamin d. black
gonzalez gonzalez 12323902130
rodriguez rodriguez 0931029321
I tried something like this and got an error message:
df$match <- ifelse(df1$a %in% df2$b, df1$a, NA)
Error in `$<-.data.frame`(`*tmp*`, match, value = c(NA, NA, NA, NA, NA :
replacement has 5 rows, data has 9

An alternative using regex partial matching:
lk <- sapply(a, grepl, x = b)
cbind(b, apply(lk, 1, function(i) names(which(i))))
[1,] "harry smith" "smith"
[2,] "john smith" "smith"
[3,] "laura smith" "smith"
[4,] "carol black" "black"
[5,] "peter h. black" "black"
[6,] "cora lee" "lee"
[7,] "benjamin d. black" "black"
[8,] "gonzalez 12323902130" "gonzalez"
[9,] "rodriguez 0931029321" "rodriguez"

df <- data.frame(match = sapply(strsplit(df2$b, " "), function(x) x[x %in% df1$a]),
b = df2$b)
df
match b
1 smith harry smith
2 smith john smith
3 smith laura smith
4 black carol black
5 black peter h. black
6 lee cora lee
7 black benjamin d. black
8 gonzalez gonzalez 12323902130
9 rodriguez rodriguez 0931029321

Related

Is there a function that will allow to get the difference between rows of the same type? [duplicate]

This question already has answers here:
Calculate difference between values in consecutive rows by group
(4 answers)
Closed 1 year ago.
I want to find the difference in the values of the same type.
Please refer to the sample dataframe below:
df <- data.frame(
x = c("Jimmy Page","Jimmy Page","Jimmy Page","Jimmy Page", "John Smith", "John Smith", "John Smith", "Joe Root", "Joe Root", "Joe Root", "Joe Root", "Joe Root"),
y = c(1,2,3,4,5,7,89,12,34,67,95,9674 )
)
I would like to get the difference in the each value for e.g. Jimmy Page = 1 and Jimmy Page = 2, difference = 1.
And present NA for difference between dissimilar names.
You can use diff in ave.
df$diff <- ave(df$y, df$x, FUN=function(z) c(diff(z), NA))
df
# x y diff
#1 Jimmy Page 1 1
#2 Jimmy Page 2 1
#3 Jimmy Page 3 1
#4 Jimmy Page 4 NA
#5 John Smith 5 2
#6 John Smith 7 82
#7 John Smith 89 NA
#8 Joe Root 12 22
#9 Joe Root 34 33
#10 Joe Root 67 28
#11 Joe Root 95 9579
#12 Joe Root 9674 NA
library(tidyverse)
df <-
data.frame(
x = c(
"Jimmy Page",
"Jimmy Page",
"Jimmy Page",
"Jimmy Page",
"John Smith",
"John Smith",
"John Smith",
"Joe Root",
"Joe Root",
"Joe Root",
"Joe Root",
"Joe Root"
),
y = c(1, 2, 3, 4, 5, 7, 89, 12, 34, 67, 95, 9674)
)
df %>%
group_by(x) %>%
mutate(res = c(NA, diff(y))) %>%
ungroup()
#> # A tibble: 12 x 3
#> x y res
#> <chr> <dbl> <dbl>
#> 1 Jimmy Page 1 NA
#> 2 Jimmy Page 2 1
#> 3 Jimmy Page 3 1
#> 4 Jimmy Page 4 1
#> 5 John Smith 5 NA
#> 6 John Smith 7 2
#> 7 John Smith 89 82
#> 8 Joe Root 12 NA
#> 9 Joe Root 34 22
#> 10 Joe Root 67 33
#> 11 Joe Root 95 28
#> 12 Joe Root 9674 9579
Created on 2021-09-14 by the reprex package (v2.0.1)

Separate column into two: before and after a certain word

I have the following data set
> data
firm_name
1: Light Ltd John Smith
2: Bolt Night Ltd Mary Poppins
3: Bright Yellow Sun Ltd Harry Potter
---
I want to separate it into two columns depending on the position of the "Ltd". So, the data would look like:
> data
firm_name name
1: Light Ltd John Smith
2: Bolt Night Ltd Mary Poppins
3: Bright Yellow Sun Ltd Harry Potter
---
I tried with the stringr package but did not find any particular solution.
thanks in advance
You can use separate from tidyr with a lookbehind regular expression for this.
library(tidyr)
df %>%
separate(col = firm_name, into = c("firm_name", "name"), sep = "(?<=Ltd)")
#> firm_name name
#> 1 Light Ltd John Smith
#> 2 Bolt Night Ltd Mary Poppins
#> 3 Bright Yellow Sun Ltd Harry Potter
data
df <- data.frame(firm_name = c("Light Ltd John Smith",
"Bolt Night Ltd Mary Poppins",
"Bright Yellow Sun Ltd Harry Potter"))
We can use base R with read.csv
read.csv(text = sub("(Ltd)", "\\1,", df$names),
header = FALSE, col.names = c('firm_name', 'name'))
# firm_name name
#1 Light Ltd John Smith
#2 Bolt Night Ltd Mary Poppins
#3 Bright Yellow Sun Ltd Harry Potter
data
df <- structure(list(names = c("Light Ltd John Smith",
"Bolt Night Ltd Mary Poppins",
"Bright Yellow Sun Ltd Harry Potter")), row.names = c(NA, -3L
), class = "data.frame")
Are you after something like this?
df <-
tibble(
names = c("Light Ltd John Smith",
"Bolt Night Ltd Mary Poppins",
"Bright Yellow Sun Ltd Harry Potter")
)
df %>%
tidyr::separate(names, c("half_1", "half_2"), sep = "Ltd")
Does this work:
> df %>% mutate(name = gsub('([A-z].*Ltd) (.*)','\\2', df$firm_name), firm_name = gsub('([A-z].*Ltd) (.*)','\\1', df$firm_name))
# A tibble: 3 x 2
firm_name name
<chr> <chr>
1 Light Ltd John Smith
2 Bolt Night Ltd Mary Poppins
3 Bright Yellow Sun Ltd Harry Potter
>
Data used:
> df
# A tibble: 3 x 1
firm_name
<chr>
1 Light Ltd John Smith
2 Bolt Night Ltd Mary Poppins
3 Bright Yellow Sun Ltd Harry Potter
>
Using tidyr::extract :
tidyr::extract(df, names, c('firm_name', 'name'), regex = '(.*Ltd)\\s(.*)')
# A tibble: 3 x 2
# firm_name name
# <chr> <chr>
#1 Light Ltd John Smith
#2 Bolt Night Ltd Mary Poppins
#3 Bright Yellow Sun Ltd Harry Potter
Or in base R :
df$name <- sub('.*Ltd\\s', '', df$names)
df$firm_name <- sub('(.*Ltd).*', '\\1', df$names)
df$names <- NULL
Another base R option
setNames(
data.frame(
do.call(
rbind,
strsplit(df$names, "(?<=Ltd)\\s+", perl = TRUE)
)
),
c("firm_name", "name")
)
giving
firm_name name
1 Light Ltd John Smith
2 Bolt Night Ltd Mary Poppins
3 Bright Yellow Sun Ltd Harry Potter

Converting Names into Identification Codes in different columns in R

I am new with R and I am struggling with the following issue:
I have a dataset more or less like this:
NAME Collegue1 Collegue 2
John Smith Bill Gates Brad Pitt
Adam Sandler Bill Gates John Smith
Bill Gates Brad Pitt Adam Sandler
Brad Pitt John Smith Bill Gates
I need to create an ID code and substitute names with the corresponding ID in the three columns, how can I do that?
Maybe you can try the code like below
df[]<-as.integer(factor(unlist(df),levels = df$NAME))
such that
> df
NAME Collegue1 Collegue2
1 1 3 4
2 2 3 1
3 3 4 2
4 4 1 3
Or
df[-1] <- as.integer(factor(unlist(df[-1]),levels = df$NAME))
such that
> df
NAME Collegue1 Collegue2
1 John Smith 3 4
2 Adam Sandler 3 1
3 Bill Gates 4 2
4 Brad Pitt 1 3
Data
df <- structure(list(NAME = c("John Smith", "Adam Sandler", "Bill Gates",
"Brad Pitt"), Collegue1 = c("Bill Gates", "Bill Gates", "Brad Pitt",
"John Smith"), Collegue2 = c("Brad Pitt", "John Smith", "Adam Sandler",
"Bill Gates")), class = "data.frame", row.names = c(NA, -4L))
You can convert the names to a factor and use unclass to get the ID codes.
x[-1] <- unclass(factor(unlist(x[-1]), x$NAME))
cbind(x["NAME"], ID=seq_along(x$NAME), x[-1])
# NAME ID Collegue1 Collegue.2
#1 John Smith 1 3 4
#2 Adam Sandler 2 3 1
#3 Bill Gates 3 4 2
#4 Brad Pitt 4 1 3
In case you are just interested in ID's:
levels(factor(unlist(x))) #Only in case you are interested in the codes of the table
#[1] "Adam Sandler" "Bill Gates" "Brad Pitt" "John Smith"
x[] <- unclass(factor(unlist(x)))
x
# NAME Collegue1 Collegue.2
#1 4 2 3
#2 1 2 4
#3 2 3 1
#4 3 4 2
Data:
x <- structure(list(NAME = c("John Smith", "Adam Sandler", "Bill Gates",
"Brad Pitt"), Collegue1 = c("Bill Gates", "Bill Gates", "Brad Pitt",
"John Smith"), Collegue.2 = c("Brad Pitt", "John Smith", "Adam Sandler",
"Bill Gates")), class = "data.frame", row.names = c(NA, -4L))

String updating by group - performance improvement

I have a data frame df:
df <- structure(list(firstname = c("John L", "Robert C", "John", "J L", "Tom F", "T F", "Tom", "Jan Paul W R", "Jan Paul", "J P W R", "J P"),
lastname = c("Doe", "Doe", "Doe", "Doe", "Frost", "Frost", "Frost", "Wilson", "Wilson", "Wilson", "Wilson"),
initial = c("JL", "RC", "J", "JL", "TF", "TF", "T", "JPWR", "JP", "JPWR", "JP")), .Names =c("firstname","lastname", "initial"), row.names = c(NA, -11L), class ="data.frame")
I want to replace all shorter first names to longest first name in a group that has a same last name with different initials and/or firstname. So, my resulting data frame df would look like this:
firstname lastname initial LongName
1 John L Doe JL John L
2 Robert C Doe RC Robert C
3 John Doe J John L
4 J L Doe JL John L
5 Tom F Frost TF Tom F
6 T F Frost TF Tom F
7 Tom Frost T Tom F
8 Jan Paul W R Wilson JPWR Jan Paul W R
9 Jan Paul Wilson JP Jan Paul W R
10 J P W R Wilson JPWR Jan Paul W R
11 J P Wilson JP Jan Paul W R
At present, I am doing this using grepl and if else, as below:
df$LongName <- apply(df,1,function(x) {
if(gsub("[[:space:]]","",x[["firstname"]]) == x[["initial"]]){
Longname <- df$firstname[grepl(x[["initial"]], df$initial) & df$lastname == x[["lastname"]]]
}
else{
Longname <- df$firstname[grepl(x[["initial"]], df$initial) & grepl(x[["firstname"]], df$firstname) & df$lastname == x[["lastname"]]]
}
Longname[which.max(nchar(Longname))]
})
The code above works well but it is slow for a large data frame since it uses if else. I was thinking if I can optimize the running time. So, I am looking for an alternative approach to speed up.
Here's an entertaining way using adist with an insertion cost of 0 to create a string distance matrix:
library(dplyr)
df <- structure(list(firstname = c("John L", "Robert C", "John", "J L", "Tom F", "T F", "Tom", "Jan Paul W R", "Jan Paul", "J P W R", "J P"),
lastname = c("Doe", "Doe", "Doe", "Doe", "Frost", "Frost", "Frost", "Wilson", "Wilson", "Wilson", "Wilson"),
initial = c("JL", "RC", "J", "JL", "TF", "TF", "T", "JPWR", "JP", "JPWR", "JP")), .Names =c("firstname","lastname", "initial"),
row.names = c(NA, -11L), class ="data.frame")
df %>%
group_by(lastname) %>%
mutate(fullname = {
# Boolean matrix of where string distance with an insertion cost of 0 is 0
d <- adist(initial, firstname, costs = c(i = 0)) == 0;
# set TRUE values to the number of characters of that string
d[d] <- nchar(firstname[col(d)][d]);
# return whichever firstname has the most characters
firstname[max.col(d)]
})
#> # A tibble: 11 x 4
#> # Groups: lastname [3]
#> firstname lastname initial fullname
#> <chr> <chr> <chr> <chr>
#> 1 John L Doe JL John L
#> 2 Robert C Doe RC Robert C
#> 3 John Doe J John L
#> 4 J L Doe JL John L
#> 5 Tom F Frost TF Tom F
#> 6 T F Frost TF Tom F
#> 7 Tom Frost T Tom F
#> 8 Jan Paul W R Wilson JPWR Jan Paul W R
#> 9 Jan Paul Wilson JP Jan Paul W R
#> 10 J P W R Wilson JPWR Jan Paul W R
#> 11 J P Wilson JP Jan Paul W R

String update for each group in data frame

I have a large data frame df like this:
firstname = c("John L", "Robert C", "John", "J L", "Tom F", "T F", "Tom")
lastname = c("Doe", "Doe", "Doe", "Doe", "Frost", "Frost", "Frost")
id = c(178, 649, 384, 479, 539, 261, 347)
df = data.frame(firstname, lastname, id)
Which looks as below in df view:
firstname lastname id
John L Doe 178
Robert C Doe 649
John Doe 384
J L Doe 479
Tom F Frost 539
T F Frost 261
Tom Frost 347
As you see, the firstname in data frame is inconsistent. Sometime it is just an initial for example. I would like to have consistent firstname. I would like to have an output data frame like this:
firstname lastname id
John L Doe 178
Robert C Doe 649
John L Doe 384
John L Doe 479
Tom F Frost 539
Tom F Frost 261
Tom F Frost 347
I have tried few approaches like grouping by lastname and then getting longest string for each group and then updating firstname in the if elseif statement by matching with other firstname in the group using below
> sapply(strsplit("John L Doe"," "), function(a) paste(a[1],a[3]))
[1] "John Doe"
> sapply(strsplit("John L Doe"," "), function(a) paste(substr(a[1],1,1),a[2],a[3]))
[1] "J L Doe"
It did not work as I realized taking a longest string in the group is not a good approach.
Mapping from initials of the firstname to the full form of firstname is always going to be correct. For example, there will be "John L Doe". But, he will have 3 variants in his firstname. For example, "John L", "John", and "J L". It is because these are list of authors on a very narrow subjects. There is a just inconsistencies in the formatting of the name which I would like to fix. Having one consistent name will help me to do more analysis on a wider scale.
How can I do this in R?
The following solution produces your expected result, but bear in mind if Jack L Doe and John L Doe exist, J L Doe will map to the first longest name.
firstname = c("John L", "Robert C", "John", "J L", "Tom F", "T F", "Tom", "Jack L", "Robert Can","R C", "R C")
lastname = c("Doe", "Doe", "Doe", "Doe", "Frost", "Frost", "Frost", "Doe","Frost","Doe", "Frost")
id = c(178, 649, 384, 479, 539, 261, 347,100,200,300,400)
df = data.frame(firstname, lastname, id,stringsAsFactors = FALSE)
df$Initials <- sapply(strsplit(as.vector(firstname), " "), function(x) paste(substr(x, 1,1), collapse=""))
df$LongName<-apply(df,1,function(x) {
if(sub("\\s","",x[["firstname"]]) == x[["Initials"]]){
choices<-df$firstname[ grepl(x[["Initials"]], df$Initials) & df$lastname == x[["lastname"]]]
}
else{
choices<-df$firstname[ grepl(x[["Initials"]], df$Initials) & grepl(x[["firstname"]], df$firstname) & df$lastname == x[["lastname"]]]
}
choices[which.max(nchar(choices))]
}
)
Result
> df
firstname lastname id Initials LongName
1 John L Doe 178 JL John L
2 Robert C Doe 649 RC Robert C
3 John Doe 384 J John L
4 J L Doe 479 JL John L
5 Tom F Frost 539 TF Tom F
6 T F Frost 261 TF Tom F
7 Tom Frost 347 T Tom F
8 Jack L Doe 100 JL Jack L
9 Robert Can Frost 200 RC Robert Can
10 R C Doe 300 RC Robert C
11 R C Frost 400 RC Robert Can
Your usecase is not entirely clear.
As mentioned, there are issues if you have people with same last name, same initial but different first name. If you are convinced that this will never be the case in your data, then the solution may be quite simple.
However, if what you're trying to do is to find out if the names refer to the same people, you'll need a lot more, and that means diving into the subject of Entity Reconciliation.
There are some neat R packages for this (I've worked on a project involving entity reconciliation) including RecordLinkage, but the bottom line is: if you want reliable record linkage, you'll need at least a little more than first name & last name
What you are trying to achieve is usually done with a dictionary matching every spelling variant to a preferred name. There are smart solutions out there based on text similarity and text mining. Except if you already have the dictionary linking c("JL", J L", J.L.", etc....) to John L. I would not do it in R.
Have a look at DataWrangler, Trifacta, Dataiku or Openrefine they all have a free version that will do what you are looking for. I know that Openrefine (was GoogleRefine before) can be scripted.

Resources