Match records with a combination of regex and lookup - r

I want to match personal records between two tables using the following logic:
Regex match on last name up to minor variations - summarized by the following regex for a given last name: grepl("LNAME( .r|-| [ivx]|.*)", last_name, ignore.case = TRUE).
The function fuzzyjoin::regex_*_join was suggested, but I'm not sure how to use it if the name isn't static...?
Match on first name based on the nicknames list. So basically matching all names in nicknames[[fname]] or just fname if that is empty. Should not be case-sensitive as well.
Exact match on city, not case-sensitive.
Right now I'm just iterating through df1 and implementing this logic by hand, but my data set is large and it's taking way too long, plus the manual implementation doesn't lend itself to parallelization, which is a concern as I willwant to optimize this in the future. There has to be a smarter way of doing this.
Example data:
df1 <- tibble("lname1"=c("SMITH","BLACK","MILLER"),
"fname1"=c("JOHN","THOMAS","JAMES"),
"city"=c("NEW YORK","LOS ANGELES","SEATTLE"),
"id1"=c("aaaa","bbbb","cccc"),
"misc1"=c("bla","ble","bla"))
df2 <- tibble("lname2"=c("Smith Jr.","Black III","Miller-Muller","Smith"),
"fname2"=c("Jon","Tom","Jamie","John"),
"city"=c("New York","Los Angeles","Seattle","New York"),
"id2"=c("1111","2222","3333","4444"),
"misc2"=c("bonk","bzdonk","boom","bam"))
nicknames <- list("john"=c("john","jon","johnny"),
"thomas"=c("thomas","tom","tommy"),
"james"=c("james","jamie","jim"))
Expected output:
expected_output <- tibble("id1"=c("aaaa","aaaa","bbbb","cccc"),
"id2"=c("1111","4444","2222","3333"),
"lname1"=c("SMITH","SMITH","BLACK","MILLER"),
"fname1"=c("JOHN","JOHN","THOMAS","JAMES"),
"lname2"=c("Smith Jr.","Smith","Black III","Miller-Muller"),
"fname2"=c("Jon","John","Tom","Jamie"),
"city"=c("New York","New York","Los Angeles","Seattle"),
"misc1"=c("bla","bla","ble","bla"),
"misc2"=c("bonk","bam","bzdonk","boom"))
# A tibble: 4 x 9
id1 id2 lname1 fname1 lname2 fname2 city misc1 misc2
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 aaaa 1111 SMITH JOHN Smith Jr. Jon New York bla bonk
2 aaaa 4444 SMITH JOHN Smith John New York bla bam
3 bbbb 2222 BLACK THOMAS Black III Tom Los Angeles ble bzdonk
4 cccc 3333 MILLER JAMES Miller-Muller Jamie Seattle bla boom
EDIT:
This is as far as I got. Spent the past few hours trying to get the last step done but I can't. I have this:
df <- tibble("fname1"=c("JOHN","JOHN","JOHN"),
"lname1"=c("SMITH","SMITH","SMITH"),
"fname2"=c("FRANK","JOHN","BILL"),
"lname2"=c("SMITH","SMITH","SMITH"),
"city"=c("NEW YORK","NEW YORK","NEW YORK"))
nicknames_df <- tibble(fname = names(nicknames), nick = paste0("^(", sapply(nicknames, paste, collapse = "|"), ")$"))
>df
# A tibble: 3 x 5
fname1 lname1 fname2 lname2 city
<chr> <chr> <chr> <chr> <chr>
1 JOHN SMITH FRANK SMITH NEW YORK
2 JOHN SMITH JOHN SMITH NEW YORK
3 JOHN SMITH BILL SMITH NEW YORK
>nicknames_df
# A tibble: 3 x 2
fname nick
<chr> <chr>
1 john ^(john|jon|johnny)$
2 thomas ^(thomas|tom|tommy)$
3 james ^(james|jamie|jim)$
Expected output:
> out
# A tibble: 1 x 5
fname1 lname1 fname2 lname2 city
<chr> <chr> <chr> <chr> <chr>
1 JOHN SMITH JOHN SMITH NEW YORK
How do I join it with nicknames df to get just the 2nd row?!
out <- df %>% fuzzyjoin::regex_left_join(nicknames_df, ???)

fuzzyjoin::regex_right_join(
df2, df1, by = c(lname2 = "lname1"),
ignore_case = TRUE)
# # A tibble: 4 x 10
# lname2 fname2 city.x id2 misc2 lname1 fname1 city.y id1 misc1
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Smith Jr. Jon New York 1111 bonk SMITH JOHN NEW YORK aaaa bla
# 2 Smith John New York 4444 bam SMITH JOHN NEW YORK aaaa bla
# 3 Black III Tom Los Angeles 2222 bzdonk BLACK THOMAS LOS ANGELES bbbb ble
# 4 Miller-Muller Jamie Seattle 3333 boom MILLER JAMES SEATTLE cccc bla
I didn't want to assume any resolution for city.x vs city.y; while it's clear visually that they're good, I'll let you work through that.

Related

Is it possible to convert lines from a text file into columns to get a dataframe?

I have a text file containing information on book title, author name, and country of birth which appear in seperate lines as shown below:
Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA
Is there any way to convert the text to a dataframe with these three items appearing as different columns:
ID Author Book Country
1 "Oscar Wilde" "De Profundis" "Ireland"
2 "Nathaniel Hawthorn" "Birthmark" "USA"
There are built-in functions for dealing with this kind of data:
data.frame(scan(text=xx, multi.line=TRUE,
what=list(Author="", Book="", Country=""), sep="\n"))
# Author Book Country
#1 Oscar Wilde De Profundis Ireland
#2 Nathaniel Hawthorn Birthmark USA
#3 James Joyce Ulysses Ireland
#4 Walt Whitman Leaves of Grass USA
You can create a 3-column matrix from one column of data.
dat <- read.table('data.txt', sep = ',')
result <- matrix(dat$V1, ncol = 3, byrow = TRUE) |>
data.frame() |>
setNames(c('Author', 'Book', 'Country'))
result <- cbind(ID = 1:nrow(result), result)
result
# ID Author Book Country
#1 1 Oscar Wilde De Profundis Ireland
#2 2 Nathaniel Hawthorn Birthmark USA
#3 3 James Joyce Ulysses Ireland
#4 4 Walt Whitman Leaves of Grass USA
There aren't any built in functions that handle data like this. But you can reshape your data after importing.
#Test data
xx <- "Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA"
writeLines(xx, "test.txt")
And then the code
library(dplyr)
library(tidyr)
lines <- read.csv("test.txt", header=FALSE)
lines %>%
mutate(
rid = ((row_number()-1) %% 3)+1,
pid = (row_number()-1) %/%3 + 1) %>%
mutate(col=case_when(rid==1~"Author",rid==2~"Book", rid==3~"Country")) %>%
select(-rid) %>%
pivot_wider(names_from=col, values_from=V1)
Which returns
# A tibble: 4 x 4
pid Author Book Country
<dbl> <chr> <chr> <chr>
1 1 Oscar Wilde De Profundis Ireland
2 2 Nathaniel Hawthorn Birthmark USA
3 3 James Joyce Ulysses Ireland
4 4 Walt Whitman Leaves of Grass USA

How can I group the same value across multiple columns and sum subsequent values?

I have a table of information that looks like the following:
rusher_full_name receiver_full_name rushing_fpts receiving_fpts
<chr> <chr> <dbl> <dbl>
1 Aaron Jones NA 5 0
2 NA Aaron Jones 0 5
3 Mike Davis NA 0.5 0
4 NA Allen Robinson 0 3
5 Mike Davis NA 0.7 0
What I'm trying to do is get all of the values from the rushing_fpts and receiving_fpts to sum up depending on the rusher_full_name and receiver_full_name value. For example, for every instance of "Aaron Jones" (whether it's in rusher_full_name or receiver_full_name) sum up the values of rushing_fpts and receiving_fpts
In the end, this is what I'd like it to look like:
player_full_name total_fpts
<chr> <dbl>
1 Aaron Jones 10
2 Mike Davis 1.2
3 Allen Robinson 3
I'm pretty new to using R and have Googled a number of things but can't find any solution. Any suggestions on how to accomplish this?
library(tidyverse)
df %>%
mutate(player_full_name = coalesce(rusher_full_name, receiver_full_name)) %>%
group_by(player_full_name) %>%
summarise(total_fpts = sum(rushing_fpts+receiving_fpts))
Output
# A tibble: 3 x 2
player_full_name total_fpts
<chr> <dbl>
1 Aaron Jones 10
2 Allen Robinson 3
3 Mike Davis 1.2
Data
df <- data.frame(
rusher_full_name = c("Aaron Jones", NA, "Mike Davis", NA, "Mike Davis"),
receiver_full_name = c(NA, "Aaron Jones", NA, "Allen Robinson", NA),
rushing_fpts = c(5,0,0.5,0,.7),
receiving_fpts = c(0,5,0,3,0),
stringsAsFactors = FALSE
)

wide to long multiple columns issue

I have something like this:
id role1 Approved by Role1 role2 Approved by Role2
1 Amy 1/1/2019 David 4/4/2019
2 Bob 2/2/2019 Sara 5/5/2019
3 Adam 3/3/2019 Rachel 6/6/2019
I want something like this:
id Name Role Approved
1 Amy role1 1/1/2019
2 Bob role1 2/2/2019
3 Adam role1 3/3/2019
1 David role2 4/4/2019
2 Sara role2 5/5/2019
3 Rachel role2 6/6/2019
I thought something like this would work
melt(df,id.vars= id,
measure.vars= list(c("role1", "role2"),c("Approved by Role1", "Approved by Role2")),
variable.name= c("Role","Approved"),
value.name= c("Name","Date"))
but i am getting Error: measure variables not found in data:c("role1", "role2"),c("Approved by Role1", "Approved by Role2")
I have tried replacing this with the number of the columns as well and haven't had any luck.
Any suggestions?? Thanks!
I really like the new tidyr::pivot_longer() function. It's still only available in the dev version of tidyr, but should be released shortly. First I'm going to clean up the column names slightly, so they have a consistent structure:
> df
# A tibble: 3 x 5
id name_role1 approved_role1 name_role2 approved_role2
<dbl> <chr> <chr> <chr> <chr>
1 1 Amy 1/1/2019 David 4/4/2019
2 2 Bob 2/2/2019 Sara 5/5/2019
3 3 Adam 3/3/2019 Rachel 6/6/2019
Then it's easy to convert to long format with pivot_longer():
library(tidyr)
df %>%
pivot_longer(
-id,
names_to = c(".value", "role"),
names_sep = "_"
)
Output:
id role name approved
<dbl> <chr> <chr> <chr>
1 1 role1 Amy 1/1/2019
2 1 role2 David 4/4/2019
3 2 role1 Bob 2/2/2019
4 2 role2 Sara 5/5/2019
5 3 role1 Adam 3/3/2019
6 3 role2 Rachel 6/6/2019

Merge two datasets

I create a node list as follows:
name <- c("Joe","Frank","Peter")
city <- c("New York","Detroit","Maimi")
age <- c(24,55,65)
node_list <- data.frame(name,age,city)
node_list
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
Then I create an edge list as follows:
from <- c("Joe","Frank","Peter","Albert")
to <- c("Frank","Albert","James","Tony")
to_city <- c("Detroit","St. Louis","New York","Carson City")
edge_list <- data.frame(from,to,to_city)
edge_list
from to to_city
1 Joe Frank Detroit
2 Frank Albert St. Louis
3 Peter James New York
4 Albert Tony Carson City
Notice that the names in the node list and edge list do not overlap 100%. I want to create a master node list of all the names, capturing city information as well. This is my dplyr attempt to do this:
new_node <- edge_list %>%
gather("from_to", "name", from, to) %>%
distinct(name) %>%
full_join(node_list)
new_node
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
4 Albert NA <NA>
5 James NA <NA>
6 Tony NA <NA>
I need to figure out how to add to_city information. What do I need to add to my dplyr code to make this happen? Thanks.
Join twice, once on to and once on from, with the irrelevant columns subsetted out:
library(dplyr)
node_list <- data_frame(name = c("Joe", "Frank", "Peter"),
city = c("New York", "Detroit", "Maimi"),
age = c(24, 55, 65))
edge_list <- data_frame(from = c("Joe", "Frank", "Peter", "Albert"),
to = c("Frank", "Albert", "James", "Tony"),
to_city = c("Detroit", "St. Louis", "New York", "Carson City"))
node_list %>%
full_join(select(edge_list, name = to, city = to_city)) %>%
full_join(select(edge_list, name = from))
#> Joining, by = c("name", "city")
#> Joining, by = "name"
#> # A tibble: 6 x 3
#> name city age
#> <chr> <chr> <dbl>
#> 1 Joe New York 24.
#> 2 Frank Detroit 55.
#> 3 Peter Maimi 65.
#> 4 Albert St. Louis NA
#> 5 James New York NA
#> 6 Tony Carson City NA
In this case the second join doesn't do anything because everybody is already included, but it would insert anyone who only existed in the from column.

Replace multiple strings/values based on separate list

I have a data frame that looks similar to this:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 John Smith GROUP1 2015 1 John Smith 5 Adam Smith 12 Mike Smith 20 Sam Smith 7 Luke Smith 3 George Smith
Each row repeats for new logs, but the values in X.1 : Y.3 change often.
The ID's and the ID's present in X.1 : Y.3 have a numeric value and then the name ID, i.e., "1 John Smith" or "20 Sam Smith" will be the string.
I have an issue where in certain instances, the ID will remain as "1 John Smith" but in X.1 : Y.3 the number may change preceding "John Smith", so for example it might be "14 John Smith". The names will always be correct, it's just the number that sometimes gets mixed up.
I have a list of 200+ ID's that are impacted by this mismatch - what is the most efficient way to replace the values in X.1 : Y.3 so that they match the correct ID in column ID?
I won't know which column "14 John Smith" shows up in, it could be X.1, or Y.2, or Y.3 depending on the row.
I can use a replace function in a dplyr line of code, or gsub for each 200+ ID's and for each column effected, but it seems very inefficient. Is there a quicker way than repeated something like the below x times?
df%>%mutate(X.1=replace(X.1, grepl('John Smith', X.1), "1 John Smith"))%>%as.data.frame()
Sometimes it helps to temporarily reshape the data. That way we can operate on all the X and Y values without iterating over them.
library(stringr)
library(tidyr)
## some data to work with
exd <- read.csv(text = "EVENT,ID,GROUP,YEAR,X.1,X.2,X.3,Y.1,Y.2,Y.3
1,1 John Smith,GROUP1,2015,19 John Smith,11 Adam Smith,9 Sam Smith,5 George Smith,13 Mike Smith,12 Luke Smith
2,2 John Smith,GROUP1,2015,1 George Smith,9 Luke Smith,19 Adam Smith,7 Sam Smith,17 Mike Smith,11 John Smith
3,3 John Smith,GROUP1,2015,5 George Smith,18 John Smith,12 Sam Smith,6 Luke Smith,2 Mike Smith,4 Adam Smith",
stringsAsFactors = FALSE)
## re-arrange to put X and Y columns into a single column
exd <- gather(exd, key = "var", value = "value", X.1, X.2, X.3, Y.1, Y.2, Y.3)
## find the X and Y values that contain the ID name
matches <- str_detect(exd$value, str_replace_all(exd$ID, "^\\d+ *", ""))
## replace X and Y values with the matching ID
exd[matches, "value"] <- exd$ID[matches]
## put it back in the original shape
exd <- spread(exd, key = "var", value = value)
exd
## EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
## 1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
## 2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
## 3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Not sure if you're set on dplyr and piping, but I think this is a plyr solution that does what you need. Given this example dataset:
> df
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 19 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 11 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 18 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
This adply function goes row by row and replaces any matching X:Y column values with the one from the ID column:
library(plyr)
adply(df, .margins = 1, function(x) {
idcol <- as.character(x$ID)
searchname <- trimws(gsub('[[:digit:]]+', "", idcol))
sapply(x[5:10], function(y) {
ifelse(grepl(searchname, y), idcol, as.character(y))
})
})
Output:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Data:
names <- c("EVENT","ID",'GROUP','YEAR', paste(rep(c("X.", "Y."), each = 3), 1:3, sep = ""))
first <- c("John", "Sam", "Adam", "Mike", "Luke", "George")
set.seed(2017)
randvals <- t(sapply(1:3, function(x) paste(sample(1:20, size = 6),
paste(sample(first, replace = FALSE, size = 6), "Smith"))))
df <- cbind(data.frame(1:3, paste(1:3, "John Smith"), "GROUP1", 2015), randvals)
names(df) <- names
I think that the most efficient way to accomplish this is by building a loop. The reason is that you will have to repeat the function to replace the names for every name in your ID list. With a loop, you can automate this.
I will make some assumptions first:
The ID list can be read as a character vector
You don't have any typos in the ID list or in your data.frame, including
different lowercase and uppercase letters in the names.
Your ID list does not contain the numbers. In case that it does contain numbers, you have to use gsub to erase them.
The example can work with a data.frame (DF) with the same structure that
you put in your question.
>
ID <- c("John Smith", "Adam Smith", "George Smith")
for(i in 1:length(ID)) {
DF[, 5:10][grep(ID[i], DF[, 5:10])] <- ID[i]
}
With each round this loop will:
Identify the positions in the columns X.1:Y.3 (columns 5 to 10 in your question) where the name "i" appears.
Then, it will change all those values to the one in the "i" position of the ID vector.
So, the first iteration will do: 1) Search for every position where the name "John Smith" appears in the data frame. 2) Replace all those "# John Smith" with "John Smith".
Note: If you simply want to delete the numbers, you can use gsub to replace them. Take into account that you probably want to erase the first space between the number and the name too. One way to do this is using gsub and a regular expression:
DF[, 5:10] <- gsub("[0-9]+ ", "", DF[, 5:10])

Resources