How to merge based on a string in a column? - r

I would like to do exact joins for the columns state and name, but a fuzzy join for the "name" and "versus" columns:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
df1 <- data.frame(year, state, name)
year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("# george v. SALLY", "#laura v. dAvid")
df2 <- data.frame(year, state, versus)
My preferred output would be the following:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("# george v. SALLY", "# george v. SALLY", "#laura v. dAvid", "#laura v. dAvid", NA, NA)
df3 <- data.frame(year, state, name, versus)
I've tried variations of the following:
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")
stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")
And they don't seem to get close to what I want.
I'm wondering if I'll need to spit up the "versus" column (remove all special characters and delimit the names) or if there's a way for me to accomplish this with something within fuzzyjoin. Any guidance would be appreciated.

A simple approach, which depends somewhat on the structure of df2$versus, would be this:
library(dplyr)
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))
Output:
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George # george v. SALLY
2 2002 TN Sally # george v. SALLY
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Update/Jul 14 2022:
If name has more complicated pattern, rather than a single word (say Molly Home, Jane Doe), we need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:
Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0
f <- function(n,v) {
wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}
Left join the original frames, and apply f() by row
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(f(name, versus), versus,NA_character_))
Output:
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Input:
df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997",
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe",
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA,
-6L))
df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "#laura v. dAvid")), class = "data.frame", row.names = c(NA,
-2L))

Update 15/07:
See comment. In such case one would want to check for a match in versus for each individual name in name. This could be done like this (using #langtang's 'new' data):
df1 |>
left_join(df2, by = c("year", "state")) |>
rowwise() |>
mutate(versus = if_else(str_detect(tolower(versus), paste0(unlist(str_extract_all(tolower(name), "\\w+")), collapse = "|")), versus, NA_character_)) |>
ungroup()
Output:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Old answer:
An approach could be:
library(tidyverse)
df1 |>
left_join(df2) |>
group_by(state) |>
mutate(versus = if_else(str_detect(tolower(versus), tolower(name)), versus, NA_character_)) |>
ungroup()
Output:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George # george v. SALLY
2 2002 TN Sally # george v. SALLY
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA

Related

Extracting and evaluating words in a text string against another dataset

I have two sets of data that I will be evaluating against one another. A heavily reduced example looks like this:
library(dplyr)
library(tidyverse)
library(sqldf)
library(dbplyr)
library(httr)
library(purrr)
library(jsonlite)
library(magrittr)
library(tidyr)
library(tidytext)
people_records_ex <- structure(list(id = c(123L, 456L, 789L), name = c("Anna Wilson",
"Jeff Smith", "Craig Mills"), biography = c("Student at Ohio State University. Class of 2024.",
"Second year law student at Stanford. Undergrad at William & Mary",
"University of North Texas Volleyball!")), class = "data.frame", row.names = c(NA,
-3L))
college_records_ex <- structure(list(college_id = c(234L, 567L, 891L, 345L), college_name = c("Ohio State University",
"Stanford", "William & Mary", "University of North Texas"), college_city = c("Columbus",
"Stanford", "Williamsburg", "Denton"), college_state = c("OH",
"CA", "VA", "TX")), class = "data.frame", row.names = c(NA, -4L
))
I am trying to create a match against the contents of the biography text string in people_records_ex against college_name in college_records_ex so the final output will look like this:
final_records_ex <- structure(list(id = c(123L, 456L, 456L, 789L), name = c("Anna Wilson",
"Jeff Smith", "Jeff Smith", "Craig Mills"), college_name = c("Ohio State University",
"Stanford", "William & Mary", "University of North Texas"), college_city = c("Columbus",
"Stanford", "Williamsburg", "Denton"), college_state = c("OH",
"CA", "VA", "TX")), class = "data.frame", row.names = c(NA, -4L
))
Or to provide a more visual example of the final output I'm expecting:
But when I run the following code, it produces zero results, which is not correct:
college_extract <- people_records_ex %>%
left_join(college_records_ex, by = c("biography" = "college_name")) %>%
filter(!is.na(college_state)) %>% dplyr::select(id, name, college_name, college_city, college_state) %>% distinct()
What am I doing incorrectly and what would the correct version look like?
Here's a very tidy and straightforward solution with fuzzy_join:
library(fuzzyjoin)
library(stringr)
library(dplyr)
fuzzy_join(
people_records_ex, college_records_ex,
by = c("biography" = "college_name"),
match_fun = str_detect,
mode = "left"
) %>%
select(-biography)
id name college_id college_name college_city college_state
1 123 Anna Wilson 234 Ohio State University Columbus OH
2 456 Jeff Smith 567 Stanford Stanford CA
3 456 Jeff Smith 891 William & Mary Williamsburg VA
4 789 Craig Mills 345 University of North Texas Denton TX
Assuming the college names in the biographies are spelled out exactly as they appear in the colleges table and the datasets are relatively small, all matches can be generated with a regex of all college names as follows
library(dplyr)
people_records_ex <- structure(list(id = c(123L, 456L, 789L), name = c(
"Anna Wilson",
"Jeff Smith", "Craig Mills"
), biography = c(
"Student at Ohio State University. Class of 2024.",
"Second year law student at Stanford. Undergrad at William & Mary",
"University of North Texas Volleyball!"
)), class = "data.frame", row.names = c(
NA,
-3L
)) %>% tibble::tibble()
college_records_ex <- structure(list(college_id = c(234L, 567L, 891L, 345L), college_name = c(
"Ohio State University",
"Stanford", "William & Mary", "University of North Texas"
), college_city = c(
"Columbus",
"Stanford", "Williamsburg", "Denton"
), college_state = c(
"OH",
"CA", "VA", "TX"
)), class = "data.frame", row.names = c(NA, -4L)) %>%
tibble::tibble()
# join college names in a regex pattern
colleges_regex <- paste0(college_records_ex$college_name, collapse = "|")
colleges_regex
#> [1] "Ohio State University|Stanford|William & Mary|University of North Texas"
# match all against bio, giving a list-column of matches
people_records_ex %>%
mutate(matches = stringr::str_match_all(biography, colleges_regex))
#> # A tibble: 3 × 4
#> id name biography matches
#> <int> <chr> <chr> <list>
#> 1 123 Anna Wilson Student at Ohio State University. Class of 2024. <chr[…]>
#> 2 456 Jeff Smith Second year law student at Stanford. Undergrad at … <chr[…]>
#> 3 789 Craig Mills University of North Texas Volleyball! <chr[…]>
# unnest the list column wider to give 1 row per person per match
people_records_ex %>%
mutate(matches = stringr::str_match_all(biography, colleges_regex)) %>%
tidyr::unnest_longer(matches)
#> # A tibble: 4 × 4
#> id name biography match…¹
#> <int> <chr> <chr> <chr>
#> 1 123 Anna Wilson Student at Ohio State University. Class of 2024. Ohio S…
#> 2 456 Jeff Smith Second year law student at Stanford. Undergrad at W… Stanfo…
#> 3 456 Jeff Smith Second year law student at Stanford. Undergrad at W… Willia…
#> 4 789 Craig Mills University of North Texas Volleyball! Univer…
#> # … with abbreviated variable name ¹​matches[,1]
Created on 2022-10-26 with reprex v2.0.2
This may be joined back to the college table such that it is annotated with college info.
In base R you can do:
do.call(rbind, lapply(college_records_ex$college_name,
\(x) people_records_ex[grep(x, people_records_ex$biography),1:2])) |>
cbind(college_records_ex[-1])
This does some matching and I subsetted the first two columns which are the id and name, cbinding it with the second data.frame getting rid of the first column
id name college_name college_city college_state
1 123 Anna Wilson Ohio State University Columbus OH
2 456 Jeff Smith Stanford Stanford CA
21 456 Jeff Smith William & Mary Williamsburg VA
3 789 Craig Mills University of North Texas Denton TX

How to add values from other column if conditional join does not execute?

I have two tables this one is old names
Last Name|First Name|ID
Clay Cassius 1
Alcindor Lou 2
Artest Ron 3
Jordan Michael 4
Scottie Pippen 5
Kanter Enes 6
New Names
Last Name| First Name| ID
Ali Muhammad 1
Abdul Jabbar Kareem 2
World Peace Metta 3
Jordan Michael 4
Pippen Scottie 5
Freedom Enes Kanter 6
Basically I want to do a join to the first table (old names) where it will show the new last name if there has been a name change otherwise blank
Last Name|First Name|ID|Discrepancies
Clay Cassius 1 Ali
Alcindor Lou 2 Abdul Jabbar
Artest Ron 3 World Peace
Jordan Michael 4
Pippen Scottie 5
Kanter Enes 6 Freedom
Note that Michael and Scottie's name did not change so in Discrepancies there is a blank.
You could use
library(dplyr)
df1 %>%
left_join(df2, by = "ID", suffix = c("", ".y")) %>%
mutate(Discrepancies = ifelse(Last_Name.y == Last_Name, "", Last_Name.y)) %>%
select(-ends_with(".y"))
to get
# A tibble: 6 x 4
Last_Name First_Name ID Discrepancies
<chr> <chr> <dbl> <chr>
1 Clay Cassius 1 "Ali"
2 Alcindor Lou 2 "Abdul Jabbar"
3 Artest Ron 3 "World Peace"
4 Jordan Michael 4 ""
5 Scottie Pippen 5 "Pippen"
6 Kanter Enes 6 "Freedom"
Note:
I named the columns Last_Name and First_Name.
The first data frame contains Scottie Pippen instead of Pippen Scottie.
Another possible solution:
library(tidyverse)
old <- data.frame(
stringsAsFactors = FALSE,
check.names = FALSE,
Last = c("Clay",
"Alcindor","Artest","Jordan","Scottie","Kanter"),
`First` = c("Cassius","Lou",
"Ron","Michael","Pippen","Enes"),
`ID` = c(1L, 2L, 3L, 4L, 5L, 6L)
)
new <- data.frame(
stringsAsFactors = FALSE,
check.names = FALSE,
`Last` = c("Ali",
"Abdul Jabbar","World Peace","Jordan","Pippen","Freedom"),
`First` = c("Muhammad",
"Kareem","Metta","Michael","Scottie","Enes Kanter"),
ID = c(1L, 2L, 3L, 4L, 5L, 6L)
)
old %>%
bind_rows(new) %>%
group_by(ID) %>%
summarise(
discrepancies = if_else(n_distinct(Last) > 1, last(Last), NA_character_),
Last = first(Last), First = first(First), .groups = "drop" )
#> # A tibble: 6 × 4
#> ID discrepancies Last First
#> <int> <chr> <chr> <chr>
#> 1 1 Ali Clay Cassius
#> 2 2 Abdul Jabbar Alcindor Lou
#> 3 3 World Peace Artest Ron
#> 4 4 <NA> Jordan Michael
#> 5 5 Pippen Scottie Pippen
#> 6 6 Freedom Kanter Enes
You can simply merge your data, and then filter duplicate occurrences.
dfinal <- setNames( merge( dat1, dat2, "ID", suffixes=c(1,2) )[
,c("Last.Name1","First.Name1","ID","Last.Name2")], c(colnames(dat1),"Discrepancies") )
dfinal$Discrepancies[ dfinal$Last.Name == dfinal$Discrepancies ] <- ""
dfinal
Last.Name First.Name ID Discrepancies
1 Clay Cassius 1 Ali
2 Alcindor Lou 2 Abdul Jabbar
3 Artest Ron 3 World Peace
4 Jordan Michael 4
5 Scottie Pippen 5 Pippen
6 Kanter Enes 6 Freedom
Data
dat1 <- structure(list(Last.Name = c("Clay", "Alcindor", "Artest", "Jordan",
"Scottie", "Kanter"), First.Name = c("Cassius", "Lou", "Ron",
"Michael", "Pippen", "Enes"), ID = 1:6), class = "data.frame", row.names = c(NA,
-6L))
dat2 <- structure(list(Last.Name = c("Ali", "Abdul Jabbar", "World Peace",
"Jordan", "Pippen", "Freedom"), First.Name = c("Muhammad", "Kareem",
"Metta", "Michael", "Scottie", "Enes Kanter"), ID = 1:6), class = "data.frame", row.names = c(NA,
-6L))

Grouping by Multiple variables and summarizing character frequencies

I am trying to group my dataset by multiple variables and build a frequency table of the number of times a character variable appears. Here is an example data set:
Location State County Job Pet
Ohio Miami Data Dog
Urban Ohio Miami Business Dog, Cat
Urban Ohio Miami Data Cat
Rural Kentucky Clark Data Cat, Fish
City Indiana Shelby Business Dog
Rural Kentucky Clark Data Dog, Fish
Ohio Miami Data Dog, Cat
Urban Ohio Miami Business Dog, Cat
Rural Kentucky Clark Data Fish
City Indiana Shelby Business Cat
I want my output to look like this:
Location State County Job Frequency Pet:Cat Pet:Dog Pet:Fish
Ohio Miami Data 2 1 2 0
Urban Ohio Miami Business 2 2 2 0
Urban Ohio Miami Data 1 1 0 0
Rural Kentucky Clark Data 3 1 1 3
City Indiana Shelby Business 2 1 1 0
I have tried different iterations of the following code, and I get close, but not quite right:
Output<-df%>%group_by(Location, State, County, Job)%>%
dplyr::summarise(
Frequency= dplyr::n(),
Pet:Cat = count(str_match(Pet, "Cat")),
Pet:Dog = count(str_match(Pet, "Dog")),
Pet:Fish = count(str_match(Pet, "Fish")),
)
Any help would be appreciated! Thank you in advance
Try this:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
separate_rows(Pet,sep=',') %>%
mutate(Pet=trimws(Pet)) %>%
group_by(Location,State,County,Job,Pet) %>%
summarise(N=n()) %>%
mutate(Pet=paste0('Pet:',Pet)) %>%
group_by(Location,State,County,Job,.drop = F) %>%
mutate(Freq=n()) %>%
pivot_wider(names_from = Pet,values_from=N,values_fill=0)
Output:
# A tibble: 5 x 8
# Groups: Location, State, County, Job [5]
Location State County Job Freq `Pet:Cat` `Pet:Dog` `Pet:Fish`
<chr> <chr> <chr> <chr> <int> <int> <int> <int>
1 "" Ohio Miami Data 2 1 2 0
2 "City" Indiana Shelby Business 2 1 1 0
3 "Rural" Kentucky Clark Data 3 1 1 3
4 "Urban" Ohio Miami Business 2 2 2 0
5 "Urban" Ohio Miami Data 1 1 0 0
Some data used:
#Data
df <- structure(list(Location = c("", "Urban", "Urban", "Rural", "City",
"Rural", "", "Urban", "Rural", "City"), State = c("Ohio", "Ohio",
"Ohio", "Kentucky", "Indiana", "Kentucky", "Ohio", "Ohio", "Kentucky",
"Indiana"), County = c("Miami", "Miami", "Miami", "Clark", "Shelby",
"Clark", "Miami", "Miami", "Clark", "Shelby"), Job = c("Data",
"Business", "Data", "Data", "Business", "Data", "Data", "Business",
"Data", "Business"), Pet = c("Dog", "Dog, Cat", "Cat", "Cat, Fish",
"Dog", "Dog, Fish", "Dog, Cat", "Dog, Cat", "Fish", "Cat")), row.names = c(NA,
-10L), class = "data.frame")

Obtain a unique ID from a different dataframe

I have this dataframe:
First.Name Last.Name Country Unit Hospital
John Mars UK Sales South
John Mars UK Sales South
John Mars UK Sales South
Lisa Smith USA HHRR North
Lisa Smith USA HHRR North
and this other:
First.Name Last.Name ID
John Valjean 1254
Peter Smith 1255
Frank Mars 1256
Marie Valjean 1257
Lisa Smith 1258
John Mars 1259
and I would like to merge them or paste them together to have:
I tried with x = merge(df1, df2, by.y=c('Last.Name','First.Name') but it doesnt seem to work. also with x = df1[c(df1$Last.Name, df1$First.Name) %in% c(df2$Last.Name, df2$First.Name),] and it also doesnt work.
When using merge, hou have to be careful with its arguments, especially with by, by.x, by.y, all, all.x and all.y. The description of each of these arguments is available here
Based on this, try out:
merge(df1, df2, by = c('First.Name', 'Last.Name')) # see #Sotos's comment
# output
First.Name Last.Name Country Unit Hospital ID
1 John Mars UK Sales South 1259
2 John Mars UK Sales South 1259
3 John Mars UK Sales South 1259
4 Lisa Smith USA HHRR North 1258
5 Lisa Smith USA HHRR North 1258
merge(df1, df2, by.x = c('Last.Name','First.Name'),
by.y = c('Last.Name','First.Name')) # in you code, you set by.y but not by.x
# output
Last.Name First.Name Country Unit Hospital ID
1 Mars John UK Sales South 1259
2 Mars John UK Sales South 1259
3 Mars John UK Sales South 1259
4 Smith Lisa USA HHRR North 1258
5 Smith Lisa USA HHRR North 1258
# by in dplyr::left_join() works like by in merge()
dplyr::left_join(df1, df2, by = c('First.Name', 'Last.Name')) # see #tmfmnk's comment
# output
First.Name Last.Name Country Unit Hospital ID
1 John Mars UK Sales South 1259
2 John Mars UK Sales South 1259
3 John Mars UK Sales South 1259
4 Lisa Smith USA HHRR North 1258
5 Lisa Smith USA HHRR North 1258
Data
df1 <- structure(list(First.Name = c("John", "John", "John", "Lisa",
"Lisa"), Last.Name = c("Mars", "Mars", "Mars", "Smith", "Smith"
), Country = c("UK", "UK", "UK", "USA", "USA"), Unit = c("Sales",
"Sales", "Sales", "HHRR", "HHRR"), Hospital = c("South", "South",
"South", "North", "North")), .Names = c("First.Name", "Last.Name",
"Country", "Unit", "Hospital"), class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(First.Name = c("John", "Peter", "Frank", "Marie",
"Lisa", "John"), Last.Name = c("Valjean", "Smith", "Mars", "Valjean",
"Smith", "Mars"), ID = 1254:1259), .Names = c("First.Name", "Last.Name",
"ID"), class = "data.frame", row.names = c(NA, -6L))

Filling in NAs in data in R by id

I am having a little problem "filling in gaps". It's not a missing data question, it's more about merging but it's not working great.
So, my data looks like this
id name region Company
1 John Smith West Walmart
1 John Smith West Amazon
1 John Smith
1 John Smith West P&G
2 Jane Smith South Apple
2 Jane Smith
3 Richard Burkett
3 Richard Burkett West Walmart
And so on.
What I want to do is fill in those gaps in the region variable by their id. So, id 1, John Smith, on the third row, should have West in the third column. Jane Smith's region should be filled in "South" where it is missing.
I've tried creating a separate dataset and then merging it based on id but it creates duplicate rows and basically increases the N by something like 14 times (no idea why).
region1<-subset(df1, df1$region=="DC"| df1$region=="Midwest"|df1$region=="Northeast"|df1$region=="South"|df1$region=="West")
region<-region1[,c(id","region")]
df2<-merge(df1, region, by="id")
I've checked the structure of the variables. Id variable is interval and region is a factor. I think there should be a super simple way to do this but I'm just not getting it. Any ideas?
Thank you in advance.
Here´s an R base solution. Suppose your data.frame is df
regions <- sapply(split(df$region, df$id), function(x) {
ind <- is.na(x);
x[ind] <- x[!ind][1];
x
})
df$region <- unlist(regions)
df
id name region Company
1 1 John Smith West Walmart
2 1 John Smith West Amazon
3 1 John Smith West <NA>
4 1 John Smith West P&G
5 2 Jane Smith South Apple
6 2 Jane Smith South <NA>
7 3 Richard Burkett West Walmart
8 3 Richard Burkett West <NA>
I would use dplyr::arrange followed by tidyr::fill
library(dplyr)
library(tidyr)
data.frame(id=c(1,1,1,1,2,2,3,3),
name=c(rep("John Smith",4), rep("Jane Smith", 2), rep("Richard Burkett", 2)),
region=c("West", "West", NA, "West", "South",NA, "West", NA),
Company=c("Walmart","Amazon",NA,"P&G","Apple",NA,"Walmart",NA)) %>%
arrange(id, name) %>%
fill(region)
Results in:
id name region Company
1 1 John Smith West Walmart
2 1 John Smith West Amazon
3 1 John Smith West NA
4 1 John Smith West P&G
5 2 Jane Smith South Apple
6 2 Jane Smith South NA
7 3 Richard Burkett West Walmart
8 3 Richard Burkett West NA
The solution which should work is group_by on id and then fill. Ideally the solution which should work in OP condition should cover in both direction.
library(tidyverse)
df %>% group_by(id) %>%
fill(region) %>%
fill(region, .direction = "up")
# id name region Company
# <int> <chr> <chr> <chr>
#1 1 John Smith West Walmart
#2 1 John Smith West Amazon
#3 1 John Smith West <NA>
#4 1 John Smith West P&G
#5 2 Jane Smith South Apple
#6 2 Jane Smith South <NA>
#7 3 Richard Burkett West Walmart
#8 3 Richard Burkett West <NA>
Data
structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), name = c("John Smith",
"John Smith", "John Smith", "John Smith", "Jane Smith", "Jane Smith",
"Richard Burkett", "Richard Burkett"), region = c("West", "West",
NA, "West", "South", NA, "West", NA), Company = c("Walmart",
"Amazon", NA, "P&G", "Apple", NA, "Walmart", NA)), .Names = c("id",
"name", "region", "Company"), class = "data.frame", row.names = c(NA,
-8L))

Resources