Extracting and evaluating words in a text string against another dataset - r

I have two sets of data that I will be evaluating against one another. A heavily reduced example looks like this:
library(dplyr)
library(tidyverse)
library(sqldf)
library(dbplyr)
library(httr)
library(purrr)
library(jsonlite)
library(magrittr)
library(tidyr)
library(tidytext)
people_records_ex <- structure(list(id = c(123L, 456L, 789L), name = c("Anna Wilson",
"Jeff Smith", "Craig Mills"), biography = c("Student at Ohio State University. Class of 2024.",
"Second year law student at Stanford. Undergrad at William & Mary",
"University of North Texas Volleyball!")), class = "data.frame", row.names = c(NA,
-3L))
college_records_ex <- structure(list(college_id = c(234L, 567L, 891L, 345L), college_name = c("Ohio State University",
"Stanford", "William & Mary", "University of North Texas"), college_city = c("Columbus",
"Stanford", "Williamsburg", "Denton"), college_state = c("OH",
"CA", "VA", "TX")), class = "data.frame", row.names = c(NA, -4L
))
I am trying to create a match against the contents of the biography text string in people_records_ex against college_name in college_records_ex so the final output will look like this:
final_records_ex <- structure(list(id = c(123L, 456L, 456L, 789L), name = c("Anna Wilson",
"Jeff Smith", "Jeff Smith", "Craig Mills"), college_name = c("Ohio State University",
"Stanford", "William & Mary", "University of North Texas"), college_city = c("Columbus",
"Stanford", "Williamsburg", "Denton"), college_state = c("OH",
"CA", "VA", "TX")), class = "data.frame", row.names = c(NA, -4L
))
Or to provide a more visual example of the final output I'm expecting:
But when I run the following code, it produces zero results, which is not correct:
college_extract <- people_records_ex %>%
left_join(college_records_ex, by = c("biography" = "college_name")) %>%
filter(!is.na(college_state)) %>% dplyr::select(id, name, college_name, college_city, college_state) %>% distinct()
What am I doing incorrectly and what would the correct version look like?

Here's a very tidy and straightforward solution with fuzzy_join:
library(fuzzyjoin)
library(stringr)
library(dplyr)
fuzzy_join(
people_records_ex, college_records_ex,
by = c("biography" = "college_name"),
match_fun = str_detect,
mode = "left"
) %>%
select(-biography)
id name college_id college_name college_city college_state
1 123 Anna Wilson 234 Ohio State University Columbus OH
2 456 Jeff Smith 567 Stanford Stanford CA
3 456 Jeff Smith 891 William & Mary Williamsburg VA
4 789 Craig Mills 345 University of North Texas Denton TX

Assuming the college names in the biographies are spelled out exactly as they appear in the colleges table and the datasets are relatively small, all matches can be generated with a regex of all college names as follows
library(dplyr)
people_records_ex <- structure(list(id = c(123L, 456L, 789L), name = c(
"Anna Wilson",
"Jeff Smith", "Craig Mills"
), biography = c(
"Student at Ohio State University. Class of 2024.",
"Second year law student at Stanford. Undergrad at William & Mary",
"University of North Texas Volleyball!"
)), class = "data.frame", row.names = c(
NA,
-3L
)) %>% tibble::tibble()
college_records_ex <- structure(list(college_id = c(234L, 567L, 891L, 345L), college_name = c(
"Ohio State University",
"Stanford", "William & Mary", "University of North Texas"
), college_city = c(
"Columbus",
"Stanford", "Williamsburg", "Denton"
), college_state = c(
"OH",
"CA", "VA", "TX"
)), class = "data.frame", row.names = c(NA, -4L)) %>%
tibble::tibble()
# join college names in a regex pattern
colleges_regex <- paste0(college_records_ex$college_name, collapse = "|")
colleges_regex
#> [1] "Ohio State University|Stanford|William & Mary|University of North Texas"
# match all against bio, giving a list-column of matches
people_records_ex %>%
mutate(matches = stringr::str_match_all(biography, colleges_regex))
#> # A tibble: 3 × 4
#> id name biography matches
#> <int> <chr> <chr> <list>
#> 1 123 Anna Wilson Student at Ohio State University. Class of 2024. <chr[…]>
#> 2 456 Jeff Smith Second year law student at Stanford. Undergrad at … <chr[…]>
#> 3 789 Craig Mills University of North Texas Volleyball! <chr[…]>
# unnest the list column wider to give 1 row per person per match
people_records_ex %>%
mutate(matches = stringr::str_match_all(biography, colleges_regex)) %>%
tidyr::unnest_longer(matches)
#> # A tibble: 4 × 4
#> id name biography match…¹
#> <int> <chr> <chr> <chr>
#> 1 123 Anna Wilson Student at Ohio State University. Class of 2024. Ohio S…
#> 2 456 Jeff Smith Second year law student at Stanford. Undergrad at W… Stanfo…
#> 3 456 Jeff Smith Second year law student at Stanford. Undergrad at W… Willia…
#> 4 789 Craig Mills University of North Texas Volleyball! Univer…
#> # … with abbreviated variable name ¹​matches[,1]
Created on 2022-10-26 with reprex v2.0.2
This may be joined back to the college table such that it is annotated with college info.

In base R you can do:
do.call(rbind, lapply(college_records_ex$college_name,
\(x) people_records_ex[grep(x, people_records_ex$biography),1:2])) |>
cbind(college_records_ex[-1])
This does some matching and I subsetted the first two columns which are the id and name, cbinding it with the second data.frame getting rid of the first column
id name college_name college_city college_state
1 123 Anna Wilson Ohio State University Columbus OH
2 456 Jeff Smith Stanford Stanford CA
21 456 Jeff Smith William & Mary Williamsburg VA
3 789 Craig Mills University of North Texas Denton TX

Related

Running a string against multiple match dataframes

I have a dataset of text strings that look something like this:
strings <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", "Jessica Wright Htx Satx",
"Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny Fitness Houston Studio")), class = "data.frame", row.names = c(NA,
-8L))
I am trying to evaluate matches in those strings against two different datasets called firstname and lastname that look as such:
firstname <- structure(list(firstnames = c("Jennifer", "Lisa", "Tina", "Jamie",
"Jessica", "Julie", "Mike", "George")), class = "data.frame", row.names = c(NA,
-8L))
lastname <- structure(list(lastnames = c("Hancock", "Smith", "Houston", "Fay",
"Tucker", "Wright", "Green", "Thomas")), class = "data.frame", row.names = c(NA,
-8L))
First thing I would like to do is remove everything after the first three words in each string, so "Jennifer Rae Hancock Brown" would just become "Jessica Rae Hancock" and "Lisa Smith Houston Blogger" would become "Lisa Smith Houston"
After that, I then want to evaluate the first word of each string to see if it matches to anything in the firstname dataframe. If it does match, it creates a new column called in the final table called firstname with the result. If it doesn't match, the result is simply "N/A".
After that, I'd like to then evaluate the remaining words against the lastname dataframe. There can be multiple matches (As seen in the "Lisa Smith Houston" example) and if that's the case, both results will be stored in the final dataframe.
The final dataframe should look like this:
final <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Lisa Smith Houston Blogger", "Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert",
"Jessica Wright Htx Satx", "Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny George Fitness Houston Studio"), firstname = c("Jennifer",
"Lisa", "Lisa", "Tina", "Jamie", "Jessica", "Julie", "Mike",
"N/A"), lastname = c("Hancock", "Smith", "Houston", "Fay", "Tucker",
"Wright", "Green", "Thomas", "N/A")), class = "data.frame", row.names = c(NA,
-9L))
What would be the most effective way to go about doing this?
We may use str_extract_all on the substring of 'string2' with pattern as the firstnames, lastnames vector converted to a single string with | (OR as delimiter) and return a list of vectors, then use unnest to convert the list to vector
library(dplyr)
library(stringr)
library(tidyr)
strings %>%
mutate(string2 = str_extract(trimws(string), "^\\S+\\s+\\S+\\s+\\S+"),
firstname = str_extract_all(string2,
str_c(firstname$firstnames, collapse = "|")),
lastname =str_extract_all(string2,
str_c(lastname$lastnames, collapse = "|")) ) %>%
unnest(where(is.list), keep_empty = TRUE) %>%
select(-string2)%>%
mutate(lastname = case_when(complete.cases(firstname) ~ lastname))
-output
# A tibble: 9 × 3
string firstname lastname
<chr> <chr> <chr>
1 "Jennifer Rae Hancock Brown" Jennifer Hancock
2 "Lisa Smith Houston Blogger" Lisa Smith
3 "Lisa Smith Houston Blogger" Lisa Houston
4 "Tina Fay Las Cruces" Tina Fay
5 "\t\nJamie Tucker Style Expert" Jamie Tucker
6 "Jessica Wright Htx Satx" Jessica Wright
7 "Julie Green Lifestyle Blogger" Julie Green
8 "Mike S Thomas Football Player" Mike Thomas
9 "Tiny Fitness Houston Studio" <NA> <NA>
OP's expected
> final
string firstname lastname
1 Jennifer Rae Hancock Brown Jennifer Hancock
2 Lisa Smith Houston Blogger Lisa Smith
3 Lisa Smith Houston Blogger Lisa Houston
4 Tina Fay Las Cruces Tina Fay
5 \t\nJamie Tucker Style Expert Jamie Tucker
6 Jessica Wright Htx Satx Jessica Wright
7 Julie Green Lifestyle Blogger Julie Green
8 Mike S Thomas Football Player Mike Thomas
9 Tiny George Fitness Houston Studio N/A N/A

How to merge based on a string in a column?

I would like to do exact joins for the columns state and name, but a fuzzy join for the "name" and "versus" columns:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
df1 <- data.frame(year, state, name)
year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("# george v. SALLY", "#laura v. dAvid")
df2 <- data.frame(year, state, versus)
My preferred output would be the following:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("# george v. SALLY", "# george v. SALLY", "#laura v. dAvid", "#laura v. dAvid", NA, NA)
df3 <- data.frame(year, state, name, versus)
I've tried variations of the following:
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")
stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")
And they don't seem to get close to what I want.
I'm wondering if I'll need to spit up the "versus" column (remove all special characters and delimit the names) or if there's a way for me to accomplish this with something within fuzzyjoin. Any guidance would be appreciated.
A simple approach, which depends somewhat on the structure of df2$versus, would be this:
library(dplyr)
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))
Output:
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George # george v. SALLY
2 2002 TN Sally # george v. SALLY
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Update/Jul 14 2022:
If name has more complicated pattern, rather than a single word (say Molly Home, Jane Doe), we need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:
Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0
f <- function(n,v) {
wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}
Left join the original frames, and apply f() by row
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(f(name, versus), versus,NA_character_))
Output:
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Input:
df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997",
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe",
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA,
-6L))
df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "#laura v. dAvid")), class = "data.frame", row.names = c(NA,
-2L))
Update 15/07:
See comment. In such case one would want to check for a match in versus for each individual name in name. This could be done like this (using #langtang's 'new' data):
df1 |>
left_join(df2, by = c("year", "state")) |>
rowwise() |>
mutate(versus = if_else(str_detect(tolower(versus), paste0(unlist(str_extract_all(tolower(name), "\\w+")), collapse = "|")), versus, NA_character_)) |>
ungroup()
Output:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Old answer:
An approach could be:
library(tidyverse)
df1 |>
left_join(df2) |>
group_by(state) |>
mutate(versus = if_else(str_detect(tolower(versus), tolower(name)), versus, NA_character_)) |>
ungroup()
Output:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George # george v. SALLY
2 2002 TN Sally # george v. SALLY
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA

Grouping by Multiple variables and summarizing character frequencies

I am trying to group my dataset by multiple variables and build a frequency table of the number of times a character variable appears. Here is an example data set:
Location State County Job Pet
Ohio Miami Data Dog
Urban Ohio Miami Business Dog, Cat
Urban Ohio Miami Data Cat
Rural Kentucky Clark Data Cat, Fish
City Indiana Shelby Business Dog
Rural Kentucky Clark Data Dog, Fish
Ohio Miami Data Dog, Cat
Urban Ohio Miami Business Dog, Cat
Rural Kentucky Clark Data Fish
City Indiana Shelby Business Cat
I want my output to look like this:
Location State County Job Frequency Pet:Cat Pet:Dog Pet:Fish
Ohio Miami Data 2 1 2 0
Urban Ohio Miami Business 2 2 2 0
Urban Ohio Miami Data 1 1 0 0
Rural Kentucky Clark Data 3 1 1 3
City Indiana Shelby Business 2 1 1 0
I have tried different iterations of the following code, and I get close, but not quite right:
Output<-df%>%group_by(Location, State, County, Job)%>%
dplyr::summarise(
Frequency= dplyr::n(),
Pet:Cat = count(str_match(Pet, "Cat")),
Pet:Dog = count(str_match(Pet, "Dog")),
Pet:Fish = count(str_match(Pet, "Fish")),
)
Any help would be appreciated! Thank you in advance
Try this:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
separate_rows(Pet,sep=',') %>%
mutate(Pet=trimws(Pet)) %>%
group_by(Location,State,County,Job,Pet) %>%
summarise(N=n()) %>%
mutate(Pet=paste0('Pet:',Pet)) %>%
group_by(Location,State,County,Job,.drop = F) %>%
mutate(Freq=n()) %>%
pivot_wider(names_from = Pet,values_from=N,values_fill=0)
Output:
# A tibble: 5 x 8
# Groups: Location, State, County, Job [5]
Location State County Job Freq `Pet:Cat` `Pet:Dog` `Pet:Fish`
<chr> <chr> <chr> <chr> <int> <int> <int> <int>
1 "" Ohio Miami Data 2 1 2 0
2 "City" Indiana Shelby Business 2 1 1 0
3 "Rural" Kentucky Clark Data 3 1 1 3
4 "Urban" Ohio Miami Business 2 2 2 0
5 "Urban" Ohio Miami Data 1 1 0 0
Some data used:
#Data
df <- structure(list(Location = c("", "Urban", "Urban", "Rural", "City",
"Rural", "", "Urban", "Rural", "City"), State = c("Ohio", "Ohio",
"Ohio", "Kentucky", "Indiana", "Kentucky", "Ohio", "Ohio", "Kentucky",
"Indiana"), County = c("Miami", "Miami", "Miami", "Clark", "Shelby",
"Clark", "Miami", "Miami", "Clark", "Shelby"), Job = c("Data",
"Business", "Data", "Data", "Business", "Data", "Data", "Business",
"Data", "Business"), Pet = c("Dog", "Dog, Cat", "Cat", "Cat, Fish",
"Dog", "Dog, Fish", "Dog, Cat", "Dog, Cat", "Fish", "Cat")), row.names = c(NA,
-10L), class = "data.frame")

Obtain a unique ID from a different dataframe

I have this dataframe:
First.Name Last.Name Country Unit Hospital
John Mars UK Sales South
John Mars UK Sales South
John Mars UK Sales South
Lisa Smith USA HHRR North
Lisa Smith USA HHRR North
and this other:
First.Name Last.Name ID
John Valjean 1254
Peter Smith 1255
Frank Mars 1256
Marie Valjean 1257
Lisa Smith 1258
John Mars 1259
and I would like to merge them or paste them together to have:
I tried with x = merge(df1, df2, by.y=c('Last.Name','First.Name') but it doesnt seem to work. also with x = df1[c(df1$Last.Name, df1$First.Name) %in% c(df2$Last.Name, df2$First.Name),] and it also doesnt work.
When using merge, hou have to be careful with its arguments, especially with by, by.x, by.y, all, all.x and all.y. The description of each of these arguments is available here
Based on this, try out:
merge(df1, df2, by = c('First.Name', 'Last.Name')) # see #Sotos's comment
# output
First.Name Last.Name Country Unit Hospital ID
1 John Mars UK Sales South 1259
2 John Mars UK Sales South 1259
3 John Mars UK Sales South 1259
4 Lisa Smith USA HHRR North 1258
5 Lisa Smith USA HHRR North 1258
merge(df1, df2, by.x = c('Last.Name','First.Name'),
by.y = c('Last.Name','First.Name')) # in you code, you set by.y but not by.x
# output
Last.Name First.Name Country Unit Hospital ID
1 Mars John UK Sales South 1259
2 Mars John UK Sales South 1259
3 Mars John UK Sales South 1259
4 Smith Lisa USA HHRR North 1258
5 Smith Lisa USA HHRR North 1258
# by in dplyr::left_join() works like by in merge()
dplyr::left_join(df1, df2, by = c('First.Name', 'Last.Name')) # see #tmfmnk's comment
# output
First.Name Last.Name Country Unit Hospital ID
1 John Mars UK Sales South 1259
2 John Mars UK Sales South 1259
3 John Mars UK Sales South 1259
4 Lisa Smith USA HHRR North 1258
5 Lisa Smith USA HHRR North 1258
Data
df1 <- structure(list(First.Name = c("John", "John", "John", "Lisa",
"Lisa"), Last.Name = c("Mars", "Mars", "Mars", "Smith", "Smith"
), Country = c("UK", "UK", "UK", "USA", "USA"), Unit = c("Sales",
"Sales", "Sales", "HHRR", "HHRR"), Hospital = c("South", "South",
"South", "North", "North")), .Names = c("First.Name", "Last.Name",
"Country", "Unit", "Hospital"), class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(First.Name = c("John", "Peter", "Frank", "Marie",
"Lisa", "John"), Last.Name = c("Valjean", "Smith", "Mars", "Valjean",
"Smith", "Mars"), ID = 1254:1259), .Names = c("First.Name", "Last.Name",
"ID"), class = "data.frame", row.names = c(NA, -6L))

Find discrepancies between two tables

I'm working with R from a SAS/SQL background, and am trying to write code to take two tables, compare them, and provide a list of the discrepancies. This code would be used repeatedly for many different sets of tables, so I need to avoid hardcoding.
I'm working with Identifying specific differences between two data sets in R , but it doesn't get me all the way there.
Example Data, using the combination of LastName/FirstName (which is unique) as a key --
Dataset One --
Last_Name First_Name Street_Address ZIP VisitCount
Doe John 1234 Main St 12345 20
Doe Jane 4321 Tower St 54321 10
Don Bob 771 North Ave 23232 5
Smith Mike 732 South Blvd. 77777 3
Dataset Two --
Last_Name First_Name Street_Address ZIP VisitCount
Doe John 1234 Main St 12345 20
Doe Jane 4111 Tower St 32132 17
Donn Bob 771 North Ave 11111 5
Desired Output --
LastName FirstName VarName TableOne TableTwo
Doe Jane StreetAddress 4321 Tower St 4111 Tower St
Doe Jane Zip 23232 32132
Doe Jane VisitCount 5 17
Note that this output ignores records where I don't have the same ID in both tables (for instance, because Bob's last name is "Don" in one table, and "Donn" in another table, we ignore that record entirely).
I've explored doing this by applying the melt function on both datasets, and then comparing them, but the size data I'm working with indicates that wouldn't be practical. In SAS, I used Proc Compare for this kind of work, but I haven't found an exact equivalent in R.
Here is a solution based on data.table:
library(data.table)
# Convert into data.table, melt
setDT(d1)
d1 <- d1[, list(VarName = names(.SD), TableOne = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]
setDT(d2)
d2 <- d2[, list(VarName = names(.SD), TableTwo = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]
# Set keys for merging
setkey(d1,Last_Name,First_Name,VarName)
# Merge, remove duplicates
d1[d2,nomatch=0][TableOne!=TableTwo]
# Last_Name First_Name VarName TableOne TableTwo
# 1: Doe Jane Street_Address 4321 Tower St 4111 Tower St
# 2: Doe Jane ZIP 54321 32132
# 3: Doe Jane VisitCount 10 17
where input data sets are:
# Input Data Sets
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John",
"Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St",
"771 North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L,
23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name",
"First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))
d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John",
"Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St",
"771 North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L,
17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address",
"ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))
dplyr and tidyr work well here. First, a slightly reduced dataset:
dat1 <- data.frame(Last_Name = c('Doe', 'Doe', 'Don', 'Smith'),
First_Name = c('John', 'Jane', 'Bob', 'Mike'),
ZIP = c(12345, 54321, 23232, 77777),
VisitCount = c(20, 10, 5, 3),
stringsAsFactors = FALSE)
dat2 <- data.frame(Last_Name = c('Doe', 'Doe', 'Donn'),
First_Name = c('John', 'Jane', 'Bob'),
ZIP = c(12345, 32132, 11111),
VisitCount = c(20, 17, 5),
stringsAsFactors = FALSE)
(Sorry, I didn't want to type it all in. If it's important, please provide a reproducible example with well-defined data structures.)
Additionally, it looks like your "desired output" is a little off with Jane Doe's ZIP and VisitCount.
Your thought to melt them works well:
library(dplyr)
library(tidyr)
dat1g <- gather(dat1, key, value, -Last_Name, -First_Name)
dat2g <- gather(dat2, key, value, -Last_Name, -First_Name)
head(dat1g)
## Last_Name First_Name key value
## 1 Doe John ZIP 12345
## 2 Doe Jane ZIP 54321
## 3 Don Bob ZIP 23232
## 4 Smith Mike ZIP 77777
## 5 Doe John VisitCount 20
## 6 Doe Jane VisitCount 10
From here, it's deceptively simple:
dat1g %>%
inner_join(dat2g, by = c('Last_Name', 'First_Name', 'key')) %>%
filter(value.x != value.y)
## Last_Name First_Name key value.x value.y
## 1 Doe Jane ZIP 54321 32132
## 2 Doe Jane VisitCount 10 17
The dataCompareR package aims to solve this exact problem. The vignette for the package includes some simple examples, and I've used this package to solve the original problem below.
Disclaimer: I was involved with creating this package.
library(dataCompareR)
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", "Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", "771 North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))
d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", "Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", "771 North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))
compd1d2 <- rCompare(d1, d2, keys = c("First_Name", "Last_Name"))
print(compd1d2)
All columns were compared, 3 row(s) were dropped from comparison
There are 3 mismatched variables:
First and last 5 observations for the 3 mismatched variables
FIRST_NAME LAST_NAME valueA valueB variable typeA typeB diffAB
1 Jane Doe 4321 Tower St 4111 Tower St STREET_ADDRESS character character
2 Jane Doe 10 17 VISITCOUNT integer integer -7
3 Jane Doe 54321 32132 ZIP integer integer 22189
To get a more detailed and pretty summary, the user can run
summary(compd1d2)
The use of FIRST_NAME and LAST_NAME as the 'join' between the two tables is controlled by the keys = argument to the rCompare function. In this case any rows that do not match on these two variables are dropped from the comparison, but you can get a more detailed output on the comparison performed by using summary

Resources