Running a string against multiple match dataframes - r

I have a dataset of text strings that look something like this:
strings <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", "Jessica Wright Htx Satx",
"Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny Fitness Houston Studio")), class = "data.frame", row.names = c(NA,
-8L))
I am trying to evaluate matches in those strings against two different datasets called firstname and lastname that look as such:
firstname <- structure(list(firstnames = c("Jennifer", "Lisa", "Tina", "Jamie",
"Jessica", "Julie", "Mike", "George")), class = "data.frame", row.names = c(NA,
-8L))
lastname <- structure(list(lastnames = c("Hancock", "Smith", "Houston", "Fay",
"Tucker", "Wright", "Green", "Thomas")), class = "data.frame", row.names = c(NA,
-8L))
First thing I would like to do is remove everything after the first three words in each string, so "Jennifer Rae Hancock Brown" would just become "Jessica Rae Hancock" and "Lisa Smith Houston Blogger" would become "Lisa Smith Houston"
After that, I then want to evaluate the first word of each string to see if it matches to anything in the firstname dataframe. If it does match, it creates a new column called in the final table called firstname with the result. If it doesn't match, the result is simply "N/A".
After that, I'd like to then evaluate the remaining words against the lastname dataframe. There can be multiple matches (As seen in the "Lisa Smith Houston" example) and if that's the case, both results will be stored in the final dataframe.
The final dataframe should look like this:
final <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Lisa Smith Houston Blogger", "Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert",
"Jessica Wright Htx Satx", "Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny George Fitness Houston Studio"), firstname = c("Jennifer",
"Lisa", "Lisa", "Tina", "Jamie", "Jessica", "Julie", "Mike",
"N/A"), lastname = c("Hancock", "Smith", "Houston", "Fay", "Tucker",
"Wright", "Green", "Thomas", "N/A")), class = "data.frame", row.names = c(NA,
-9L))
What would be the most effective way to go about doing this?

We may use str_extract_all on the substring of 'string2' with pattern as the firstnames, lastnames vector converted to a single string with | (OR as delimiter) and return a list of vectors, then use unnest to convert the list to vector
library(dplyr)
library(stringr)
library(tidyr)
strings %>%
mutate(string2 = str_extract(trimws(string), "^\\S+\\s+\\S+\\s+\\S+"),
firstname = str_extract_all(string2,
str_c(firstname$firstnames, collapse = "|")),
lastname =str_extract_all(string2,
str_c(lastname$lastnames, collapse = "|")) ) %>%
unnest(where(is.list), keep_empty = TRUE) %>%
select(-string2)%>%
mutate(lastname = case_when(complete.cases(firstname) ~ lastname))
-output
# A tibble: 9 × 3
string firstname lastname
<chr> <chr> <chr>
1 "Jennifer Rae Hancock Brown" Jennifer Hancock
2 "Lisa Smith Houston Blogger" Lisa Smith
3 "Lisa Smith Houston Blogger" Lisa Houston
4 "Tina Fay Las Cruces" Tina Fay
5 "\t\nJamie Tucker Style Expert" Jamie Tucker
6 "Jessica Wright Htx Satx" Jessica Wright
7 "Julie Green Lifestyle Blogger" Julie Green
8 "Mike S Thomas Football Player" Mike Thomas
9 "Tiny Fitness Houston Studio" <NA> <NA>
OP's expected
> final
string firstname lastname
1 Jennifer Rae Hancock Brown Jennifer Hancock
2 Lisa Smith Houston Blogger Lisa Smith
3 Lisa Smith Houston Blogger Lisa Houston
4 Tina Fay Las Cruces Tina Fay
5 \t\nJamie Tucker Style Expert Jamie Tucker
6 Jessica Wright Htx Satx Jessica Wright
7 Julie Green Lifestyle Blogger Julie Green
8 Mike S Thomas Football Player Mike Thomas
9 Tiny George Fitness Houston Studio N/A N/A

Related

Extracting and evaluating words in a text string against another dataset

I have two sets of data that I will be evaluating against one another. A heavily reduced example looks like this:
library(dplyr)
library(tidyverse)
library(sqldf)
library(dbplyr)
library(httr)
library(purrr)
library(jsonlite)
library(magrittr)
library(tidyr)
library(tidytext)
people_records_ex <- structure(list(id = c(123L, 456L, 789L), name = c("Anna Wilson",
"Jeff Smith", "Craig Mills"), biography = c("Student at Ohio State University. Class of 2024.",
"Second year law student at Stanford. Undergrad at William & Mary",
"University of North Texas Volleyball!")), class = "data.frame", row.names = c(NA,
-3L))
college_records_ex <- structure(list(college_id = c(234L, 567L, 891L, 345L), college_name = c("Ohio State University",
"Stanford", "William & Mary", "University of North Texas"), college_city = c("Columbus",
"Stanford", "Williamsburg", "Denton"), college_state = c("OH",
"CA", "VA", "TX")), class = "data.frame", row.names = c(NA, -4L
))
I am trying to create a match against the contents of the biography text string in people_records_ex against college_name in college_records_ex so the final output will look like this:
final_records_ex <- structure(list(id = c(123L, 456L, 456L, 789L), name = c("Anna Wilson",
"Jeff Smith", "Jeff Smith", "Craig Mills"), college_name = c("Ohio State University",
"Stanford", "William & Mary", "University of North Texas"), college_city = c("Columbus",
"Stanford", "Williamsburg", "Denton"), college_state = c("OH",
"CA", "VA", "TX")), class = "data.frame", row.names = c(NA, -4L
))
Or to provide a more visual example of the final output I'm expecting:
But when I run the following code, it produces zero results, which is not correct:
college_extract <- people_records_ex %>%
left_join(college_records_ex, by = c("biography" = "college_name")) %>%
filter(!is.na(college_state)) %>% dplyr::select(id, name, college_name, college_city, college_state) %>% distinct()
What am I doing incorrectly and what would the correct version look like?
Here's a very tidy and straightforward solution with fuzzy_join:
library(fuzzyjoin)
library(stringr)
library(dplyr)
fuzzy_join(
people_records_ex, college_records_ex,
by = c("biography" = "college_name"),
match_fun = str_detect,
mode = "left"
) %>%
select(-biography)
id name college_id college_name college_city college_state
1 123 Anna Wilson 234 Ohio State University Columbus OH
2 456 Jeff Smith 567 Stanford Stanford CA
3 456 Jeff Smith 891 William & Mary Williamsburg VA
4 789 Craig Mills 345 University of North Texas Denton TX
Assuming the college names in the biographies are spelled out exactly as they appear in the colleges table and the datasets are relatively small, all matches can be generated with a regex of all college names as follows
library(dplyr)
people_records_ex <- structure(list(id = c(123L, 456L, 789L), name = c(
"Anna Wilson",
"Jeff Smith", "Craig Mills"
), biography = c(
"Student at Ohio State University. Class of 2024.",
"Second year law student at Stanford. Undergrad at William & Mary",
"University of North Texas Volleyball!"
)), class = "data.frame", row.names = c(
NA,
-3L
)) %>% tibble::tibble()
college_records_ex <- structure(list(college_id = c(234L, 567L, 891L, 345L), college_name = c(
"Ohio State University",
"Stanford", "William & Mary", "University of North Texas"
), college_city = c(
"Columbus",
"Stanford", "Williamsburg", "Denton"
), college_state = c(
"OH",
"CA", "VA", "TX"
)), class = "data.frame", row.names = c(NA, -4L)) %>%
tibble::tibble()
# join college names in a regex pattern
colleges_regex <- paste0(college_records_ex$college_name, collapse = "|")
colleges_regex
#> [1] "Ohio State University|Stanford|William & Mary|University of North Texas"
# match all against bio, giving a list-column of matches
people_records_ex %>%
mutate(matches = stringr::str_match_all(biography, colleges_regex))
#> # A tibble: 3 × 4
#> id name biography matches
#> <int> <chr> <chr> <list>
#> 1 123 Anna Wilson Student at Ohio State University. Class of 2024. <chr[…]>
#> 2 456 Jeff Smith Second year law student at Stanford. Undergrad at … <chr[…]>
#> 3 789 Craig Mills University of North Texas Volleyball! <chr[…]>
# unnest the list column wider to give 1 row per person per match
people_records_ex %>%
mutate(matches = stringr::str_match_all(biography, colleges_regex)) %>%
tidyr::unnest_longer(matches)
#> # A tibble: 4 × 4
#> id name biography match…¹
#> <int> <chr> <chr> <chr>
#> 1 123 Anna Wilson Student at Ohio State University. Class of 2024. Ohio S…
#> 2 456 Jeff Smith Second year law student at Stanford. Undergrad at W… Stanfo…
#> 3 456 Jeff Smith Second year law student at Stanford. Undergrad at W… Willia…
#> 4 789 Craig Mills University of North Texas Volleyball! Univer…
#> # … with abbreviated variable name ¹​matches[,1]
Created on 2022-10-26 with reprex v2.0.2
This may be joined back to the college table such that it is annotated with college info.
In base R you can do:
do.call(rbind, lapply(college_records_ex$college_name,
\(x) people_records_ex[grep(x, people_records_ex$biography),1:2])) |>
cbind(college_records_ex[-1])
This does some matching and I subsetted the first two columns which are the id and name, cbinding it with the second data.frame getting rid of the first column
id name college_name college_city college_state
1 123 Anna Wilson Ohio State University Columbus OH
2 456 Jeff Smith Stanford Stanford CA
21 456 Jeff Smith William & Mary Williamsburg VA
3 789 Craig Mills University of North Texas Denton TX

R ifelse with match

I have a dataframe 'df' that looks like this:
LatName
ComName
Todd Smith
Becky Jones
Becky Jones
Becky Jones
Rachel Adams
And another dataframe 'df2' that looks like this:
LatinName
CommonName
Brad Robbins
Becky Jones
Steve Reisen
Rachel Adams
Connor McDougal
Charlie Williams
If want to match values in ComName and CommonName and if they match fill in LatName with LatinName only if LatName is empty to begin with. If LatName isn't empty, then I want that entry left alone, so that the end result of df looks like this:
LatName
ComName
Todd Smith
Becky Jones
Brad Robbins
Becky Jones
Brad Robbins
Becky Jones
Steve Reisen
Rachel Adams
I left the blank row in there on purpose because some of my rows have nothing in them.
Thank you!
We could do a join and then coalesce
library(dplyr)
left_join(df1, df2, by = c("ComName" = "CommonName")) %>%
mutate(LatName = coalesce(na_if(LatName, ""), LatinName)
-output
LatName ComName
1 Todd Smith Becky Jones
2 Brad Robbins Becky Jones
3 Brad Robbins Becky Jones
4 Steve Reisen Rachel Adams
data
df1 <- structure(list(LatName = c("Todd Smith", NA, NA, NA),
ComName = c("Becky Jones",
"Becky Jones", "Becky Jones", "Rachel Adams")), class = "data.frame",
row.names = c(NA,
-4L))
df2 <- structure(list(LatinName = c("Brad Robbins", "Steve Reisen",
"Connor McDougal"), CommonName = c("Becky Jones", "Rachel Adams",
"Charlie Williams")), class = "data.frame", row.names = c(NA,
-3L))

Obtain a unique ID from a different dataframe

I have this dataframe:
First.Name Last.Name Country Unit Hospital
John Mars UK Sales South
John Mars UK Sales South
John Mars UK Sales South
Lisa Smith USA HHRR North
Lisa Smith USA HHRR North
and this other:
First.Name Last.Name ID
John Valjean 1254
Peter Smith 1255
Frank Mars 1256
Marie Valjean 1257
Lisa Smith 1258
John Mars 1259
and I would like to merge them or paste them together to have:
I tried with x = merge(df1, df2, by.y=c('Last.Name','First.Name') but it doesnt seem to work. also with x = df1[c(df1$Last.Name, df1$First.Name) %in% c(df2$Last.Name, df2$First.Name),] and it also doesnt work.
When using merge, hou have to be careful with its arguments, especially with by, by.x, by.y, all, all.x and all.y. The description of each of these arguments is available here
Based on this, try out:
merge(df1, df2, by = c('First.Name', 'Last.Name')) # see #Sotos's comment
# output
First.Name Last.Name Country Unit Hospital ID
1 John Mars UK Sales South 1259
2 John Mars UK Sales South 1259
3 John Mars UK Sales South 1259
4 Lisa Smith USA HHRR North 1258
5 Lisa Smith USA HHRR North 1258
merge(df1, df2, by.x = c('Last.Name','First.Name'),
by.y = c('Last.Name','First.Name')) # in you code, you set by.y but not by.x
# output
Last.Name First.Name Country Unit Hospital ID
1 Mars John UK Sales South 1259
2 Mars John UK Sales South 1259
3 Mars John UK Sales South 1259
4 Smith Lisa USA HHRR North 1258
5 Smith Lisa USA HHRR North 1258
# by in dplyr::left_join() works like by in merge()
dplyr::left_join(df1, df2, by = c('First.Name', 'Last.Name')) # see #tmfmnk's comment
# output
First.Name Last.Name Country Unit Hospital ID
1 John Mars UK Sales South 1259
2 John Mars UK Sales South 1259
3 John Mars UK Sales South 1259
4 Lisa Smith USA HHRR North 1258
5 Lisa Smith USA HHRR North 1258
Data
df1 <- structure(list(First.Name = c("John", "John", "John", "Lisa",
"Lisa"), Last.Name = c("Mars", "Mars", "Mars", "Smith", "Smith"
), Country = c("UK", "UK", "UK", "USA", "USA"), Unit = c("Sales",
"Sales", "Sales", "HHRR", "HHRR"), Hospital = c("South", "South",
"South", "North", "North")), .Names = c("First.Name", "Last.Name",
"Country", "Unit", "Hospital"), class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(First.Name = c("John", "Peter", "Frank", "Marie",
"Lisa", "John"), Last.Name = c("Valjean", "Smith", "Mars", "Valjean",
"Smith", "Mars"), ID = 1254:1259), .Names = c("First.Name", "Last.Name",
"ID"), class = "data.frame", row.names = c(NA, -6L))

Filling in NAs in data in R by id

I am having a little problem "filling in gaps". It's not a missing data question, it's more about merging but it's not working great.
So, my data looks like this
id name region Company
1 John Smith West Walmart
1 John Smith West Amazon
1 John Smith
1 John Smith West P&G
2 Jane Smith South Apple
2 Jane Smith
3 Richard Burkett
3 Richard Burkett West Walmart
And so on.
What I want to do is fill in those gaps in the region variable by their id. So, id 1, John Smith, on the third row, should have West in the third column. Jane Smith's region should be filled in "South" where it is missing.
I've tried creating a separate dataset and then merging it based on id but it creates duplicate rows and basically increases the N by something like 14 times (no idea why).
region1<-subset(df1, df1$region=="DC"| df1$region=="Midwest"|df1$region=="Northeast"|df1$region=="South"|df1$region=="West")
region<-region1[,c(id","region")]
df2<-merge(df1, region, by="id")
I've checked the structure of the variables. Id variable is interval and region is a factor. I think there should be a super simple way to do this but I'm just not getting it. Any ideas?
Thank you in advance.
Here´s an R base solution. Suppose your data.frame is df
regions <- sapply(split(df$region, df$id), function(x) {
ind <- is.na(x);
x[ind] <- x[!ind][1];
x
})
df$region <- unlist(regions)
df
id name region Company
1 1 John Smith West Walmart
2 1 John Smith West Amazon
3 1 John Smith West <NA>
4 1 John Smith West P&G
5 2 Jane Smith South Apple
6 2 Jane Smith South <NA>
7 3 Richard Burkett West Walmart
8 3 Richard Burkett West <NA>
I would use dplyr::arrange followed by tidyr::fill
library(dplyr)
library(tidyr)
data.frame(id=c(1,1,1,1,2,2,3,3),
name=c(rep("John Smith",4), rep("Jane Smith", 2), rep("Richard Burkett", 2)),
region=c("West", "West", NA, "West", "South",NA, "West", NA),
Company=c("Walmart","Amazon",NA,"P&G","Apple",NA,"Walmart",NA)) %>%
arrange(id, name) %>%
fill(region)
Results in:
id name region Company
1 1 John Smith West Walmart
2 1 John Smith West Amazon
3 1 John Smith West NA
4 1 John Smith West P&G
5 2 Jane Smith South Apple
6 2 Jane Smith South NA
7 3 Richard Burkett West Walmart
8 3 Richard Burkett West NA
The solution which should work is group_by on id and then fill. Ideally the solution which should work in OP condition should cover in both direction.
library(tidyverse)
df %>% group_by(id) %>%
fill(region) %>%
fill(region, .direction = "up")
# id name region Company
# <int> <chr> <chr> <chr>
#1 1 John Smith West Walmart
#2 1 John Smith West Amazon
#3 1 John Smith West <NA>
#4 1 John Smith West P&G
#5 2 Jane Smith South Apple
#6 2 Jane Smith South <NA>
#7 3 Richard Burkett West Walmart
#8 3 Richard Burkett West <NA>
Data
structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), name = c("John Smith",
"John Smith", "John Smith", "John Smith", "Jane Smith", "Jane Smith",
"Richard Burkett", "Richard Burkett"), region = c("West", "West",
NA, "West", "South", NA, "West", NA), Company = c("Walmart",
"Amazon", NA, "P&G", "Apple", NA, "Walmart", NA)), .Names = c("id",
"name", "region", "Company"), class = "data.frame", row.names = c(NA,
-8L))

Find discrepancies between two tables

I'm working with R from a SAS/SQL background, and am trying to write code to take two tables, compare them, and provide a list of the discrepancies. This code would be used repeatedly for many different sets of tables, so I need to avoid hardcoding.
I'm working with Identifying specific differences between two data sets in R , but it doesn't get me all the way there.
Example Data, using the combination of LastName/FirstName (which is unique) as a key --
Dataset One --
Last_Name First_Name Street_Address ZIP VisitCount
Doe John 1234 Main St 12345 20
Doe Jane 4321 Tower St 54321 10
Don Bob 771 North Ave 23232 5
Smith Mike 732 South Blvd. 77777 3
Dataset Two --
Last_Name First_Name Street_Address ZIP VisitCount
Doe John 1234 Main St 12345 20
Doe Jane 4111 Tower St 32132 17
Donn Bob 771 North Ave 11111 5
Desired Output --
LastName FirstName VarName TableOne TableTwo
Doe Jane StreetAddress 4321 Tower St 4111 Tower St
Doe Jane Zip 23232 32132
Doe Jane VisitCount 5 17
Note that this output ignores records where I don't have the same ID in both tables (for instance, because Bob's last name is "Don" in one table, and "Donn" in another table, we ignore that record entirely).
I've explored doing this by applying the melt function on both datasets, and then comparing them, but the size data I'm working with indicates that wouldn't be practical. In SAS, I used Proc Compare for this kind of work, but I haven't found an exact equivalent in R.
Here is a solution based on data.table:
library(data.table)
# Convert into data.table, melt
setDT(d1)
d1 <- d1[, list(VarName = names(.SD), TableOne = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]
setDT(d2)
d2 <- d2[, list(VarName = names(.SD), TableTwo = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]
# Set keys for merging
setkey(d1,Last_Name,First_Name,VarName)
# Merge, remove duplicates
d1[d2,nomatch=0][TableOne!=TableTwo]
# Last_Name First_Name VarName TableOne TableTwo
# 1: Doe Jane Street_Address 4321 Tower St 4111 Tower St
# 2: Doe Jane ZIP 54321 32132
# 3: Doe Jane VisitCount 10 17
where input data sets are:
# Input Data Sets
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John",
"Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St",
"771 North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L,
23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name",
"First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))
d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John",
"Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St",
"771 North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L,
17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address",
"ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))
dplyr and tidyr work well here. First, a slightly reduced dataset:
dat1 <- data.frame(Last_Name = c('Doe', 'Doe', 'Don', 'Smith'),
First_Name = c('John', 'Jane', 'Bob', 'Mike'),
ZIP = c(12345, 54321, 23232, 77777),
VisitCount = c(20, 10, 5, 3),
stringsAsFactors = FALSE)
dat2 <- data.frame(Last_Name = c('Doe', 'Doe', 'Donn'),
First_Name = c('John', 'Jane', 'Bob'),
ZIP = c(12345, 32132, 11111),
VisitCount = c(20, 17, 5),
stringsAsFactors = FALSE)
(Sorry, I didn't want to type it all in. If it's important, please provide a reproducible example with well-defined data structures.)
Additionally, it looks like your "desired output" is a little off with Jane Doe's ZIP and VisitCount.
Your thought to melt them works well:
library(dplyr)
library(tidyr)
dat1g <- gather(dat1, key, value, -Last_Name, -First_Name)
dat2g <- gather(dat2, key, value, -Last_Name, -First_Name)
head(dat1g)
## Last_Name First_Name key value
## 1 Doe John ZIP 12345
## 2 Doe Jane ZIP 54321
## 3 Don Bob ZIP 23232
## 4 Smith Mike ZIP 77777
## 5 Doe John VisitCount 20
## 6 Doe Jane VisitCount 10
From here, it's deceptively simple:
dat1g %>%
inner_join(dat2g, by = c('Last_Name', 'First_Name', 'key')) %>%
filter(value.x != value.y)
## Last_Name First_Name key value.x value.y
## 1 Doe Jane ZIP 54321 32132
## 2 Doe Jane VisitCount 10 17
The dataCompareR package aims to solve this exact problem. The vignette for the package includes some simple examples, and I've used this package to solve the original problem below.
Disclaimer: I was involved with creating this package.
library(dataCompareR)
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", "Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", "771 North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))
d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", "Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", "771 North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))
compd1d2 <- rCompare(d1, d2, keys = c("First_Name", "Last_Name"))
print(compd1d2)
All columns were compared, 3 row(s) were dropped from comparison
There are 3 mismatched variables:
First and last 5 observations for the 3 mismatched variables
FIRST_NAME LAST_NAME valueA valueB variable typeA typeB diffAB
1 Jane Doe 4321 Tower St 4111 Tower St STREET_ADDRESS character character
2 Jane Doe 10 17 VISITCOUNT integer integer -7
3 Jane Doe 54321 32132 ZIP integer integer 22189
To get a more detailed and pretty summary, the user can run
summary(compd1d2)
The use of FIRST_NAME and LAST_NAME as the 'join' between the two tables is controlled by the keys = argument to the rCompare function. In this case any rows that do not match on these two variables are dropped from the comparison, but you can get a more detailed output on the comparison performed by using summary

Resources