test if words are in a string (grepl, fuzzyjoin?) - r

I need to do a match and join on two data frames if the string from two columns of one data frame are contained in the string of a column from a second data frame.
Example dataframe:
First <- c("john", "jane", "jimmy", "jerry", "matt", "tom", "peter", "leah")
Last <- c("smith", "doe", "mcgee", "bishop", "gibbs", "dinnozo", "lane", "palmer")
Name <- c("mr john smith","", "timothy t mcgee", "dinnozo tom", "jane l doe", "jimmy mcgee", "leah elizabeth arthur palmer and co", "jerry bishop the cat")
ID <- c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6", "ID7", "ID8")
df1 <- data.frame(First, Last)
df2 <- data.frame(Name, ID)
So basically, I have df1 which has fairly orderly names of people in first and last name; I have df2, which has names which may be organized as "First Name, Last Name", or "Last Name First Name" or "First Name MI Last Name" or something else entirely that also contains the name. I need the ID column from df2. So I want to run a code to see if df1$First and df2$Last is somewhere in the string of df2$Name, and if it is have it pull and join df2$ID to df1.
My R guru told me to use fuzzy_left_join from the fuzzyjoin package:
fzjoin <- fuzzy_left_join(df1, df2, by = c("First" = "Name"), match_fun = "contains")
but it gives me an error where the argument is not logical; and I can't figure out how to rewrite it to do what I want; the documentation says that match_fun should be TRUE or FALSE, but I don't know what to do with that. Also, it only matches on df1$First rather than df1$First and df1$Last. I think I might be able to use the grepl, but not sure how based on examples I've seen. Any advice?

The documentation says that match_fun should be a "Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match." It's not TRUE or FALSE, it's a function that returns TRUE or FALSE. If we switch your order, we can use stringr::str_detect, which does return TRUE or FALSE as required.
fuzzyjoin::fuzzy_left_join(
df2, df1,
by = c("Name" = "First", "Name" = "Last"),
match_fun = stringr::str_detect
)
# Name ID First Last
# 1 mr john smith ID1 john smith
# 2 ID2 <NA> <NA>
# 3 timothy t mcgee ID3 <NA> <NA>
# 4 dinnozo tom ID4 tom dinnozo
# 5 jane l doe ID5 jane doe
# 6 jimmy mcgee ID6 jimmy mcgee
# 7 leah elizabeth arthur palmer and co ID7 leah palmer
# 8 jerry bishop the cat ID8 jerry bishop

Related

In R, how can I find rows with any values included in a column

I have two dataframe:
df1: Names of staff in my organization.
df2: Names of staff in 10 different organizations
I would like to find people listed in df1 from df2. In particular, I would like to make an additional variable showing whether the names in df2 is overlapped with names in df1 (yes:1, no:0)
How should I code this?
Thanks
You can try something like this:
Use data.table to check for matches between df1 and df2 on staff_names column.
library(data.table)
Manually create data tables
df1 <- data.table(staff_names = c("John Appleseed", "Daniel Lewis", "Todd Smith"))
df2 <- data.table(staff_names = c("John Appleseed", "Greg Scott", "Tony Hawk"))
Code:
df3 <- df1[df2, on=c(staff_names="staff_names"), overlap:="1"]
df3[is.na(df3)] <- 0
#> staff_names overlap
#> 1: John Appleseed 1
#> 2: Daniel Lewis 0
#> 3: Todd Smith 0
Created on 2020-08-08 by the reprex package (v0.3.0)

How to update name based on other column's condition (Cleaning Data)

I have a df below
df <- data.frame(LASTNAME = c("Robinson", "Anderson", "Beckham", "Wickham", "Carlos", "Robinson", "Beckham", "Anderson", "Carlos"),
FIRSTNAME = c("David", "Adi", "Joan", "Kesley", "Anberto", "Dave", "Joana", "Adien", "An"))
df <- data.frame(lapply(df, as.character), stringsAsFactors = FALSE)
There are some first names are not consistent. I want to find and replace these ones. But when I put it in the function, it doesn't work. One more thing is my data is big. There are hundred of names, so are there any better ways to do it.
My code works well when it is alone (not in function), but I failed to find a way to do it if I have 100 names need to find and replace. I found a reference here, but does not resolve my problem. Any suggestions would be appreciated.
fil_name <- function(last,first,alternative){
df %>%
mutate(FIRSTNAME = ifelse(LASTNAME == "last" & FIRSTNAME == "first", "alternative", FIRSTNAME))
}
fil_name(Robinson,Dave,David)
Expected output:
LASTNAME FIRSTNAME
1 Robinson David
2 Anderson Adien
3 Beckham Joana
4 Wickham Kesley
5 Carlos Anberto
6 Robinson David
7 Beckham Joana
8 Anderson Adien
9 Carlos Anberto
We can convert to character inside the function, and it should work
fil_name <- function(df, last,first,alternative){
last <- rlang::as_string(rlang::ensym(last))
first <- rlang::as_string(rlang::ensym(first))
alternative <- rlang::as_string(rlang::ensym(alternative))
df %>%
dplyr::mutate(FIRSTNAME = case_when(LASTNAME == last &
FIRSTNAME == first ~ alternative, TRUE ~ FIRSTNAME))
}
fil_name(df, Robinson,Dave,David)
Another approach is to create a separate data frame including the FIRSTNAME alternative name pairings, merge it into the original data, and update FIRSTNAME for those rows where ALTNAME is not NA.
This allows one to update the data with a vectorized process, rather than changing the names one by one.
# create data frame with a column to maintain original sort order
df <- data.frame(obs = 1:9,
LASTNAME = c("Robinson", "Anderson", "Beckham", "Wickham", "Carlos", "Robinson", "Beckham", "Anderson", "Carlos"),
FIRSTNAME = c("David", "Adi", "Joan", "Kesley", "Anberto", "Dave", "Joana", "Adien", "An"),
stringsAsFactors = FALSE)
# create firstname / altname pairs
altnames <- data.frame(FIRSTNAME = c("Dave","Adi","Joan","An"),
ALTNAME = c("David","Adien","Joana","Anberto"),
stringsAsFactors = FALSE)
# merge by firstname, keeping all rows from original data frame
combined <- merge(df,altnames,by="FIRSTNAME",all.x=TRUE)
# update rows where ALTNAME is not NA
combined[!is.na(combined$ALTNAME),"FIRSTNAME"] <- combined[!is.na(combined$ALTNAME),"ALTNAME"]
# print the result, ordered by sequence in original data frame
combined[order(combined$obs),c("LASTNAME","FIRSTNAME")]
...and the output:
> combined[order(combined$obs),c("LASTNAME","FIRSTNAME")]
LASTNAME FIRSTNAME
6 Robinson David
1 Anderson Adien
7 Beckham Joana
9 Wickham Kesley
4 Carlos Anberto
5 Robinson David
8 Beckham Joana
2 Anderson Adien
3 Carlos Anberto
>

Create a two new column by mapping multiple column

How to match columns in R and extract value. As an example: I want to match on the basis of Name and City columns of dataframe_one with dataframe_two and then return the output with two another column temp and ID. If it matches it should return TRUE and ID too.
My input is:
dataframe_one
Name City
Sarah ON
David BC
John KN
Diana AN
Judy ON
dataframe_two
Name City ID
Dave ON 1092
Diana AN 2314
Judy ON 1290
Ari KN 1450
Shanu MN 1983
I want the output to be
Name City temp ID
Sarah ON FALSE NA
David BC TRUE 1450
John KN TRUE 1983
Diana AN FALSE NA
Judy ON FALSE NA
One thing that makes answering questions of this type easier is if you at least put the data frames in R, like so:
df1 <- data.frame(stringsAsFactors=FALSE,
Name = c("Sarah", "David", "John", "Diana", "Judy"),
City = c("ON", "BC", "KN", "AN", "ON")
)
df2 <- data.frame(stringsAsFactors=FALSE,
Name = c("Dave", "Diana", "Judy", "Ari", "Shanyu"),
City = c("ON", "AN", "ON", "KN", "MN"),
ID = c(1092, 2314, 1290, 1450, 1983)
)
Then search existing Stack Overflow questions that have answered similar questions (e.g. How to join (merge) data frames (inner, outer, left, right)).
Given that neither of your original dfs contain the column "Temp" you would need to create it in the joined (merged) data frame.
We'll be able to help a lot more if you at least make a start and then the community will help you troubleshoot.
That being said, I can't for the life of me figure out how you would generate your output df from the inputs.
Using biomiha code to generate df1 and df2:
df1 <- data.frame(stringsAsFactors=FALSE,
Name = c("Sarah", "David", "John", "Diana", "Judy"),
City = c("ON", "BC", "KN", "AN", "ON")
)
df2 <- data.frame(stringsAsFactors=FALSE,
Name = c("Dave", "Diana", "Judy", "Ari", "Shanyu"),
City = c("ON", "AN", "ON", "KN", "MN"),
ID = c(1092, 2314, 1290, 1450, 1983)
)
you may try:
library(dplyr)
df1 %>%
left_join(df2, by = c("Name" = "Name", "City" = "City")) %>%
mutate(temp = !is.na(ID))
gives the output:
Name City ID temp
1 Sarah ON NA FALSE
2 David BC NA FALSE
3 John KN NA FALSE
4 Diana AN 2314 TRUE
5 Judy ON 1290 TRUE

2 data frames, 2 different styles of naming the same person, how to make them similar?

I have a two datasets. In one dataset fist name, second name and last name are written in different variables.
for instance:
ID firstname second name last name
12 john arnold doe
14 jerry k wildlife
In the second one they are written in one variable:
ID name
12 john arnold doe
14 jerry k wildlife
Now I want to be able to find these people from dataset two (full names) in dataset one (seperate names).
A couple of problems that i have are:
not all names occur in both dataset,
not all names have a middle initial,
not all names have IDs so I cannot search on that alone either.
so the question is, could someone suggest a command to split the names in first/second/last name? secondly would someone know how to search for these names with a simple command, something like:
df<-df.old[grepl("firstname", df.old$firstname, ignore.cases=T) & grepl("secondname", df.old$secondname,ignore.cases=T) & grepl("lastname", df.old$lastname, ignore.cases=T),]
any suggestions?
Dirk
You can use separate from tidyr package.
separate(df2, name, into=c("firstname", "secondname", "last name"), " ")
# ID firstname secondname last name
#1 12 john arnold doe
#2 14 jerry k wildlife
For missing middlenames, if lastname can be classified as middle name,
df2 <- data.frame(ID=c(12, 14), name=c("john arnold doe", "jerry wildlife"))
library(splitstackshape)
cSplit(df2, 2, sep = " ")# this reads "split 2nd column by white space"
# ID name_1 name_2 name_3
#1: 12 john arnold doe
#2: 14 jerry wildlife NA
name_1 corresponds to first name, name_2 to middle name
Try this:
*Sample Data *
df2 <- data.frame(ID=c(12, 14), name=c("john arnold doe", "jerry k wildlife"))
Split names by space
df2 <- cbind(df2$ID, data.frame(do.call(rbind, strsplit(as.character(df2$name), " "))))
names(df2) <- c("ID", "firstname", "second name", "last name")
df2
Join the two data frames by first name and last or id.

Replacing vector values in R based on a list (hash)

I have a dataframe, one column of which is names. In a later phase of analysis, I will need to merge with other data by this name column, and there are a few names which vary by source. I'd like to clean up my names using a hash (map) of names->cleaned names. I've found several references to using R lists as hashes (e.g., this question on SE), but I can't figure out how to extract values for keys in a vector only as they occur. So for example,
> players=data.frame(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8))
> xref = c("Bob"="Robert", "Fred Jr." = "Fred")
> players$names
[1] Joe John Bob
Levels: Bob Joe John
Whereas players$names gives a vector of names from the original frame, I need the same vector, only with any values that occur in xref replaced with their equivalent (lookup) values; my desired result is the vector Joe John Robert.
The closest I've come is:
> players$names %in% names(xref)
[1] FALSE FALSE TRUE
Which correctly indicates that only "Bob" in players$names exists in the "keys" (names) of xref, but I can't figure out how to extract the value for that name and combine it with the other names in the vector that don't belong to xref as needed.
note: in case it's not completely clear, I'm pretty new to R, so if I'm approaching this in the wrong fashion, I'm happy to be corrected, but my core issue is essentially as stated: I need to clean up some incoming data within R by replacing some incoming values with known replacements and keeping all other values; further, the map of original->replacement should be stored as data (like xref), not as code.
Updated answer: ifelse
ifelse is an even more straightforward solution, in the case that xref is a named vector and not a list.
players <- data.frame(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8), stringsAsFactors = FALSE)
xref <- c("Bob" = "Robert", "Fred Jr." = "Fred")
players$clean <- ifelse(is.na(xref[players$names]), players$names, xref[players$names])
players
Result
names scores clean
1 Joe 9.8 Joe
2 John 9.9 John
3 Bob 8.8 Robert
Previous answer: sapply
If xref is a list, then sapply function can be used to do conditional look-ups
players <- data.frame(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8))
xref <- list("Bob" = "Robert", "Fred Jr." = "Fred")
players$clean <- sapply(players$names, function(x) ifelse( x %in% names(xref), xref[x], as.vector(x)) )
players
Result
> players
names scores clean
1 Joe 9.8 Joe
2 John 9.9 John
3 Bob 8.8 Robert
You can replace the factor levels with the desired text. Here's an example which loops through xref and does the replacement:
for (n in names(xref)) {
levels(players$names)[levels(players$names) == n ] <- xref[n]
}
players
## names scores
## 1 Joe 9.8
## 2 John 9.9
## 3 Robert 8.8
Another example of replacing the factor levels.
allnames = levels(players$names)
levels(players$names)[ !is.na(xref[allnames]) ] = na.omit(xref[allnames])
players
# names scores
# 1 Joe 9.8
# 2 John 9.9
# 3 Robert 8.8
If you get into really big data sets, you might take a look at merge function or the data.table package. Here is a data.table example of a join.
library(data.table)
players=data.table(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8), key="names")
nms = data.table(names=names(xref),names2=xref, key="names")
out = nms[players]
out[is.na(names2),names2:=names]
out
# names names2 scores
# 1: Bob Robert 8.8
# 2: Joe Joe 9.8
# 3: John John 9.9
Here is an similar example with the merge function.
players=data.frame(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8))
nms = data.frame(names=names(xref),names2=xref,row.names=NULL)
merge(nms,players,all.y=TRUE)
# names names2 scores
# 1 Bob Robert 8.8
# 2 Joe <NA> 9.8
# 3 John <NA> 9.9

Resources