Create a two new column by mapping multiple column - r

How to match columns in R and extract value. As an example: I want to match on the basis of Name and City columns of dataframe_one with dataframe_two and then return the output with two another column temp and ID. If it matches it should return TRUE and ID too.
My input is:
dataframe_one
Name City
Sarah ON
David BC
John KN
Diana AN
Judy ON
dataframe_two
Name City ID
Dave ON 1092
Diana AN 2314
Judy ON 1290
Ari KN 1450
Shanu MN 1983
I want the output to be
Name City temp ID
Sarah ON FALSE NA
David BC TRUE 1450
John KN TRUE 1983
Diana AN FALSE NA
Judy ON FALSE NA

One thing that makes answering questions of this type easier is if you at least put the data frames in R, like so:
df1 <- data.frame(stringsAsFactors=FALSE,
Name = c("Sarah", "David", "John", "Diana", "Judy"),
City = c("ON", "BC", "KN", "AN", "ON")
)
df2 <- data.frame(stringsAsFactors=FALSE,
Name = c("Dave", "Diana", "Judy", "Ari", "Shanyu"),
City = c("ON", "AN", "ON", "KN", "MN"),
ID = c(1092, 2314, 1290, 1450, 1983)
)
Then search existing Stack Overflow questions that have answered similar questions (e.g. How to join (merge) data frames (inner, outer, left, right)).
Given that neither of your original dfs contain the column "Temp" you would need to create it in the joined (merged) data frame.
We'll be able to help a lot more if you at least make a start and then the community will help you troubleshoot.
That being said, I can't for the life of me figure out how you would generate your output df from the inputs.

Using biomiha code to generate df1 and df2:
df1 <- data.frame(stringsAsFactors=FALSE,
Name = c("Sarah", "David", "John", "Diana", "Judy"),
City = c("ON", "BC", "KN", "AN", "ON")
)
df2 <- data.frame(stringsAsFactors=FALSE,
Name = c("Dave", "Diana", "Judy", "Ari", "Shanyu"),
City = c("ON", "AN", "ON", "KN", "MN"),
ID = c(1092, 2314, 1290, 1450, 1983)
)
you may try:
library(dplyr)
df1 %>%
left_join(df2, by = c("Name" = "Name", "City" = "City")) %>%
mutate(temp = !is.na(ID))
gives the output:
Name City ID temp
1 Sarah ON NA FALSE
2 David BC NA FALSE
3 John KN NA FALSE
4 Diana AN 2314 TRUE
5 Judy ON 1290 TRUE

Related

test if words are in a string (grepl, fuzzyjoin?)

I need to do a match and join on two data frames if the string from two columns of one data frame are contained in the string of a column from a second data frame.
Example dataframe:
First <- c("john", "jane", "jimmy", "jerry", "matt", "tom", "peter", "leah")
Last <- c("smith", "doe", "mcgee", "bishop", "gibbs", "dinnozo", "lane", "palmer")
Name <- c("mr john smith","", "timothy t mcgee", "dinnozo tom", "jane l doe", "jimmy mcgee", "leah elizabeth arthur palmer and co", "jerry bishop the cat")
ID <- c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6", "ID7", "ID8")
df1 <- data.frame(First, Last)
df2 <- data.frame(Name, ID)
So basically, I have df1 which has fairly orderly names of people in first and last name; I have df2, which has names which may be organized as "First Name, Last Name", or "Last Name First Name" or "First Name MI Last Name" or something else entirely that also contains the name. I need the ID column from df2. So I want to run a code to see if df1$First and df2$Last is somewhere in the string of df2$Name, and if it is have it pull and join df2$ID to df1.
My R guru told me to use fuzzy_left_join from the fuzzyjoin package:
fzjoin <- fuzzy_left_join(df1, df2, by = c("First" = "Name"), match_fun = "contains")
but it gives me an error where the argument is not logical; and I can't figure out how to rewrite it to do what I want; the documentation says that match_fun should be TRUE or FALSE, but I don't know what to do with that. Also, it only matches on df1$First rather than df1$First and df1$Last. I think I might be able to use the grepl, but not sure how based on examples I've seen. Any advice?
The documentation says that match_fun should be a "Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match." It's not TRUE or FALSE, it's a function that returns TRUE or FALSE. If we switch your order, we can use stringr::str_detect, which does return TRUE or FALSE as required.
fuzzyjoin::fuzzy_left_join(
df2, df1,
by = c("Name" = "First", "Name" = "Last"),
match_fun = stringr::str_detect
)
# Name ID First Last
# 1 mr john smith ID1 john smith
# 2 ID2 <NA> <NA>
# 3 timothy t mcgee ID3 <NA> <NA>
# 4 dinnozo tom ID4 tom dinnozo
# 5 jane l doe ID5 jane doe
# 6 jimmy mcgee ID6 jimmy mcgee
# 7 leah elizabeth arthur palmer and co ID7 leah palmer
# 8 jerry bishop the cat ID8 jerry bishop

How can I match strings with at least one word in common in R?

Dataframe 1 example:
NAME; CITY; STATE; SURNAME;
Maria Antonia Sousa A X Antonia Sousa
Josep Oliveira Carlos A X Oliveira Carlos
Jose Mario Augusto Farias B Y Augusto Farias
Andre Gois Lucas B Y Gois Lucas
I want to create a column familyDummy in the second dataframe that indicates people that share a least one word of their surnames with the surnames in the first dataframe, but only if they are from the same city and state. The same person may appear in both df's and I don't want to identify them as family. The df's don't have the same lenght.
Dataframe 2 example:
NAME; CITY; STATE; SURNAME; familyDummy;
Maria Antonia Sousa A X Antonia Sousa 0
Angela Oliveira Santos A X Oliveira Santos 1
Fabio Silva Carlos B Y Silva Carlos 0
Luan Gois Lucas B Y Gois Lucas 1
I appreciate any help.
Here's a solution to solve your problem. The solution first divides the SURNAME column, of both df1 and df2, into two surnames to check for individual matches (see df1_bis and df2_bis). Then, it cycles over all the entries of df2 to check if the exact NAME is not found in df1 and if at least one surname of each entry of df2 is found in df1. If these two conditions are met, it afterwards check if the CITY and STATE of these entries match in df1 and df2. If this is the case, then it assigns familyDummy as 1, if not, as 0.
library(tidyverse)
# Your data
df1 <-structure(list(NAME = c("Maria Antonia Sousa", "Josep Oliveira Carlos",
"Jose Mario Augusto Farias", "Andre Gois Lucas"), CITY = c("A",
"A", "B", "B"), STATE = c("X", "X", "Y", "Y"), SURNAME = c("Antonia Sousa",
"Oliveira Carlos", "Augusto Farias", "Gois Lucas")), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(NAME = c("Maria Antonia Sousa", "Angela Oliveira Santos",
"Fabio Silva Carlos", "Luan Gois Lucas"), CITY = c("A", "A",
"B", "B"), STATE = c("X", "X", "Y", "Y"), SURNAME = c("Antonia Sousa",
"Oliveira Santos", "Silva Carlos", "Gois Lucas"), familyDummy = c(0L,
1L, 0L, 1L)), class = "data.frame", row.names = c(NA, -4L))
# Divide surnames
df1_bis <- df1 %>%
# Divide SURNAME into two surnames to check independently for each single surname
mutate(surname1 = str_extract(SURNAME,"[A-z]+(?=\\s)"),
surname2 = str_extract(SURNAME,"(?<=\\s)[A-z]+"))
df2_bis <- df2 %>%
# Divide SURNAME into two surnames to check independently for each single surname
mutate(surname1 = str_extract(SURNAME,"[A-z]+(?=\\s)"),
surname2 = str_extract(SURNAME,"(?<=\\s)[A-z]+"))
df2 %>%
# Add the result as another column
# Use map to cycle over each row in df2
mutate(familyDummy = map(1:nrow(df2_bis), function(i){
# Check if the same NAME is in df1 and df2, if it appears assign 0, if not, 1.
dif_name = str_detect(df2_bis$NAME[i], df1_bis$NAME, negate = T)
# Check if any of the surnames of df1 is in df2. If it appears, assign 1, if not 0,
surname_same = ifelse(str_detect(df2_bis$surname1[i], df1_bis$surname1) | str_detect(df2_bis$surname1[i], df1_bis$surname2) | str_detect(df2_bis$surname2[i], df1_bis$surname1) | str_detect(df2_bis$surname2[i], df1_bis$surname2), 1, 0)
# Get the indices in df1 of the cases that meet the two latter criteria
temp <- which(dif_name == 1 & surname_same == 1)
# Check if there are cases where at least one entry matches the two criteria
if(length(temp) >= 1){
# Check if city and state in df1 matches that in df2
# I used %in% instead of == because there might be more than 1 match
familyDummy = ifelse(df2_bis$CITY[i] %in% df1_bis$CITY[temp] & df2_bis$STATE[i] %in% df1_bis$STATE[temp], 1, 0)
}else{ # If no case match the previous two criteria return 0
familyDummy = 0
}
return(familyDummy)
}))
# NAME CITY STATE SURNAME familyDummy
#1 Maria Antonia Sousa A X Antonia Sousa 0
#2 Angela Oliveira Santos A X Oliveira Santos 1
#3 Fabio Silva Carlos B Y Silva Carlos 0
#4 Luan Gois Lucas B Y Gois Lucas 1

How to update name based on other column's condition (Cleaning Data)

I have a df below
df <- data.frame(LASTNAME = c("Robinson", "Anderson", "Beckham", "Wickham", "Carlos", "Robinson", "Beckham", "Anderson", "Carlos"),
FIRSTNAME = c("David", "Adi", "Joan", "Kesley", "Anberto", "Dave", "Joana", "Adien", "An"))
df <- data.frame(lapply(df, as.character), stringsAsFactors = FALSE)
There are some first names are not consistent. I want to find and replace these ones. But when I put it in the function, it doesn't work. One more thing is my data is big. There are hundred of names, so are there any better ways to do it.
My code works well when it is alone (not in function), but I failed to find a way to do it if I have 100 names need to find and replace. I found a reference here, but does not resolve my problem. Any suggestions would be appreciated.
fil_name <- function(last,first,alternative){
df %>%
mutate(FIRSTNAME = ifelse(LASTNAME == "last" & FIRSTNAME == "first", "alternative", FIRSTNAME))
}
fil_name(Robinson,Dave,David)
Expected output:
LASTNAME FIRSTNAME
1 Robinson David
2 Anderson Adien
3 Beckham Joana
4 Wickham Kesley
5 Carlos Anberto
6 Robinson David
7 Beckham Joana
8 Anderson Adien
9 Carlos Anberto
We can convert to character inside the function, and it should work
fil_name <- function(df, last,first,alternative){
last <- rlang::as_string(rlang::ensym(last))
first <- rlang::as_string(rlang::ensym(first))
alternative <- rlang::as_string(rlang::ensym(alternative))
df %>%
dplyr::mutate(FIRSTNAME = case_when(LASTNAME == last &
FIRSTNAME == first ~ alternative, TRUE ~ FIRSTNAME))
}
fil_name(df, Robinson,Dave,David)
Another approach is to create a separate data frame including the FIRSTNAME alternative name pairings, merge it into the original data, and update FIRSTNAME for those rows where ALTNAME is not NA.
This allows one to update the data with a vectorized process, rather than changing the names one by one.
# create data frame with a column to maintain original sort order
df <- data.frame(obs = 1:9,
LASTNAME = c("Robinson", "Anderson", "Beckham", "Wickham", "Carlos", "Robinson", "Beckham", "Anderson", "Carlos"),
FIRSTNAME = c("David", "Adi", "Joan", "Kesley", "Anberto", "Dave", "Joana", "Adien", "An"),
stringsAsFactors = FALSE)
# create firstname / altname pairs
altnames <- data.frame(FIRSTNAME = c("Dave","Adi","Joan","An"),
ALTNAME = c("David","Adien","Joana","Anberto"),
stringsAsFactors = FALSE)
# merge by firstname, keeping all rows from original data frame
combined <- merge(df,altnames,by="FIRSTNAME",all.x=TRUE)
# update rows where ALTNAME is not NA
combined[!is.na(combined$ALTNAME),"FIRSTNAME"] <- combined[!is.na(combined$ALTNAME),"ALTNAME"]
# print the result, ordered by sequence in original data frame
combined[order(combined$obs),c("LASTNAME","FIRSTNAME")]
...and the output:
> combined[order(combined$obs),c("LASTNAME","FIRSTNAME")]
LASTNAME FIRSTNAME
6 Robinson David
1 Anderson Adien
7 Beckham Joana
9 Wickham Kesley
4 Carlos Anberto
5 Robinson David
8 Beckham Joana
2 Anderson Adien
3 Carlos Anberto
>

Getting a few tables together

I have three tables read into R as dataframes.
Table1:
Student ID School_Name Campus Area
4356791 BCCS Northwest Springdale
03127. BZS South Vernon
12437. BCCS. South Vernon
Table 2:
ProctorID. Date. Score. Student ID Form#
0211 10/05/16 75.57 55612 25432178
0211 10/17/16 83.04 55612 47135671
5134 10/17/16 63.28 02613 2371245
Table 3:
ProctorID First. Last. Campus Area
O211. Simone Lewis. Northwest Springdale
5134. Mona. Yashamito Northwest Springdale
0712. Steven. Lewis. South Vernon
I want to combine the data frames and create a table with the scores next to each other for each area, by school name. I want an output like the following:
School_Name Form# Northwest Springdale Southvernon
BCCS. 2543127. 83.04. 63.25
BCCS. 35674. 75.14. *
BZS. 5321567. 65.2. 62.3
A particular form for a particular school may not have a score for a certain area. Any ideas? I have been playing with sqldf package. Also Is it possible to manipulate this in R without using any sql?
To cast, something like this:
library(reshape2)
casted_df <- dcast(df, ... ~ "Campus Area", value.var="Score.")
An example that seems to work for me:
df1 <- data.frame("StudentID" = 1:3, "SchoolName" = c("School1", "School2", "School3"), "Area" = c("Area1", "Area2", "Area3"))
df2 <- data.frame("StudentID" = 1:3, "Score" = 100:102, "Proctor" = 4:6)
df3 <- data.frame("Proctor" = 4:6, "Area" = c("Area1", "Area2", "Area3"), "Name" = c("John", "Jane", "Jim"))
combined <- merge(df1, df2, by.x = "StudentID")
combined2 <- merge(combined, df3, by.x = "Proctor", by.y="Proctor")
library(reshape2)
final <- dcast(combined2, ... ~ Area.x, value.var="Score")

Change the values in one data set based on the values in another

I have the following problem:
names <- c("Peter", "Gabriel", "James", "Philip")
city <- c("LA", "NY","Chicago","Chicago")
number <- seq(1, length(names))
from <- c("Peter", "Peter", "Gabriel", "James", "James")
to <- c("James","Gabriel", "Philip", "Gabriel", "Philip")
nodes <- data.frame(names, city, number)
edges <- data.frame(from, to)
How do I change the values of edges$from to match those in nodes$number?
You can use the following,
edges$from <- sapply(edges$from, function(i)nodes$number[match(i, nodes$names)])
edges
# from to
#1 1 James
#2 1 Gabriel
#3 2 Philip
#4 3 Gabriel
#5 3 Philip

Resources