Merge two datasets using multiple column checks - r

I do have two dataframes with one ID-Variable in the first df ("ID") and three in the second df ("SIC","Ur","Sonst"). Now I am trying to merge these two datasets by checking if the "ID" in the first df either matches with SIC, Ur, or Sonst in the respective row. Here is my reproducable example:
df1 <- data.frame(ID = c("A", "B", "C","D"),
Value=c(1:4))
df2 <- data.frame(SIC = c("B", NA,NA,NA,NA,NA),
Ur = c(NA, "C", NA,NA,NA,NA),
Sonst=c(NA,NA,"A",NA,NA,NA),
Age=c(14:19))
Now I want a final df only with IDs and all information of the first df (as it is the target df) plus the corresponding age information, if ID either matches with SIC, Ur or Sonst. I have tried dplyr and merge function approaches but did not come up with a proper solution. I'm thankful for any suggestions.

An approach using dplyr and left_join with tidyrs unite
library(dplyr)
library(tidyr)
left_join(df1, df2 %>% unite("ID", SIC:Sonst, na.rm=T))
Joining with `by = join_by(ID)`
ID Value Age
1 A 1 16
2 B 2 14
3 C 3 15
4 D 4 NA
or an inner_join if you only want A, B and C to show up
inner_join(df1, df2 %>% unite("ID", SIC:Sonst, na.rm=T))
Joining with `by = join_by(ID)`
ID Value Age
1 A 1 16
2 B 2 14
3 C 3 15

The convenient way is perhaps to do with the tidyverse family as was nicely indicated by Andre Wildberg. You can do it also using base R merge() function but in your case we need to create an ifelse() function to put all non-missing values of the three columns from df2 into a single column:
df2$ID <- ifelse(!is.na(df2$SIC), df2$SIC,
ifelse(!is.na(df2$Ur), df2$Ur, df2$Sonst))
merge the two dfs:
df3 <- merge(df1, df2, by= "ID", all.x = TRUE)
Discard unwanted columns from merged data (df3):
df3 <- df3[, c("ID", "Value", "Age")]
df3
ID Value Age
A 1 16
B 2 14
C 3 15
D 4 NA

Related

Finding Differences Between Two Dataframes

I have two dataframes: 1) an old dataframe (let's call it "df1") and 2) an updated dataframe ("df2"). I need to identify what has been added to or removed from df1 to create df2. So, I need a new dataframe with a new column identifying what rows should be added to or removed from df1 in order to get df2.
The two dataframes are of differing lengths, and Vessel_ID is the only unique identifier.
Here is a reproducible example:
df1 <- data.frame(Name=c('Vessel1', 'Vessel2', 'Vessel3', 'Vessel4', 'Vessel5'),
Vessel_ID=c('1','2','3','4','5'), special_NO=c(10,20,30,40,50),
stringsAsFactors=F)
df2 <- data.frame(Name=c('Vessel1', 'x', 'y', 'Vessel3', 'x', 'Vessel6'), Vessel_ID=c('1', '6', '7', '3', '5', '10'), special_NO=NA, stringsAsFactors=F)
Ideally I would want an output like this:
df3
Name Vessel_ID special_NO add_remove
Vessel2 2 20 remove
Vessel4 4 40 remove
Vessel6 10 NA add
x 6 NA add
y 7 NA add
Also, if the Vessel_ID matches, I want to substitute the special_NO from df1 for NA in df2...but maybe that's for another question.
I tried add a new column to both df1 and df2 to identify which df they originally belonged to, then merging the dataframes and using the duplicated () function. This seemed to work, but I still wasn't sure which rows to remove or to add, and got different results depending on if I specified fromLast=T or fromLast=F.
An approach using bind_rows
library(dplyr)
bind_rows(df1 %>% mutate(add_remove="remove"),
df2 %>% mutate(add_remove="add")) %>%
group_by(Vessel_ID) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 5 × 4
Name Vessel_ID special_NO add_remove
<chr> <chr> <dbl> <chr>
1 Vessel2 2 20 remove
2 Vessel4 4 40 remove
3 x 6 NA add
4 y 7 NA add
5 Vessel6 10 NA add
Thanks for the comment! That looks like it would work too. Here's another solution a friend gave me using all base R:
df1$old_new <- "old"
df2$old_new <- "new"
#' Use the full_join function in the dplyr package to join both data.frames based on Name and Vessel_ID
df.comb <- dplyr::full_join(df1, df2, by = c("Name", "Vessel_ID"))
#' If you want to go fully base, you can use the merge() function to get the same result.
# df.comb <- merge(df1, df2, by = c("Name", "Vessel_ID"), all = TRUE, sort = FALSE)
#' Create a new column that sets the 'status' of a row
#' If old_new.x is NA, that row came from df2, so it is "new"
df.comb$status[is.na(df.comb$old_new.x)] <- "new"
# If old_new.x is not NA and old_new.y is NA then that row was in df1, but isn't in df2, so it has been "deleted"
df.comb$status[!is.na(df.comb$old_new.x) & is.na(df.comb$old_new.y)] <- "deleted"
# If old_new.x is not NA and old_new.y is not NA then that row was in both df1 and df2 = "same"
df.comb$status[!is.na(df.comb$old_new.x) & !is.na(df.comb$old_new.y)] <- "same"
# only keep the columns you need
df.comb <- df.comb[, c("Name", "Vessel_ID", "special_NO", "status")]

In R: How to consistently replace (/ anonimize) ids or names within two separate data frames

Suppose I have two data frames, df1 and df2. Both data frames have an identifier id. My goal is to merge those data sets on this identifier, but I want to anonimize the names in the id column. However, problem is that I want to so for both data sets individually, thus for df1 and df2 and not directly on df3 (because that would be easy: just replace the id column with some random characters)
I think my solution would need to look something like this. First, I make a separate dataframe consisting of all unique ids from both df1 and df2. Then, I would need to assign some randomization, for example, idxxxx where xxxx is an unique number for each id in this separate data frame. With a dplyr, gsub, or stringr approach I can replace the ids from df1 and df2 according to the value assigned in this separate data frame. After this, I will merge the two data sets.
Here I have two example data frames, my try to solve the problem, and the desired result. Note that the the number of ids does not really matter to me (e.g., it does not matter if John Terry has id0004 or id0003, as long as it is consistently changed within both data frames.
Can someone help me out with this? Thanks!
id <- c("John Williams", "John Terry", "Rick Fire", "Katie Blue", "Unknown")
row1 <- c("28", "17", "17", "29", "39")
df1 <- data.frame(id,row1)
id <- c("Frank Johnson", "John Terry", "Rick Fire", "Katie Blue")
row2 <- c("Purple", "Red", "Yellow", "Green")
df2 <- data.frame(id,row2)
df3 <- merge(df1, df2, all.x = TRUE, all.y = TRUE)
#My try
#Make separate data frame
id_df <- merge(df1, df2, all.x = TRUE, all.y = TRUE)
id_df <- subset(id_df,TRUE,select = c(id))
id_df$anonymous <- id_df %>% mutate(id = row_number()) #it would be nicer to have something like id0001
#Replace ids within df1 and df2 according to the id_df anonymous variable
library(stringr)
df1$id <- str_replace(df1$id, id_df$id, as.character(id_df$anonymous)) #replacement does not work
#desired result
#df1 row1
#id0003 28
#id0002 17
#id0005 17
#id0004 29
#id0006 39
#df2 row2
#id0001 Purple
#id0002 Red
#id0005 Yellow
#id0004 Green
#df3
#id #row1 #row2
#id0001 NA Purple
#id0002 17 Red
#id0003 28 NA
#id0004 29 Green
#id0005 17 Yellow
#id0006 39 NA
If you want to anonymize the ids and make it reasonably difficult to reverse them, you could calculate the md5 hash of each string. Two identical strings will produce the same md5 hash:
df1$id <- sapply(df1$id, digest::digest, algo = "md5")
df2$id <- sapply(df2$id, digest::digest, algo = "md5")
df3 <- merge(df1, df2, all.x = TRUE, all.y = TRUE)
df3
#> id row1 row2
#> 1 22e35044ed452870ad5b014e87121d9d 39 <NA>
#> 2 69d61b42a2f549c4765699f06de3b351 28 <NA>
#> 3 ad1cc76e26c5d73ba4a03bf51df1b6af 17 Yellow
#> 4 b3bdcc4913da319308e6ddf47e09da12 <NA> Purple
#> 5 badea53ae1e8a2fa66ebd1cdde9dd413 17 Red
#> 6 d1f305c19a2f9649abe11efcf26ac645 29 Green
Here is a solution with all base R (no tidyverse). We create a lookup table with all unique IDs (use the set operation union to find the IDs) and then merge the lookup table with each data frame separately.
# Find all unique ids and create a lookup table.
all_ids <- union(df1$id, df2$id)
id_df <- data.frame(id = all_ids, code = paste0('id', sprintf('%04d', 1:length(all_ids))))
# Merge df1 with the lookup table, remove the id column, and rename the code column to id.
df1 <- merge(df1, id_df, all.x = TRUE)
df1 <- df1[, c('code', 'row1')]
names(df1)[1] <- 'id'
# Repeat for df2
df2 <- merge(df2, id_df, all.x = TRUE)
df2 <- df2[, c('code', 'row2')]
names(df2)[1] <- 'id'
df3 <- merge(df1, df2, all.x = TRUE, all.y = TRUE)
Note that sprintf('%04d, ...)` will pad the number code with zeroes to a total length of 4.
A tidyverse solution that honors your request not to operate on df3 until you join
id_df <- data.frame(id = union(df1$id,df2$id))
id_df <-
id_df %>%
mutate(anonymous = paste0("id", stringr::str_pad(row_number(),
width = 4,
pad = 0)))
newdf1 <- left_join(df1, id_df) %>% select(-id) %>% relocate(anonymous)
newdf2 <- left_join(df2, id_df) %>% select(-id) %>% relocate(anonymous)
full_join(newdf1, newdf2)
#> Joining, by = "anonymous"
#> anonymous row1 row2
#> 1 id0001 28 <NA>
#> 2 id0002 17 Red
#> 3 id0003 17 Yellow
#> 4 id0004 29 Green
#> 5 id0005 39 <NA>
#> 6 id0006 <NA> Purple

How to create a variable based on the number of unique values in another data frame?

This is a simplified example of what I want to do.
Dataset 1 (DF1) has data of apples (like the size or number of holes), and a second dataset (DF2) has information of worms found inside them, including color, and in which apple they were found.
What I want to do is to add a variable in DF1 with the number of unique colors (of the worms) that exist in each apple.
DF1<-data.frame(x=c("A1","A2","A3","A4","A5"),y=c(3,26,5,27,5))
DF2<-data.frame(Q=c("A1","A1","A1","A1","A1","A1","A2","A2","A3","A3","A3","A4","A5","A5","A5","A5"),R=c("red","red","blue","yellow","yellow","blue","orange","orange","green","red","red","blue","blue", "purple","black","red"),S=c(4,5,3,5,4,3,5,4,3,5,4,3,5,4,3,5))
I am new in R, and when trying to solve it I thought of :
DF1$N.Colors<-length(unique(DF2$R[match(DF1$X,DF2$Q)]))
But it gives me back a new variable filled with 0s, instead of the wanted vector:
DF1$N.Colors<-c(3,1,2,1,4)
I'd appreciate very much your help with it
This could be done by making use of join with the 'Q', 'x' columns of both dataset, count the unique values of 'R' and assign it to a new column in 'DF1'
library(data.table)
DF1$N.Colors <- setDT(DF2)[DF1, uniqueN(R), on = .(Q = x), by = .EACHI]$V1
Or using tidyverse
library(dplyr)
DF2 %>%
group_by(x = Q) %>%
summarise(N.Colors = n_distinct(R)) %>%
right_join(DF1)
A base solution with aggregate() and merge():
merge(DF1, aggregate(N.Colors ~ Q, list(N.Colors = DF2$R, Q = DF2$Q), function(x) length(unique(x))), all.x = T, by.x = "x", by.y = "Q")
# x y N.Colors
# 1 A1 3 3
# 2 A2 26 1
# 3 A3 5 2
# 4 A4 27 1
# 5 A5 5 4

Create data frame, matching based on first elements of a list

I want to create a data frame based on the first element of a list. Specifically, I have
One vector containing variables (names1);
One list that contains two variables (some vars1 and the values);
And the end product should a data.frame with "names1" that contains as many lines as cases that match.
If there is no match between a specific list and a the vector, it should be NA.
The values can also be factors or strings.
names1 <- c("a", "b", "c")
dat1 <- data.frame(names1 =c("a", "b", "c", "f"),values= c("val1", 13, 11, 0))
dat1$values <- as.factor(dat1$values)
dat2 <- data.frame(names1 =c("a", "b", "x"),values= c(12, 10, 2))
dat2$values <- as.factor(dat2$values)
list1 <- list(dat1, dat2)
The results should be a new data frame with the variables "names" and all values that match of each of list parts:
a b c
val1 13 11
12 10 NA
One option would be to loop through the list ('list1'), filter the 'names' column based on the 'names' vector, convert it to a single dataset while creating an identification column with .id, spread from 'long' to 'wide' and remove the 'grp' column
library(tidyverse)
map_df(list1, ~ .x %>%
filter(names %in% !! names), .id = 'grp') %>%
spread(names, values) %>%
select(-grp)
# a b c
#1 25 13 11
#2 12 10 NA
Or another option is to bind the datasets together with bind_rows, created a grouping id 'grp' to specify the list element, filter the rows by selecting only 'names' column that match with the 'names' vector and spread from 'long' to 'wide'
bind_rows(list1, .id = 'grp') %>%
filter(names %in% !! names) %>%
spread(names, values)
NOTE: It is better not to use reserved keywords for specifying object names (names). Also, to avoid confusions, the object should be different from the column names of the dataframe object.
It can be also done with only base R. Create a group identifier with Map, rbind the list elements to single dataset, subset the rows by keeping only the values from the 'names' vector, and reshape from 'long' to 'wide'
df1 <- subset(do.call(rbind, Map(cbind, list1,
ind = seq_along(list1))), names %in% .GlobalEnv$names)
reshape(df1, idvar = 'ind', direction = 'wide', timevar = 'names')[-1]
A mix of base R and dplyr. For every list element we create a dataframe with 1 row. Using dplyr's rbind_list row bind them together and then subset only those columns which we need using names.
library(dplyr)
rbind_list(lapply(list1, function(x)
setNames(data.frame(t(x$values)), x$names)))[names]
# a b c
# <dbl> <dbl> <dbl>
#1 25 13 11
#2 12 10 NA
Output without subset looks like this
rbind_list(lapply(list1, function(x) setNames(data.frame(t(x$values)), x$names)))
# a b c x
# <dbl> <dbl> <dbl> <dbl>
#1 25 13 11 NA
#2 12 10 NA 2
In base R
t(sapply(list1, function(x) setNames(x$values, names)[match(names, x$names)]))
# a b c
# [1,] 25 13 11
# [2,] 12 10 NA
Using base R only
body <- do.call('rbind', lapply(list1, function(list.element){
element.vals <- list.element[['values']]
element.names <- list.element[['names']]
names(element.vals) <- element.names
return.vals <- element.vals[names]
if(all(is.na(return.vals))) NULL else return.vals
}))
df <- as.data.frame(body)
names(df) <- names
df
For the sake of completeness, here is a data.table approach using dcast() and rowid():
library(data.table)
nam <- names1 # avoid name conflict with column name
rbindlist(list1)[names1 %in% nam, dcast(.SD, rowid(names1) ~ names1)][, names1 := NULL][]
a b c
1: val1 13 11
2: 12 10 <NA>
Or, more concisely, pick columns after reshaping:
library(data.table)
rbindlist(list1)[, dcast(.SD, rowid(names1) ~ names1)][, .SD, .SDcols = names1]

Merging two data.frames by key column

I have two dataframes. In the first one, I have a KEY/ID column and two variables:
KEY V1 V2
1 10 2
2 20 4
3 30 6
4 40 8
5 50 10
In the second dataframe, I have a KEY/ID column and a third variable
KEY V3
1 5
2 10
3 20
I would like to extract the rows of the first dataframe that are also in the second dataframe by matching them according to the KEY column. I would also like to add the V3 column to final dataset.
KEY V1 V2 V3
1 10 2 5
2 20 4 10
3 30 6 20
This are my attempts by using the subset and the merge function
subset(data1, data1$KEY == data2$KEY)
merge(data1, data2, by.x = "KEY", by.y = "KEY")
None of them does the task.
Any hint would be appreaciated. Thank you!
merge(data1, data2, by="KEY") should do it!
If what you want is an inner join, then your attempt should do it. If it doesn't check the formats of Key columns in both the table using class(data1$key).
Apart from these and the merge suggested by Christian, you can use -
library(plyr)
join(data1, data2, by="KEY", type="inner")
or
library(data.table)
setkey(data1, KEY)
setkey(data2, KEY)
data1[,list(data1,data2)]
You could use a dplyr *_join. Given the sample data, both of the following would give the same result:
library(dplyr)
df_merged <- inner_join(data1, data2, by = 'KEY')
df_merged <- right_join(data1, data2, by = 'KEY')
A inner_join returns all rows from df1 where there are matching values in df2, and all columns from df1 and df2.
A right_join returns all rows from df2, and all columns from df1 and df2.

Resources