Merging two data.frames by key column

Merging two data.frames by key column - r

I have two dataframes. In the first one, I have a KEY/ID column and two variables:
KEY V1 V2
1 10 2
2 20 4
3 30 6
4 40 8
5 50 10
In the second dataframe, I have a KEY/ID column and a third variable
KEY V3
1 5
2 10
3 20
I would like to extract the rows of the first dataframe that are also in the second dataframe by matching them according to the KEY column. I would also like to add the V3 column to final dataset.
KEY V1 V2 V3
1 10 2 5
2 20 4 10
3 30 6 20
This are my attempts by using the subset and the merge function
subset(data1, data1$KEY == data2$KEY)
merge(data1, data2, by.x = "KEY", by.y = "KEY")
None of them does the task.
Any hint would be appreaciated. Thank you!

merge(data1, data2, by="KEY") should do it!

If what you want is an inner join, then your attempt should do it. If it doesn't check the formats of Key columns in both the table using class(data1$key).
Apart from these and the merge suggested by Christian, you can use -
library(plyr)
join(data1, data2, by="KEY", type="inner")
or
library(data.table)
setkey(data1, KEY)
setkey(data2, KEY)
data1[,list(data1,data2)]

You could use a dplyr *_join. Given the sample data, both of the following would give the same result:
library(dplyr)
df_merged <- inner_join(data1, data2, by = 'KEY')
df_merged <- right_join(data1, data2, by = 'KEY')
A inner_join returns all rows from df1 where there are matching values in df2, and all columns from df1 and df2.
A right_join returns all rows from df2, and all columns from df1 and df2.

Related

Merge two datasets using multiple column checks

I do have two dataframes with one ID-Variable in the first df ("ID") and three in the second df ("SIC","Ur","Sonst"). Now I am trying to merge these two datasets by checking if the "ID" in the first df either matches with SIC, Ur, or Sonst in the respective row. Here is my reproducable example:
df1 <- data.frame(ID = c("A", "B", "C","D"),
Value=c(1:4))
df2 <- data.frame(SIC = c("B", NA,NA,NA,NA,NA),
Ur = c(NA, "C", NA,NA,NA,NA),
Sonst=c(NA,NA,"A",NA,NA,NA),
Age=c(14:19))
Now I want a final df only with IDs and all information of the first df (as it is the target df) plus the corresponding age information, if ID either matches with SIC, Ur or Sonst. I have tried dplyr and merge function approaches but did not come up with a proper solution. I'm thankful for any suggestions.

An approach using dplyr and left_join with tidyrs unite
library(dplyr)
library(tidyr)
left_join(df1, df2 %>% unite("ID", SIC:Sonst, na.rm=T))
Joining with `by = join_by(ID)`
ID Value Age
1 A 1 16
2 B 2 14
3 C 3 15
4 D 4 NA
or an inner_join if you only want A, B and C to show up
inner_join(df1, df2 %>% unite("ID", SIC:Sonst, na.rm=T))
Joining with `by = join_by(ID)`
ID Value Age
1 A 1 16
2 B 2 14
3 C 3 15

The convenient way is perhaps to do with the tidyverse family as was nicely indicated by Andre Wildberg. You can do it also using base R merge() function but in your case we need to create an ifelse() function to put all non-missing values of the three columns from df2 into a single column:
df2$ID <- ifelse(!is.na(df2$SIC), df2$SIC,
ifelse(!is.na(df2$Ur), df2$Ur, df2$Sonst))
merge the two dfs:
df3 <- merge(df1, df2, by= "ID", all.x = TRUE)
Discard unwanted columns from merged data (df3):
df3 <- df3[, c("ID", "Value", "Age")]
df3
ID Value Age
A 1 16
B 2 14
C 3 15
D 4 NA

Finding Differences Between Two Dataframes

I have two dataframes: 1) an old dataframe (let's call it "df1") and 2) an updated dataframe ("df2"). I need to identify what has been added to or removed from df1 to create df2. So, I need a new dataframe with a new column identifying what rows should be added to or removed from df1 in order to get df2.
The two dataframes are of differing lengths, and Vessel_ID is the only unique identifier.
Here is a reproducible example:
df1 <- data.frame(Name=c('Vessel1', 'Vessel2', 'Vessel3', 'Vessel4', 'Vessel5'),
Vessel_ID=c('1','2','3','4','5'), special_NO=c(10,20,30,40,50),
stringsAsFactors=F)
df2 <- data.frame(Name=c('Vessel1', 'x', 'y', 'Vessel3', 'x', 'Vessel6'), Vessel_ID=c('1', '6', '7', '3', '5', '10'), special_NO=NA, stringsAsFactors=F)
Ideally I would want an output like this:
df3
Name Vessel_ID special_NO add_remove
Vessel2 2 20 remove
Vessel4 4 40 remove
Vessel6 10 NA add
x 6 NA add
y 7 NA add
Also, if the Vessel_ID matches, I want to substitute the special_NO from df1 for NA in df2...but maybe that's for another question.
I tried add a new column to both df1 and df2 to identify which df they originally belonged to, then merging the dataframes and using the duplicated () function. This seemed to work, but I still wasn't sure which rows to remove or to add, and got different results depending on if I specified fromLast=T or fromLast=F.

An approach using bind_rows
library(dplyr)
bind_rows(df1 %>% mutate(add_remove="remove"),
df2 %>% mutate(add_remove="add")) %>%
group_by(Vessel_ID) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 5 × 4
Name Vessel_ID special_NO add_remove
<chr> <chr> <dbl> <chr>
1 Vessel2 2 20 remove
2 Vessel4 4 40 remove
3 x 6 NA add
4 y 7 NA add
5 Vessel6 10 NA add

Thanks for the comment! That looks like it would work too. Here's another solution a friend gave me using all base R:
df1$old_new <- "old"
df2$old_new <- "new"
#' Use the full_join function in the dplyr package to join both data.frames based on Name and Vessel_ID
df.comb <- dplyr::full_join(df1, df2, by = c("Name", "Vessel_ID"))
#' If you want to go fully base, you can use the merge() function to get the same result.
# df.comb <- merge(df1, df2, by = c("Name", "Vessel_ID"), all = TRUE, sort = FALSE)
#' Create a new column that sets the 'status' of a row
#' If old_new.x is NA, that row came from df2, so it is "new"
df.comb$status[is.na(df.comb$old_new.x)] <- "new"
# If old_new.x is not NA and old_new.y is NA then that row was in df1, but isn't in df2, so it has been "deleted"
df.comb$status[!is.na(df.comb$old_new.x) & is.na(df.comb$old_new.y)] <- "deleted"
# If old_new.x is not NA and old_new.y is not NA then that row was in both df1 and df2 = "same"
df.comb$status[!is.na(df.comb$old_new.x) & !is.na(df.comb$old_new.y)] <- "same"
# only keep the columns you need
df.comb <- df.comb[, c("Name", "Vessel_ID", "special_NO", "status")]

How to create a variable based on the number of unique values in another data frame?

This is a simplified example of what I want to do.
Dataset 1 (DF1) has data of apples (like the size or number of holes), and a second dataset (DF2) has information of worms found inside them, including color, and in which apple they were found.
What I want to do is to add a variable in DF1 with the number of unique colors (of the worms) that exist in each apple.
DF1<-data.frame(x=c("A1","A2","A3","A4","A5"),y=c(3,26,5,27,5))
DF2<-data.frame(Q=c("A1","A1","A1","A1","A1","A1","A2","A2","A3","A3","A3","A4","A5","A5","A5","A5"),R=c("red","red","blue","yellow","yellow","blue","orange","orange","green","red","red","blue","blue", "purple","black","red"),S=c(4,5,3,5,4,3,5,4,3,5,4,3,5,4,3,5))
I am new in R, and when trying to solve it I thought of :
DF1$N.Colors<-length(unique(DF2$R[match(DF1$X,DF2$Q)]))
But it gives me back a new variable filled with 0s, instead of the wanted vector:
DF1$N.Colors<-c(3,1,2,1,4)
I'd appreciate very much your help with it

This could be done by making use of join with the 'Q', 'x' columns of both dataset, count the unique values of 'R' and assign it to a new column in 'DF1'
library(data.table)
DF1$N.Colors <- setDT(DF2)[DF1, uniqueN(R), on = .(Q = x), by = .EACHI]$V1
Or using tidyverse
library(dplyr)
DF2 %>%
group_by(x = Q) %>%
summarise(N.Colors = n_distinct(R)) %>%
right_join(DF1)

A base solution with aggregate() and merge():
merge(DF1, aggregate(N.Colors ~ Q, list(N.Colors = DF2$R, Q = DF2$Q), function(x) length(unique(x))), all.x = T, by.x = "x", by.y = "Q")
# x y N.Colors
# 1 A1 3 3
# 2 A2 26 1
# 3 A3 5 2
# 4 A4 27 1
# 5 A5 5 4

How to join a data frame at the bottom of an existing one without rewriting the whole?

df1 = data.frame(c(1,2),c(3,4))
colnames(df1) = c("V1","V2")
df2 = data.frame(c(2,3),c(5,6))
colnames(df2) = c("V1","V2")
How to add df2 at the bottom of df1 without using the rbind that requires the re writing of the entire dataframe?

We can do an assignment by creating a new row in 'df1' after extracting the first column of 'df2'
df1[nrow(df1)+1, ] <- df2[[1]]
df1
# V1 V2
#1 1 3
#2 2 4
#3 2 3
NOTE: The OP showed a dataset 'df2' with a single column. It is assumed that the number of rows in that column equals to the number of columns in 'df1'
With the new dataset, we can use
df1[nrow(df1) + seq_len(nrow(df2)),] <- df2

How to find indices of specific rows in dataframe r

I have a dataframe, A, which looks like this:
col 1 col2 col3
NL 6 9
UK 5 5
US 9 7
and I have a dataframe, B, consisting of a subset of the rows of the large dataframe looking like this:
col 1 col2 col3
NL 6 9
UK 5 5
Now, I want to find the indices of the rows from B in A, so it should return 1 and 2. Does someone know how to do this?
EDIT
Next, I also want to find the indices of the rows in A, when I have only the first two columns in B. So, in that case it should also return 1 and 2. Anyone an idea how to do this?

Generally, match gets the index. In our case, an approach is to paste the rows together and get the index with match
match(do.call(paste, df2), do.call(paste, df1)
If there are only subset of columns that are having the same column names, get the vector of column names with intersect, subset the datasets, do the paste and get the index with match
nm1 <- intersect(names(df1), names(df2))
match(do.call(paste, df2[nm1]), do.call(paste, df1[nm1]))
Another option is join where we create a row index in both datasets, do a join and extract the row index
library(dplyr)
df2 %>%
mutate(rn = row_number()) %>%
left_join(df2 %>%
mutate(rn = row_number()), by = c('col1', 'col2', 'col3')) %>%
pull(rn.y)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merging two data.frames by key column - r

merge(data1, data2, by="KEY") should do it!

Related

Merge two datasets using multiple column checks

Finding Differences Between Two Dataframes

How to create a variable based on the number of unique values in another data frame?

How to join a data frame at the bottom of an existing one without rewriting the whole?

How to find indices of specific rows in dataframe r

Categories

Resources