Double merge two data frames in r - r

I have two dataframes
df1 = data.frame(Sites=c("A","B","C"),total=c(12,6,35))
df2 = data.frame(Site.1=c("A","A","B"),Site.2=c("B","C","C"), Score=c(60,70,80))
I need to merge them to produce the dataframe
df3=data.frame(Site.1=c("A","A","B"),Site.2=c("B","C","C"),
Score=c(60,70,80),Site.1.total=c(12,12,6),Site.2.total=c(6,35,35))
Any advice on the simplest way to do such a double merge? Thanks

Simply merge twice:
x <- merge(df2, df1, all.x=TRUE, by.x="Site.2", by.y="Sites", sort=FALSE)
merge(x, df1, all.x=TRUE, by.x="Site.1", by.y="Sites", sort=FALSE)
Site.1 Site.2 Score total.x total.y
1 A B 60 6 12
2 A C 70 35 12
3 B C 80 35 6

Here are a couple of sqldf solutions.
First lets rename the columns containing a dot in their names to remove the dot since dot is an SQL operator. (Had we not wished to do that we could have referred to those columns in the SQL statement as Site_1 and Site_2 and it would have understood that we were referring to Site.1 and Site.2 .)
library(sqldf)
df1 = data.frame(Sites = c("A","B","C"), total = c(12,6,35))
df2 = data.frame(Site1 = c("A","A","B"), Site2 = c("B","C","C"),
Score = c(60,70,80))
Now that we have our inputs lets try a couple of approaches with sqldf:
sqldf with three sql statements
temp1 <- sqldf("SELECT * FROM df1 as a, df2 as b WHERE a.Sites = b.Site1 ")
temp2 <- sqldf("SELECT * FROM df1 as a, df2 as b WHERE a.Sites = b.Site2 ")
sqldf("SELECT
Site1,
b.Site2,
a.Score,
a.Total as Site1Total,
b.Total as Site2Total
FROM temp1 as a, temp2 as b
USING (Site1)
GROUP BY a.Total, b.Total")
sqldf reduced to a triple join
We can further reduce the above to a triple join which perhaps clarifies the essence of the computation. That is, the three SQL statements above can be reduced to this single statement:
> sqldf("SELECT Site1, Site2, Score, a1.total AS total1, a2.total AS total2
+ FROM df1 AS a1, df1 a2, df2 AS b
+ WHERE a1.Sites = Site1 AND a2.Sites = Site2")
Site1 Site2 Score total1 total2
1 A B 60 12 6
2 A C 70 12 35
3 B C 80 6 35

Related

r - Efficient conditional join on multiple columns

I have two tables that i would like to join using multiple columns, and this is perfectly feasible using the dplyr join functions. The complication comes from the fact that i need to join on multiple columns and the join should be succesful if at least one column join is succesful. To demonstrate my case here is a reproducible example:
df1 <- data.frame(
A1 = c(1,2,3,4),
B1 = c(4,5,6,7),
C1 = c("a", "b", "c", "d")
)
df2 <- data.frame(
A2 = c(8,"",3,4),
B2 = c(9,5,"",7),
C2 = c("aa", "bb", "cc", "dd")
)
I would like to join df1 and df2 on columns A or B, meaning to keep all rows where at least df1$A = df2$A or df1$B = df2$B (please note my real dataset has 6 columns that i would like to use for the joining). The end result for the simplified example should be:
data.frame(
A1 = c(2,3,4),
A2 = c("",3,7),
B1 = c(5,6,7),
B2 = c(5,"", 7),
C1 = c("b", "c", "d"),
C2 = c("bb", "cc", "dd")
)
Many thanks in advance for any recommendations on how this can be done efficiently or if fast is not possible then slow solution can be accepted as well
Not quite sure how to do this using dplyr, but sqldf could help you out:
library(sqldf)
sqldf("SELECT *
FROM df1
JOIN df2
ON df1.A1 = df2.A2
OR df1.B1 = df2.B2")
You can add additional OR statements after this for more columns.
A simple way can be:
library(dplyr)
df1 <- df1 %>%
mutate(A1 = as.character(A1), B1 = as.character(B1))
df1 %>%
bind_cols(df2) %>%
filter(A1 == A2 | B1 == B2) %>%
relocate(sort(names(.)))
#> A1 A2 B1 B2 C1 C2
#> 1 2 5 5 b bb
#> 2 3 3 6 c cc
#> 3 4 4 7 7 d dd
It seems like this isn't possible with a single call to a dplyr join function.
If you would like to use a dplyr join, here is a hacky workaround I created using a purrr map function to do a separate inner join for each of the conditions in the conditional join. Then bind them together and remove duplicate rows. It can be generalized to more columns by appending to the key1 and key2 vectors.
note: first we need to modify the example data so columns to be joined have the same type. dplyr throws an error if you try to join incompatible column types, in this case integer and character.
library(dplyr)
library(purrr)
df1 <- df1 %>%
mutate(A1 = as.character(A1), B1 = as.character(B1))
key1 <- c('A1', 'B1')
key2 <- c('A2', 'B2')
map2_dfr(key1, key2, ~ inner_join(df1, df2, by = setNames(.y, .x), keep = TRUE)) %>%
distinct()
Result:
A1 B1 C1 A2 B2 C2
1 3 6 c 3 cc
2 4 7 d 4 7 dd
3 2 5 b 5 bb

Is there a way to do a merge with one table and if not found with the second table?

df1 <- data.frame(id=c(1,2,3,4,5,8), var=c("a","b","c","d","e","t"), stringsAsFactors = F)
df2 <- data.frame(id=c(1,2,3,4,5,6,7), var=c("e","f","c","d","e","g","h"), stringsAsFactors = F)
df <- data.frame(id=c(1,2,3,4,5,6,7,8))
I need to join to get the var value for df but I would like the var value for df2 rather than df1, and if there is not an equivalent in df2, then I would like to take it from df1. I have this but is there an easier way to do this? and how can I add a column to see where var came from?
df %>% left_join(df1, by="id") %>% left_join(df2, by="id") %>%
dplyr::mutate(var=ifelse(!is.na(var.x), var.x, var.y))
Use bind_rows on df1 and df2 first and you can see where var came from if the argument .id is set.
library(dplyr)
bind_rows(df1 = df1, df2 = df2, .id = "from") %>%
distinct(id, .keep_all = T) %>%
right_join(df)
# from id var
# 1 df1 1 a
# 2 df1 2 b
# 3 df1 3 c
# 4 df1 4 d
# 5 df1 5 e
# 6 df2 6 g
# 7 df2 7 h
# 8 df1 8 t
We can use an SQL triple join like this:
library(sqldf)
sqldf("select a.*, coalesce(b.var, c.var) as var
from df a
left join df1 b using(id)
left join df2 c using(id)")
giving:
id var
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 g
7 7 h
8 8 t
If you need to put it into a pipeline:
df %>%
{ sqldf("select a.*, coalesce(b.var, c.var) as var
from [.] a
left join df1 b using(id)
left join df2 c using(id)") }

Joining / merging two data frames by symmetric differences in rows and columns

I would like to join / merge two data frames, but ignoring similarities in rows and columns in the resulting data frame. Consider the following example:
df1 <- data.frame(
id = c("a","b","c"),
a = runif(3,1,9),
b = runif(3,1,9)
)
df2 <- data.frame(
df1[1:2,],
c = runif(2,1,9)
)
Results in two data frames that have exactly four cells in common (not counting id), so df1[1:2,2:3] == df2[1:2,2:3]. However, they do differ in regard that df1 as an additional row and df2 has an additional column:
> print(df1)
id a b
1 a 6.396168 4.037320
2 b 4.119025 8.181253
3 c 5.608775 4.219469
> print(df2)
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280
I want a new data frame to consist of the symmetric differences between these two, so no duplicates in rows or columns. The closest result I have achieved is by using dplyr::full_join(df1, df2, by = "id"), but this results in duplicated columns.
The result should look like this:
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280
3 c 5.608775 4.219469 NA
What's the best way of achieving this dynamically? Thanks
With data.table we can join on the 'id' and assign the 'c' from the second dataset to create the 'c' column in the first data. By default, the non-matching elements will be assigned as NA
library(data.table)
setDT(df1)[df2, c := c, on = .(id)]
df1
# id a b c
#1: a 4.601639 1.065642 7.476494
#2: b 6.065758 6.234421 8.929932
#3: c 4.000351 7.365717 NA
NOTE: The values are different as there was not set seed
In base R, an option would be match
df1$c <- df2$c[match(df1$id, df2$id)]
Regarding the OP's use of full_join (left_join would be fine based on the example), the trick is to remove the columns that are not needed in the second dataset
library(dplyr)
nm1 <- c("id", setdiff(names(df2), names(df1)))
left_join(df1, select(df2, nm1), by = 'id')
Another approach if one of the data frames has all the rows you want (df2 here):
library(dplyr)
bind_rows(df2, anti_join(df1, df2))
#Joining, by = c("id", "a", "b")
# id a b c
#1 a 1.912298 5.792475 6.899253
#2 b 2.537666 1.495075 1.186120
#3 c 5.947766 6.594028 NA
In this particular case this would be sufficient
library(sqldf)
sqldf("select * from df1 left natural join df2")
## id a b c
## 1 a 6.396168 4.037320 2.444122
## 2 b 4.119025 8.181253 6.444280
## 3 c 5.608775 4.219469 NA
or with dplyr:
library(dplyr)
left_join(df1, df2)
but in general you might need the following. Note this is perfectly general. We did not need to specify the column or row names in either the above or following code and in the following code it is symmetric in df1 and df2 so it does not rely on knowing the structure of either.
sqldf("select * from df1 left natural join df2
union
select * from df2 left natural join df1")
## id a b c
## 1 a 6.396168 4.037320 2.444122
## 2 b 4.119025 8.181253 6.444280
## 3 c 5.608775 4.219469 NA
or with dplyr. This will give a warning but still works. You can avoid the warning if id were character rather than factor or if you convert it to character first.
library(dplyr)
rbind(left_join(df1, df2), left_join(df2, df1)) %>% distinct
Note
Because the question did not use set.seed the code to generate the input is
not reproducible but we can copy the particular df1 and df2 so that we have the same data as in the question.
Lines1 <- "
id a b
1 a 6.396168 4.037320
2 b 4.119025 8.181253
3 c 5.608775 4.219469"
df1 <- read.table(text = Lines1)
Lines2 <- "
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280"
df2 <- read.table(text = Lines2)

Merging 3 dataframes Left join

I have 3 dataframes with unequal rows
df1-
T1 T2 T3
1 Joe TTT
2 PP YYY
3 JJ QQQ
5 UU OOO
6 OO GGG
df2
X1 X2
1 09/20/2017
2 08/02/2015
3 05/02/2000
8 06/03/1999
df3
L1 L2
1 New
6 Notsure
9 Also
The final dataframe should be like a left join of all 3 only retaining rows of df1. The matching rows are T1, X1 and L1 but with different header names. The number of rows are different in each dataframe. I couldn't find a solution for this situation. On SO, what i found was available for 2 dataframes or 3 dataframes with equal rows or same column name
T1 T2 T3 X2 L2
1 Joe TTT 09/20/2017 New
2 PP YYY 08/02/2015 NA
3 JJ QQQ 05/02/2000 NA
5 UU OOO NA NA
6 OO GGG NA NotSure
I am comparatively new in R, and couldn't find a R code for this
The idea is to put your data frames in a list, change the name of the first column, and use Reduce to merge, i.e.
Reduce(function(...) merge(..., by = 'Var1', all.x = TRUE),
lapply( mget(ls(pattern = 'df[0-9]+')), function(i) {names(i)[1] <- 'Var1'; i}))
which gives,
Var1 T2 T3 X2 L2
1 1 Joe TTT 09/20/2017 New
2 2 PP YYY 08/02/2015 Old
3 3 JJ QQQ 05/02/2000 <NA>
4 5 UU OOO <NA> <NA>
5 6 OO GGG <NA> Notsure
using tidyverse functions, you can try:
df1 %>%
left_join(df2, by = c("T1" = "X1")) %>%
left_join(df3, by = c("T1" = "L1"))
which gives:
T1 T2 T3 X2 L2
1 1 Joe TTT 09/20/2017 New
2 2 PP YYY 08/02/2015 <NA>
3 3 JJ QQQ 05/02/2000 <NA>
4 5 UU OOO <NA> <NA>
5 6 OO GGG <NA> Notsure
1) sqldf
library(sqldf)
sqldf("select df1.*, X2, L2
from df1
left join df2 on T1 = X1
left join df3 on T1 = L1")
1a) Although slightly longer this variation can make it easier later when reviewing the code by making it explicit as to which source each column came from. If the data frame names were long you might want to use aliases, e.g. from df1 as a, but here we don't bother since they are short.
sqldf("select df1.*, df2.X2, df3.L2
from df1
left join df2 on df1.T1 = df2.X1
left join df3 on df1.T1 = df3.L1")
2) merge Using repeated merge. No packages used.
Merge <- function(x, y) merge(x, y, by = 1, all.x = TRUE)
Merge(Merge(df1, df2), df3)
2a) This could also be written using a magrittr pipeline like this:
library(magrittr)
df1 %>% Merge(df2) %>% Merge(df3)
2b) Using Reduce we can do the repeated merges like this:
Reduce(Merge, list(df1, df2, df3))
Note: The inputs in reproducible form are:
Lines1 <- "
T1 T2 T3
1 Joe TTT
2 PP YYY
3 JJ QQQ
5 UU OOO
6 OO GGG"
Lines2 <- "
X1 X2
1 09/20/2017
2 08/02/2015
3 05/02/2000
8 06/03/1999"
Lines3 <- "
L1 L2
1 New
6 Notsure
9 Also"
df1 <- read.table(text = Lines1, header = TRUE)
df2 <- read.table(text = Lines2, header = TRUE)
df3 <- read.table(text = Lines3, header = TRUE)
With left_join() it would be something like this
df1 = data.frame(X = c("a", "b", "c"), var1 = c(1,2, 3))
df2 = data.frame(V = c("a", "b", "c"), var2 =c(5,NA, NA) )
df3 = data.frame(Y = c("a", "b", "c"), var3 =c("name", NA, "age") )
# rename
df2 = df2 %>% rename(X = V)
df3 = df3 %>% rename(X = Y)
df = left_join(df1, df2, by = "X") %>%
left_join(., df3, by = "X")
> df
X var1 var2 var3
1 a 1 5 name
2 b 2 NA <NA>
3 c 3 NA age

Use an id value to perform element-wise calculations on two data frames

I'm doing element-wise calculations on two data frames, but only where the same id exists in both sets of data.
The current method I'm using is to subset both data frames where the same ids exist, then sort the data by id, then do the calculation:
## Example data
id <- c('a','b','c','d','e')
v1 <- c(10, 20, 30,20,40)
v2 <- c(20,30,20,20,40)
df1 <- data.frame(id, v1, v2, stringsAsFactors=FALSE)
id <- c('a','c','d','b','f')
v1 <- c(20,60,30,10,20)
v2 <- c(60,20,50,10,20)
df2 <- data.frame(id, v1, v2, stringsAsFactors=FALSE)
## subset both data frames by ids that exist in both
df1_subset <- df[df1$id %in% df2$id,]
df2_subset <- df2[df2$id %in% df1$id,]
id <- df1_subset$id
## arrange by id value
library(dplyr)
df1_sorted <- df1_subset %>% arrange(id)
df2_sorted <- df2_subset %>% arrange(id)
## find the difference between each value
df_result <- cbind(id, df2_sorted[,2:3] - df1_sorted[,2:3])
Is there a 'better' way of doing this calculation where the data doesn't need
to be subset and sorted, and uses the id value directly to validate/ensure the calculation is being performed on the correct row & column of data?
library(dplyr)
inner_join(df1, df2, by="id") %>%
mutate(v1=v1.y-v1.x, v2=v2.y-v2.x) %>%
select(id, v1, v2)
# id v1 v2
#1 a 10 40
#2 b -10 -20
#3 c 30 0
#4 d 10 30
You can use merge and then a single transform to do what you need:
#merge will find the common ids between the dataframes
a <- merge(df1,df2, by='id')
#transform will add the two columns you need (subtracting one from the other)
a <- transform(a, v1 = v1.y - v1.x, v2 = v2.y - v2.x)
Output:
> a
id v1.x v2.x v1.y v2.y v1 v2
1 a 10 20 20 60 10 40
2 b 20 30 10 10 -10 -20
3 c 30 20 60 20 30 0
4 d 20 20 30 50 10 30
Which is the same as your df_result
> df_result
id v1 v2
1 a 10 40
2 b -10 -20
3 c 30 0
4 d 10 30
First, you can easily join these DFs on id with merge() (In R base) :
df_merged = merge(df1,df2, by='id')
which gives you the following new column names:
names(df_merged)
# [1] "id" "v1.x" "v2.x" "v1.y" "v2.y"
because merge() by default adds suffixes to colliding column names.
Then consider this combination to get your result ...
df_result = with(df_merged, data.frame(id, result1 = v1.x - v1.y, result2 = v2.x-v2.y)))
with() adds readability. There are many many ways to do this. Lots of nice libraries like plyrand sqldf to make it easy. I look forward to seeing a more R-er way in the answers.

Resources