r - Efficient conditional join on multiple columns - r

I have two tables that i would like to join using multiple columns, and this is perfectly feasible using the dplyr join functions. The complication comes from the fact that i need to join on multiple columns and the join should be succesful if at least one column join is succesful. To demonstrate my case here is a reproducible example:
df1 <- data.frame(
A1 = c(1,2,3,4),
B1 = c(4,5,6,7),
C1 = c("a", "b", "c", "d")
)
df2 <- data.frame(
A2 = c(8,"",3,4),
B2 = c(9,5,"",7),
C2 = c("aa", "bb", "cc", "dd")
)
I would like to join df1 and df2 on columns A or B, meaning to keep all rows where at least df1$A = df2$A or df1$B = df2$B (please note my real dataset has 6 columns that i would like to use for the joining). The end result for the simplified example should be:
data.frame(
A1 = c(2,3,4),
A2 = c("",3,7),
B1 = c(5,6,7),
B2 = c(5,"", 7),
C1 = c("b", "c", "d"),
C2 = c("bb", "cc", "dd")
)
Many thanks in advance for any recommendations on how this can be done efficiently or if fast is not possible then slow solution can be accepted as well

Not quite sure how to do this using dplyr, but sqldf could help you out:
library(sqldf)
sqldf("SELECT *
FROM df1
JOIN df2
ON df1.A1 = df2.A2
OR df1.B1 = df2.B2")
You can add additional OR statements after this for more columns.

A simple way can be:
library(dplyr)
df1 <- df1 %>%
mutate(A1 = as.character(A1), B1 = as.character(B1))
df1 %>%
bind_cols(df2) %>%
filter(A1 == A2 | B1 == B2) %>%
relocate(sort(names(.)))
#> A1 A2 B1 B2 C1 C2
#> 1 2 5 5 b bb
#> 2 3 3 6 c cc
#> 3 4 4 7 7 d dd

It seems like this isn't possible with a single call to a dplyr join function.
If you would like to use a dplyr join, here is a hacky workaround I created using a purrr map function to do a separate inner join for each of the conditions in the conditional join. Then bind them together and remove duplicate rows. It can be generalized to more columns by appending to the key1 and key2 vectors.
note: first we need to modify the example data so columns to be joined have the same type. dplyr throws an error if you try to join incompatible column types, in this case integer and character.
library(dplyr)
library(purrr)
df1 <- df1 %>%
mutate(A1 = as.character(A1), B1 = as.character(B1))
key1 <- c('A1', 'B1')
key2 <- c('A2', 'B2')
map2_dfr(key1, key2, ~ inner_join(df1, df2, by = setNames(.y, .x), keep = TRUE)) %>%
distinct()
Result:
A1 B1 C1 A2 B2 C2
1 3 6 c 3 cc
2 4 7 d 4 7 dd
3 2 5 b 5 bb

Related

tidyverse alternative to left_join & rows_update when two data frames differ in columns and rows

There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w

merging and filling the NA values of another column based on another dataframe

I have 2 dfs, a subset of which looks like this. Where available, I want the "NA" values to be replaced by the rsid values in the other df.
df1:
SNP A1 A2 rsid
1:100000012 A G rs1234
1:1000066 T C <NA>
1:2032101 C T rs5678
df2:
SNP A1 A2 rsid
2:107877 A G rs1112023
3:1000066 T C rs8213723
1:1000066 T C rs7778899
This is what I want where the NA is replaced by the rsid values of the other df. In this example, the rsid of row 3 for df2 replaces the NA value of the rsid for row 2 for df1. I only want the new df to include rows in df1, like so.
df3
SNP A1 A2 rsid
1:100000012 A G rs1234
1:1000066 T C rs7778899
1:2032101 C T rs5678
I tried this, but am getting some error messages. Can someone help?
library(dplyr)
bind_rows(df1, df2) %>%
group_by(SNP, A1, A2) %>%
summarise(rsid = rsid[complete.cases(rsid)], .groups = 'drop')
Error: Column `rsid` must be length 1 (a summary value), not 2
In addition: Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
We can bind the datasets together with bind_rows and then do a group by summarise while removing the NA with complete.cases (dplyr version >= 1.0)
library(dplyr)
bind_rows(df1, df2) %>%
group_by(SNP, A1, A2) %>%
summarise(rsid = rsid[complete.cases(rsid)], .groups = 'drop')
-output
# A tibble: 5 x 4
# SNP A1 A2 rsid
# <chr> <chr> <chr> <chr>
#1 1:100000012 A G rs1234
#2 1:1000066 T C rs7778899
#3 1:2032101 C T rs5678
#4 2:107877 A G rs1112023
#5 3:1000066 T C rs8213723
If the version of dplyr is < 1.0, summarise expects the output to be of length 1 per group. We can wrap it in a list and then unnest
bind_rows(df1, df2) %>%
group_by(SNP, A1, A2) %>%
summarise(rsid = list(rsid[complete.cases(rsid)])) %>%
ungroup %>%
unnest(c(rsid))
Update
Based on the updated post, if we need to update the column 'rsid' based on the second data, an option is to do a join and then assign (:=) after coalescing the 'rsid' columns
library(data.table)
setDT(df1)[df2, rsid := fcoalesce(rsid, i.rsid), on = .(SNP, A1, A2)]
-output
df1
# SNP A1 A2 rsid
#1: 1:100000012 A G rs1234
#2: 1:1000066 T C rs7778899
#3: 1:2032101 C T rs5678
A similar option is also possible with dplyr
left_join(df1, df2, by = c('SNP', 'A1', 'A2')) %>%
transmute(SNP, A1, A2, rsid = coalesce(rsid.x, rsid))
data
df1 <- structure(list(SNP = c("1:100000012", "1:1000066", "1:2032101"
), A1 = c("A", "T", "C"), A2 = c("G", "C", "T"), rsid = c("rs1234",
NA, "rs5678")), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(SNP = c("2:107877", "3:1000066", "1:1000066"),
A1 = c("A", "T", "T"), A2 = c("G", "C", "C"), rsid = c("rs1112023",
"rs8213723", "rs7778899")), class = "data.frame", row.names = c(NA,
-3L))

Joining / merging two data frames by symmetric differences in rows and columns

I would like to join / merge two data frames, but ignoring similarities in rows and columns in the resulting data frame. Consider the following example:
df1 <- data.frame(
id = c("a","b","c"),
a = runif(3,1,9),
b = runif(3,1,9)
)
df2 <- data.frame(
df1[1:2,],
c = runif(2,1,9)
)
Results in two data frames that have exactly four cells in common (not counting id), so df1[1:2,2:3] == df2[1:2,2:3]. However, they do differ in regard that df1 as an additional row and df2 has an additional column:
> print(df1)
id a b
1 a 6.396168 4.037320
2 b 4.119025 8.181253
3 c 5.608775 4.219469
> print(df2)
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280
I want a new data frame to consist of the symmetric differences between these two, so no duplicates in rows or columns. The closest result I have achieved is by using dplyr::full_join(df1, df2, by = "id"), but this results in duplicated columns.
The result should look like this:
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280
3 c 5.608775 4.219469 NA
What's the best way of achieving this dynamically? Thanks
With data.table we can join on the 'id' and assign the 'c' from the second dataset to create the 'c' column in the first data. By default, the non-matching elements will be assigned as NA
library(data.table)
setDT(df1)[df2, c := c, on = .(id)]
df1
# id a b c
#1: a 4.601639 1.065642 7.476494
#2: b 6.065758 6.234421 8.929932
#3: c 4.000351 7.365717 NA
NOTE: The values are different as there was not set seed
In base R, an option would be match
df1$c <- df2$c[match(df1$id, df2$id)]
Regarding the OP's use of full_join (left_join would be fine based on the example), the trick is to remove the columns that are not needed in the second dataset
library(dplyr)
nm1 <- c("id", setdiff(names(df2), names(df1)))
left_join(df1, select(df2, nm1), by = 'id')
Another approach if one of the data frames has all the rows you want (df2 here):
library(dplyr)
bind_rows(df2, anti_join(df1, df2))
#Joining, by = c("id", "a", "b")
# id a b c
#1 a 1.912298 5.792475 6.899253
#2 b 2.537666 1.495075 1.186120
#3 c 5.947766 6.594028 NA
In this particular case this would be sufficient
library(sqldf)
sqldf("select * from df1 left natural join df2")
## id a b c
## 1 a 6.396168 4.037320 2.444122
## 2 b 4.119025 8.181253 6.444280
## 3 c 5.608775 4.219469 NA
or with dplyr:
library(dplyr)
left_join(df1, df2)
but in general you might need the following. Note this is perfectly general. We did not need to specify the column or row names in either the above or following code and in the following code it is symmetric in df1 and df2 so it does not rely on knowing the structure of either.
sqldf("select * from df1 left natural join df2
union
select * from df2 left natural join df1")
## id a b c
## 1 a 6.396168 4.037320 2.444122
## 2 b 4.119025 8.181253 6.444280
## 3 c 5.608775 4.219469 NA
or with dplyr. This will give a warning but still works. You can avoid the warning if id were character rather than factor or if you convert it to character first.
library(dplyr)
rbind(left_join(df1, df2), left_join(df2, df1)) %>% distinct
Note
Because the question did not use set.seed the code to generate the input is
not reproducible but we can copy the particular df1 and df2 so that we have the same data as in the question.
Lines1 <- "
id a b
1 a 6.396168 4.037320
2 b 4.119025 8.181253
3 c 5.608775 4.219469"
df1 <- read.table(text = Lines1)
Lines2 <- "
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280"
df2 <- read.table(text = Lines2)

Select minimum data of grouped data - keeping all columns [duplicate]

This question already has an answer here:
R: Uniques (or dplyr distinct) + most recent date
(1 answer)
Closed 7 years ago.
I am running into a wall here.
I have a dataframe, many rows.
Here is schematic example.
#myDf
ID c1 c2 myDate
A 1 1 01.01.2015
A 2 2 02.02.2014
A 3 3 03.01.2014
B 4 4 09.09.2009
B 5 5 10.10.2010
C 6 6 06.06.2011
....
I need to group my dataframe by my ID, and then select the row with the oldest date, and write the ouput into a new dataframe - keeping all rows.
ID c1 c2 myDate
A 3 3 03.01.2014
B 4 4 09.09.2009
C 6 6 06.06.2011
....
That is how I approach it:
test <- myDf %>%
group_by(ID) %>%
mutate(date == as.Date(myDate, format = "%d.%m.%Y")) %>%
filter(date == min(b2))
To verfiy: The nrow of my resulting dataframe should be the same as unique returns.
unique(myDf$ID) %>% length == nrow(test)
FALSE
Does not work. I tried this:
newDf <- ddply(.data = myDf,
.variables = "ID",
.fun = function(piece){
take.this.row <- piece$myDate %>% as.Date(format="%d.%m.%Y") %>% which.min
piece[take.this.row,]
})
That does run forever. I terminated it.
Why is the first approach not working and what would be a good way to approach the problem?
Considering you have a pretty large dataset, I think using data.table will be better ! Here is the data.table version to solve your problem, it will be quicker than dplyr package:
library(data.table)
df <- data.table(ID=c("A","A","A","B","B","C"),c1=1:6,c2=1:6,
myDate=c("01.01.2015","02.02.2014",
"03.01.2014","09.09.2009","10.10.2010","06.06.2011"))
df[,myDate:=as.Date(myDate, '%d.%m.%Y')]
> df_new <- df[ df[, .I[myDate == min(myDate)], by=ID]$V1 ]
> df_new
ID c1 c2 myDate
1: A 3 3 2014-01-03
2: B 4 4 2009-09-09
3: C 6 6 2011-06-06
PS: you can use setDT(mydf) to transform data.frame to data.table.
After grouping by 'ID', we can use which.min to get the index of 'myDate' (after converting to Date class), and we extract the rows with slice.
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(which.min(as.Date(myDate, '%d.%m.%Y')))
# ID c1 c2 myDate
# (chr) (int) (int) (chr)
#1 A 3 3 03.01.2014
#2 B 4 4 09.09.2009
#3 C 6 6 06.06.2011
data
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")), .Names = c("ID",
"c1", "c2", "myDate"), class = "data.frame", row.names = c(NA,
-6L))
If you wanted to just use the base functions you can also go with the aggregate and merge functions.
# data (from response above)
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")),
.Names = c("ID","c1", "c2", "myDate"),
class = "data.frame", row.names = c(NA,-6L))
# convert your date column to POSIXct object
df1$myDate = as.POSIXct(df1$myDate,format="%d.%m.%Y")
# Use the aggregate function to look for the minimum dates by group.
# In this case our variable of interest in the myDate column and the
# group to sort by is the "ID" column.
# The function will sort out the minimum date and create a new data frame
# with names "myDate" and "ID"
df2 = aggregate(list(myDate = df1$myDate),list(ID = df1$ID),
function(x){x[which(x == min(x))]})
df2
# Use the merge function to merge your original data frame with the
# data from the aggregate function
merge(df1,df2)

Double merge two data frames in r

I have two dataframes
df1 = data.frame(Sites=c("A","B","C"),total=c(12,6,35))
df2 = data.frame(Site.1=c("A","A","B"),Site.2=c("B","C","C"), Score=c(60,70,80))
I need to merge them to produce the dataframe
df3=data.frame(Site.1=c("A","A","B"),Site.2=c("B","C","C"),
Score=c(60,70,80),Site.1.total=c(12,12,6),Site.2.total=c(6,35,35))
Any advice on the simplest way to do such a double merge? Thanks
Simply merge twice:
x <- merge(df2, df1, all.x=TRUE, by.x="Site.2", by.y="Sites", sort=FALSE)
merge(x, df1, all.x=TRUE, by.x="Site.1", by.y="Sites", sort=FALSE)
Site.1 Site.2 Score total.x total.y
1 A B 60 6 12
2 A C 70 35 12
3 B C 80 35 6
Here are a couple of sqldf solutions.
First lets rename the columns containing a dot in their names to remove the dot since dot is an SQL operator. (Had we not wished to do that we could have referred to those columns in the SQL statement as Site_1 and Site_2 and it would have understood that we were referring to Site.1 and Site.2 .)
library(sqldf)
df1 = data.frame(Sites = c("A","B","C"), total = c(12,6,35))
df2 = data.frame(Site1 = c("A","A","B"), Site2 = c("B","C","C"),
Score = c(60,70,80))
Now that we have our inputs lets try a couple of approaches with sqldf:
sqldf with three sql statements
temp1 <- sqldf("SELECT * FROM df1 as a, df2 as b WHERE a.Sites = b.Site1 ")
temp2 <- sqldf("SELECT * FROM df1 as a, df2 as b WHERE a.Sites = b.Site2 ")
sqldf("SELECT
Site1,
b.Site2,
a.Score,
a.Total as Site1Total,
b.Total as Site2Total
FROM temp1 as a, temp2 as b
USING (Site1)
GROUP BY a.Total, b.Total")
sqldf reduced to a triple join
We can further reduce the above to a triple join which perhaps clarifies the essence of the computation. That is, the three SQL statements above can be reduced to this single statement:
> sqldf("SELECT Site1, Site2, Score, a1.total AS total1, a2.total AS total2
+ FROM df1 AS a1, df1 a2, df2 AS b
+ WHERE a1.Sites = Site1 AND a2.Sites = Site2")
Site1 Site2 Score total1 total2
1 A B 60 12 6
2 A C 70 12 35
3 B C 80 6 35

Resources