A merge indicator for R data.table? - r

My question is related to this question but it was asking dplyr solution.
What I'd like to do is to perform outer join and create a indicator variable that explains the merge result, like pandas or STATA would do.
To be specific, I would like to have _merge column after full outer join operation that indicates the merge result with left_only or right_only or both as below example.
UPDATE : I've updated example
key1 = c('a','b','c','d','e')
v1 = c(1,2,3, NA, 5)
key2 = c('a','b','d','f')
v2 = c(4,5,6,7)
df1 = data.frame(key=key1,v1)
df2 = data.frame(key=key2,v2)
> df1
key v1
1: a 1
2: b 2
3: c 3
4: d NA
5: e 5
> df2
key v2
1: a 4
2: b 5
3: d 6
4: f 7
# merge result I'd like to have
key v1 v2 _merge
1: a 1 4 both
2: b 2 5 both
3: c 3 NA left_only
4: d NA 6 both # <- not right_only, both
5: e 5 NA left_only
6: f NA 7 right_only
I'm wondering if I'm missing an existing data.table feature, or is there a simple way to do this task?

You can use merge.data.table with all=TRUE for a full outer join:
library(data.table)
setDT(df1)
setDT(df2)
DT <- merge(df1[, r1 := .I], df2[, r2 := .I], by="key", all=TRUE)
DT[, merge_ := "both"][
is.na(r1), merge_ := "right_only"][
is.na(r2), merge_ := "left_only"]
output:
key v1 r1 v2 r2 merge_
1: a 1 1 4 1 both
2: b 2 2 5 2 both
3: c 3 3 NA NA left_only
4: d NA NA 6 3 right_only
data:
key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key=key1,v1)
df2 = data.frame(key=key2,v2)
As mentioned by Michael Chirico, with data.table_1.13.0 released on Jul 24, 2020, one can also use fcase as follows:
DT[, merge_ := fcase(
is.na(r1), "right_only",
is.na(r2), "left_only",
default = "both"
)]

Related

Left join adding all rows from right table

I noticed, when using updating by reference, I was losing some rows from my right table if there is more tan one row by join key.
No matter how I browse the forum, I can't find how to do it. Something escapes me ?
Even with mult= it doesn't seem to work.
Due to a performance and volumetry issue I would like to keep updating by reference.
In my reprex, I expected two rows for a=2
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = c(2,2,5), b = 23:25)
A[B, on = 'a', newvar := i.b, mult = 'all']
Thanks !!
One option is to create a list column in 'B' and do the join and assign (:=) as := cannot expand rows on the original data.
A[B[, .(b = .(b)), a], on = .(a), newvar := i.b]
-output
> A
a b newvar
1: 1 12
2: 2 13 23,24
3: 3 14
4: 4 15
Once we have the list, it is easier to unnest
library(tidyr)
A[, unnest(.SD, newvar, keep_empty = TRUE)]
# A tibble: 5 x 3
a b newvar
<int> <int> <int>
1 1 12 NA
2 2 13 23
3 2 13 24
4 3 14 NA
5 4 15 NA
Or use a full join with merge.data.table
merge(A, B, by = 'a', all.x = TRUE)
a b.x b.y
1: 1 12 NA
2: 2 13 23
3: 2 13 24
4: 3 14 NA
5: 4 15 NA

Replacing some values of a column based on some match in data.table

Let say I have below data.table
library(data.table)
DT = data.table(Col1 = LETTERS[1:10], Col2 = c(1,4,2,3,6,NA,4,2, 5, 4))
DT
Col1 Col2
1: A 1
2: B 4
3: C 2
4: D 3
5: E 6
6: F NA
7: G 4
8: H 2
9: I 5
10: J 4
Now I want to replace the 4 and NA values in Col2 by 999
In actual scenario, I have very large DT, so I am looking for most efficient way to achieve the same.
Any insight will be highly appreciated.
An option with na_if/replace_na
library(dplyr)
library(data.table)
DT[, Col2 := replace_na(na_if(Col2, 4), 999)]

How to merge two data.tables with complementary column data in one go?

I have two data.tables, columns v2 of each one are complementary:
set.seed(1234)
v1 <- sample(1:20, 5)
v2a <- c(1:2,NA,NA,NA)
v2b <- c(NA,NA,3:5)
id <- c(letters[1:5])
library(data.table)
dt1 <- data.table(id = id, v1=v1,v2=v2a)
dt2 <- data.table(id = id, v2=v2b)
dt1
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 NA
4: d 15 NA
5: e 9 NA
dt2
id v2
1: a NA
2: b NA
3: c 3
4: d 4
5: e 5
The goal is to merge the two data.tables and have column v2 with the proper values without NA.
I got it correctly done either by:
dt <- rbindlist(list(dt1,dt2), use.names = T, fill = T)
dt <- dt[,v2:= sum(v2, na.rm = T), by = id]
dt <- dt[!is.na(v1)]
or:
dt <- merge(dt1, dt2, by = "id", all = T)
dt[, v2:=sum(v2.x, v2.y, na.rm = T), by = id][, v2.x := NULL][,v2.y := NULL]
both giving the correct desired result:
dt
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
Is there an easier/one go way to do it?
The code below updates the values of dt1$v2 where is.na(dt1$v2) == TRUE with the values of dt$v2, based on id.
dt1[is.na(v2), v2 := dt2[ dt1[is.na(v2),], v2, on = .(id)] ][]
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
There is another, less convoluted approach which uses the fcoalesce() function which was introduced with data.table v1.12.4 (on CRAN 03 Oct 2019):
dt1[dt2, on = .(id), v2 := fcoalesce(x.v2, i.v2)][]
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
dt1[dt2, on = .(id), v2 := fcoalesce(v2, i.v2)][]
works as well because
dt1[dt2, on = .(id)]
returns
id v1 v2 i.v2
1: a 16 1 NA
2: b 5 2 NA
3: c 12 NA 3
4: d 15 NA 4
5: e 9 NA 5

R: Binding columns by key variable

I want to combine two dataframes, df1 and df2, by different groups of a key variable in x1. It is basically some join operation, however, I do not want the rows to duplicate and do not care about the relationship among the added columns.
Assume:
df1:
x1 x2
A 1
A 2
A 3
B 4
B 5
C 6
C 7
df2:
x1 x3
A a
A b
A c
A d
A e
A f
B g
C h
The result should look like this.
df1 + df2:
x1 x2 x3
A 1 a
A 2 b
A 3 c
A NA d
A NA f
B 4 g
B 5 NA
C 6 h
C 7 NA
Does anyone have an idea? I would most appreciate your help!
The full_join in dplyr works well for this too. See below:
#recreate your data
library (data.table)
library (dplyr)
df1 <- data.table (x1 = c("A","A","A","B","B","C","C"), x2 = seq (from = 1, to = 7))
df2 <- data.table (x1 = c("A","A","A","A","A","A","B","C"), x3 = c("a","b","c","d","e","f","g","h" ))
df1[, rowid := rowid(x1)]
df2[, rowid := rowid(x1)]
df3 <- full_join (df1, df2, by = c ("x1","rowid"))
df3$rowid <- NULL
setorder (df3, x1)
To replicate your resulting data.frame you can create row ids by x1 and then merge on those row ids and x1 (but I don't really know if that is what you are trying to accomplish)
library(data.table)
df1 = read.table(text = "x1 x2
A 1
A 2
A 3
B 4
B 5
C 6
C 7", header = T)
df2 = read.table(text = "x1 x3
A a
A b
A c
A d
A e
A f
B g
C h", header = T)
setDT(df1)
setDT(df2)
df1[, rowid := seq(.N), by = x1] # create rowid
df2[, rowid := seq(.N), by = x1] # create rowid
merge(df1, df2, by = c("x1", "rowid"), all = T)[, rowid := NULL][]
x1 x2 x3
1: A 1 a
2: A 2 b
3: A 3 c
4: A NA d
5: A NA e
6: A NA f
7: B 4 g
8: B 5 NA
9: C 6 h
10: C 7 NA

Update join: replace with values from a column with the same name

I need to merge some data from one data.table into another. I know how to add a new column from one data.table to another in a join. Below values from column b in df2 is added to df1 in a join based on the Id column:
df1 <- data.table(Id = letters[1:5], a = 1:5)
df2 <- data.table(Id = letters[1:3], a = 7:9, b = 7:9)
setkey(df1, Id)
setkey(df2, Id)
df1[df2, b := b][]
#> Id a b
#> 1: a 1 7
#> 2: b 2 8
#> 3: c 3 9
#> 4: d 4 NA
#> 5: e 5 NA
However, that idiom does not work when the column already exists in df1, here column a:
df1[df2, a := a][]
#> Id a
#> 1: a 1
#> 2: b 2
#> 3: c 3
#> 4: d 4
#> 5: e 5
I understand that a is not updated by this assignment because the field a already exists in df1. The reference to a in the right hand side of the assignment resolves to that value, not the on in df2.
So how to update values in df1$a with those in df2$a in a join on matching id to get the following:
#> Id a
#> 1: a 7
#> 2: b 8
#> 3: c 9
#> 4: d 4
#> 5: e 5
From ?data.table:
When i is a data.table, the columns of i can be referred to in j by using the prefix i., e.g., X[Y, .(val, i.val)]. Here val refers to X's column and i.val Y's.
Thus, in the RHS of :=, use the i. prefix to refer to the a column in df2, i.a:
library(data.table)
df1 <- data.table(Id = letters[1:5], a = 1:5)
df2 <- data.table(Id = letters[1:3], a = 7:9, b = 7:9)
setkey(df1, Id)
setkey(df2, Id)
df1[df2, a := i.a]
# or instead of setting keys, use `on` argument:
df1[df2, on = .(Id), a := i.a]
df1
# Id a
# <char> <int>
# 1: a 7
# 2: b 8
# 3: c 9
# 4: d 4
# 5: e 5

Resources