Update join: replace with values from a column with the same name - r

I need to merge some data from one data.table into another. I know how to add a new column from one data.table to another in a join. Below values from column b in df2 is added to df1 in a join based on the Id column:
df1 <- data.table(Id = letters[1:5], a = 1:5)
df2 <- data.table(Id = letters[1:3], a = 7:9, b = 7:9)
setkey(df1, Id)
setkey(df2, Id)
df1[df2, b := b][]
#> Id a b
#> 1: a 1 7
#> 2: b 2 8
#> 3: c 3 9
#> 4: d 4 NA
#> 5: e 5 NA
However, that idiom does not work when the column already exists in df1, here column a:
df1[df2, a := a][]
#> Id a
#> 1: a 1
#> 2: b 2
#> 3: c 3
#> 4: d 4
#> 5: e 5
I understand that a is not updated by this assignment because the field a already exists in df1. The reference to a in the right hand side of the assignment resolves to that value, not the on in df2.
So how to update values in df1$a with those in df2$a in a join on matching id to get the following:
#> Id a
#> 1: a 7
#> 2: b 8
#> 3: c 9
#> 4: d 4
#> 5: e 5

From ?data.table:
When i is a data.table, the columns of i can be referred to in j by using the prefix i., e.g., X[Y, .(val, i.val)]. Here val refers to X's column and i.val Y's.
Thus, in the RHS of :=, use the i. prefix to refer to the a column in df2, i.a:
library(data.table)
df1 <- data.table(Id = letters[1:5], a = 1:5)
df2 <- data.table(Id = letters[1:3], a = 7:9, b = 7:9)
setkey(df1, Id)
setkey(df2, Id)
df1[df2, a := i.a]
# or instead of setting keys, use `on` argument:
df1[df2, on = .(Id), a := i.a]
df1
# Id a
# <char> <int>
# 1: a 7
# 2: b 8
# 3: c 9
# 4: d 4
# 5: e 5

Related

Left join adding all rows from right table

I noticed, when using updating by reference, I was losing some rows from my right table if there is more tan one row by join key.
No matter how I browse the forum, I can't find how to do it. Something escapes me ?
Even with mult= it doesn't seem to work.
Due to a performance and volumetry issue I would like to keep updating by reference.
In my reprex, I expected two rows for a=2
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = c(2,2,5), b = 23:25)
A[B, on = 'a', newvar := i.b, mult = 'all']
Thanks !!
One option is to create a list column in 'B' and do the join and assign (:=) as := cannot expand rows on the original data.
A[B[, .(b = .(b)), a], on = .(a), newvar := i.b]
-output
> A
a b newvar
1: 1 12
2: 2 13 23,24
3: 3 14
4: 4 15
Once we have the list, it is easier to unnest
library(tidyr)
A[, unnest(.SD, newvar, keep_empty = TRUE)]
# A tibble: 5 x 3
a b newvar
<int> <int> <int>
1 1 12 NA
2 2 13 23
3 2 13 24
4 3 14 NA
5 4 15 NA
Or use a full join with merge.data.table
merge(A, B, by = 'a', all.x = TRUE)
a b.x b.y
1: 1 12 NA
2: 2 13 23
3: 2 13 24
4: 3 14 NA
5: 4 15 NA

A merge indicator for R data.table?

My question is related to this question but it was asking dplyr solution.
What I'd like to do is to perform outer join and create a indicator variable that explains the merge result, like pandas or STATA would do.
To be specific, I would like to have _merge column after full outer join operation that indicates the merge result with left_only or right_only or both as below example.
UPDATE : I've updated example
key1 = c('a','b','c','d','e')
v1 = c(1,2,3, NA, 5)
key2 = c('a','b','d','f')
v2 = c(4,5,6,7)
df1 = data.frame(key=key1,v1)
df2 = data.frame(key=key2,v2)
> df1
key v1
1: a 1
2: b 2
3: c 3
4: d NA
5: e 5
> df2
key v2
1: a 4
2: b 5
3: d 6
4: f 7
# merge result I'd like to have
key v1 v2 _merge
1: a 1 4 both
2: b 2 5 both
3: c 3 NA left_only
4: d NA 6 both # <- not right_only, both
5: e 5 NA left_only
6: f NA 7 right_only
I'm wondering if I'm missing an existing data.table feature, or is there a simple way to do this task?
You can use merge.data.table with all=TRUE for a full outer join:
library(data.table)
setDT(df1)
setDT(df2)
DT <- merge(df1[, r1 := .I], df2[, r2 := .I], by="key", all=TRUE)
DT[, merge_ := "both"][
is.na(r1), merge_ := "right_only"][
is.na(r2), merge_ := "left_only"]
output:
key v1 r1 v2 r2 merge_
1: a 1 1 4 1 both
2: b 2 2 5 2 both
3: c 3 3 NA NA left_only
4: d NA NA 6 3 right_only
data:
key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key=key1,v1)
df2 = data.frame(key=key2,v2)
As mentioned by Michael Chirico, with data.table_1.13.0 released on Jul 24, 2020, one can also use fcase as follows:
DT[, merge_ := fcase(
is.na(r1), "right_only",
is.na(r2), "left_only",
default = "both"
)]

How to merge two data.tables with complementary column data in one go?

I have two data.tables, columns v2 of each one are complementary:
set.seed(1234)
v1 <- sample(1:20, 5)
v2a <- c(1:2,NA,NA,NA)
v2b <- c(NA,NA,3:5)
id <- c(letters[1:5])
library(data.table)
dt1 <- data.table(id = id, v1=v1,v2=v2a)
dt2 <- data.table(id = id, v2=v2b)
dt1
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 NA
4: d 15 NA
5: e 9 NA
dt2
id v2
1: a NA
2: b NA
3: c 3
4: d 4
5: e 5
The goal is to merge the two data.tables and have column v2 with the proper values without NA.
I got it correctly done either by:
dt <- rbindlist(list(dt1,dt2), use.names = T, fill = T)
dt <- dt[,v2:= sum(v2, na.rm = T), by = id]
dt <- dt[!is.na(v1)]
or:
dt <- merge(dt1, dt2, by = "id", all = T)
dt[, v2:=sum(v2.x, v2.y, na.rm = T), by = id][, v2.x := NULL][,v2.y := NULL]
both giving the correct desired result:
dt
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
Is there an easier/one go way to do it?
The code below updates the values of dt1$v2 where is.na(dt1$v2) == TRUE with the values of dt$v2, based on id.
dt1[is.na(v2), v2 := dt2[ dt1[is.na(v2),], v2, on = .(id)] ][]
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
There is another, less convoluted approach which uses the fcoalesce() function which was introduced with data.table v1.12.4 (on CRAN 03 Oct 2019):
dt1[dt2, on = .(id), v2 := fcoalesce(x.v2, i.v2)][]
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
dt1[dt2, on = .(id), v2 := fcoalesce(v2, i.v2)][]
works as well because
dt1[dt2, on = .(id)]
returns
id v1 v2 i.v2
1: a 16 1 NA
2: b 5 2 NA
3: c 12 NA 3
4: d 15 NA 4
5: e 9 NA 5

R: Binding columns by key variable

I want to combine two dataframes, df1 and df2, by different groups of a key variable in x1. It is basically some join operation, however, I do not want the rows to duplicate and do not care about the relationship among the added columns.
Assume:
df1:
x1 x2
A 1
A 2
A 3
B 4
B 5
C 6
C 7
df2:
x1 x3
A a
A b
A c
A d
A e
A f
B g
C h
The result should look like this.
df1 + df2:
x1 x2 x3
A 1 a
A 2 b
A 3 c
A NA d
A NA f
B 4 g
B 5 NA
C 6 h
C 7 NA
Does anyone have an idea? I would most appreciate your help!
The full_join in dplyr works well for this too. See below:
#recreate your data
library (data.table)
library (dplyr)
df1 <- data.table (x1 = c("A","A","A","B","B","C","C"), x2 = seq (from = 1, to = 7))
df2 <- data.table (x1 = c("A","A","A","A","A","A","B","C"), x3 = c("a","b","c","d","e","f","g","h" ))
df1[, rowid := rowid(x1)]
df2[, rowid := rowid(x1)]
df3 <- full_join (df1, df2, by = c ("x1","rowid"))
df3$rowid <- NULL
setorder (df3, x1)
To replicate your resulting data.frame you can create row ids by x1 and then merge on those row ids and x1 (but I don't really know if that is what you are trying to accomplish)
library(data.table)
df1 = read.table(text = "x1 x2
A 1
A 2
A 3
B 4
B 5
C 6
C 7", header = T)
df2 = read.table(text = "x1 x3
A a
A b
A c
A d
A e
A f
B g
C h", header = T)
setDT(df1)
setDT(df2)
df1[, rowid := seq(.N), by = x1] # create rowid
df2[, rowid := seq(.N), by = x1] # create rowid
merge(df1, df2, by = c("x1", "rowid"), all = T)[, rowid := NULL][]
x1 x2 x3
1: A 1 a
2: A 2 b
3: A 3 c
4: A NA d
5: A NA e
6: A NA f
7: B 4 g
8: B 5 NA
9: C 6 h
10: C 7 NA

Join single variable to multiple variables in r data.table

DT1 holds the mapping for all IDs, DT2 holds relationships between IDs.
I'd like to join the mappings from DT1 directly to both IDs in DT2.
Example datasets below:
# Join a mapping to multiple variables.
library(data.table)
# Dataset with mappings.
set.seed(1)
dt1 <- data.table(id=1:10,
group=sample(letters[1:4], 10, replace=TRUE))
# > dt1
# id group
# 1: 1 b
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 a
# 6: 6 d
# 7: 7 d
# 8: 8 c
# 9: 9 c
# 10: 10 a
# Dataset with relationship between IDs.
dt2 <- data.table(id1=1:5,
id2=6:10)
# > dt2
# id1 id2
# 1: 1 6
# 2: 2 7
# 3: 3 8
# 4: 4 9
# 5: 5 10
I could of course use two joins, first on ID1, then on ID2. Another way of achieving what I want is first melting DT2, so all the ID values are a single variable before joining...
# Now melt, join group variable of DT1 to DT2, then cast again to obtain
# original structure.
dt2[, i := .I] # need an observation ID
dt2Long <- melt(dt2, id="i")
setkey(dt2Long, value)
dcast(dt2Long[dt1], i ~ variable, value.var=c("value", "group"))
# i value_id1 value_id2 group_id1 group_id2
# 1: 1 1 6 b d
# 2: 2 2 7 b d
# 3: 3 3 8 c c
# 4: 4 4 9 d c
# 5: 5 5 10 a a
This gives the desired result, but I would like to know if something like the following is possible (i.e. merging a single variable with two variables)?
setkey(dt1, id)
dt1[dt2, on=c("id1", "id2")]

Resources