I noticed, when using updating by reference, I was losing some rows from my right table if there is more tan one row by join key.
No matter how I browse the forum, I can't find how to do it. Something escapes me ?
Even with mult= it doesn't seem to work.
Due to a performance and volumetry issue I would like to keep updating by reference.
In my reprex, I expected two rows for a=2
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = c(2,2,5), b = 23:25)
A[B, on = 'a', newvar := i.b, mult = 'all']
Thanks !!
One option is to create a list column in 'B' and do the join and assign (:=) as := cannot expand rows on the original data.
A[B[, .(b = .(b)), a], on = .(a), newvar := i.b]
-output
> A
a b newvar
1: 1 12
2: 2 13 23,24
3: 3 14
4: 4 15
Once we have the list, it is easier to unnest
library(tidyr)
A[, unnest(.SD, newvar, keep_empty = TRUE)]
# A tibble: 5 x 3
a b newvar
<int> <int> <int>
1 1 12 NA
2 2 13 23
3 2 13 24
4 3 14 NA
5 4 15 NA
Or use a full join with merge.data.table
merge(A, B, by = 'a', all.x = TRUE)
a b.x b.y
1: 1 12 NA
2: 2 13 23
3: 2 13 24
4: 3 14 NA
5: 4 15 NA
Related
My question is related to this question but it was asking dplyr solution.
What I'd like to do is to perform outer join and create a indicator variable that explains the merge result, like pandas or STATA would do.
To be specific, I would like to have _merge column after full outer join operation that indicates the merge result with left_only or right_only or both as below example.
UPDATE : I've updated example
key1 = c('a','b','c','d','e')
v1 = c(1,2,3, NA, 5)
key2 = c('a','b','d','f')
v2 = c(4,5,6,7)
df1 = data.frame(key=key1,v1)
df2 = data.frame(key=key2,v2)
> df1
key v1
1: a 1
2: b 2
3: c 3
4: d NA
5: e 5
> df2
key v2
1: a 4
2: b 5
3: d 6
4: f 7
# merge result I'd like to have
key v1 v2 _merge
1: a 1 4 both
2: b 2 5 both
3: c 3 NA left_only
4: d NA 6 both # <- not right_only, both
5: e 5 NA left_only
6: f NA 7 right_only
I'm wondering if I'm missing an existing data.table feature, or is there a simple way to do this task?
You can use merge.data.table with all=TRUE for a full outer join:
library(data.table)
setDT(df1)
setDT(df2)
DT <- merge(df1[, r1 := .I], df2[, r2 := .I], by="key", all=TRUE)
DT[, merge_ := "both"][
is.na(r1), merge_ := "right_only"][
is.na(r2), merge_ := "left_only"]
output:
key v1 r1 v2 r2 merge_
1: a 1 1 4 1 both
2: b 2 2 5 2 both
3: c 3 3 NA NA left_only
4: d NA NA 6 3 right_only
data:
key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key=key1,v1)
df2 = data.frame(key=key2,v2)
As mentioned by Michael Chirico, with data.table_1.13.0 released on Jul 24, 2020, one can also use fcase as follows:
DT[, merge_ := fcase(
is.na(r1), "right_only",
is.na(r2), "left_only",
default = "both"
)]
I have two data.tables, columns v2 of each one are complementary:
set.seed(1234)
v1 <- sample(1:20, 5)
v2a <- c(1:2,NA,NA,NA)
v2b <- c(NA,NA,3:5)
id <- c(letters[1:5])
library(data.table)
dt1 <- data.table(id = id, v1=v1,v2=v2a)
dt2 <- data.table(id = id, v2=v2b)
dt1
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 NA
4: d 15 NA
5: e 9 NA
dt2
id v2
1: a NA
2: b NA
3: c 3
4: d 4
5: e 5
The goal is to merge the two data.tables and have column v2 with the proper values without NA.
I got it correctly done either by:
dt <- rbindlist(list(dt1,dt2), use.names = T, fill = T)
dt <- dt[,v2:= sum(v2, na.rm = T), by = id]
dt <- dt[!is.na(v1)]
or:
dt <- merge(dt1, dt2, by = "id", all = T)
dt[, v2:=sum(v2.x, v2.y, na.rm = T), by = id][, v2.x := NULL][,v2.y := NULL]
both giving the correct desired result:
dt
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
Is there an easier/one go way to do it?
The code below updates the values of dt1$v2 where is.na(dt1$v2) == TRUE with the values of dt$v2, based on id.
dt1[is.na(v2), v2 := dt2[ dt1[is.na(v2),], v2, on = .(id)] ][]
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
There is another, less convoluted approach which uses the fcoalesce() function which was introduced with data.table v1.12.4 (on CRAN 03 Oct 2019):
dt1[dt2, on = .(id), v2 := fcoalesce(x.v2, i.v2)][]
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
dt1[dt2, on = .(id), v2 := fcoalesce(v2, i.v2)][]
works as well because
dt1[dt2, on = .(id)]
returns
id v1 v2 i.v2
1: a 16 1 NA
2: b 5 2 NA
3: c 12 NA 3
4: d 15 NA 4
5: e 9 NA 5
I have a dt:
library(data.table)
DT <- data.table(a = c(1,2,3,4,5), b = c(4,5,6,7,8), c = c("X","X","X","Y","Y") )
I want to add one column d, within each group of column C:
the first row value should be the same as b[i],
the second to last row within each group should be d[i-1] + 2*b[i]
Intended results:
a b c d
1: 1 4 X 4
2: 2 5 X 14
3: 3 6 X 26
4: 4 7 Y 7
5: 5 8 Y 23
I tried to use functions such as shift but I struggle to update rows dynamically (so to speak) here,
wonder if there is any elegant data.table style solution?
We can use cumsum and subtract the first row using [1]:
DT[, d := cumsum(2 * b) - b[1], .(c)][]
#> a b c d
#> 1: 1 4 X 4
#> 2: 2 5 X 14
#> 3: 3 6 X 26
#> 4: 4 7 Y 7
#> 5: 5 8 Y 23
Here we can use accumulate
library(purrr)
library(data.table)
DT[, d := accumulate(b, ~ .x + 2 *.y), by = c]
Or with Reduce and accumulate = TRUE from base R
DT[, d := Reduce(function(x, y) x + 2 * y, b, accumulate = TRUE), by = c]
This post is related to the previous post here: match rows of two data.tables to fill subset of a data.table
Not sure how I can integrate them together.
I have a situation where other than the NA for one column of DT1, a couple of more conditions should apply for merging, but that doesn't work.
> DT1 <- data.table(colA = c(1,1, 2,2,2,3,3), colB = c('A', NA, 'AA', 'B', NA, 'A', 'C'), timeA = c(2,4,3,4,6,1,4))
> DT1
colA colB timeA
1: 1 A 2
2: 1 <NA> 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
> DT2 <- data.table(colC = c(1,1,1,2,2,3), timeB1 = c(1,3,6, 2,4, 1), timeB2 = c(2,5,7,3,5,4), colD = c('Z', 'YY', 'AB', 'JJ', 'F', 'RR'))
> DT2
colC timeB1 timeB2 colD
1: 1 1 2 Z
2: 1 3 5 YY
3: 1 6 7 AB
4: 2 2 3 JJ
5: 2 4 5 F
6: 3 1 4 RR
Using the same guideline as mentioned above, I'd like to merge ColD of DT2 to colB of DT1 only for NA values of colB in DT1 AND use the values of colD for which timeA in DT1 is between timeB1 and timeB2 in DT2. I tried the following but merge doesn't happen:
> output <- DT1[DT2, on = .(colA = colC), colB := ifelse(is.na(x.colB) & i.timeB1 <= x.timeA & x.timeA <= i.timeB2, i.colD, x.colB)]
> output
> output
colA colB timeA
1: 1 A 2
2: 1 <NA> 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
Nothing changes in output.
these is my desired output:
> desired_output
colA colB timeA
1: 1 A 2
2: 1 YY 4 --> should find a match
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6 --> shouldn't find a match
6: 3 A 1
7: 3 C 4
why doesn't this work?
I'd like to use data.table operations only without using additional packages.
An in place update of the colB in DT1 would work as follows:
DT1[is.na(colB), colB := DT2[DT1[is.na(colB)],
on = .(colC = colA, timeB1 <= timeA, timeB2 >= timeA), colD]]
print(DT1)
colA colB timeA
1: 1 A 2
2: 1 YY 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
This indexes the values where colB is NA and after a join on the condition, as defined in on= ..., replaces the missing values by the matching values found in colD.
Possibly not the sortest answer, but it gets the job done.. I'm no data.table-expert, so I welcome improvements/suggestions.
DT1[ is.na(colB), colB := DT1[ is.na(colB), ][ DT2, colB := i.colD, on = c( "colA == colC", "timeA >= timeB1", "timeA <= timeB2")]$colB]
what is does:
first, subset DT1 for all rows where is.na(colB) = TRUE
then, update the value of colB in these rows with the colB-vector from the result of a non-equi join of the same subset of rows on DT2
Bonus is that DT1 is chaged by reference, so it's pretty fast and memory efficient on large data (I think).
colA colB timeA
1: 1 A 2
2: 1 YY 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
I need to merge some data from one data.table into another. I know how to add a new column from one data.table to another in a join. Below values from column b in df2 is added to df1 in a join based on the Id column:
df1 <- data.table(Id = letters[1:5], a = 1:5)
df2 <- data.table(Id = letters[1:3], a = 7:9, b = 7:9)
setkey(df1, Id)
setkey(df2, Id)
df1[df2, b := b][]
#> Id a b
#> 1: a 1 7
#> 2: b 2 8
#> 3: c 3 9
#> 4: d 4 NA
#> 5: e 5 NA
However, that idiom does not work when the column already exists in df1, here column a:
df1[df2, a := a][]
#> Id a
#> 1: a 1
#> 2: b 2
#> 3: c 3
#> 4: d 4
#> 5: e 5
I understand that a is not updated by this assignment because the field a already exists in df1. The reference to a in the right hand side of the assignment resolves to that value, not the on in df2.
So how to update values in df1$a with those in df2$a in a join on matching id to get the following:
#> Id a
#> 1: a 7
#> 2: b 8
#> 3: c 9
#> 4: d 4
#> 5: e 5
From ?data.table:
When i is a data.table, the columns of i can be referred to in j by using the prefix i., e.g., X[Y, .(val, i.val)]. Here val refers to X's column and i.val Y's.
Thus, in the RHS of :=, use the i. prefix to refer to the a column in df2, i.a:
library(data.table)
df1 <- data.table(Id = letters[1:5], a = 1:5)
df2 <- data.table(Id = letters[1:3], a = 7:9, b = 7:9)
setkey(df1, Id)
setkey(df2, Id)
df1[df2, a := i.a]
# or instead of setting keys, use `on` argument:
df1[df2, on = .(Id), a := i.a]
df1
# Id a
# <char> <int>
# 1: a 7
# 2: b 8
# 3: c 9
# 4: d 4
# 5: e 5