How to join tables with OR condition in data.table - r

Is this possible in data.table to join tables with OR condition?
For example:
library(data.table)
X<-data.table(x=c('a','b','c','d','e','f'),y=c(1,1,2,2,3,3),z=c(10,11,12,13,14,15))
x y z
1: a 1 12
2: b 1 11
3: c 2 12
4: d 2 13
5: e 3 14
6: f 3 15
Y<-data.table(x=c('a','e','a'),z=c(12,20,14),t=c('a','b','c'))
x z t
1: a 12 a
2: e 20 b
3: a 14 c
# and i need something like this:
X[Y,on=c("x"|"z"),.(x,y,z,i.t)]
x y z t
1: a 1 10 a
2: a 1 10 c
3: b 1 11 NA
4: c 2 12 a
5: d 2 13 NA
6: e 3 14 b
7: e 3 14 c
8: f 3 15 NA
I haven't found information about joining with OR in documentation.
Have I missed something?

The OP requested that the result set should consist of 3 subsets:
rows matching on column x
rows matching on column y
remaining rows of data.table X
So, this is a kind of right outer join of table X with Y on either column x or y.
This can be translated into 2 separate inner joins on column x and y resp., a union of both result sets, and a final outer join to add the remaining rows from table X.
Combined in one data.table statement this becomes
unique(rbindlist(list(
X[Y, on = "x", .(x, y, z, t), nomatch = 0],
X[Y, on = "z", .(x, y, z, t), nomatch = 0]
)))[X, on = .(x, y, z)]
# x y z t
#1: a 1 10 a
#2: a 1 10 c
#3: b 1 11 NA
#4: c 2 12 a
#5: d 2 13 NA
#6: e 3 14 b
#7: e 3 14 c
#8: f 3 15 NA
The inner joins are enforced by parameter nomatch = 0. The union operation is implemented using rbindlist(list(...)). EDIT: unique() is required to remove double matches in case where x and z are matching in the same row in X and in Y (thanks to filius_arator for pointing this out).
The final right outer join uses all rows of X including those which haven't been matched yet. Note that this join is on the three columns of X.

I am not sure if this is what you want or if it is very data.table-esque but there's no other answers at the moment:
join1 <- merge(X, Y[,c('x', 't'), with=FALSE], all.x=TRUE)
merge(join1, Y[,c('z', 't'), with=FALSE], all.x=TRUE, by = 'z')[,
t := ifelse(!is.na(t.x), t.x, t.y)][,
t.x := NULL][,
t.y := NULL][]
Giving:
z x y t
1: 10 a 1 a
2: 11 b 1 NA
3: 12 c 2 a
4: 13 d 2 NA
5: 14 e 3 b
6: 15 f 3 NA
EDIT with the updated example here's an approach but I'm sure there are better ways that the data.table gurus could should:
join1 <- merge(X, Y[,c('x', 't'), with=FALSE], all.x=TRUE)
merge(join1, Y[,c('z', 't'), with=FALSE], all.x=TRUE, by = 'z')[,
id := seq(.N)][,
.(t =list( na.omit(c(t.x, t.y)))), by = c('id', 'x', 'y', 'z')][,
.(x=x, y=y, z=z, t=unlist(t)), by = c('id')][]
## id x y z t
## 1: 1 a 1 10 a
## 2: 2 a 1 10 c
## 3: 3 b 1 11 NA
## 4: 4 c 2 12 a
## 5: 5 d 2 13 NA
## 6: 6 e 3 14 b
## 7: 6 e 3 14 c
## 8: 7 f 3 15 NA

Related

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

R - unpivot list in data.table rows

I have a dataset that contains several columns, including 1 with list entries:
DT = data.table(
x = c(1:5),
y = seq(2, 10, 2),
z = list(list("a","b","a"), list("a","c"), list("b","c"), list("a","b","c"), list("b","c","b"))
)
Basically, I'm trying to unlist a, b, c from column z, and aggregate the data based on the x & y values.
Desired output:
z x sum(y)
1: a 1 4
2: b 1 2
3: a 2 4
4: c 2 4
5: b 3 6
6: c 3 6
7: a 4 8
8: b 4 8
9: c 4 8
10: b 5 20
11: c 5 10
My current method is rather round-about; I created 2 other columns with x and y values in lists of the same length as the list entry in z column, then unlisted all 3 columns simultaneously before aggregating - i.e. sum y values, grouped by z & x.
Code (before unlisting & aggregation):
DT[, listlen := sapply(z, function(x) length(x))]
for (a in c(1:nrow(DT))){
DT[a, x1:= list(list(rep(DT[a, x], DT[a, listlen])))]
DT[a, y1:= list(list(rep(DT[a, y], DT[a, listlen])))]}
DT_out = data.table(x = unlist(DT[,x1]), y = unlist(DT[,y1]), z = unlist(DT[,z]))
x y z listlen x1 y1
1: 1 2 <list> 3 1,1,1 2,2,2
2: 2 4 <list> 2 2,2 4,4
3: 3 6 <list> 2 3,3 6,6
4: 4 8 <list> 3 4,4,4 8,8,8
5: 5 10 <list> 3 5,5,5 10,10,10
Is there a method through data.table or reshape packages that can help me melt the dataset / do this much simpler? As I'm working with a lot more rows than this and this step seems to be very inefficient.
Any other help regarding the aggregation step would be much appreciated too!
unlist your z column first and then just aggregate as per normal via by=:
DT[, .(z=unlist(z)), by=.(x,y)][, .(sumy=sum(y)), by=.(x,z)]
# x z sumy
# 1: 1 a 4
# 2: 1 b 2
# 3: 2 a 4
# 4: 2 c 4
# 5: 3 b 6
# 6: 3 c 6
# 7: 4 a 8
# 8: 4 b 8
# 9: 4 c 8
#10: 5 b 20
#11: 5 c 10

Update a data.table based on another data table

I want to update the columns in an old data.table based on a new data.table only when the value is not NA.
DT_old = data.table(x=rep(c("a","b","c")), y=c(1,3,6), v=1:3, l=c(1,1,1))
DT_old
x y v l
1: a 1 1 1
2: b 3 2 1
3: c 6 3 1
DT_new = data.table(x=rep(c("b","c",'d')), y=c(9,6,10), v=c(2,NA,10), z=c(9,9,9))
DT_new
x y v z
1: b 9 2 9
2: c 6 NA 9
3: d 10 10 9
I want the output to be
x y v z
1: b 9 2 9
2: c 6 3 9
3: d 10 10 9
4: a 1 1 NA
Currently I am merging the two data.table and going through each column and replacing the NA in the new data.table
DT_merged <- merge(DT_new, DT_old, all=TRUE, by='x')
DT_merged
x y.x v.x z y.y v.y l
1: a NA NA NA 1 1 1
2: b 9 2 9 3 2 1
3: c 6 NA 9 6 3 1
4: d 10 10 9 NA NA NA
DT_merged[is.na(y.x), y.x := y.y]
DT_merged[is.na(v.x), v.x := v.y]
DT_merged = DT_merged[, list(y=y.x, v=v.x, z=z)
Is there a better way to do the above?
Here's how I would approach this. First, I will expand DT_new according to the unique values combination of x columns of both tables using binary join
res <- setkey(DT_new, x)[unique(c(x, DT_old$x))]
res
# x y v z
# 1: b 9 2 9
# 2: c 6 NA 9
# 3: d 10 10 9
# 4: a NA NA NA
Then, I will updated the two columns in res by reference using another binary join
setkey(res, x)[DT_old, `:=`(y = i.y, v = i.v)]
res
# x y v z
# 1: a 1 1 NA
# 2: b 3 2 9
# 3: c 6 3 9
# 4: d 10 10 9
Following the comments section, it seems that you are trying to join each column by its own condition. There is no simple way of doing such thing in R or any language AFAIK. Thus, your own solution could be a good option by itself.
Though, here are some other alternatives, mainly taken from a similar question I myself asked not long ago
Using two ifelse statments
setkey(res, x)[DT_old, `:=`(y = ifelse(is.na(y), i.y, y),
v = ifelse(is.na(v), i.v, v))]
Two separate conditional joins
setkey(res, x) ; setkey(DT_old, x) ## old data set needs to be keyed too now
res[is.na(y), y := DT_old[.SD, y]]
res[is.na(v), v := DT_old[.SD, v]]
Both will give you what you need.
P.S.
If you don't want warnings, you need to define the corresponding column classes correctly, e.g. v column in DT_new should be defined as v= c(2L, NA_integer_, 10L)

Eliminate all duplicates according to a key + keep records from a table which are not in another table

Question 1:
I am relatively new to R and I have two distinct questions.
I need to eliminate duplicates according to a key (single or multiple) but all of them so unique wouldn't do it. I also found the function duplicated but it will mark as true only from the second occurrence onward, but I need to eliminate all of them.
> DT <- data.table(Key=c("a","a","a","b","c"),var=c(1:5))
> DT
Key var
1: a 1
2: a 2
3: a 3
4: b 4
5: c 5
> unique(DT)
Key var
1: a 1
2: b 4
3: c 5
> duplicated(DT)
[1] FALSE TRUE TRUE FALSE FALSE
what I want instead is
Key var
1: b 4
2: c 5
Question 2:
I have 2 data tables and I want to keep only records from DTFrom for which the combination of values from the 2 (or more keys) is not in DTFilter (I found similar questions for SQL but not r):
> DTFrom
key1 key2 var
1: q m 1
2: q n 2
3: q b 3
4: w n 4
5: e m 5
6: e n 6
7: e b 7
8: r n 8
9: r b 9
10: t m 10
11: t n 11
12: t b 12
13: t v 13
> DTFilter
key1 key2 var
1: q m 1
2: w n 4
3: e b 7
4: e n 6
5: r n 8
6: r b 9
7: t m 10
8: t v 13
and I want the result to be:
> DTOut
key1 key2 var
1: q n 2
2: q b 3
3: e m 5
4: t n 11
5: t b 12
Thanks in advance!
For the first question, you can use the fromLast argument in duplicated:
DT[ !(duplicated(Key) | duplicated(Key, fromLast = TRUE))]
# Key var
#1: b 4
#2: c 5
For the second question, you can do:
setkey(DTFrom, key1, key2)
DTFrom[!DTFilter]
# key1 key2 var
#1: e m 5
#2: q b 3
#3: q n 2
#4: t b 12
#5: t n 11
As for the first question you can use table function for additional filtering:
DT[!duplicated(Key)][table(DT$Key) == 1,]
# Key var
# 1: b 4
# 2: c 5
As for the second question there is anti_join function in dplyr package specially for this case:
require("dplyr")
anti_join(DTFrom, DTFilter, by = c("key1", "key2"))
# key1 key2 var
# 1 e m 5
# 2 q b 3
# 3 q n 2
# 4 t n 11
# 5 t b 12
My preferred method for the first question is:
DT[.(DT[ , .N, by=key(DT)][N==1L, !"N"])]
Similarly we can do:
DT[.(DT[, .N, by=key(DT)][N==1L, ]), .SD]
#docendodiscimus 's answer for Q2 is also mine.

get rows of unique values by group

I have a data.table and want to pick those lines of the data.table where some values of a variable x are unique relative to another variable y
It's possible to get the unique values of x, grouped by y in a separate dataset, like this
dt[,unique(x),by=y]
But I want to pick the rows in the original dataset where this is the case. I don't want a new data.table because I also need the other variables.
So, what do I have to add to my code to get the rows in dt for which the above is true?
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
y x z
1: a 1 1
2: a 2 2
3: a 2 3
4: b 3 4
5: b 2 5
6: b 1 6
What I want:
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
The idiomatic data.table way is:
require(data.table)
unique(dt, by = c("y", "x"))
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 3 4
# 4: b 2 5
# 5: b 1 6
data.table is a bit different in how to use duplicated. Here's the approach I've seen around here somewhere before:
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
setkey(dt, "y", "x")
key(dt)
# [1] "y" "x"
!duplicated(dt)
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
dt[!duplicated(dt)]
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 1 6
# 4: b 2 5
# 5: b 3 4
The simpler data.table solution is to grab the first element of each group
> dt[, head(.SD, 1), by=.(y, x)]
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
Thanks to dplyR
library(dplyr)
col1 = c(1,1,3,3,5,6,7,8,9)
col2 = c("cust1", 'cust1', 'cust3', 'cust4', 'cust5', 'cust5', 'cust5', 'cust5', 'cust6')
df1 = data.frame(col1, col2)
df1
distinct(select(df1, col1, col2))

Resources