Non-equi join with calculated column - r

I would like to perform a non-equi join on two data.table where the comparison is made with the sum of two columns of the first data.table:
set.seed(2018)
DT1 <- data.table(ID =seq(10), X= round(500*runif(10)), Y= round(500*runif(10)));DT1
DT2 <- data.table(ID =seq(10), X_min =seq(0,900,100), X_max =seq(99,999,100), Text=LETTERS[1:10] );DT2
Now I would like to join the text from DT2 that corresponds to the interval [X_min, X_max] that the sum X+Y from DT1 is in. I could do:
DT1[, Z :=X+Y]
DT1[DT2, Description:=i.Text, on =.(Z>=X_min, Z<= X_max)]
is it possible to avoid calculatingZexplicitly? This Fails:
DT1[DT2, Description:=i.Text, on =.((X+Y)>=X_min, (X+Y)<= X_max)]

Related

Replace N/As in a data.table join

When joining data tables I'd like to be able to replace NA values that aren't matched. Is there a way to do this all in one line? I've provided my own two line solution but I imagine there must be a cleaner way. It would also help when I'm using it for multiple variables not to require a line for each.
dt1[dt2, frequency.lrs := lr, on = .(joinVariable)]
dt1[is.na(frequency.lrs), frequency.lrs := 1]
You could create (and fill fill) the column frequency.lrs with value 1 before joining with dt2, and then use the update join to replace frequency.lrs on matched rows only.
dt1[, frequency.lrs := 1][dt2, frequency.lrs := lr, on = .(joinVariable)]
Another option:
dt1[, VAL :=
dt2[dt1, on=.(ID), replace(VAL, is.na(VAL), 1)]
]
output:
ID VAL
1: 1 3
2: 2 1
data:
library(data.table)
dt1 <- data.table(ID=1:2)
dt2 <- data.table(ID=1, VAL=3)

Should non-equi inner join (nomatch=0L) be bidirectional?

When doing a non-equi inner join, should the order of X[Y] and Y[X] matters? I am under the impression that it should not.
library(data.table) #data.table_1.12.2
dt1 <- data.table(ID=LETTERS[1:4], TIME=2L:5L)
cols1 <- names(dt1)
dt2 <- data.table(ID=c("A", "B"), START=c(1L, 20L), END=c(3L, 30L))
cols2 <- names(dt2)
> dt1
ID TIME
1: A 2
2: B 3
3: C 4
4: D 5
> dt2
ID START END
1: A 1 3
2: B 20 30
I am trying to filter for rows in dt1 such that 1) ID matches and 2) dt1$TIME lies between dt2$START and dt2$END. Desired output:
ID TIME
1: A 2
Since I wanted rows from dt1, I started with using dt1 as i in data.table[ but I am getting either columns from dt2 or encountered errors:
#no error but using x. values
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L]
#error for the rest
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L, mget(paste0("i.", cols1))]
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L, .SD]
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L, .(START)]
Error message:
Error in [.data.table(dt2, dt1, on = .(ID, START < TIME, END > TIME), : column(s) not found: START
So I had to use dt2 as the i as a workaround:
#need to type out all the columns:
dt1[dt2, on=.(ID, TIME>START, TIME<END), nomatch=0L, .(ID, TIME=x.TIME)]
#using setNames
dt1[dt2, on=.(ID, TIME>START, TIME<END), nomatch=0L,
setNames(mget(paste0("x.", cols1)), cols1)]
Or is this a simple case of my misunderstanding?
References:
Confusion arise from answering: r compare two data.tables by row
https://github.com/Rdatatable/data.table/issues/1700
https://github.com/Rdatatable/data.table/issues/1807
https://github.com/Rdatatable/data.table/pull/2706
https://github.com/Rdatatable/data.table/pull/3093
I am trying to filter for rows in dt1 such that 1) ID matches and 2) dt1$TIME lies between dt2$START and dt2$END.
That sounds like a semi join: Perform a semi-join with data.table
dt1[
dt1[dt2, on=.(ID, TIME >= START, TIME <= END), nomatch=0, which=TRUE]
]
# ID TIME
# 1: A 2
If it's possible that multiple rows of dt2 will match rows of dt1, then the "which" output can be wrapped in unique() as in the linked answer.
There are a couple linked feature requests for a more convenient way to do this: https://github.com/Rdatatable/data.table/issues/2158

Operate in data.table column by matching column from second data.table

I am trying to perform a character operation (paste) in a column from one data.table using data from a second data.table.
Since I am also performing other unrelated merge operations before and after this particular code, the rows order might change, so I am currently setting the order both before and after this manipulation.
DT1 <- data.table(ID = c("a", "b", "c"), N = c(4,1,3)) # N used
DT2 <- data.table(ID = c("b","a","c"), N = c(10,10, 15)) # N total
# without merge
DT1 <- DT1[order(ID)]
DT2 <- DT2[order(ID)]
DT1[, N := paste0(N, "/", DT2$N)]
DT1
# ID N
# 1: a 4/10
# 2: b 1/10
# 3: c 3/15
I know a merge of the two DTs (by definition) would take care of the matching, but this creates extra columns that I need to remove afterwards.
# using merge
DT1 <- merge(DT1, DT2, by = "ID")
DT1[, N := paste0(N.x, "/", N.y)]
DT1[, c("N.x", "N.y") := list(NULL, NULL)]
DT1
# ID N
# 1: a 4/10
# 2: b 1/10
# 3: c 3/15
Is there a more intelligent way of doing this using data.table?
We can use join after converting the 'N' column to character
DT1[DT2, N := paste0(N, "/", i.N), on = .(ID)]
DT1
# ID N
#1: a 4/10
#2: b 1/10
#3: c 3/15
data
DT1 <- data.table(ID = c("a", "b", "c"), N = c(4,1,3))
DT2 <- data.table(ID = c("b","a","c"), N = c(10,10, 15)) # N total
DT1[, N:= as.character(N)]

Assign a value based on closest neighbour from other data frame

With generic data:
set.seed(456)
a <- sample(0:1,50,replace = T)
b <- rnorm(50,15,5)
df1 <- data.frame(a,b)
c <- seq(0.01,0.99,0.01)
d <- rep(NA, 99)
for (i in 1:99) {
d[i] <- 0.5*(10*c[i])^2+5
}
df2 <- data.frame(c,d)
For each df1$b we want to find the nearest df2$d.
Then we create a new variable df1$XYZ that takes the df2$c value of the nearest df2$d
This question has guided me towards data.table library. But I am not sure if ddplyr and group_by can also be used:
Here was my data.table attempt:
library(data.table)
dt1 <- data.table( df1 , key = "b" )
dt2 <- data.table( df2 , key = "d" )
dt[ ldt , list( d ) , roll = "nearest" ]
Here's one way with data.table:
require(data.table)
setDT(df1)[, XYZ := setDT(df2)[df1, c, on=c(d="b"), roll="nearest"]]
You need to get df2$c corresponding to the nearest value in df2$d for every df1$b. So, we need to join as df2[df1] which results in nrow(df1) rows.That can be done with setDT(df2)[df1, c, on=c(d="b"), roll="nearest"].
It returns the result you require. All we need to do is to add this back to df1 with the name XYZ. We do that using :=.
The thought process in constructing the rolling join is something like this (assuming df1 and df2 are both data tables):
We need get some value(s) for each row of df1. That means, i = df1 in x[i] syntax.
df2[df1]
We need to join df2$d with df1$b. Using on= that'd be:
df2[df1, on=c(d="b")]
We need just the c column. Use j to select just that column.
df2[df1, c, on=c(d="b")]
We don't need equi-join but roll to nearest join.
df2[df1, c, on=c(d="b"), roll="nearest"]
Hope this helps.

data.table join (multiple) selected columns with new names

I like to join two tables that have some identical columns (names and values) and others that are not. I'm only interested in joining those that are not identical and I would like to determine a new name for them. The way I currently do it seems verbose and hard to handle for the real tables I have with 100+ columns, i.e. I would like to determine the columns to be joined in advance and not in join statement. Reproducible example:
# create table 1
DT1 = data.table(id = 1:5, x=letters[1:5], a=11:15, b=21:25)
# create table 2 with changed values for a, b via pre-determined cols
DT2 = copy(DT1)
cols <- c("a", "b")
DT2[, (cols) := lapply(.SD, function(x) x*2), .SDcols = cols]
# this both works but is verbose for many columns
DT1[DT2, c("a_new", "b_new") := list(i.a, i.b), on=c(id="id")]
DT1[DT2, `:=` (a_new=i.a, b_new=i.b), on = c(id="id")]
I was thinking about something like this (doesn't work):
cols_new <- c("a_new", "b_new")
cols <- c("a", "b")
DT1[DT2, cols_new := i.cols, on=c(id="id")]
Updated answer based on Arun's recommendation:
cols_old <- c('i.a', 'i.b')
DT1[DT2, (cols_new) := mget(cols_old), on = c(id = "id")]
you could also generate the cols_old by doing:
paste0('i.', gsub('_new', '', cols_new, fixed = TRUE))
See history for the old answer.

Resources