Should non-equi inner join (nomatch=0L) be bidirectional? - r

When doing a non-equi inner join, should the order of X[Y] and Y[X] matters? I am under the impression that it should not.
library(data.table) #data.table_1.12.2
dt1 <- data.table(ID=LETTERS[1:4], TIME=2L:5L)
cols1 <- names(dt1)
dt2 <- data.table(ID=c("A", "B"), START=c(1L, 20L), END=c(3L, 30L))
cols2 <- names(dt2)
> dt1
ID TIME
1: A 2
2: B 3
3: C 4
4: D 5
> dt2
ID START END
1: A 1 3
2: B 20 30
I am trying to filter for rows in dt1 such that 1) ID matches and 2) dt1$TIME lies between dt2$START and dt2$END. Desired output:
ID TIME
1: A 2
Since I wanted rows from dt1, I started with using dt1 as i in data.table[ but I am getting either columns from dt2 or encountered errors:
#no error but using x. values
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L]
#error for the rest
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L, mget(paste0("i.", cols1))]
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L, .SD]
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L, .(START)]
Error message:
Error in [.data.table(dt2, dt1, on = .(ID, START < TIME, END > TIME), : column(s) not found: START
So I had to use dt2 as the i as a workaround:
#need to type out all the columns:
dt1[dt2, on=.(ID, TIME>START, TIME<END), nomatch=0L, .(ID, TIME=x.TIME)]
#using setNames
dt1[dt2, on=.(ID, TIME>START, TIME<END), nomatch=0L,
setNames(mget(paste0("x.", cols1)), cols1)]
Or is this a simple case of my misunderstanding?
References:
Confusion arise from answering: r compare two data.tables by row
https://github.com/Rdatatable/data.table/issues/1700
https://github.com/Rdatatable/data.table/issues/1807
https://github.com/Rdatatable/data.table/pull/2706
https://github.com/Rdatatable/data.table/pull/3093

I am trying to filter for rows in dt1 such that 1) ID matches and 2) dt1$TIME lies between dt2$START and dt2$END.
That sounds like a semi join: Perform a semi-join with data.table
dt1[
dt1[dt2, on=.(ID, TIME >= START, TIME <= END), nomatch=0, which=TRUE]
]
# ID TIME
# 1: A 2
If it's possible that multiple rows of dt2 will match rows of dt1, then the "which" output can be wrapped in unique() as in the linked answer.
There are a couple linked feature requests for a more convenient way to do this: https://github.com/Rdatatable/data.table/issues/2158

Related

subsetting ID and dates in data.table R

I have a large matrix similar to the next example that I create (I have 70 columns and millions of rows):
a <- seq(as.IDate("2011-12-30"), as.IDate("2014-01-04"), by="days")
data <- data.table(ID = 1:length(a), date1 = a)
I want to extract all those lines that are in IDs, it contains the ID of the individual, and the dates that I need to extract from that individual. An individual can have multiple lines.
a <- seq(as.IDate("2011-12-30"), as.IDate("2014-01-04"), by="week")
b <- seq(as.IDate("2012-01-01"), as.IDate("2014-01-06"), by="week")
IDs <- data.table(ID = 1:length(a), date1 = a, date2 = b)
Currently, my solution is not very fast, what would be better?
A <- list()
for(i in 1:dim(IDs)[1]){
A[[i]] <- data[ID == IDs[i,ID] & (date1 %between% IDs[i,.(date1,date2)]),]
}
I think you are looking for a non-equi inner join:
IDs[data, on=.(ID, date1<=date1, date2>=date1), nomatch=0L, .(ID, date1=i.date1)]
Or associatively,
data[IDs, on=.(ID, date1>=date1, date1<=date2), nomatch=0L, .(ID, date1=x.date1)]
Or viewing it as a non-equi semi-join:
data[IDs[data, on=.(ID, date1<=date1, date2>=date1), nomatch=0L, which=TRUE]]
output:
ID date1
1: 1 2011-12-30

Replace N/As in a data.table join

When joining data tables I'd like to be able to replace NA values that aren't matched. Is there a way to do this all in one line? I've provided my own two line solution but I imagine there must be a cleaner way. It would also help when I'm using it for multiple variables not to require a line for each.
dt1[dt2, frequency.lrs := lr, on = .(joinVariable)]
dt1[is.na(frequency.lrs), frequency.lrs := 1]
You could create (and fill fill) the column frequency.lrs with value 1 before joining with dt2, and then use the update join to replace frequency.lrs on matched rows only.
dt1[, frequency.lrs := 1][dt2, frequency.lrs := lr, on = .(joinVariable)]
Another option:
dt1[, VAL :=
dt2[dt1, on=.(ID), replace(VAL, is.na(VAL), 1)]
]
output:
ID VAL
1: 1 3
2: 2 1
data:
library(data.table)
dt1 <- data.table(ID=1:2)
dt2 <- data.table(ID=1, VAL=3)

Update join with multiple rows

Question
When doing an update-join, where the i table has multiple rows per key, how can you control which row is returned?
Example
In this example, the update-join returns the last row from dt2
library(data.table)
dt1 <- data.table(id = 1)
dt2 <- data.table(id = 1, letter = letters)
dt1[
dt2
, on = "id"
, letter := i.letter
]
dt1
# id letter
# 1: 1 z
How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?
References
A couple of references similar to this by user #Frank
data.table tutorial - in particular the 'warning' on update-joins
Issue on github
The most flexible idea I can think of is to only join the part of dt2 which contains the rows you want. So, for the second row:
dt1[
dt2[, .SD[2], by=id]
, on = "id"
, letter := i.letter
]
dt1
# id letter
#1: 1 b
With a hat-tip to #Frank for simplifying the sub-select of dt2.
How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?
Not elegant, but sort-of works:
n = 3L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
A couple problems:
It doesn't select using GForce, eg as seen here:
> dt2[, letter[3], by=id, verbose=TRUE]
Detected that j uses these columns: letter
Finding groups using forderv ... 0.020sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
lapply optimization is on, j unchanged as 'letter[3]'
GForce optimized j to '`g[`(letter, 3)'
Making each group and running j (GForce TRUE) ... 0.000sec
id V1
1: 1 c
If n is outside of 1:.N for some joined groups, no warning will be given:
n = 40L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
Alternately, make a habit of checking that i in an update join x[i] is "keyed" by the join columns:
cols = "id"
stopifnot(nrow(dt2) == uniqueN(dt2, by=cols))
And then make a different i table to join on if appropriate
mDT = dt2[, .(letter = letter[3L]), by=id]
dt1[mDT, on=cols, v := i.letter]

Non-equi join with calculated column

I would like to perform a non-equi join on two data.table where the comparison is made with the sum of two columns of the first data.table:
set.seed(2018)
DT1 <- data.table(ID =seq(10), X= round(500*runif(10)), Y= round(500*runif(10)));DT1
DT2 <- data.table(ID =seq(10), X_min =seq(0,900,100), X_max =seq(99,999,100), Text=LETTERS[1:10] );DT2
Now I would like to join the text from DT2 that corresponds to the interval [X_min, X_max] that the sum X+Y from DT1 is in. I could do:
DT1[, Z :=X+Y]
DT1[DT2, Description:=i.Text, on =.(Z>=X_min, Z<= X_max)]
is it possible to avoid calculatingZexplicitly? This Fails:
DT1[DT2, Description:=i.Text, on =.((X+Y)>=X_min, (X+Y)<= X_max)]

Finding the appropriate interval

Suppose I've several intervals which are subset of real line as follows:
I_1 = [0, 1]
I_2 = [1.5, 2]
I_3 = [5, 9]
I_4 = [13, 16]
Now given a real number x = 6.4, say, I'd like to find which interval contains the number x. I would like to know the algorithm to find this interval, and/or how to do this in R.
Thanks in advance.
Update using non-equi joins:
This is much simpler and straightforward using the new non-equi joins feature in the current development version of data.table, v1.9.7:
require(data.table) # v1.9.7+
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[.(x=4.5), on=.(start<=x, end>=x), which=TRUE]
# [1] 7
No need to set keys or create indices.
Old solution using foverlaps:
One way would be to use interval/overlap joins using the data.table package:
require(data.table) ## 1.9.4+
DT1 = data.table(start=c(0,1.5,5,13), end=c(1,2,9,16))
DT2 = data.table(start=6.4, end=6.4)
setkey(DT1)
foverlaps(DT2, DT1, which=TRUE, type="within")
# xid yid
# 1: 1 3
This searches if each interval in DT2 lies completely within DT1 efficiently. In your case DT2 is a point, not an interval. If it did not exist within any intervals in DT1, it'd return NA.
Have a look at ?foverlaps to check out the other arguments you can use. For example mult= argument controls if you'd want to return all the matching rows or just the first or last etc..
Since setkey sorts the result, you'll have to add a separate id as follows:
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[, id := .I] # .I is a special variable. See ?data.table
setkey(DT1, start, end)
DT2 = data.table(start=4.5 ,end=4.5)
olaps = foverlaps(DT2, DT1, type="within", which=TRUE)
olaps[, yid := DT1$id[yid]]
# xid yid
# 1: 1 7

Resources