Unexpected behaviour in data.table non-equi join - r

This is a follow-up on this question, where the accepted answer showed an example of a matching exercise using data.table, including non-equi conditions.
Background
The basic set up is that we have DT1 with a sample of people's details, and DT2, which is a sort-of master database. And the aim is to find out whether each person in DT1 matches at least one entry in DT2.
First, we initialize a column that would indicate a match to FALSE, so that its values could be updated to TRUE whenever a match is found.
DT1[, MATCHED := FALSE]
The following general solution is then used to update the column:
DT1[, MATCHED := DT2[.SD, on=.(Criteria), .N, by=.EACHI ]$N > 0L ]
In theory, it looks (and should work) fine. The sub-expression DT2[.SD, on=.(Criteria), .N, by=.EACHI] produces a sub-table with each row from DT1, and computes the N column which is the number of matches for that row found in DT2. Then, whenever N is greater than zero, the value of MATCHED in DT1 is updated to TRUE.
It works as intended in a trivial reproducible example. But I encountered some unexpected behaviour using it with the real data, and cannot get to the bottom of it. I may be missing something or it may be a bug. Unfortunately, I cannot provide a minimal reproducible example, because the data is big, and it only shows in the big data. But I will try to document it as best I can.
Unexpected behaviour or a bug
What helped noticing this is that, for a historic reason, the matches needed to be sought in two separate databases, and hence, the filter !(MATCHED) was added to the expression to update only those values which have not already been matched:
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(Criteria), .N, by=.EACHI ]$N > 0L ]
I then noticed that if the line is re-ran several times, with each subsequent run, there will be more and more matches, which were not matched in the preceding runs. (Nothing to do with separate databases, each run matches to DT2).
First run:
MATCHED N
1: FALSE 3248007
2: TRUE 2379514
Second run:
MATCHED N
1: FALSE 2149648
2: TRUE 3477873
To investigate, I then filtered cases which weren't matched on the first run, but were matched on the second. It looks like most cases were false negatives, i.e. those which should have been matched on the first run, but weren't. (But with many runs, eventually there appear also many false positives).
For example, here is one entry from DT1:
DATE FORENAME SURNAME
1: 2016-01-01 JOHN SMITH
And a matching entry from DT2:
START_DATE EXPIRY_DATE FORENAME SURNAME
1: 2015-09-09 2017-05-01 JOHN SMITH
Running the sub-expression (described above) alone, outside the main expression, to look at the N numbers, we see that it does not result in a match, when it should (N=0). (You may also note that START_DATE and END_DATE take on the value of DATE in the output, but that is a whole other issue).
SUB <- DF2[DF1, on=.(FORENAME, SURNAME, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI]
SUB[FORENAME=="JOHN" & "SURNAME=="SMITH"]
FORENAME SURNAME START_DATE EXPIRY_DATE N
1: JOHN SMITH 2016-01-01 2016-01-01 0
However, the buggy behaviour is that the result is affected by what other rows are present in DF1. For example, suppose I know that JOHN SMITH's row number in DF1 is 149 and filter DF1 to only that row:
DF2[DF1[149], on=.(FORENAME, SURNAME, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI]
FORENAME SURNAME START_DATE EXPIRY_DATE N
1: JOHN SMITH 2016-01-01 2016-01-01 1
Secondly, I also noticed that the buggy behaviour occurs only with more than one non-equi criterion in the conditions. If the conditions are on=.(FORENAME, SURNAME, START_DATE <= DATE), there are no longer any differences between the runs and all rows appear to be matched correctly the first time.
Unfortunately, to solve the real-world problem, I must have several non-equi matching conditions. Not only to ensure that DT1's DATE is between DT2's START_DATE and END_DATEs, but also that DT1's CHECKING_DATE is before DT2's EFFECTIVE_DATE, etc.
To summarize
Non-equi joins in data.table behave in a buggy way when:
Some rows are present/absent from one of the tables
AND
More than one non-equi conditions
Update: Reproducible example
set.seed(123)
library(data.table)
library(stringi)
n <- 100000
DT1 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
DATE = sample(seq(as.Date('2016-01-01'), as.Date('2016-12-31'), by="day"), n, replace=T))
DT2 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
START_DATE = sample(seq(as.Date('2015-01-01'), as.Date('2017-12-31'), by="day"), n, replace=T))
DT2[, EXPIRY_DATE := START_DATE + floor(runif(1000, 200,300))]
#Initialization
DT1[, MATCHED := FALSE]
#First run
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
DT1[, .N, by=MATCHED]
MATCHED N
1: FALSE 85833
2: TRUE 14167
#Second run
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
DT1[, .N, by=MATCHED]
MATCHED N
1: FALSE 73733
2: TRUE 26267
#And so on with subsequent runs...

This is a workaround solution, which is not at all elegant, but appears to give the right result while the bug is not fixed.
First, we need each row in DT1 and DT2 to have a unique id. A row number will do.
DT1[, DT1_ID := 1:nrow(DT1)]
DT2[, DT2_ID := 1:nrow(DT2)]
Then, we do a following right join to find the matches:
M <- DT2[DT1, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE)]
head(M, 3)
RANDOM_STRING START_DATE EXPIRY_DATE DT2_ID DT1_ID
1: diejk 2016-03-30 2016-03-30 NA 1
2: afjgf 2016-09-14 2016-09-14 NA 2
3: kehgb 2016-12-11 2016-12-11 NA 3
M has each row from DT1 next to all matches for that row in DT2. When DT2_ID = NA, there was no match. nrow(M) = 100969, indicating that some DT1 rows were matched to >1 DT2 row. (Dates also took on the wrong values.)
Next, we can use an ifelse() statement to label rows in the original DT1 according to whether or not they were matched.
DT1$MATCHED <- ifelse(DT1$DT1_ID %in% M[!is.na(DT2_ID)]$DT1_ID, TRUE, FALSE)
Final result: 13,316 matches of 100,000
DT1[, .N, by=MATCHED]
MATCHED N
1: FALSE 86684
2: TRUE 13316

Related

Does unique.data.table with by behave like dplyr::distinct with .keep_all = TRUE?

I want to get unique rows in a data frame based on one variable, while still choosing which rows (based on other variables) are included.
Example:
dt <- as.data.table(list(group = c("A", "A", "B", "B", "C", "C"), number = c(1, 2, 1, 2, 2, 1)))
I would normally do this, as it allows me to always keep the row where number == 1.
dt %>%
arrange(group, number) %>%
distinct(group, .keep_all = TRUE)
This is now too slow, and I'm hoping the data.table equivalent will be faster.
This seems to work:
dt <- dt[order(group, number)]
unique(dt, by = c("group"))
But I couldn't find anything in the unique.data.table documentation which says that the first row per group is the one which is kept. Is it safe to assume it is?
According to documentation
unique returns a data.table with duplicated rows removed, by columns specified in by argument. When no by then duplicated rows by all columns are removed.
We can reason from that it does return the first row by each unique group.
To complement options provided by #Ian here is another one which will probably be the fastest one.
setkeyv(dt, c("group","number"))
unique(dt, by="group")
At least as of now, because there are possible improvements coming. An example of reducing time from 3.544s to 0.075s, it needs an index rather than key, can be found in unique can be optimized on keyed data.tables #2947.
How about subsetting .SD in j?
library(data.table)
dt[order(group,number),.SD[1],by=group]
# group number
#1: A 1
#2: B 1
#3: C 1
You might also find using .I faster because it avoids assembling .SD:
In this version, we first assemble a list of row indices using the .I special symbol and subset those indices by ones that equal 1 and then take the first ([1]) by group. We then access just the indices with $V1 and subset the original dt by that.
dt[,.I[number == 1][1], by=group]
group V1
1: A 1
2: B 3
3: C 6
dt[dt[,.I[number == 1][1], by=group]$V1]
group number
1: A 1
2: B 1
3: C 1
Edit:
As #IceCreamToucan points out in the comments, another, easier to read option is with head.data.table:
dt[order(group,number), head(.SD, 1), by=group]
group number
1: A 1
2: B 1
3: C 1

Join two data frames together and use most recent result as rows added

I am trying to achieve the 'Final.Data' output shown below.
We start with the Reference data and I want to add the 'Add.Data' but join on the 'Person' and return the most recent result prior to the reference (date).
I am looking for dplyr, data.table or sql solutions in r.
I then want to be able to reproduce this for 1000s of entries, so looking for a reasonable efficient solution.
library(tibble)
Reference.Data <- tibble(Person = "John",
Date = "2019-07-10")
Add.Data <- tibble(Person = "John",
Order.Date = c("2019-07-09","2019-07-08") ,
Order = 1:2)
Final.Data <- tibble(Person = "John",
Date = "2019-07-10",
Order.Date = "2019-07-09",
Order = 1)
A roling join to the nearest before date should work pretty fast..
#data preparation:
# convert to data.tables, set dates as 'real' dates
DT1 <- setDT(Reference.Data)[, Date := as.IDate( Date )]
DT2 <- setDT(Add.Data)[, Order.Date := as.IDate( Order.Date )]
#set keys (this also orders the dates, convenient for the join later)
setkey(DT1, Person, Date)
setkey(DT2, Person, Order.Date)
#perform rolling update join on DT1
DT1[ DT2, `:=`( Order.date = i.Order.Date, Order = i.Order), roll = -Inf][]
# Person Date Order.date Order
# 1: John 2019-07-10 2019-07-09 1
An approach using data.table non-equi join and update by reference directly on Reference.Data:
library(data.table)
setDT(Add.Data)
setDT(Reference.Data)
setorder(Add.Data, Person, Order.Date)
Reference.Data[, (names(Add.Data)) :=
Add.Data[.SD, on=.(Person, Order.Date<Date), mult="last",
mget(paste0("x.", names(Add.Data)))]
]
output:
Person Date Order.Date Order
1: John 2019-07-10 2019-07-09 1
Another data.table solution:
setDT(Add.Data)[, Order.Date := as.Date(Order.Date)]
setDT(Reference.Data)[, Date := as.Date(Date)]
Reference.Data[, c("Order.Date", "Order") := Add.Data[.SD,
on = .(Person, Order.Date = Date),
roll = TRUE,
.(x.Order.Date, x.Order)]]
Reference.Data
# Person Date Order.Date Order
# 1: John 2019-07-10 2019-07-09 1
We can do a inner_join and then group by 'Person', slice the row with the max 'Order.Date'
library(tidyverse)
inner_join(Add.Data, Reference.Data) %>%
group_by(Person) %>%
slice(which.max(as.Date(Order.Date)))
# A tibble: 1 x 4
# Groups: Person [1]
# Person Order.Date Order Date
# <chr> <chr> <int> <chr>
#1 John 2019-07-09 1 2019-07-10
Or using data.tabl#
library(data.table)
setDT(Add.Data)[as.data.table(Reference.Data), on = .(Person)][,
.SD[which.max(as.Date(Order.Date))], by = Person]
Left join the Reference.Data to the Add.Data joining on Person and on Order.Date being at or before Date. Group that by the original Reference.Data rows and take the maximum Order.Date from those. The way it works is that the Add.Data row that is used for each row of Reference.Data will be the one with the maximum Order.Date so the correct Order will be shown.
Note that dot is an SQL operator and order is an SQL keyword so we must surround names with a dot or the name order (regardless of case) with square brackets.
library(sqldf)
sqldf("select r.*, max(a.[Order.Date]) as [Order.Date], a.[Order]
from [Reference.Data] as r
left join [Add.Data] as a on r.Person = a.Person and a.[Order.Date] <= r.Date
group by r.rowid")
giving:
Person Date Order.Date Order
1 John 2019-07-10 2019-07-09 1
I haven't checked how fast this is (adding indexes could speed it up if need be) but with only a few thousand rows efficiency is not likely as important as readability.

R: Function varies the length of its output when used with data.table

I have the problem, that the use of a user-defined function in data.table varies the functions output. I have constructed a simple version which has the same problem:
library(data.table)
tmp.f <- function(Date.v){var.v <- Date.v }
dt1 <- data.table( Date = c("2018-05-15","2018-05-16") )
dt1[, tmp := length( tmp.f(Date.v = Date)) ]
dt2 <- data.table( Date = c("2018-05-14","2018-05-15","2018-05-16") )
dt2[, tmp := length( tmp.f(Date.v = Date)) ]
dt1
# Date tmp
#1: 2018-05-15 2
#2: 2018-05-16 2
dt2
# Date tmp
#1: 2018-05-14 3
#2: 2018-05-15 3
#3: 2018-05-16 3
I would need the function to simply pick up the respective date from the Date column in the data.table and calculate the corresponding value (in this example the same date). The length of the function output should always be 1. But somehow it seems to pick up the column length.
(The example is just constructed to show the problem that I have within a larger function)
Thank you very much.
As suggested by Roman, you can use by to obtain the desired output:
dt2[, tmp := length( tmp.f(Date.v = Date)), by = Date ]
Functions like length, sum or max take vectors as inputs but return single values. What happens in your example is that your column Date is entirely passed to tmp.f then to length which will output 3 as a single value. It is then recycled to fill the tmp column, giving the impression that length( tmp.f(Date.v = Date)) has been computed for each row while it has only been computed once.
Using by or not will mainly depend on whether the function you apply is naturally vectorized (or in the case of cumsum, outputs vectors of same length):
dt2[, tmp := as.Date(Date) + 10] # works as expected because function(x){as.Date(x)+10}
# is naturally vectorized

Update join with multiple rows

Question
When doing an update-join, where the i table has multiple rows per key, how can you control which row is returned?
Example
In this example, the update-join returns the last row from dt2
library(data.table)
dt1 <- data.table(id = 1)
dt2 <- data.table(id = 1, letter = letters)
dt1[
dt2
, on = "id"
, letter := i.letter
]
dt1
# id letter
# 1: 1 z
How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?
References
A couple of references similar to this by user #Frank
data.table tutorial - in particular the 'warning' on update-joins
Issue on github
The most flexible idea I can think of is to only join the part of dt2 which contains the rows you want. So, for the second row:
dt1[
dt2[, .SD[2], by=id]
, on = "id"
, letter := i.letter
]
dt1
# id letter
#1: 1 b
With a hat-tip to #Frank for simplifying the sub-select of dt2.
How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?
Not elegant, but sort-of works:
n = 3L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
A couple problems:
It doesn't select using GForce, eg as seen here:
> dt2[, letter[3], by=id, verbose=TRUE]
Detected that j uses these columns: letter
Finding groups using forderv ... 0.020sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
lapply optimization is on, j unchanged as 'letter[3]'
GForce optimized j to '`g[`(letter, 3)'
Making each group and running j (GForce TRUE) ... 0.000sec
id V1
1: 1 c
If n is outside of 1:.N for some joined groups, no warning will be given:
n = 40L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
Alternately, make a habit of checking that i in an update join x[i] is "keyed" by the join columns:
cols = "id"
stopifnot(nrow(dt2) == uniqueN(dt2, by=cols))
And then make a different i table to join on if appropriate
mDT = dt2[, .(letter = letter[3L]), by=id]
dt1[mDT, on=cols, v := i.letter]

Finding the appropriate interval

Suppose I've several intervals which are subset of real line as follows:
I_1 = [0, 1]
I_2 = [1.5, 2]
I_3 = [5, 9]
I_4 = [13, 16]
Now given a real number x = 6.4, say, I'd like to find which interval contains the number x. I would like to know the algorithm to find this interval, and/or how to do this in R.
Thanks in advance.
Update using non-equi joins:
This is much simpler and straightforward using the new non-equi joins feature in the current development version of data.table, v1.9.7:
require(data.table) # v1.9.7+
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[.(x=4.5), on=.(start<=x, end>=x), which=TRUE]
# [1] 7
No need to set keys or create indices.
Old solution using foverlaps:
One way would be to use interval/overlap joins using the data.table package:
require(data.table) ## 1.9.4+
DT1 = data.table(start=c(0,1.5,5,13), end=c(1,2,9,16))
DT2 = data.table(start=6.4, end=6.4)
setkey(DT1)
foverlaps(DT2, DT1, which=TRUE, type="within")
# xid yid
# 1: 1 3
This searches if each interval in DT2 lies completely within DT1 efficiently. In your case DT2 is a point, not an interval. If it did not exist within any intervals in DT1, it'd return NA.
Have a look at ?foverlaps to check out the other arguments you can use. For example mult= argument controls if you'd want to return all the matching rows or just the first or last etc..
Since setkey sorts the result, you'll have to add a separate id as follows:
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[, id := .I] # .I is a special variable. See ?data.table
setkey(DT1, start, end)
DT2 = data.table(start=4.5 ,end=4.5)
olaps = foverlaps(DT2, DT1, type="within", which=TRUE)
olaps[, yid := DT1$id[yid]]
# xid yid
# 1: 1 7

Resources