Question
When doing an update-join, where the i table has multiple rows per key, how can you control which row is returned?
Example
In this example, the update-join returns the last row from dt2
library(data.table)
dt1 <- data.table(id = 1)
dt2 <- data.table(id = 1, letter = letters)
dt1[
dt2
, on = "id"
, letter := i.letter
]
dt1
# id letter
# 1: 1 z
How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?
References
A couple of references similar to this by user #Frank
data.table tutorial - in particular the 'warning' on update-joins
Issue on github
The most flexible idea I can think of is to only join the part of dt2 which contains the rows you want. So, for the second row:
dt1[
dt2[, .SD[2], by=id]
, on = "id"
, letter := i.letter
]
dt1
# id letter
#1: 1 b
With a hat-tip to #Frank for simplifying the sub-select of dt2.
How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?
Not elegant, but sort-of works:
n = 3L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
A couple problems:
It doesn't select using GForce, eg as seen here:
> dt2[, letter[3], by=id, verbose=TRUE]
Detected that j uses these columns: letter
Finding groups using forderv ... 0.020sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
lapply optimization is on, j unchanged as 'letter[3]'
GForce optimized j to '`g[`(letter, 3)'
Making each group and running j (GForce TRUE) ... 0.000sec
id V1
1: 1 c
If n is outside of 1:.N for some joined groups, no warning will be given:
n = 40L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
Alternately, make a habit of checking that i in an update join x[i] is "keyed" by the join columns:
cols = "id"
stopifnot(nrow(dt2) == uniqueN(dt2, by=cols))
And then make a different i table to join on if appropriate
mDT = dt2[, .(letter = letter[3L]), by=id]
dt1[mDT, on=cols, v := i.letter]
Related
I need to group a data table using rleid. There should be three groups. One group for first row one for last row and one for all other rows in between first and last row.
I know how to group if I have a condition. Like
dt[,group := rleid(condition)]
You can have a constant vector of size nrow(dt) - 2 to get a condition to apply rleid() on.
dt[, group := rleid(c(1, rep(2, nrow(dt) - 2), 3))]
You can make a vector of all the same values then replace individual elements (e.g. first and last elements) with something else. The code below creates a column which is 1L for the first row, 3L for the last row, and 2L otherwise.
df[, group := replace(rep(2L, .N), c(1L, .N), c(1L, 3L))]
Another way using rleid is
df[, group:= rleid(.I %in% c(1L, .N))]
You can also do grouping operations on variables you create, not already in the data table.
df <- data.table(x = runif(100))
df[, .(sumx = sum(x)),
.(group = replace(rep(2L, nrow(df)), c(1L, nrow(df)), c(1L, 3L)))]
# group sumx
# 1: 1 0.1546382
# 2: 2 48.1939765
# 3: 3 0.4710213
When joining data tables I'd like to be able to replace NA values that aren't matched. Is there a way to do this all in one line? I've provided my own two line solution but I imagine there must be a cleaner way. It would also help when I'm using it for multiple variables not to require a line for each.
dt1[dt2, frequency.lrs := lr, on = .(joinVariable)]
dt1[is.na(frequency.lrs), frequency.lrs := 1]
You could create (and fill fill) the column frequency.lrs with value 1 before joining with dt2, and then use the update join to replace frequency.lrs on matched rows only.
dt1[, frequency.lrs := 1][dt2, frequency.lrs := lr, on = .(joinVariable)]
Another option:
dt1[, VAL :=
dt2[dt1, on=.(ID), replace(VAL, is.na(VAL), 1)]
]
output:
ID VAL
1: 1 3
2: 2 1
data:
library(data.table)
dt1 <- data.table(ID=1:2)
dt2 <- data.table(ID=1, VAL=3)
I want to group-by a data table by an id column and then count how many times each id occurs. This can be done as follows:
dt <- data.table(id = c(1, 1, 2))
dt_by_id <- dt[, .N, by = id]
dt_by_id
id N
1: 1 2
2: 2 1
That's pretty fine, but I want the N-column to have a different name (e. g. count). In the help it says:
.N is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in
advance and for convenience generally. When grouping by i, .N is the
number of rows in x matched to, for each row of i, regardless of
whether nomatch is NA or 0. It is renamed to N (no dot) in the result
(otherwise a column called ".N" could conflict with the .N variable,
see FAQ 4.6 for more details and example), unless it is explicitly
named; ... .
How to "explicitly name" the N-column when creating the dt_by_id data table? (I know how to rename it afterwards.) I tried
dt_by_id <- dt[, count = .N, by = id]
but this led to
Error in `[.data.table`(dt, , count = .N, by = id) :
unused argument (count = .N)
You have to list the output of your calculation if you want to give your own name:
dt[, .(count=.N), by = id]
This is identical to dt[, list(count=.N), by = id], if you prefer; . is an alias for list here.
If we have already named it, then use setnames
setnames(dt_by_id, "N", 'count')
or using rename
library(dplyr)
dt_by_id %>%
rename(count = N)
# id count
#1: 1 2
#2: 2 1
Using dplyr::count (x, name= "new column" ) will replace the default column name n with a new name.
dt <- data.frame(id = c(1, 1, 2))
dt %>%
dplyr:: count(id, name = 'ID')
This is a follow-up on this question, where the accepted answer showed an example of a matching exercise using data.table, including non-equi conditions.
Background
The basic set up is that we have DT1 with a sample of people's details, and DT2, which is a sort-of master database. And the aim is to find out whether each person in DT1 matches at least one entry in DT2.
First, we initialize a column that would indicate a match to FALSE, so that its values could be updated to TRUE whenever a match is found.
DT1[, MATCHED := FALSE]
The following general solution is then used to update the column:
DT1[, MATCHED := DT2[.SD, on=.(Criteria), .N, by=.EACHI ]$N > 0L ]
In theory, it looks (and should work) fine. The sub-expression DT2[.SD, on=.(Criteria), .N, by=.EACHI] produces a sub-table with each row from DT1, and computes the N column which is the number of matches for that row found in DT2. Then, whenever N is greater than zero, the value of MATCHED in DT1 is updated to TRUE.
It works as intended in a trivial reproducible example. But I encountered some unexpected behaviour using it with the real data, and cannot get to the bottom of it. I may be missing something or it may be a bug. Unfortunately, I cannot provide a minimal reproducible example, because the data is big, and it only shows in the big data. But I will try to document it as best I can.
Unexpected behaviour or a bug
What helped noticing this is that, for a historic reason, the matches needed to be sought in two separate databases, and hence, the filter !(MATCHED) was added to the expression to update only those values which have not already been matched:
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(Criteria), .N, by=.EACHI ]$N > 0L ]
I then noticed that if the line is re-ran several times, with each subsequent run, there will be more and more matches, which were not matched in the preceding runs. (Nothing to do with separate databases, each run matches to DT2).
First run:
MATCHED N
1: FALSE 3248007
2: TRUE 2379514
Second run:
MATCHED N
1: FALSE 2149648
2: TRUE 3477873
To investigate, I then filtered cases which weren't matched on the first run, but were matched on the second. It looks like most cases were false negatives, i.e. those which should have been matched on the first run, but weren't. (But with many runs, eventually there appear also many false positives).
For example, here is one entry from DT1:
DATE FORENAME SURNAME
1: 2016-01-01 JOHN SMITH
And a matching entry from DT2:
START_DATE EXPIRY_DATE FORENAME SURNAME
1: 2015-09-09 2017-05-01 JOHN SMITH
Running the sub-expression (described above) alone, outside the main expression, to look at the N numbers, we see that it does not result in a match, when it should (N=0). (You may also note that START_DATE and END_DATE take on the value of DATE in the output, but that is a whole other issue).
SUB <- DF2[DF1, on=.(FORENAME, SURNAME, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI]
SUB[FORENAME=="JOHN" & "SURNAME=="SMITH"]
FORENAME SURNAME START_DATE EXPIRY_DATE N
1: JOHN SMITH 2016-01-01 2016-01-01 0
However, the buggy behaviour is that the result is affected by what other rows are present in DF1. For example, suppose I know that JOHN SMITH's row number in DF1 is 149 and filter DF1 to only that row:
DF2[DF1[149], on=.(FORENAME, SURNAME, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI]
FORENAME SURNAME START_DATE EXPIRY_DATE N
1: JOHN SMITH 2016-01-01 2016-01-01 1
Secondly, I also noticed that the buggy behaviour occurs only with more than one non-equi criterion in the conditions. If the conditions are on=.(FORENAME, SURNAME, START_DATE <= DATE), there are no longer any differences between the runs and all rows appear to be matched correctly the first time.
Unfortunately, to solve the real-world problem, I must have several non-equi matching conditions. Not only to ensure that DT1's DATE is between DT2's START_DATE and END_DATEs, but also that DT1's CHECKING_DATE is before DT2's EFFECTIVE_DATE, etc.
To summarize
Non-equi joins in data.table behave in a buggy way when:
Some rows are present/absent from one of the tables
AND
More than one non-equi conditions
Update: Reproducible example
set.seed(123)
library(data.table)
library(stringi)
n <- 100000
DT1 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
DATE = sample(seq(as.Date('2016-01-01'), as.Date('2016-12-31'), by="day"), n, replace=T))
DT2 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
START_DATE = sample(seq(as.Date('2015-01-01'), as.Date('2017-12-31'), by="day"), n, replace=T))
DT2[, EXPIRY_DATE := START_DATE + floor(runif(1000, 200,300))]
#Initialization
DT1[, MATCHED := FALSE]
#First run
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
DT1[, .N, by=MATCHED]
MATCHED N
1: FALSE 85833
2: TRUE 14167
#Second run
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
DT1[, .N, by=MATCHED]
MATCHED N
1: FALSE 73733
2: TRUE 26267
#And so on with subsequent runs...
This is a workaround solution, which is not at all elegant, but appears to give the right result while the bug is not fixed.
First, we need each row in DT1 and DT2 to have a unique id. A row number will do.
DT1[, DT1_ID := 1:nrow(DT1)]
DT2[, DT2_ID := 1:nrow(DT2)]
Then, we do a following right join to find the matches:
M <- DT2[DT1, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE)]
head(M, 3)
RANDOM_STRING START_DATE EXPIRY_DATE DT2_ID DT1_ID
1: diejk 2016-03-30 2016-03-30 NA 1
2: afjgf 2016-09-14 2016-09-14 NA 2
3: kehgb 2016-12-11 2016-12-11 NA 3
M has each row from DT1 next to all matches for that row in DT2. When DT2_ID = NA, there was no match. nrow(M) = 100969, indicating that some DT1 rows were matched to >1 DT2 row. (Dates also took on the wrong values.)
Next, we can use an ifelse() statement to label rows in the original DT1 according to whether or not they were matched.
DT1$MATCHED <- ifelse(DT1$DT1_ID %in% M[!is.na(DT2_ID)]$DT1_ID, TRUE, FALSE)
Final result: 13,316 matches of 100,000
DT1[, .N, by=MATCHED]
MATCHED N
1: FALSE 86684
2: TRUE 13316
Suppose I've several intervals which are subset of real line as follows:
I_1 = [0, 1]
I_2 = [1.5, 2]
I_3 = [5, 9]
I_4 = [13, 16]
Now given a real number x = 6.4, say, I'd like to find which interval contains the number x. I would like to know the algorithm to find this interval, and/or how to do this in R.
Thanks in advance.
Update using non-equi joins:
This is much simpler and straightforward using the new non-equi joins feature in the current development version of data.table, v1.9.7:
require(data.table) # v1.9.7+
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[.(x=4.5), on=.(start<=x, end>=x), which=TRUE]
# [1] 7
No need to set keys or create indices.
Old solution using foverlaps:
One way would be to use interval/overlap joins using the data.table package:
require(data.table) ## 1.9.4+
DT1 = data.table(start=c(0,1.5,5,13), end=c(1,2,9,16))
DT2 = data.table(start=6.4, end=6.4)
setkey(DT1)
foverlaps(DT2, DT1, which=TRUE, type="within")
# xid yid
# 1: 1 3
This searches if each interval in DT2 lies completely within DT1 efficiently. In your case DT2 is a point, not an interval. If it did not exist within any intervals in DT1, it'd return NA.
Have a look at ?foverlaps to check out the other arguments you can use. For example mult= argument controls if you'd want to return all the matching rows or just the first or last etc..
Since setkey sorts the result, you'll have to add a separate id as follows:
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[, id := .I] # .I is a special variable. See ?data.table
setkey(DT1, start, end)
DT2 = data.table(start=4.5 ,end=4.5)
olaps = foverlaps(DT2, DT1, type="within", which=TRUE)
olaps[, yid := DT1$id[yid]]
# xid yid
# 1: 1 7