Replace N/As in a data.table join - r

When joining data tables I'd like to be able to replace NA values that aren't matched. Is there a way to do this all in one line? I've provided my own two line solution but I imagine there must be a cleaner way. It would also help when I'm using it for multiple variables not to require a line for each.
dt1[dt2, frequency.lrs := lr, on = .(joinVariable)]
dt1[is.na(frequency.lrs), frequency.lrs := 1]

You could create (and fill fill) the column frequency.lrs with value 1 before joining with dt2, and then use the update join to replace frequency.lrs on matched rows only.
dt1[, frequency.lrs := 1][dt2, frequency.lrs := lr, on = .(joinVariable)]

Another option:
dt1[, VAL :=
dt2[dt1, on=.(ID), replace(VAL, is.na(VAL), 1)]
]
output:
ID VAL
1: 1 3
2: 2 1
data:
library(data.table)
dt1 <- data.table(ID=1:2)
dt2 <- data.table(ID=1, VAL=3)

Related

R - Most efficient way to remove all non-matched rows in a data.table rolling join (instead of 2-step procedure with semi join)

Currently solve this with a workaround, but I would like to know if there is a more efficient way.
See below for exemplary data:
library(data.table)
library(anytime)
library(tidyverse)
library(dplyr)
library(batchtools)
# Lookup table
Date <- c("1990-03-31", "1990-06-30", "1990-09-30", "1990-12-31",
"1991-03-31", "1991-06-30", "1991-09-30", "1991-12-31")
period <- c(1:8)
metric_1 <- rep(c(2000, 3500, 4000, 100000), 2)
metric_2 <- rep(c(200, 350, 400, 10000), 2)
id <- 22
dt <- setDT(data.frame(Date, period, id, metric_1, metric_2))
# Fill and match table 2
Date_2 <- c("1990-08-30", "1990-02-28", "1991-07-31", "1991-09-30", "1991-10-31")
random <- c(10:14)
id_2 <- c(22,33,57,73,999)
dt_fill <- setDT(data.frame(EXCL_DATE, random, id_2))
# Convert date columns to type date
dt[ , Date := anydate(Date)]
dt_fill[ , Date_2 := anydate(Date_2)]
Now for the data wrangling. I want to get the most recent preceding data from dt (aka lookup table) into dt_fill. I do this with an easy 1-line rolling join like this.
# Rolling join
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# if not all id_2 present in id column in table 1, we get rows with NA
# I want to only retain the rows with id's that were originally in the lookup table
Then I end with a bunch of rows filled with NAs for the newly added columns that I would like to get rid of. I do this with a semi-join. I found outdated solutions to be quite hard to understand and settled for batchtools::sjoin() function which is essentially also a one liner.
dt_final <- sjoin(dt_res, dt, by = "id")
Is there a more efficient way of accomplishing a clean output result from a rolling join than by doing the rolling join first and then a semi-join with the original dataset. It is also not very fast for very long data sets. Thanks!
Essentially, there are two approaches I find that are both viable solutions.
Solution 1
First, proposed by lil_barnacle is an elegant one-liner that reads like following:
# Rolling join with nomtach-argument set to 0
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE, nomatch=0]
Original approach
Adding the nomatch argument and setting it to 0 like this nomatch = 0, is equivalent to doing the rolling join first and doing the semi-join thereafter.
# Rolling join without specified nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# Semi-join required
dt_final <- sjoin(dt_res, dt, by = "id")
Solution 2
Second, the solution that I came up with was to 'align' both data sets before the rolling join by means of filtering by the 'joined variable' like so:
# Aligning data sets by filtering accd. to joined 'variable'
dt_fill <- dt_fill[id_2 %in% dt[ , unique(id)]]
# Rolling join without need to specify nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]

Should non-equi inner join (nomatch=0L) be bidirectional?

When doing a non-equi inner join, should the order of X[Y] and Y[X] matters? I am under the impression that it should not.
library(data.table) #data.table_1.12.2
dt1 <- data.table(ID=LETTERS[1:4], TIME=2L:5L)
cols1 <- names(dt1)
dt2 <- data.table(ID=c("A", "B"), START=c(1L, 20L), END=c(3L, 30L))
cols2 <- names(dt2)
> dt1
ID TIME
1: A 2
2: B 3
3: C 4
4: D 5
> dt2
ID START END
1: A 1 3
2: B 20 30
I am trying to filter for rows in dt1 such that 1) ID matches and 2) dt1$TIME lies between dt2$START and dt2$END. Desired output:
ID TIME
1: A 2
Since I wanted rows from dt1, I started with using dt1 as i in data.table[ but I am getting either columns from dt2 or encountered errors:
#no error but using x. values
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L]
#error for the rest
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L, mget(paste0("i.", cols1))]
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L, .SD]
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L, .(START)]
Error message:
Error in [.data.table(dt2, dt1, on = .(ID, START < TIME, END > TIME), : column(s) not found: START
So I had to use dt2 as the i as a workaround:
#need to type out all the columns:
dt1[dt2, on=.(ID, TIME>START, TIME<END), nomatch=0L, .(ID, TIME=x.TIME)]
#using setNames
dt1[dt2, on=.(ID, TIME>START, TIME<END), nomatch=0L,
setNames(mget(paste0("x.", cols1)), cols1)]
Or is this a simple case of my misunderstanding?
References:
Confusion arise from answering: r compare two data.tables by row
https://github.com/Rdatatable/data.table/issues/1700
https://github.com/Rdatatable/data.table/issues/1807
https://github.com/Rdatatable/data.table/pull/2706
https://github.com/Rdatatable/data.table/pull/3093
I am trying to filter for rows in dt1 such that 1) ID matches and 2) dt1$TIME lies between dt2$START and dt2$END.
That sounds like a semi join: Perform a semi-join with data.table
dt1[
dt1[dt2, on=.(ID, TIME >= START, TIME <= END), nomatch=0, which=TRUE]
]
# ID TIME
# 1: A 2
If it's possible that multiple rows of dt2 will match rows of dt1, then the "which" output can be wrapped in unique() as in the linked answer.
There are a couple linked feature requests for a more convenient way to do this: https://github.com/Rdatatable/data.table/issues/2158

Update join with multiple rows

Question
When doing an update-join, where the i table has multiple rows per key, how can you control which row is returned?
Example
In this example, the update-join returns the last row from dt2
library(data.table)
dt1 <- data.table(id = 1)
dt2 <- data.table(id = 1, letter = letters)
dt1[
dt2
, on = "id"
, letter := i.letter
]
dt1
# id letter
# 1: 1 z
How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?
References
A couple of references similar to this by user #Frank
data.table tutorial - in particular the 'warning' on update-joins
Issue on github
The most flexible idea I can think of is to only join the part of dt2 which contains the rows you want. So, for the second row:
dt1[
dt2[, .SD[2], by=id]
, on = "id"
, letter := i.letter
]
dt1
# id letter
#1: 1 b
With a hat-tip to #Frank for simplifying the sub-select of dt2.
How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?
Not elegant, but sort-of works:
n = 3L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
A couple problems:
It doesn't select using GForce, eg as seen here:
> dt2[, letter[3], by=id, verbose=TRUE]
Detected that j uses these columns: letter
Finding groups using forderv ... 0.020sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
lapply optimization is on, j unchanged as 'letter[3]'
GForce optimized j to '`g[`(letter, 3)'
Making each group and running j (GForce TRUE) ... 0.000sec
id V1
1: 1 c
If n is outside of 1:.N for some joined groups, no warning will be given:
n = 40L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
Alternately, make a habit of checking that i in an update join x[i] is "keyed" by the join columns:
cols = "id"
stopifnot(nrow(dt2) == uniqueN(dt2, by=cols))
And then make a different i table to join on if appropriate
mDT = dt2[, .(letter = letter[3L]), by=id]
dt1[mDT, on=cols, v := i.letter]

data.table join (multiple) selected columns with new names

I like to join two tables that have some identical columns (names and values) and others that are not. I'm only interested in joining those that are not identical and I would like to determine a new name for them. The way I currently do it seems verbose and hard to handle for the real tables I have with 100+ columns, i.e. I would like to determine the columns to be joined in advance and not in join statement. Reproducible example:
# create table 1
DT1 = data.table(id = 1:5, x=letters[1:5], a=11:15, b=21:25)
# create table 2 with changed values for a, b via pre-determined cols
DT2 = copy(DT1)
cols <- c("a", "b")
DT2[, (cols) := lapply(.SD, function(x) x*2), .SDcols = cols]
# this both works but is verbose for many columns
DT1[DT2, c("a_new", "b_new") := list(i.a, i.b), on=c(id="id")]
DT1[DT2, `:=` (a_new=i.a, b_new=i.b), on = c(id="id")]
I was thinking about something like this (doesn't work):
cols_new <- c("a_new", "b_new")
cols <- c("a", "b")
DT1[DT2, cols_new := i.cols, on=c(id="id")]
Updated answer based on Arun's recommendation:
cols_old <- c('i.a', 'i.b')
DT1[DT2, (cols_new) := mget(cols_old), on = c(id = "id")]
you could also generate the cols_old by doing:
paste0('i.', gsub('_new', '', cols_new, fixed = TRUE))
See history for the old answer.

Finding the appropriate interval

Suppose I've several intervals which are subset of real line as follows:
I_1 = [0, 1]
I_2 = [1.5, 2]
I_3 = [5, 9]
I_4 = [13, 16]
Now given a real number x = 6.4, say, I'd like to find which interval contains the number x. I would like to know the algorithm to find this interval, and/or how to do this in R.
Thanks in advance.
Update using non-equi joins:
This is much simpler and straightforward using the new non-equi joins feature in the current development version of data.table, v1.9.7:
require(data.table) # v1.9.7+
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[.(x=4.5), on=.(start<=x, end>=x), which=TRUE]
# [1] 7
No need to set keys or create indices.
Old solution using foverlaps:
One way would be to use interval/overlap joins using the data.table package:
require(data.table) ## 1.9.4+
DT1 = data.table(start=c(0,1.5,5,13), end=c(1,2,9,16))
DT2 = data.table(start=6.4, end=6.4)
setkey(DT1)
foverlaps(DT2, DT1, which=TRUE, type="within")
# xid yid
# 1: 1 3
This searches if each interval in DT2 lies completely within DT1 efficiently. In your case DT2 is a point, not an interval. If it did not exist within any intervals in DT1, it'd return NA.
Have a look at ?foverlaps to check out the other arguments you can use. For example mult= argument controls if you'd want to return all the matching rows or just the first or last etc..
Since setkey sorts the result, you'll have to add a separate id as follows:
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[, id := .I] # .I is a special variable. See ?data.table
setkey(DT1, start, end)
DT2 = data.table(start=4.5 ,end=4.5)
olaps = foverlaps(DT2, DT1, type="within", which=TRUE)
olaps[, yid := DT1$id[yid]]
# xid yid
# 1: 1 7

Resources