Suppose I've several intervals which are subset of real line as follows:
I_1 = [0, 1]
I_2 = [1.5, 2]
I_3 = [5, 9]
I_4 = [13, 16]
Now given a real number x = 6.4, say, I'd like to find which interval contains the number x. I would like to know the algorithm to find this interval, and/or how to do this in R.
Thanks in advance.
Update using non-equi joins:
This is much simpler and straightforward using the new non-equi joins feature in the current development version of data.table, v1.9.7:
require(data.table) # v1.9.7+
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[.(x=4.5), on=.(start<=x, end>=x), which=TRUE]
# [1] 7
No need to set keys or create indices.
Old solution using foverlaps:
One way would be to use interval/overlap joins using the data.table package:
require(data.table) ## 1.9.4+
DT1 = data.table(start=c(0,1.5,5,13), end=c(1,2,9,16))
DT2 = data.table(start=6.4, end=6.4)
setkey(DT1)
foverlaps(DT2, DT1, which=TRUE, type="within")
# xid yid
# 1: 1 3
This searches if each interval in DT2 lies completely within DT1 efficiently. In your case DT2 is a point, not an interval. If it did not exist within any intervals in DT1, it'd return NA.
Have a look at ?foverlaps to check out the other arguments you can use. For example mult= argument controls if you'd want to return all the matching rows or just the first or last etc..
Since setkey sorts the result, you'll have to add a separate id as follows:
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[, id := .I] # .I is a special variable. See ?data.table
setkey(DT1, start, end)
DT2 = data.table(start=4.5 ,end=4.5)
olaps = foverlaps(DT2, DT1, type="within", which=TRUE)
olaps[, yid := DT1$id[yid]]
# xid yid
# 1: 1 7
Related
Given a data table DT with a column Col1, select the rows of DT where the values x in Col1 satisfy some boolean expression, for example f(x) == TRUE or another example f(x) <= 4, and then doing more data table operations.
For example, I tried something like
DT[f(Col1) == TRUE, Col2 := 2]
which does not work because f() acts on values not vectors. Using lapply(), seems to work but it take a long time to run with a very large DT.
A workaround would be to create a column and using that to select the rows
DT[, fvalues := f(Col1)][fvalues == TRUE, Col2 := 2]
but it would be better not to increase the size of DT.
EDIT: Here is an example.
map1<-data.table(k1=c("A","B","C"), v=c(-1,2,3))
map2<-data.table(k2=c("A","B","A","A","C","B"), k3=c("A","B","C","B","C","B"))
f <- function(x) map1[k1 == x, v]
To find the rows in map2 using the corresponding value in map1: these do not work (returning an error about a length mismatch around ==)
map2[f(k2) == 2, flag1 := TRUE]
map2[f(k2) + f(k3) == 2, flag2 := TRUE]
but using lapply() the first one works but it is somehow slower (for a large data table) than adding column to map2 with the values of f and selecting based on that new column
map2[lapply(k2,f) == 2, flag1 := TRUE]
and the second one
map2[lapply(k2,f) + lapply(k3,f) == 2, flag2 := TRUE]
returns an error (non-numeric argument).
The question would be how to do this most efficiently, particularly without making the data table larger.
I think the problem perhaps is when you're adding the parts which are aiming to modify the data.table in-place (i.e. your := parts). I don't think you can filter in place, as such filter really requires writing to a new memory location.
This works for filtering, whilst creating a new object:
library(data.table)
f <- function(x) x > 0.5
DT <- data.table(Col1 = runif(10))
DT[f(Col1),]
#> Col1
#> 1: 0.7916055
#> 2: 0.5391773
#> 3: 0.6855657
#> 4: 0.5250881
#> 5: 0.9089948
#> 6: 0.6639571
To do more data.table operations on a filtered table, assign to a new object and work with that one:
DT2 <- DT[f(Col1),]
DT2[, Col2 := 2]
Perhaps I've misunderstood your problem though - what function are you using? Could you post more code so we can replicate your problem more precisely?
If f is a function working only on scalar values, you could Vectorize it:
DT[Vectorize(f)(Col1)]
Not sure this fully answers your question because Vectorize uses lapply/mapply
Currently solve this with a workaround, but I would like to know if there is a more efficient way.
See below for exemplary data:
library(data.table)
library(anytime)
library(tidyverse)
library(dplyr)
library(batchtools)
# Lookup table
Date <- c("1990-03-31", "1990-06-30", "1990-09-30", "1990-12-31",
"1991-03-31", "1991-06-30", "1991-09-30", "1991-12-31")
period <- c(1:8)
metric_1 <- rep(c(2000, 3500, 4000, 100000), 2)
metric_2 <- rep(c(200, 350, 400, 10000), 2)
id <- 22
dt <- setDT(data.frame(Date, period, id, metric_1, metric_2))
# Fill and match table 2
Date_2 <- c("1990-08-30", "1990-02-28", "1991-07-31", "1991-09-30", "1991-10-31")
random <- c(10:14)
id_2 <- c(22,33,57,73,999)
dt_fill <- setDT(data.frame(EXCL_DATE, random, id_2))
# Convert date columns to type date
dt[ , Date := anydate(Date)]
dt_fill[ , Date_2 := anydate(Date_2)]
Now for the data wrangling. I want to get the most recent preceding data from dt (aka lookup table) into dt_fill. I do this with an easy 1-line rolling join like this.
# Rolling join
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# if not all id_2 present in id column in table 1, we get rows with NA
# I want to only retain the rows with id's that were originally in the lookup table
Then I end with a bunch of rows filled with NAs for the newly added columns that I would like to get rid of. I do this with a semi-join. I found outdated solutions to be quite hard to understand and settled for batchtools::sjoin() function which is essentially also a one liner.
dt_final <- sjoin(dt_res, dt, by = "id")
Is there a more efficient way of accomplishing a clean output result from a rolling join than by doing the rolling join first and then a semi-join with the original dataset. It is also not very fast for very long data sets. Thanks!
Essentially, there are two approaches I find that are both viable solutions.
Solution 1
First, proposed by lil_barnacle is an elegant one-liner that reads like following:
# Rolling join with nomtach-argument set to 0
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE, nomatch=0]
Original approach
Adding the nomatch argument and setting it to 0 like this nomatch = 0, is equivalent to doing the rolling join first and doing the semi-join thereafter.
# Rolling join without specified nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# Semi-join required
dt_final <- sjoin(dt_res, dt, by = "id")
Solution 2
Second, the solution that I came up with was to 'align' both data sets before the rolling join by means of filtering by the 'joined variable' like so:
# Aligning data sets by filtering accd. to joined 'variable'
dt_fill <- dt_fill[id_2 %in% dt[ , unique(id)]]
# Rolling join without need to specify nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
When joining data tables I'd like to be able to replace NA values that aren't matched. Is there a way to do this all in one line? I've provided my own two line solution but I imagine there must be a cleaner way. It would also help when I'm using it for multiple variables not to require a line for each.
dt1[dt2, frequency.lrs := lr, on = .(joinVariable)]
dt1[is.na(frequency.lrs), frequency.lrs := 1]
You could create (and fill fill) the column frequency.lrs with value 1 before joining with dt2, and then use the update join to replace frequency.lrs on matched rows only.
dt1[, frequency.lrs := 1][dt2, frequency.lrs := lr, on = .(joinVariable)]
Another option:
dt1[, VAL :=
dt2[dt1, on=.(ID), replace(VAL, is.na(VAL), 1)]
]
output:
ID VAL
1: 1 3
2: 2 1
data:
library(data.table)
dt1 <- data.table(ID=1:2)
dt2 <- data.table(ID=1, VAL=3)
I'm trying to build a column in a data.table by interacting with another data.table and having trouble referring to variables correctly to do this without a for-loop. Once I enter the second data.table, I can no longer seem to refer to the column in the first data.table correctly.
This is kind of similar to Subsetting a data.table using another data.table
but I believe the merge-style solutions aren't appropriate.
Consider something like
#used numbers instead of dates to not have to deal with formatting, but idea is the same.
dt1 <- data.table(id = c('a', 'b', 'c'), date1 = c(1.1, 5.4, 9.1), amt= '100')
dt2 <- data.table(date2 = c(1.3, 3, 6.4, 10.5),
dt2col = c(1.5, 1.02, 1.005, .99)
)
dt1[result := prod(dt2[date2-(date1)>0,
dt2col
]
)
]
I want the result to be a new column in dt1 which is the product of dt2col when date2 (in dt2) is later than date1 (in dt1) for each specific row in dt1. I think the (date1) part is the problem.
I expect result[1] to be the product of dt2col for all of them, but result[2] to be the product of dt2col for only the dates after '5/4/2018', etc.
Here are some data.table options:
1) Using non-equi joins:
dt1[, result := dt2[dt1, on=.(date2 > date1), prod(dt2col), by=.EACHI]$V1]
dt1
2) Using rolling joins after calculating the cumulative product:
setorder(dt2, -date2)
dt2[, cprod := cumprod(dt2col)]
dt1[dt2, result := cprod, on=.(date1=date2), roll=Inf]
output:
id date1 amt result
1: a 1.1 100 1.522273
2: b 5.4 100 0.994950
3: c 9.1 100 0.990000
Try this:
dt1[,`:=`(date1 = as.Date.character(date1,format = "%d/%m/%Y"))]
dt2[,`:=`(date2 = as.Date.character(date2,format = "%d/%m/%Y"))]
dt1[,`:=`(inds = lapply(X = date1,function(t){
intersect(x = which(year(t)==year(dt2$date2)),
y = which(as.integer(dt2$date2-t)>0))}))][,result:=
lapply(X = inds,function(t){prod(dt2$dt2col[t])})]
# id date1 amt inds result
#1: a 2018-01-01 100 1,2,3,4 1.522273
#2: b 2018-04-05 100 1,4 1.485
#3: c 2018-01-09 100 1,4 1.485
Question
When doing an update-join, where the i table has multiple rows per key, how can you control which row is returned?
Example
In this example, the update-join returns the last row from dt2
library(data.table)
dt1 <- data.table(id = 1)
dt2 <- data.table(id = 1, letter = letters)
dt1[
dt2
, on = "id"
, letter := i.letter
]
dt1
# id letter
# 1: 1 z
How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?
References
A couple of references similar to this by user #Frank
data.table tutorial - in particular the 'warning' on update-joins
Issue on github
The most flexible idea I can think of is to only join the part of dt2 which contains the rows you want. So, for the second row:
dt1[
dt2[, .SD[2], by=id]
, on = "id"
, letter := i.letter
]
dt1
# id letter
#1: 1 b
With a hat-tip to #Frank for simplifying the sub-select of dt2.
How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?
Not elegant, but sort-of works:
n = 3L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
A couple problems:
It doesn't select using GForce, eg as seen here:
> dt2[, letter[3], by=id, verbose=TRUE]
Detected that j uses these columns: letter
Finding groups using forderv ... 0.020sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
lapply optimization is on, j unchanged as 'letter[3]'
GForce optimized j to '`g[`(letter, 3)'
Making each group and running j (GForce TRUE) ... 0.000sec
id V1
1: 1 c
If n is outside of 1:.N for some joined groups, no warning will be given:
n = 40L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
Alternately, make a habit of checking that i in an update join x[i] is "keyed" by the join columns:
cols = "id"
stopifnot(nrow(dt2) == uniqueN(dt2, by=cols))
And then make a different i table to join on if appropriate
mDT = dt2[, .(letter = letter[3L]), by=id]
dt1[mDT, on=cols, v := i.letter]