Updating a data.table column sequentially with loop - r

I've implemented a simple dynamic programming example described here, using data.table, in the hope that it would be as fast as vectorized code.
library(data.table)
B=100; M=50; alpha=0.5; beta=0.9;
n = B + M + 1
m = M + 1
u <- function(c)c^alpha
dt <- data.table(s = 0:(B+M))[, .(a = 0:min(s, M)), s] # State Space and corresponging Action Space
dt[, u := (s-a)^alpha,] # rewards r(s, a)
dt <- dt[, .(s_next = a:(a+B), u = u), .(s, a)] # all possible (s') for each (s, a)
dt[, p := 1/(B+1), s] # transition probs
# s a s_next u p
# 1: 0 0 0 0 0.009901
# 2: 0 0 1 0 0.009901
# 3: 0 0 2 0 0.009901
# 4: 0 0 3 0 0.009901
# 5: 0 0 4 0 0.009901
# ---
#649022: 150 50 146 10 0.009901
#649023: 150 50 147 10 0.009901
#649024: 150 50 148 10 0.009901
#649025: 150 50 149 10 0.009901
#649026: 150 50 150 10 0.009901
To give a little content to my question: conditional on s and a, future values of s (s_next) is realized as one of a:(a+10), each with probability p=1/(B + 1). u column gives the u(s, a) for each combination (s, a).
Given the initial values V(always n by 1 vector) for each unique state s, V updates according to V[s]=max(u(s, a)) + beta* sum(p*v(s_next)) (Bellman Equation).
Maximization is wrt a, hence, [, `:=`(v = max(v), i = s_next[which.max(v)]), by = .(s)] in the iteration below.
Actually there is very efficient vectorized solution. I thought data.table solution would be comparable in performance as vectorized approach.
I know that the main culprit is dt[, v := V[s_next + 1]]. Alas, I have no idea how to fix it.
# Iteration starts here
system.time({
V <- rep(0, n) # initial guess for Value function
i <- 1
tol <- 1
while(tol > 0.0001){
dt[, v := V[s_next + 1]]
dt[, v := u + beta * sum(p*v), by = .(s, a)
][, `:=`(v = max(v), i = s_next[which.max(v)]), by = .(s)] # Iteration
dt1 <- dt[, .(v[1L], i[1L]), by = s]
Vnew <- dt1$V1
sig <- dt1$V2
tol <- max(abs(V - Vnew))
V <- Vnew
i <- i + 1
}
})
# user system elapsed
# 5.81 0.40 6.25
To my dismay, the data.table solution is even slower than the following highly non-vectorized solution. As a sloppy data.table-user, I must be missing some data.table functionality. Is there a way to improve things, or, data.table is not suitable for these kinds of computations?
S <- 0:(n-1) # StateSpace
VFI <- function(V){
out <- rep(0, length(V))
for(s in S){
x <- -Inf
for(a in 0:min(s, M)){
s_next <- a:(a+B) # (s')
x <- max(x, u(s-a) + beta * sum(V[s_next + 1]/(B+1)))
}
out[s+1] <- x
}
out
}
system.time({
V <- rep(0, n) # initial guess for Value function
i <- 1
tol <- 1
while(tol > 0.0001){
Vnew <- VFI(V)
tol <- max(abs(V - Vnew))
V <- Vnew
i <- i + 1
}
})
# user system elapsed
# 3.81 0.00 3.81

Here's how I would do this...
DT = CJ(s = seq_len(n)-1L, a = seq_len(m)-1L, s_next = seq_len(n)-1L)
DT[ , p := 0]
#p is 0 unless this is true
DT[between(s_next, a, a + B), p := 1/(B+1)]
#may as well subset to eliminate irrelevant states
DT = DT[p>0 & s>=a]
DT[ , util := u(s - a)]
#don't technically need by, but just to be careful
DT[ , V0 := rep(0, n), by = .(a, s_next)]
while(TRUE) {
#for each s, maximize given past value;
# within each s, have to sum over s_nexts,
# to do so, sum by a
DT[ , V1 := max(.SD[ , util[1L] + beta*sum(V0*p), by = a],
na.rm = TRUE), by = s]
if (DT[ , max(abs(V0 - V1))] < 1e-4) break
DT[ , V0 := V1]
}
On my machine this takes about 15 seconds (so not good)... but maybe this will give you some ideas. For example, this data.table is far too large since there really only are n unique values of V ultimately.

Related

Loop calculation with previous value not using for in R

I'm a beginning R programmer. I have trouble in a loop calculation with a previous value like recursion.
An example of my data:
dt <- data.table(a = c(0:4), b = c( 0, 1, 2, 1, 3))
And calculated value 'c' is y[n] = (y[n-1] + b[n])*a[n]. Initial value of c is 0. (c[1] = 0)
I used the for loop and the code and result is as below.
dt$y <- 0
for (i in 2:nrow(dt)) {
dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
}
a b y
1: 0 0 0
2: 1 1 1
3: 2 2 6
4: 3 1 21
5: 4 3 96
This result is what I want. However, my data has over 1,000,000 rows and several columns, therefore I'm trying to find other ways without using a for loop. I tried to use "Reduce()", but it only works with a single vector (ex. y[n] = y_[n-1]+b[n]). As shown above, my function uses two vectors, a and b, so I can't find a solution.
Is there a more efficient way to be faster without using a for loop, such as using a recursive function or any good package functions?
This kind of computation cannot make use of R's advantage of vectorization because of the iterative dependencies. But the slow-down appears to really be coming from indexing performance on a data.frame or data.table.
Interestingly, I was able to speed up the loop considerably by accessing a, b, and y directly as numeric vectors (1000+ fold advantage for 2*10^5 rows) or as matrix "columns" (100+ fold advantage for 2*10^5 rows) versus as columns in a data.table or data.frame.
This old discussion may still shed some light on this rather surprising result: https://stat.ethz.ch/pipermail/r-help/2011-July/282666.html
Please note that I also made a different toy data.frame, so I could test a larger example without returning Inf as y grew with i:
Option data.frame (numeric vectors embedded in a data.frame or data.table per your example):
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
})
#user system elapsed
#79.39 146.30 225.78
#NOTE: Sorry, I didn't have the patience to let the data.table version finish for vec_length=2*10^5.
tail(dt$y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
Option vector (numeric vectors extracted in advance of loop):
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
y[i] <- (y[i - 1] + b[i]) * a[i]
})
#user system elapsed
#0.03 0.00 0.03
tail(y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
Option matrix (data.frame converted to matrix before loop):
vec_length <- 200000
dt <- as.matrix(data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0))
system.time(for (i in 2:nrow(dt)) {
dt[i, 1] <- (dt[i - 1, 3] + dt[i, 2]) * dt[i, 1]
})
#user system elapsed
#0.67 0.01 0.69
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
#NOTE: a matrix is actually a vector but with an additional attribute (it's "dim") that says how the "matrix" should be organized into rows and columns
Option data.frame with matrix style indexing:
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
dt[i, 3] <- (dt[(i - 1), 3] + dt[i, 2]) * dt[i, 1]
})
#user system elapsed
#110.69 0.03 112.01
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
An option is to use Rcpp since for this recursive equation is easy to code in C++:
library(Rcpp)
cppFunction("
NumericVector func(NumericVector b, NumericVector a) {
int len = b.size();
NumericVector y(len);
for (int i = 1; i < len; i++) {
y[i] = (y[i-1] + b[i]) * a[i];
}
return(y);
}
")
func(c( 0, 1, 2, 1, 3), c(0:4))
#[1] 0 1 6 21 96
timing code:
vec_length <- 1e7
dt <- data.frame(a=1:vec_length, b=1:vec_length, y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
y[i] <- (y[i - 1] + b[i]) * a[i]
})
# user system elapsed
# 19.22 0.06 19.44
system.time(func(b, a))
# user system elapsed
# 0.09 0.02 0.09
Here is a base R solution.
According to the information from #ThetaFC, an indication for speedup is to use matrix or vector (rather than data.frame for data.table). Thus, it is better to have the following preprocessing before calculating df$y, i.e.,
a <- as.numeric(df$a)
b <- as.numeric(df$b)
Then, you have two approaches to get df$y:
writing your customized recursion function
f <- function(k) {
if (k == 1) return(0)
c(f(k-1),(tail(f(k-1),1) + b[k])*a[k])
}
df$y <- f(nrow(df))
Or a non-recursion function (I guess this will be much faster then the recursive approach)
g <- Vectorize(function(k) sum(rev(cumprod(rev(a[2:k])))*b[2:k]))
df$y <- g(seq(nrow(df)))
such that
> df
a b y
1 0 0 0
2 1 1 1
3 2 2 6
4 3 1 21
5 4 3 96
I don't think this will be any faster, but here's one way to do it without an explicit loop
dt[, y := purrr::accumulate2(a, b, function(last, a, b) (last + b)*a
, .init = 0)[-1]]
dt
# a b y
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 6
# 4: 3 1 21
# 5: 4 3 96

Iterative partitioning and labeling using data.table

I have an iterative method of partitioning that assigns a label to each observation and continues until all partitions are less than or equal to the specified minimum observations.
Using data.table I have run into issues incorporating '{' and ':='. My current solution is the following:
part.test <- function(x, y, min.obs=4){
PART = data.table(x=as.numeric(x),y=as.numeric(y),quadrant='q',prior.quadrant='q',key = c('quadrant','x','y'))
PART=PART[,counts := .N,quadrant]
setkey(PART,counts,quadrant,x,y)
i=0L
while(i>=0){
PART=PART[,counts := .N,quadrant]
l.PART=sum(PART$counts>min.obs)
if(l.PART==0){break}
min.obs.rows=PART[counts>=min.obs,which=TRUE]
PART[min.obs.rows, prior.quadrant := quadrant]
PART[min.obs.rows, quadrant :=
ifelse( x <= mean(x) & y <= mean(y), paste0(quadrant,4),
ifelse(x <= mean(x) & y > mean(y), paste0(quadrant,2),
ifelse(x > mean(x) & y <= mean(y), paste0(quadrant,3), paste0(quadrant,1)))),
by=quadrant]
i=i+1
}
return(PART[])
}
Here is an example:
> set.seed(123);x=rnorm(1e5);y=rnorm(1e5)
> part.test(x,y)
x y quadrant prior.quadrant counts
1: 2.45670228 2.4710128 q1111141 q111114 1
2: 2.36216477 2.3211246 q1111144 q111114 1
3: 2.03019608 3.1102172 q1111212 q111121 1
4: 2.18349873 2.7801719 q1111213 q111121 1
5: 2.14224180 2.5529947 q1111231 q111123 1
---
99996: 0.51221861 0.1992352 q143234342 q14323434 4
99997: 0.08995397 -0.6415489 q324423131 q32442313 4
99998: 0.09069140 -0.6427392 q324423131 q32442313 4
99999: 0.09077251 -0.6406127 q324423131 q32442313 4
100000: 0.09077963 -0.6413572 q324423131 q32442313 4
> system.time(part.test(x,y))
user system elapsed
3.45 0.00 3.53
What is the best way to improve this performance using data.table?
EDIT: I have moved the setkey outside of the loop per Frank's comment.
Elaborating on my comment, here's some improvement:
f <- function(x, y, min.obs = 4){
DT = data.table(x,y,q="q")[, counts := .N]
while(TRUE){
DT[counts >= min.obs, counts := .N, by=q]
if (max(DT$counts) == min.obs) break
w = DT[counts >= min.obs, which=TRUE]
mDT = DT[w, lapply(.SD, mean), by=q, .SDcols = x:y]
DT[mDT, on=.(q), q_new := {
lox = x.x <= i.x
loy = x.y <= i.y
1L + lox + loy*2L
}]
DT[w, q := paste0(q, q_new)]
DT[, q_new := NULL ]
}
setorder(DT[], counts, q, x, y)[]
}
system.time(res <- part.test(x,y))
# user system elapsed
# 2.65 0.00 2.66
system.time(fres <- f(x,y))
# user system elapsed
# 0.65 0.05 0.70
# verify they match
fsetequal(
zf <- setnames(copy(fres), "q", "quadrant"),
z <- copy(res)[, prior.quadrant := NULL ]
) # TRUE
Maybe why it's faster:
GForce is used to compute the means in mDT.
Arithmetic is used instead of ifelse.
Keying/sorting is only done once.
It could probably be even faster than this.

R, for loop, scalable solutions

I have data that looks like this :
char_column date_column1 date_column2 integer_column
415 18JT9R6EKV 2014-08-28 2014-09-06 1
26 18JT9R6EKV 2014-12-08 2014-12-11 2
374 18JT9R6EKV 2015-03-03 2015-03-09 1
139 1PEGXAVCN5 2014-05-06 2014-05-10 3
969 1PEGXAVCN5 2014-06-11 2014-06-15 2
649 1PEGXAVCN5 2014-08-12 2014-08-16 3
I want to perform a loop that would check every row against the preceding row, and given certain conditions assign them the same number (so I can group them later) , the point is that if the date segments are close enough I would collapse them into one segment.
my attempt is the following :
i <- 1
z <- 1
v <- 1
for (i in 2:nrow(df)){
z[i] <- ifelse(df[i,'char_column'] == df[i-1,'char_column'],
ifelse((df[i,'date_column1'] - df[i-1,'date_column2']) <= 5,
ifelse(df[i,'integer_column'] == df[i-1,'integer_column'],
v, v<- v+1),
v <- v+1),
v <- v+1)}
df$grouping <- z
then I would just group using min(date_column1) and max(date_column2).
this method works perfectly for say 100,000 rows (22.86 seconds)
but for a million rows : 33.18 minutes!! I have over 60m rows to process,
is there a way I can make the process more efficient ?
PS: to generate a similar table you can use the following code :
x <- NULL
for (i in 1:200) { x[i] <- paste(sample(c(LETTERS, 1:9), 10), collapse = '')}
y <- sample((as.Date('2014-01-01')):as.Date('2015-05-01'), 1000, replace = T)
y2 <- y + sample(1:10)
df <- data.frame(char_column = sample(x, 1000, rep = T),
date_column1 = as.Date(y, origin = '1970-01-01'),
date_column2 = as.Date(y2,origin = '1970-01-01'),
integer_column = sample(1:3,1000, replace = T),
row.names = NULL)
df <- df[order(df$char_column, df$date_column1),]
Since data.table::rleid does not work, I post another (hopefully) fast solution
1. Get rid of nested ifelse
ifelse is often slow, especially for scalar evaluation, use if.
Nested ifelse should be avoided whenever possible: observe that ifelse(A, ifelse(B, x, y), y) can be suitably replaced by if (A&B) x else y
f1 <- function(df){
z <- rep(NA, nrow(df))
z[1] <- 1
char_col <- df[, 'char_column']
date_col1 <- df[, 'date_column1']
date_col2 <- df[, 'date_column2']
int_col <- df[, 'integer_column']
for (i in 2:nrow(df)){
if((char_col[i] == char_col[i-1])&((date_col1[i] - date_col2[i-1]) <= 5)&(int_col[i] == int_col[i-1]))
{
z[i] <- z[i-1]
}
else
{
z[i] <- z[i-1]+1
}
}
z
}
f1 is about 40% faster than the original solution for 10.000 rows.
system.time(f1(df))
user system elapsed
2.72 0.00 2.79
2. Vectorize
Upon closer inspection the conditions inside if can be vectorized
library(data.table)
f2 <- function(df){
z <- rep(NA, nrow(df))
z[1] <- 1
char_col <- df[, 'char_column']
date_col1 <- df[, 'date_column1']
date_col2 <- df[, 'date_column2']
int_col <- df[, 'integer_column']
cond <- (char_col==shift(char_col))&(date_col1 - shift(date_col2) <= 5)&(int_col==shift(int_col))
for (i in 2:nrow(df)){
if(cond[i])
{
z[i] <- z[i-1]
}
else
{
z[i] <- z[i-1]+1
}
}
z
}
# for 10000 rows
system.time(f2(df))
# user system elapsed
# 0.01 0.00 0.02
3. Vectorize, Vectorize
While f2 is already quite fast, a further vectorization is possible. Observe how z is calculated: cond is a logical vector, and z[i] = z[i-1] + 1 when cond is FALSE. This is none other than cumsum(!cond).
f3 <- function(df){
setDT(df)
df[, cond := (char_column==shift(char_column))&(date_column1 - shift(date_column2) <= 5)&(integer_column==shift(integer_column)),]
df[, group := cumsum(!c(FALSE, cond[-1L])),]
}
For 1M rows
system.time(f3(df))
# user system elapsed
# 0.05 0.05 0.09
system.time(f2(df))
# user system elapsed
# 1.83 0.05 1.87

How to write a cumulative calculation in data.table

A sequential, cumulative calculation
I need to make a time-series calculation, where the value calculated in each row depends on the result calculated in the previous row. I am hoping to use the convenience of data.table. The actual problem is a hydrological model -- a cumulative water balance calculation, adding rainfall at each time step and subtracting runoff and evaporation as a function of the current water volume. The dataset includes different basins and scenarios (groups). Here I will use a simpler illustration of the problem.
A simplified example of the calculation looks like this, for each time step (row) i:
v[i] <- a[i] + b[i] * v[i-1]
a and b are vectors of parameter values, and v is the result vector. For the first row (i == 1) the initial value of v is taken as v0 = 0.
First attempt
My first thought was to use shift() in data.table. A minimal example, including the desired result v.ans, is
library(data.table) # version 1.9.7
DT <- data.table(a = 1:4,
b = 0.1,
v.ans = c(1, 2.1, 3.21, 4.321) )
DT
# a b v.ans
# 1: 1 0.1 1.000
# 2: 2 0.1 2.100
# 3: 3 0.1 3.210
# 4: 4 0.1 4.321
DT[, v := NA] # initialize v
DT[, v := a + b * ifelse(is.na(shift(v)), 0, shift(v))][]
# a b v.ans v
# 1: 1 0.1 1.000 1
# 2: 2 0.1 2.100 2
# 3: 3 0.1 3.210 3
# 4: 4 0.1 4.321 4
This doesn't work, because shift(v) gives a copy of the original column v, shifted by 1 row. It is unaffected by assignment to v.
I also considered building the equation using cumsum() and cumprod(), but that won't work either.
Brute force approach
So I resort to a for loop inside a function for convenience:
vcalc <- function(a, b, v0 = 0) {
v <- rep(NA, length(a)) # initialize v
for (i in 1:length(a)) {
v[i] <- a[i] + b[i] * ifelse(i==1, v0, v[i-1])
}
return(v)
}
This cumulative function works fine with data.table:
DT[, v := vcalc(a, b, 0)][]
# a b v.ans v
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
identical(DT$v, DT$v.ans)
# [1] TRUE
My question
My question is, can I write this calculation in a more concise and efficient data.table way, without having to use the for loop and/or function definition? Using set() perhaps?
Or is there a better approach all together?
Edit: A better loop
David's Rcpp solution below inspired me to remove the ifelse() from the for loop:
vcalc2 <- function(a, b, v0 = 0) {
v <- rep(NA, length(a))
for (i in 1:length(a)) {
v0 <- v[i] <- a[i] + b[i] * v0
}
return(v)
}
vcalc2() is 60% faster than vcalc().
It may not be 100% what you are looking for, as it does not use the "data.table-way" and still uses a for-loop. However, this approach should be faster (I assume you want to use data.table and the data.table-way to speed up your code). I leverage Rcpp to write a short function called HydroFun, that can be used in R like any other function (you just need to source the function first). My gut-feeling tells me that the data.table way (if existent) is pretty complicated because you cannot compute a closed-form solution (but I may be wrong on this point...).
My approach looks like this:
The Rcpp function looks like this (in the file: hydrofun.cpp):
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector HydroFun(NumericVector a, NumericVector b, double v0 = 0.0) {
// get the size of the vectors
int vecSize = a.length();
// initialize a numeric vector "v" (for the result)
NumericVector v(vecSize);
// compute v_0
v[0] = a[0] + b[0] * v0;
// loop through the vector and compute the new value
for (int i = 1; i < vecSize; ++i) {
v[i] = a[i] + b[i] * v[i - 1];
}
return v;
}
To source and use the function in R you can do:
Rcpp::sourceCpp("hydrofun.cpp")
library(data.table)
DT <- data.table(a = 1:4,
b = 0.1,
v.ans = c(1, 2.1, 3.21, 4.321))
DT[, v_ans2 := HydroFun(a, b, 0)]
DT
# a b v.ans v_ans2
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
Which gives the result you are looking for (at least from the value-perspective).
Comparing the speeds reveals a speed-up of roughly 65x.
library(microbenchmark)
n <- 10000
dt <- data.table(a = 1:n,
b = rnorm(n))
microbenchmark(dt[, v1 := vcalc(a, b, 0)],
dt[, v2 := HydroFun(a, b, 0)])
# Unit: microseconds
# expr min lq mean median uq max neval
# dt[, `:=`(v1, vcalc(a, b, 0))] 28369.672 30203.398 31883.9872 31651.566 32646.8780 68727.433 100
# dt[, `:=`(v2, HydroFun(a, b, 0))] 381.307 421.697 512.2957 512.717 560.8585 1496.297 100
identical(dt$v1, dt$v2)
# [1] TRUE
Does that help you in any way?
I think Reduce together with accumulate = TRUE is a commonly used technique for these types of calculations (see e.g. recursively using the output as an input for a function). It is not necessarily faster than a well-written loop*, and I don't know how data.table-esque you believe it is, still I want to suggest it for your toolbox.
DT[ , v := 0][
, v := Reduce(f = function(v, i) a[i] + b[i] * v, x = .I[-1], init = a[1], accumulate = TRUE)]
DT
# a b v.ans v
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
Explanation:
Set initial value of v to 0 (v := 0). Use Reduce to apply function f on an integer vector of row numbers except the first row (x = .I[-1]). Instead add a[1] to the start of of x (init = a[1]).
Reduce then "successively applies f to the elements [...] from left to right".
The successive reduce combinations are "accumulated" (accumulate = TRUE).
*See e.g. here, where you also can read more about Reduce in this section.

Find values in a given interval without a vector scan

With a the R package data.table is it possible to find the values that are in a given interval without a full vector scan of the data. For example
>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
x
1: 3
2: 5
3: 8
Where DT can be a very big table.
Bonus question:
is it possible to do the same thing for a set of non-overlapping intervals such as
>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
i min max
1: 1 3 10
2: 2 20 40
> my.data.table.function2(DT,I)
i x
1: 1 3
2: 1 5
3: 1 8
4: 2 21
5: 2 34
Where both I and DT can be very big.
Thanks a lot
Here is a variation of the code proposed by #user1935457 (see comment in #user1935457 post)
system.time({
if(!identical(key(DT), "x")) setkey(DT, x)
setkey(IT, min)
#below is the line that differs from #user1935457
#Using IT to address the lines of DT creates a smaller intermediate table
#We can also directly use .I
target.low<-DT[IT,list(i=i,min=.I),roll=-Inf, nomatch = 0][,list(min=min[1]),keyby=i]
setattr(IT, "sorted", "max")
# same here
target.high<-DT[IT,list(i=i,max=.I),roll=Inf, nomatch = 0][,list(max=last(max)),keyby=i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll2 <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
setcolorder(ans.roll2, c("i", "x"))
})
# user system elapsed
# 0.07 0.00 0.06
system.time({
# #user1935457 code
})
# user system elapsed
# 0.08 0.00 0.08
identical(ans.roll2, ans.roll)
#[1] TRUE
The performance gain is not huge here, but it shall be more sensitive with larger DT and smaller IT. thanks again to #user1935457 for your answer.
First of all, vecseq isn't exported as a visible function from data.table, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical check at the end.
That out of the way, we need a bigger example to exhibit difference from vector scan approach:
require(data.table)
n <- 1e5L
f <- 10L
ni <- n / f
set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni,
min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))
DT, the Data Table is a not-too-random subset of 1:n. IT, the Interval Table is ni = n / 10 non-overlapping intervals in 1:n. Doing the repeated vector scan on all ni intervals takes a while:
system.time({
ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
## user system elapsed
## 84.15 4.48 88.78
One can do two rolling joins on the interval endpoints (see the roll argument in ?data.table) to get everything in one swoop:
system.time({
# Save time if DT is already keyed correctly
if(!identical(key(DT), "x")) setkey(DT, x)
DT[, row := .I]
setkey(IT, min)
target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]
# Non-overlapping intervals => (sorted by min => sorted by max)
setattr(IT, "sorted", "max")
target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
ans.roll[, row := NULL]
setcolorder(ans.roll, c("i", "x"))
})
## user system elapsed
## 0.12 0.00 0.12
Ensuring the same row order verifies the result:
setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE
If you don't want to do a full vector scan, you should first declare your variable as a key for your data.table :
DT <- data.table(x=c(1,1,2,3,5,8,13,21,34,55,89),key="x")
Then you can use %between% :
R> DT[x %between% c(3,10),]
x
1: 3
2: 5
3: 8
R> DT[x %between% c(3,10) | x %between% c(20,40),]
x
1: 3
2: 5
3: 8
4: 21
5: 34
EDIT : As #mnel pointed out, %between% still does vector scans. The Note section of the help page says :
Current implementation does not make use of ordered keys.
So this doesn't answer your question.

Resources