As a simple example I need to create the following table (this is my desired result):
library(data.table)
DT <- data.table( A= c(2,5,4,-2,4),
B= c(1000,1100,1375,1650,1485),
C= c(50,55,68.75,82.5,74.25),
D= c(100,275,275,-165,297))
DT
This is my attempt so far which is not working:
DT.2 <- data.table(A= c(2,5,4,-2,4) )
DT.2[, B := 1000 ] # B should begin at 1000, then cumulatively add column D
DT.2[, C := B * 0.05 ]
DT.2[, D := A * C ]
DT.2[, B := 1000 + cumsum(shift(D, type= "lag", fill=FALSE)) ]
DT.2
As you can see the columns do not update correctly as each column relies on the results of other columns. It's a very easy calculation in Excel and I just need to understand how to adapt it to R.
Thank you
So far as I can tell, you need a loop. Here's my input data:
DT <- data.table(A = c(2, 5, 4, -2, 4),
B = c(1000, rep(NA, 4)),
C = numeric(5),
D = numeric(5))
And the loop I use:
#initial row
DT[1, c("C", "D") := .(.05 * B, .05 * A * B)]
#subsequent rows
for (nn in 2:nrow(DT)){
new_B <- DT[nn - 1L, B + D]
DT[nn, c("B", "C", "D") := .(new_B, .05 * new_B, .05 * A * new_B)]
}
Could also easily be translated to a sapply so you don't have nn sitting around in your namespace (but now it will print something meaningless to your console):
sapply(2:nrow(DT), function(nn){
(same as before)})
And yes, I can imagine this would look much simpler in Excel since it auto-updates the formula for you with click+drag.
I think that you actually want a function, and I am guessing that it would make more sense like this:
fun <- function(id, ## not sure what this is, probably an id
start = 1000,
rate = .05,
gain_or_loss){
require(data.table)
cnt <- length(id)
ret <- data.table(ID = c(id, NA),
bal = cumsum(c(start, gain_or_loss)),
fixed_change = rep(NA_real_, cnt + 1),
total_change = c(gain_or_loss, NA_real_))
ret[ , fixed_change := bal * 0.05 ]
ret <- ret[] ## needed because of a data.table quirk
return(ret)
}
Which would give:
fun(id = c(2,5,4,-2,4),
gain_or_loss = c(100,275,275,-165,297))
# ID bal fixed_change total_change
# 1: 2 1000 50.00 100
# 2: 5 1100 55.00 275
# 3: 4 1375 68.75 275
# 4: -2 1650 82.50 -165
# 5: 4 1485 74.25 297
# 6: NA 1782 89.10 NA
I would also add code after require to say if (length(id) != length(gain_or_loss)) stop("id and gain_or_loss need to be of the same length")
Related
Summary of real-world problem
Essentially this is a scenario evaluation, of a linear system of equations.
I have two data tables.
s_dt contains the scenarios, drivers (d) and values (v) for each observed scenario (o).
c_dt contains a series of terms (n) for a number of fitted model bases (b).
The individual powers of drivers, and associated coefficients are coded into (d and t) as name-value pairs.
Each basis (b) is essentially a polynomial with n terms.
The issue
Repro case below gives desired output format.
But is far too slow for required use case, even on a cut-down problem.
Numbers are junk, but I can't share actual data. Running on real-world data gives similar timing.
Circa 3sec for "lil" problem on my system (12 threads).
But "big" problem is 4000 times larger. So expect circa 3hours. Ouch!
Aim is to have the "big" problem run sub 5min (or ideally much faster!)
So, awesome clever people, how can this be made a lot faster?
(And what is the root cause of the slowdown?)
I'll happily accept base/tidyverse solutions too, if they meet the performance needs. I just assumed data.table was the best way to go for the size of the problem.
Current solution
Run fun on s_dt, grouping by o.
fun: Joins c_dt with each group data, to populate v, thus enabling calculation of r (the result of evaluating each of the polynomial equations).
In data.table parlance:
s_dt[, fun(.SD), keyby = .(o)]
Repro case
Creates two data.tables that have the combinations and field types matching real-world problem.
But with cut-down size for illustrative purposes.
Defines fun, then runs to populate r for all scenarios.
library(data.table)
# problem sizing ----
dims <- list(o = 50000, d = 50, b = 250, n = 200) # "big" problem - real-life size
dims <- list(o = 100, d = 50, b = 25, n = 200) # "lil" problem (make runtime shorter as example)
# build some test data tables ----
build_s <- function() {
o <- seq_len(dims$o)
d <- paste0("d",seq_len(dims$d))
v <- as.double(seq_len(dims$o * dims$d))/10000
CJ(o, d)[, `:=`(v = v)]
}
s_dt <- build_s()
build_c <- function() {
b <- paste0("c", seq_len(dims$b))
n <- seq_len(dims$n)
d <- c("c", paste0("d", seq_len(dims$d)))
t <- as.double(rep_len(0:6, dims$b * dims$n * (dims$d+1)))
dt <- CJ(d, b, n)[, `:=`(t = t)]
dt <- dt[t != 0]
}
c_dt <- build_c()
# define fun and evaluate ----
# (this is what needs optimising)
profvis::profvis({
fun <- function(dt) {
# don't use chaining here, for more useful profvis output
dt <- dt[c_dt, on = .(d)]
dt <- dt[, r := fcase(d == "c", t,
is.na(v), 0,
rep(TRUE, .N), v^t)]
dt <- dt[, .(r = prod(r)), keyby = .(b, n)]
dt <- dt[, .(r = sum(r)), keyby = .(b)]
}
res <- s_dt[, fun(.SD), keyby = .(o)]
})
Example inputs and outputs
> res
o b r
1: 1 c1 0.000000e+00
2: 1 c10 0.000000e+00
3: 1 c11 0.000000e+00
4: 1 c12 0.000000e+00
5: 1 c13 0.000000e+00
---
2496: 100 c5 6.836792e-43
2497: 100 c6 6.629646e-43
2498: 100 c7 6.840915e-43
2499: 100 c8 6.624668e-43
2500: 100 c9 6.842608e-43
> s_dt
o d v
1: 1 d1 0.0001
2: 1 d10 0.0002
3: 1 d11 0.0003
4: 1 d12 0.0004
5: 1 d13 0.0005
---
4996: 100 d50 0.4996
4997: 100 d6 0.4997
4998: 100 d7 0.4998
4999: 100 d8 0.4999
5000: 100 d9 0.5000
> c_dt
d b n t
1: c c1 2 1
2: c c1 3 2
3: c c1 4 3
4: c c1 5 4
5: c c1 6 5
---
218567: d9 c9 195 5
218568: d9 c9 196 6
218569: d9 c9 198 1
218570: d9 c9 199 2
218571: d9 c9 200 3
This would be difficult to fully vectorize. The "big" problem requires so many operations that going parallel is probably the most straightforward way to get to ~5 minutes.
But first, we can get a ~3x speed boost by using RcppArmadillo for the product and sum calculations instead of data.table's grouping operations.
library(data.table)
library(parallel)
Rcpp::cppFunction(
"std::vector<double> sumprod(arma::cube& a) {
for(unsigned int i = 1; i < a.n_slices; i++) a.slice(0) %= a.slice(i);
return(as<std::vector<double>>(wrap(sum(a.slice(0), 0))));
}",
depends = "RcppArmadillo",
plugins = "cpp11"
)
cl <- makeForkCluster(detectCores() - 1L)
The following approach requires extensive preprocessing. The upshot is that it makes it trivial to parallelize. However, it will work only if the values of s_dt$d are the same for each o as in the MRE:
identical(s_dt$d, rep(s_dt[o == 1]$d, length.out = nrow(s_dt)))
#> [1] TRUE
Now let's build the functions to accept s_dt and c_dt:
# slightly modified original function for comparison
fun1 <- function(dt, c_dt) {
# don't use chaining here, for more useful profvis output
dt <- dt[c_dt, on = .(d)]
dt <- dt[, r := fcase(d == "c", t,
is.na(v), 0,
rep(TRUE, .N), v^t)]
dt <- dt[, .(r = prod(r)), keyby = .(b, n)]
dt <- dt[, .(r = sum(r)), keyby = .(b)]
}
fun2 <- function(s_dt, c_dt, cl = NULL) {
s_dt <- copy(s_dt)
c_dt <- copy(c_dt)
# preprocess to get "a", "tt", "i", and "idxs"
i_dt <- s_dt[o == 1][, idxs := .I][c_dt, on = .(d)][, ic := .I][!is.na(v)]
ub <- unique(c_dt$b)
un <- unique(c_dt$n)
nb <- length(ub)
nn <- length(un)
c_dt[, `:=`(i = match(n, un) + nn*(match(b, ub) - 1L), r = 0)]
c_dt[, `:=`(i = i + (0:(.N - 1L))*nn*nb, ni = .N), i]
c_dt[d == "c", r := t]
a <- array(1, c(nn, nb, max(c_dt$ni)))
a[c_dt$i] <- c_dt$r # 3-d array to store v^t (updated for each unique "o")
i <- c_dt$i[i_dt$ic] # the indices of "a" to update (same for each unique "o")
tt <- c_dt$t[i_dt$ic] # c_dt$t ordered for "a" (same for each unique "o")
idxs <- i_dt$idxs # the indices to order s_dt$v (same for each unique "o")
uo <- unique(s_dt$o)
v <- collapse::gsplit(s_dt$v, s_dt$o)
if (is.null(cl)) {
# non-parallel solution
data.table(
o = rep(uo, each = length(ub)),
b = rep(ub, length(v)),
r = unlist(
lapply(
v,
function(x) {
a[i] <- x[idxs]^tt
sumprod(a)
}
)
),
key = "o"
)
} else {
# parallel solution
clusterExport(cl, c("a", "tt", "i", "idxs"), environment())
data.table(
o = rep(uo, each = length(ub)),
b = rep(ub, length(v)),
r = unlist(
parLapply(
cl,
v,
function(x) {
a[i] <- x[idxs]^tt
sumprod(a)
}
)
),
key = "o"
)
}
}
Now the data:
# problem sizing ----
bigdims <- list(o = 50000, d = 50, b = 250, n = 200) # "big" problem - real-life size
lildims <- list(o = 100, d = 50, b = 25, n = 200) # "lil" problem (make runtime shorter as example)
# build some test data tables ----
build_s <- function(dims) {
o <- seq_len(dims$o)
d <- paste0("d",seq_len(dims$d))
v <- as.double(seq_len(dims$o * dims$d))/10000
CJ(o, d)[, `:=`(v = v)]
}
build_c <- function(dims) {
b <- paste0("c", seq_len(dims$b))
n <- seq_len(dims$n)
d <- c("c", paste0("d", seq_len(dims$d)))
t <- as.double(rep_len(0:6, dims$b * dims$n * (dims$d+1)))
dt <- CJ(d, b, n)[, `:=`(t = t)]
dt <- dt[t != 0]
}
Timing the lil problem, which is so small that parallelization doesn't help:
s_dt <- build_s(lildims)
c_dt <- build_c(lildims)
microbenchmark::microbenchmark(fun1 = s_dt[, fun1(.SD, c_dt), o],
fun2 = fun2(s_dt, c_dt),
times = 10,
check = "equal")
#> Unit: seconds
#> expr min lq mean median uq max neval
#> fun1 3.204402 3.237741 3.383257 3.315450 3.404692 3.888289 10
#> fun2 1.134680 1.138761 1.179907 1.179872 1.210293 1.259249 10
Now the big problem:
s_dt <- build_s(bigdims)
c_dt <- build_c(bigdims)
system.time(dt2p <- fun2(s_dt, c_dt, cl))
#> user system elapsed
#> 24.937 9.386 330.600
stopCluster(cl)
A bit longer than 5 minutes with 31 cores.
Is it possible to use data.table to apply a two-parameter function quickly by group across a data set? On a 1 million row data set, I am finding that calling the simple function defined below is taking over 11 seconds, which is much longer than I would expect for something of this complexity.
The self-contained code below outlines the essentials of what I am trying to do:
# generate data frame - 1 million rows
library(data.table)
set.seed(42)
nn = 1e6
daf = data.frame(aa=sample(1:1000, nn, repl=TRUE),
bb=sample(1:1000, nn, repl=TRUE),
xx=rnorm(nn),
yy=rnorm(nn),
stringsAsFactors=FALSE)
# myfunc is the function to apply to each group
myfunc = function(xx, yy) {
if (max(yy)>1) {
return(mean(xx))
} else {
return(weighted.mean(yy, ifelse(xx>0, 2, 1)))
}
}
# running the function takes around 11.5 seconds
system.time({
dt = data.table(daf, key=c("aa","bb"))
dt = dt[,myfunc(xx, yy), by=c("aa","bb")]
})
head(dt)
# OUTPUT:
# aa bb V1
# 1: 1 2 -1.02605645
# 2: 1 3 -0.49318243
# 3: 1 4 0.02165797
# 4: 1 5 0.40811793
# 5: 1 6 -1.00312393
# 6: 1 7 0.14754417
Is there a way to significantly reduce the time for a function call like this?
I am interested in whether there is a more efficient way to perform the above calculation without completely re-writing the function call, or whether it can only be sped up by breaking apart the function and somehow rewriting it in data.table syntax.
Many thanks in advance for your replies.
Your results:
system.time({
dt = data.table(daf, key = c("aa","bb"))
dt = dt[,myfunc(xx, yy), by = c("aa","bb")]
}) # 21.25
dtInitial <- copy(dt)
V1: if NA values does not concern you, you can modify your function like this:
myfunc2 = function(xx, yy) {
if (max(yy) > 1) {
return(mean(xx))
} else {
w <- ifelse(xx > 0, 2, 1)
return(sum((yy * w)[w != 0])/sum(w))
}
}
system.time({
dt = data.table(daf, key = c("aa","bb"))
dtM = dt[, myfunc2(xx, yy), by = c("aa","bb")]
}) # 6.69
all.equal(dtM, dtInitial)
# [1] TRUE
V2: Also, you can do it faster like this:
system.time({
dt3 <- data.table(daf, key = c("aa","bb"))
dt3[, maxy := max(yy), by = c("aa","bb")]
dt3[, meanx := mean(xx), by = c("aa","bb")]
dt3[, w := ifelse(xx > 0, 2, 1)]
dt3[, wm2 := sum((yy * w)[w != 0])/sum(w), by = c("aa","bb")]
r2 <- dt3[, .(aa, bb, V1 = ifelse(maxy > 1, meanx, wm2))]
r2 <- unique(r2)
}) #2.09
all.equal(r2, dtInitial)
# [1] TRUE
20 sek vs 2 sek for me
Update:
Or a little bit faster:
system.time({
dt3 <- data.table(daf, key = c("aa","bb"))
dt3[, w := ifelse(xx > 0, 2, 1)]
dt3[, yyw := yy * w]
r2 <- dt3[, .(maxy = max(yy),
meanx = mean(xx),
wm2 = sum(yyw)/sum(w)),
, by = c("aa","bb")]
r2[, V1 := ifelse(maxy > 1, meanx, wm2)]
r2[, c("maxy", "meanx", "wm2") := NULL]
}) # 1.51
all.equal(r2, dtInitial)
# [1] TRUE
Another solution
system.time({
dat <- data.table(daf, key = c("aa","bb"))
dat[, xweight := (xx > 0) * 1 + 1]
result <- dat[, list(MaxY = max(yy), Mean1 = mean(xx), Mean2 = sum(yy*xweight)/sum(xweight)), keyby=c("aa", "bb")]
result[, FinalMean := ifelse(MaxY > 1, Mean1, Mean2)]
})
user system elapsed
1.964 0.059 1.348
I've found a way to gain a further speedup of 8x, which reduces the time down to around 0.2 seconds on my machine. See below. Rather than calculating sum(yyw)/sum(w) directly for each group, which is time-consuming, we instead calculate the quantities sum(yyw) and sum(w) for each group, and only afterwards perform the division. Magic!
system.time({
dt <- data.table(daf, key = c("aa","bb"))
dt[, w := 1][xx > 0, w := 2]
dt[, yyw := yy * w]
res <- dt[, .(maxy = max(yy),
meanx = mean(xx),
wm2num = sum(yyw),
wm2den = sum(w)),
by = c("aa","bb")]
res[, wm2 := wm2num/wm2den]
res[, V1 := wm2][maxy > 1, V1 := meanx]
res[, c("maxy", "meanx", "wm2num", "wm2den", "wm2") := NULL]
}) # 0.19
all.equal(res, dtInitial)
# [1] TRUE
I've implemented a simple dynamic programming example described here, using data.table, in the hope that it would be as fast as vectorized code.
library(data.table)
B=100; M=50; alpha=0.5; beta=0.9;
n = B + M + 1
m = M + 1
u <- function(c)c^alpha
dt <- data.table(s = 0:(B+M))[, .(a = 0:min(s, M)), s] # State Space and corresponging Action Space
dt[, u := (s-a)^alpha,] # rewards r(s, a)
dt <- dt[, .(s_next = a:(a+B), u = u), .(s, a)] # all possible (s') for each (s, a)
dt[, p := 1/(B+1), s] # transition probs
# s a s_next u p
# 1: 0 0 0 0 0.009901
# 2: 0 0 1 0 0.009901
# 3: 0 0 2 0 0.009901
# 4: 0 0 3 0 0.009901
# 5: 0 0 4 0 0.009901
# ---
#649022: 150 50 146 10 0.009901
#649023: 150 50 147 10 0.009901
#649024: 150 50 148 10 0.009901
#649025: 150 50 149 10 0.009901
#649026: 150 50 150 10 0.009901
To give a little content to my question: conditional on s and a, future values of s (s_next) is realized as one of a:(a+10), each with probability p=1/(B + 1). u column gives the u(s, a) for each combination (s, a).
Given the initial values V(always n by 1 vector) for each unique state s, V updates according to V[s]=max(u(s, a)) + beta* sum(p*v(s_next)) (Bellman Equation).
Maximization is wrt a, hence, [, `:=`(v = max(v), i = s_next[which.max(v)]), by = .(s)] in the iteration below.
Actually there is very efficient vectorized solution. I thought data.table solution would be comparable in performance as vectorized approach.
I know that the main culprit is dt[, v := V[s_next + 1]]. Alas, I have no idea how to fix it.
# Iteration starts here
system.time({
V <- rep(0, n) # initial guess for Value function
i <- 1
tol <- 1
while(tol > 0.0001){
dt[, v := V[s_next + 1]]
dt[, v := u + beta * sum(p*v), by = .(s, a)
][, `:=`(v = max(v), i = s_next[which.max(v)]), by = .(s)] # Iteration
dt1 <- dt[, .(v[1L], i[1L]), by = s]
Vnew <- dt1$V1
sig <- dt1$V2
tol <- max(abs(V - Vnew))
V <- Vnew
i <- i + 1
}
})
# user system elapsed
# 5.81 0.40 6.25
To my dismay, the data.table solution is even slower than the following highly non-vectorized solution. As a sloppy data.table-user, I must be missing some data.table functionality. Is there a way to improve things, or, data.table is not suitable for these kinds of computations?
S <- 0:(n-1) # StateSpace
VFI <- function(V){
out <- rep(0, length(V))
for(s in S){
x <- -Inf
for(a in 0:min(s, M)){
s_next <- a:(a+B) # (s')
x <- max(x, u(s-a) + beta * sum(V[s_next + 1]/(B+1)))
}
out[s+1] <- x
}
out
}
system.time({
V <- rep(0, n) # initial guess for Value function
i <- 1
tol <- 1
while(tol > 0.0001){
Vnew <- VFI(V)
tol <- max(abs(V - Vnew))
V <- Vnew
i <- i + 1
}
})
# user system elapsed
# 3.81 0.00 3.81
Here's how I would do this...
DT = CJ(s = seq_len(n)-1L, a = seq_len(m)-1L, s_next = seq_len(n)-1L)
DT[ , p := 0]
#p is 0 unless this is true
DT[between(s_next, a, a + B), p := 1/(B+1)]
#may as well subset to eliminate irrelevant states
DT = DT[p>0 & s>=a]
DT[ , util := u(s - a)]
#don't technically need by, but just to be careful
DT[ , V0 := rep(0, n), by = .(a, s_next)]
while(TRUE) {
#for each s, maximize given past value;
# within each s, have to sum over s_nexts,
# to do so, sum by a
DT[ , V1 := max(.SD[ , util[1L] + beta*sum(V0*p), by = a],
na.rm = TRUE), by = s]
if (DT[ , max(abs(V0 - V1))] < 1e-4) break
DT[ , V0 := V1]
}
On my machine this takes about 15 seconds (so not good)... but maybe this will give you some ideas. For example, this data.table is far too large since there really only are n unique values of V ultimately.
1) Is it possible to do operations (multiplication, division, addition, subtraction) between unequal-sized data.tables using data.table or will it have to be done with data.frame?
The following example is a simplified version of my original posting. In my actual data set, it would be A1:A12, B1:B12, C1:C12, E1:E12, F1:F12, etc. I've added in columns J and K to get close to my original data set and to show that I can not do the following in a matrix.
# Sample Data
library(data.table)
input1a <- data.table(ID = c(37, 45, 900),
A1 = c(1, 2, 3),
A2 = c(43, 320, 390),
B1 = c(-0.94, 2.2, -1.223),
B2 = c(2.32, 4.54, 7.21),
C1 = c(1, 2, 3),
C2 = c(-0.94, 2.2, -1.223),
D = c(43, 320, 390),
J = paste0("measurement_1", 1:3),
K = paste0("type_1", 1:3))
setkey(input1a, ID)
input1a
# ID A1 A2 B1 B2 C1 C2 D J K
# 1: 37 1 43 -0.940 2.32 1 -0.940 43 measurement_11 type_11
# 2: 45 2 320 2.200 4.54 2 2.200 320 measurement_12 type_12
# 3: 900 3 390 -1.223 7.21 3 -1.223 390 measurement_13 type_13
input2a <- data.table(ID = c(37, 45, 900),
E1 = c(23, -0.2, 12),
E2 = c(-0.33, -0.012, -1.342))
setkey(input2a, ID)
input2a
# ID E1 E2
# 1: 37 -0.6135756 -0.330
# 2: 45 -0.0124872 -0.012
# 3: 900 -0.4165049 -1.342
outputa <- 0.00066 * input1a[, c(4:5), with = FALSE] *
input1a[, 8, with = FALSE] * input2a[, c(2:3), with = FALSE] # no keys, but would
# like to keep the keys
# outputa <- 0.00066 * B1:B2 * D * A1:A2 / referring back to the column names
setnames(outputa, 2:3, c("F1", "F2"))
Result using outputa
outputa # using existing code and gives a result with no keys
# F1 F2
# 1: -0.6135756 -0.02172773
# 2: -0.0929280 -0.01150618
# 3: -3.7776024 -2.49055607
In the following code I took outputa, which did not keep the keys, and rewrote outputa as outputause. I would like to have the following question answered so that I can perform the needed operations on the data set while keeping the keys intact.
2) How can the following code be rewritten with x defined for each group of columns? This question stems from Weighted sum of variables by groups with data.table and my trouble trying to replicate any of the answers with my data set.
Each group of columns is defined below:
A1:A2 (input1a[, 2:3]),
B1:B2 (input1a[, 4:5]), and
D input1a[, 8]
In outputause, if input1a[, c(4:5), with = FALSE] was the only group from input1a, then alone it would be x.
What about when you have more than one group from a single data.table as is shown below?
outputause <- input1a[, lapply(.SD, function(x) {
0.00066 * input1a[, c(4:5), with = FALSE] * input1a[, 8, with = FALSE] *
input2a[, c(2, 3), with = FALSE]
}), by = key(input1a)] # keeping keys intact
setnames(outputause, 2:3, c("F1", "F2"))
Result using outputause
outputause # using revised code and result includes the keys
# ID F1 F2
# 1: 37 -0.6135756 -0.02172773
# 2: 45 -0.0929280 -0.01150618
# 3: 900 -3.7776024 -2.49055607
UPDATE
input2at <- data.table(t(input2a))
inputs <- data.table(input1a, input2at)
I have transposed input2a and combined it with input1a in the data.table inputs. In this simple example I had 3 rows, but in my actual data set I'll have close to 1300 rows. This is the reason why I've asked question 2).
Thank you.
I am answering my own question based on an answer provided to me in R data.table operations with multiple groups in single data.table and outside function with lapply.
outputa <- data.table(input1a, input2a)
setnames(outputa, 8, "D1")
outputa[, D2 := D1]
fun <- function(B, D, E) 0.00066 * B * D * E
outputa[, lapply(1:2, function(i) fun(get(paste0('B', i)),
get(paste0('D', i)),
get(paste0('E', i)))),
by = ID]
With a the R package data.table is it possible to find the values that are in a given interval without a full vector scan of the data. For example
>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
x
1: 3
2: 5
3: 8
Where DT can be a very big table.
Bonus question:
is it possible to do the same thing for a set of non-overlapping intervals such as
>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
i min max
1: 1 3 10
2: 2 20 40
> my.data.table.function2(DT,I)
i x
1: 1 3
2: 1 5
3: 1 8
4: 2 21
5: 2 34
Where both I and DT can be very big.
Thanks a lot
Here is a variation of the code proposed by #user1935457 (see comment in #user1935457 post)
system.time({
if(!identical(key(DT), "x")) setkey(DT, x)
setkey(IT, min)
#below is the line that differs from #user1935457
#Using IT to address the lines of DT creates a smaller intermediate table
#We can also directly use .I
target.low<-DT[IT,list(i=i,min=.I),roll=-Inf, nomatch = 0][,list(min=min[1]),keyby=i]
setattr(IT, "sorted", "max")
# same here
target.high<-DT[IT,list(i=i,max=.I),roll=Inf, nomatch = 0][,list(max=last(max)),keyby=i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll2 <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
setcolorder(ans.roll2, c("i", "x"))
})
# user system elapsed
# 0.07 0.00 0.06
system.time({
# #user1935457 code
})
# user system elapsed
# 0.08 0.00 0.08
identical(ans.roll2, ans.roll)
#[1] TRUE
The performance gain is not huge here, but it shall be more sensitive with larger DT and smaller IT. thanks again to #user1935457 for your answer.
First of all, vecseq isn't exported as a visible function from data.table, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical check at the end.
That out of the way, we need a bigger example to exhibit difference from vector scan approach:
require(data.table)
n <- 1e5L
f <- 10L
ni <- n / f
set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni,
min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))
DT, the Data Table is a not-too-random subset of 1:n. IT, the Interval Table is ni = n / 10 non-overlapping intervals in 1:n. Doing the repeated vector scan on all ni intervals takes a while:
system.time({
ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
## user system elapsed
## 84.15 4.48 88.78
One can do two rolling joins on the interval endpoints (see the roll argument in ?data.table) to get everything in one swoop:
system.time({
# Save time if DT is already keyed correctly
if(!identical(key(DT), "x")) setkey(DT, x)
DT[, row := .I]
setkey(IT, min)
target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]
# Non-overlapping intervals => (sorted by min => sorted by max)
setattr(IT, "sorted", "max")
target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
ans.roll[, row := NULL]
setcolorder(ans.roll, c("i", "x"))
})
## user system elapsed
## 0.12 0.00 0.12
Ensuring the same row order verifies the result:
setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE
If you don't want to do a full vector scan, you should first declare your variable as a key for your data.table :
DT <- data.table(x=c(1,1,2,3,5,8,13,21,34,55,89),key="x")
Then you can use %between% :
R> DT[x %between% c(3,10),]
x
1: 3
2: 5
3: 8
R> DT[x %between% c(3,10) | x %between% c(20,40),]
x
1: 3
2: 5
3: 8
4: 21
5: 34
EDIT : As #mnel pointed out, %between% still does vector scans. The Note section of the help page says :
Current implementation does not make use of ordered keys.
So this doesn't answer your question.