Equivalent for dlply in data.table - r

I try to achieve the same what dlply does with data.table. So just as a very simple example:
library(plyr)
library(data.table)
dt <- data.table( p = c("A", "B"), q = 1:2 )
dlply( dt, "p", identity )
$A
p q
1 A 1
$B
p q
1 B 2
dt[ , identity(.SD), by = p ]
p q
1: A 1
2: B 2
foo <- function(x) as.list(x)
dt[ , foo(.SD), by = p ]
p q
1: A 1
2: B 2
Obviously the return values of foo are collapsed to one data.table. And I don't want to use dlply because it passes the split data.tables as data.frames to foo which makes further data.table operations within foo inefficient.

Here's a more data.table oriented approach:
setkey(dt, p)
dt[, list(list(dt[J(.BY[[1]])])), by = p]$V1
#[[1]]
# p q
#1: A 1
#
#[[2]]
# p q
#1: B 2
There are more data.table style alternatives to the above but that seems to be the fastest - here's a comparison with lapply:
dt <- data.table( p = rep( LETTERS[1:25], 1E6), q = 25*1E6, key = "p" )
microbenchmark(dt[, list(list(dt[J(.BY[[1]])])), by = p]$V1, lapply(unique(dt$p), function(x) dt[x]), times = 10)
#Unit: seconds
# expr min lq median uq max neval
#dt[, list(list(dt[J(.BY[[1]])])), by = p]$V1 1.111385 1.508594 1.717357 1.966694 2.108188 10
# lapply(unique(dt$p), function(x) dt[x]) 1.871054 1.934865 2.216192 2.282428 2.367505 10

Try this:
> split(dt, dt[["p"]])
$A
p q
1: A 1
$B
p q
1: B 2

Regarding G. Grothendieck's answer I was curious how well split performs:
dt <- data.table( p = rep( LETTERS[1:25], 1E6), q = 25*1E6, key = "p" )
system.time(
ll <- split(dt, dt[ ,p ] )
)
user system elapsed
5.237 1.340 6.563
system.time(
ll <- lapply( unique(dt[,p]), function(x) dt[x] )
)
user system elapsed
1.179 0.363 1.541
So if there is no better answer, I'd stick with
lapply( unique(dt[,p]), function(x) dt[x] )

Related

How can this R data.table join+group+summarise, operation be made a lot faster?

Summary of real-world problem
Essentially this is a scenario evaluation, of a linear system of equations.
I have two data tables.
s_dt contains the scenarios, drivers (d) and values (v) for each observed scenario (o).
c_dt contains a series of terms (n) for a number of fitted model bases (b).
The individual powers of drivers, and associated coefficients are coded into (d and t) as name-value pairs.
Each basis (b) is essentially a polynomial with n terms.
The issue
Repro case below gives desired output format.
But is far too slow for required use case, even on a cut-down problem.
Numbers are junk, but I can't share actual data. Running on real-world data gives similar timing.
Circa 3sec for "lil" problem on my system (12 threads).
But "big" problem is 4000 times larger. So expect circa 3hours. Ouch!
Aim is to have the "big" problem run sub 5min (or ideally much faster!)
So, awesome clever people, how can this be made a lot faster?
(And what is the root cause of the slowdown?)
I'll happily accept base/tidyverse solutions too, if they meet the performance needs. I just assumed data.table was the best way to go for the size of the problem.
Current solution
Run fun on s_dt, grouping by o.
fun: Joins c_dt with each group data, to populate v, thus enabling calculation of r (the result of evaluating each of the polynomial equations).
In data.table parlance:
s_dt[, fun(.SD), keyby = .(o)]
Repro case
Creates two data.tables that have the combinations and field types matching real-world problem.
But with cut-down size for illustrative purposes.
Defines fun, then runs to populate r for all scenarios.
library(data.table)
# problem sizing ----
dims <- list(o = 50000, d = 50, b = 250, n = 200) # "big" problem - real-life size
dims <- list(o = 100, d = 50, b = 25, n = 200) # "lil" problem (make runtime shorter as example)
# build some test data tables ----
build_s <- function() {
o <- seq_len(dims$o)
d <- paste0("d",seq_len(dims$d))
v <- as.double(seq_len(dims$o * dims$d))/10000
CJ(o, d)[, `:=`(v = v)]
}
s_dt <- build_s()
build_c <- function() {
b <- paste0("c", seq_len(dims$b))
n <- seq_len(dims$n)
d <- c("c", paste0("d", seq_len(dims$d)))
t <- as.double(rep_len(0:6, dims$b * dims$n * (dims$d+1)))
dt <- CJ(d, b, n)[, `:=`(t = t)]
dt <- dt[t != 0]
}
c_dt <- build_c()
# define fun and evaluate ----
# (this is what needs optimising)
profvis::profvis({
fun <- function(dt) {
# don't use chaining here, for more useful profvis output
dt <- dt[c_dt, on = .(d)]
dt <- dt[, r := fcase(d == "c", t,
is.na(v), 0,
rep(TRUE, .N), v^t)]
dt <- dt[, .(r = prod(r)), keyby = .(b, n)]
dt <- dt[, .(r = sum(r)), keyby = .(b)]
}
res <- s_dt[, fun(.SD), keyby = .(o)]
})
Example inputs and outputs
> res
o b r
1: 1 c1 0.000000e+00
2: 1 c10 0.000000e+00
3: 1 c11 0.000000e+00
4: 1 c12 0.000000e+00
5: 1 c13 0.000000e+00
---
2496: 100 c5 6.836792e-43
2497: 100 c6 6.629646e-43
2498: 100 c7 6.840915e-43
2499: 100 c8 6.624668e-43
2500: 100 c9 6.842608e-43
> s_dt
o d v
1: 1 d1 0.0001
2: 1 d10 0.0002
3: 1 d11 0.0003
4: 1 d12 0.0004
5: 1 d13 0.0005
---
4996: 100 d50 0.4996
4997: 100 d6 0.4997
4998: 100 d7 0.4998
4999: 100 d8 0.4999
5000: 100 d9 0.5000
> c_dt
d b n t
1: c c1 2 1
2: c c1 3 2
3: c c1 4 3
4: c c1 5 4
5: c c1 6 5
---
218567: d9 c9 195 5
218568: d9 c9 196 6
218569: d9 c9 198 1
218570: d9 c9 199 2
218571: d9 c9 200 3
This would be difficult to fully vectorize. The "big" problem requires so many operations that going parallel is probably the most straightforward way to get to ~5 minutes.
But first, we can get a ~3x speed boost by using RcppArmadillo for the product and sum calculations instead of data.table's grouping operations.
library(data.table)
library(parallel)
Rcpp::cppFunction(
"std::vector<double> sumprod(arma::cube& a) {
for(unsigned int i = 1; i < a.n_slices; i++) a.slice(0) %= a.slice(i);
return(as<std::vector<double>>(wrap(sum(a.slice(0), 0))));
}",
depends = "RcppArmadillo",
plugins = "cpp11"
)
cl <- makeForkCluster(detectCores() - 1L)
The following approach requires extensive preprocessing. The upshot is that it makes it trivial to parallelize. However, it will work only if the values of s_dt$d are the same for each o as in the MRE:
identical(s_dt$d, rep(s_dt[o == 1]$d, length.out = nrow(s_dt)))
#> [1] TRUE
Now let's build the functions to accept s_dt and c_dt:
# slightly modified original function for comparison
fun1 <- function(dt, c_dt) {
# don't use chaining here, for more useful profvis output
dt <- dt[c_dt, on = .(d)]
dt <- dt[, r := fcase(d == "c", t,
is.na(v), 0,
rep(TRUE, .N), v^t)]
dt <- dt[, .(r = prod(r)), keyby = .(b, n)]
dt <- dt[, .(r = sum(r)), keyby = .(b)]
}
fun2 <- function(s_dt, c_dt, cl = NULL) {
s_dt <- copy(s_dt)
c_dt <- copy(c_dt)
# preprocess to get "a", "tt", "i", and "idxs"
i_dt <- s_dt[o == 1][, idxs := .I][c_dt, on = .(d)][, ic := .I][!is.na(v)]
ub <- unique(c_dt$b)
un <- unique(c_dt$n)
nb <- length(ub)
nn <- length(un)
c_dt[, `:=`(i = match(n, un) + nn*(match(b, ub) - 1L), r = 0)]
c_dt[, `:=`(i = i + (0:(.N - 1L))*nn*nb, ni = .N), i]
c_dt[d == "c", r := t]
a <- array(1, c(nn, nb, max(c_dt$ni)))
a[c_dt$i] <- c_dt$r # 3-d array to store v^t (updated for each unique "o")
i <- c_dt$i[i_dt$ic] # the indices of "a" to update (same for each unique "o")
tt <- c_dt$t[i_dt$ic] # c_dt$t ordered for "a" (same for each unique "o")
idxs <- i_dt$idxs # the indices to order s_dt$v (same for each unique "o")
uo <- unique(s_dt$o)
v <- collapse::gsplit(s_dt$v, s_dt$o)
if (is.null(cl)) {
# non-parallel solution
data.table(
o = rep(uo, each = length(ub)),
b = rep(ub, length(v)),
r = unlist(
lapply(
v,
function(x) {
a[i] <- x[idxs]^tt
sumprod(a)
}
)
),
key = "o"
)
} else {
# parallel solution
clusterExport(cl, c("a", "tt", "i", "idxs"), environment())
data.table(
o = rep(uo, each = length(ub)),
b = rep(ub, length(v)),
r = unlist(
parLapply(
cl,
v,
function(x) {
a[i] <- x[idxs]^tt
sumprod(a)
}
)
),
key = "o"
)
}
}
Now the data:
# problem sizing ----
bigdims <- list(o = 50000, d = 50, b = 250, n = 200) # "big" problem - real-life size
lildims <- list(o = 100, d = 50, b = 25, n = 200) # "lil" problem (make runtime shorter as example)
# build some test data tables ----
build_s <- function(dims) {
o <- seq_len(dims$o)
d <- paste0("d",seq_len(dims$d))
v <- as.double(seq_len(dims$o * dims$d))/10000
CJ(o, d)[, `:=`(v = v)]
}
build_c <- function(dims) {
b <- paste0("c", seq_len(dims$b))
n <- seq_len(dims$n)
d <- c("c", paste0("d", seq_len(dims$d)))
t <- as.double(rep_len(0:6, dims$b * dims$n * (dims$d+1)))
dt <- CJ(d, b, n)[, `:=`(t = t)]
dt <- dt[t != 0]
}
Timing the lil problem, which is so small that parallelization doesn't help:
s_dt <- build_s(lildims)
c_dt <- build_c(lildims)
microbenchmark::microbenchmark(fun1 = s_dt[, fun1(.SD, c_dt), o],
fun2 = fun2(s_dt, c_dt),
times = 10,
check = "equal")
#> Unit: seconds
#> expr min lq mean median uq max neval
#> fun1 3.204402 3.237741 3.383257 3.315450 3.404692 3.888289 10
#> fun2 1.134680 1.138761 1.179907 1.179872 1.210293 1.259249 10
Now the big problem:
s_dt <- build_s(bigdims)
c_dt <- build_c(bigdims)
system.time(dt2p <- fun2(s_dt, c_dt, cl))
#> user system elapsed
#> 24.937 9.386 330.600
stopCluster(cl)
A bit longer than 5 minutes with 31 cores.

performance considerations get() in data.table

I've been using get() in a loop to manipulate a column j by i with reference to multiple other columns.
I wonder if there is a faster/more efficient way? Any performance considerations?
Here a minimal example of the type of operation I have in mind:
require(data.table) # version 1.12.8
dt = data.table(v1=c(1,2,NA),v2=c(0,0,1),v3=c(0,0,0))
for (i in 1:2){
dt[ is.na(get(paste0('v',i))), (paste0('v',i)):= get(paste0('v',i+1))+2 ][]
}
The actual tables I do this with are much larger (~5 mio rows, ~300 columns).
I'd highly appreciate any thoughts.
We can use set which would assign in place
library(data.table)
for(j in 1:2) {
i1 <- which(is.na(dt[[j]]))
set(dt, i = i1, j = j, value = dt[[j+1]][i1]+ 2)
}
dt
# v1 v2 v3
#1: 1 0 0
#2: 2 0 0
#3: 3 1 0
There is not much difference between a for loop or lapplyif both are using the get. For performace improvement, it is better to use set
In base R, we can also do
setDF(dt)
i1 <- is.na(dt[-length(dt)])
dt[-length(dt)][i1] <- dt[-1][i1] + 2
dt
# v1 v2 v3
#1 1 0 0
#2 2 0 0
#3 3 1 0
Yes your for loop slows you down considerably. Even a simple lapply (and there's probably more elegant ways to do), brings you significant performance gains:
library(data.table)
dt <- data.table(v1 = rnorm(100), v2 = sample(c(NA,1:5)), v3 = sample(c(NA,1:5)), v4 = sample(c(NA,1:5)))
dt2 <- copy(dt)
dt3 <- copy(dt)
dt4 <- copy(dt)
microbenchmark::microbenchmark(
for (i in 1:2){
dt[ is.na(get(paste0('v',i))), (paste0('v',i)):= get(paste0('v',i+1))+2 ]
},
for (i in 1:2){
dt2[ is.na(get(paste0('v',i))), (paste0('v',i)):= get(paste0('v',i+1))+2 ][]
},
lapply(1:2, function(i) dt3[ is.na(get(paste0('v',i))), (paste0('v',i)):= get(paste0('v',i+1))+2 ]),
for(j in 1:2) {
i1 <- which(is.na(dt4[[j]]))
set(dt4, i = i1, j = j, value = dt[[j+1]][i1]+ 2)
}
)
Unit: milliseconds
expr min lq mean median
for (i in 1:2) { dt[is.na(get(paste0("v", i))), `:=`((paste0("v", i)), get(paste0("v", i + 1)) + 2)] } 8.439924 8.651308 10.257670 8.900500
for (i in 1:2) { dt2[is.na(get(paste0("v", i))), `:=`((paste0("v", i)), get(paste0("v", i + 1)) + 2)][] } 8.902435 9.098875 10.469305 9.262659
lapply(1:2, function(i) dt3[is.na(get(paste0("v", i))), `:=`((paste0("v", i)), get(paste0("v", i + 1)) + 2)]) 1.032788 1.144117 1.561741 1.224858
for (j in 1:2) { i1 <- which(is.na(dt4[[j]])) set(dt4, i = i1, j = j, value = dt[[j + 1]][i1] + 2) } 6.216452 6.392754 7.970074 6.502356
uq max neval
9.588571 35.259060 100
9.729876 23.245224 100
1.349337 9.467026 100
7.046646 30.857044 100
Checking results are equivalent:
identical(dt,dt2)
# [1] TRUE
identical(dt,dt3)
# [1] TRUE
identical(dt,dt4)
# [1] TRUE
There's probably more elegant way to do that but a division by 10 of mean computation time for something that only took a few seconds to program is a good yield ;)

R - fast two sample t test

I would like to perform a two sample t test in R using separate groupings. The t.test must be "unbiased", meaning that for all transactions in the outer group (group 2 below), the T test must be run for each inner group (group 1 below) like: "inner group A" vs. "inner group not A". The for loop code shown below is probably clearer than a verbal explanation...
My current code is below. Does anyone know a faster/better way to do this? Open to using any package, but currently using data.table.
For context, I have ~1 million rows of transaction data. Group 1 indicates a person (if there are multiple rows they have multiple transactions) and contains ~30k unique values. Group 2 indicates a zip code and contains ~500 unique values
Thanks!
library(data.table)
# fake data
grp1 <- c('A','A','A','B','B','C','C','D','D','D','D','E','E','E','F','F')
grp2 <- c(1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 2,2)
vals <- c(10,20,30, 40,15, 25,60, 70,100,200,300, 400,1000,2000, 3000,5000)
DT <- data.table(grp1 = grp1, grp2 = grp2, vals = vals)
# "two sample t.test" --------------------------------------------------
# non vectorized, in-place
# runtime is ~50 mins for real data
for (z in DT[,unique(grp2)]){
for (c in DT[grp2 == z, unique(grp1)]) {
res = t.test(
DT[grp2 == z & grp1 == c, vals],
DT[grp2 == z & grp1 != c, vals],
alternative = 'greater'
)
DT[grp2 == z & grp1 == c, pval := res$p.value]
DT[grp2 == z & grp1 == c, tstat := res$statistic]
}
}
# vectorized, creates new summarized data.table
# runtime is 1-2 mins on real data
vec <- DT[,{
grp2_vector = vals
.SD[,.(tstat = t.test(vals, setdiff(grp2_vector, vals), alternative = 'g')$statistic,
pval = t.test(vals, setdiff(grp2_vector, vals), alternative = 'g')$p.value), by=grp1]
} , by=grp2]
stats::t.test is generalized and does a number of checks. You can just calculate what you need, i.e. t-statistic and p-value and also make use of the optimization in data.table to calculate length, mean and variance. Here is a possible approach:
#combinations of grp1 and grp2 and those not in grp1 for each grp2
comb <- unique(DT[, .(grp1, grp2)])[,
rbindlist(lapply(1:.N, function(n) .(g1=rep(grp1[n], .N-1L), notIn=grp1[-n]))),
.(g2=grp2)]
#this is optimized, switch on verbose to see the output
X <- DT[, .(nx=.N, mx=mean(vals), vx=var(vals)), .(grp1, grp2)] #, verbose=TRUE]
#calculate length, mean, var for values not in grp1
Y <- DT[comb, on=.(grp2=g2, grp1=notIn), allow.cartesian=TRUE][,
.(ny=.N, my=mean(vals), vy=var(vals)), by=.(grp1=g1, grp2=grp2)]
#calculate outputs based on stats:::t.test.default
ans <- X[Y, on=.(grp1, grp2)][, c("tstat", "pval") := {
stderrx <- sqrt(vx/nx)
stderry <- sqrt(vy/ny)
stderr <- sqrt(stderrx^2 + stderry^2)
df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny - 1))
tstat <- (mx - my)/stderr
.(tstat, pt(tstat, df, lower.tail = FALSE))
}, by=1:Y[,.N]]
output:
grp1 grp2 nx mx vx ny my vy tstat pval
1: C 1 2 42.500 612.50 5 23.0000 145.0000 1.06500150 0.22800432
2: B 1 2 27.500 312.50 5 29.0000 355.0000 -0.09950372 0.53511601
3: A 1 3 20.000 100.00 4 35.0000 383.3333 -1.31982404 0.87570431
4: F 2 2 4000.000 2000000.00 7 581.4286 489747.6190 3.30491342 0.08072148
5: E 2 3 1133.333 653333.33 6 1445.0000 4323350.0000 -0.32174451 0.62141500
6: D 2 4 167.500 10891.67 5 2280.0000 3292000.0000 -2.59809850 0.97016160
timing code:
library(data.table) #data.table_1.12.4
set.seed(0L)
np <- 4.2e5
nzc <- 4.2e3
DT <- data.table(grp1=rep(1:np, each=5), grp2=rep(1:nzc, each=np/nzc*5),
vals=abs(rnorm(np*5, 5000, 2000)), key=c("grp1", "grp2"))
mtd0 <- function() {
DT[, {
grp2_vector <- vals
.SD[,{
tres <- t.test(vals, setdiff(grp2_vector, vals), alternative = 'g')
.(tstat=tres$statistic, pval=tres$p.value)
}, by=grp1]
} , by=grp2]
}
mtd1 <- function() {
comb <- unique(DT[, .(grp1, grp2)])[,
rbindlist(lapply(1:.N, function(n) .(g1=rep(grp1[n], .N-1L), notIn=grp1[-n]))),
.(g2=grp2)]
X <- DT[, .(nx=.N, mx=mean(vals), vx=var(vals)), .(grp1, grp2)] #, verbose=TRUE]
Y <- DT[comb, on=.(grp2=g2, grp1=notIn), allow.cartesian=TRUE][,
.(ny=.N, my=mean(vals), vy=var(vals)), by=.(grp1=g1, grp2=grp2)]
ans <- X[Y, on=.(grp1, grp2)][, c("tstat", "pval") := {
stderrx <- sqrt(vx/nx)
stderry <- sqrt(vy/ny)
stderr <- sqrt(stderrx^2 + stderry^2)
df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny - 1))
tstat <- (mx - my)/stderr
.(tstat, pt(tstat, df, lower.tail = FALSE))
}, by=1:Y[,.N]]
}
microbenchmark::microbenchmark(mtd0(), mtd1(), times=1L)
timings:
Unit: seconds
expr min lq mean median uq max neval
mtd0() 65.76456 65.76456 65.76456 65.76456 65.76456 65.76456 1
mtd1() 18.29710 18.29710 18.29710 18.29710 18.29710 18.29710 1
I would suggest looking at the package Rfast. There are commands, such as ttest1, ttest2 and ttests for one sample, 2 sample and many t

Strange behaviour matching strings in data.table [duplicate]

Let's say I have two columns of strings:
library(data.table)
DT <- data.table(x = c("a","aa","bb"), y = c("b","a","bbb"))
For each row, I want to know whether the string in x is present in column y. A looping approach would be:
for (i in 1:length(DT$x)){
DT$test[i] <- DT[i,grepl(x,y) + 0]
}
DT
x y test
1: a b 0
2: aa a 0
3: bb bbb 1
Is there a vectorized implementation of this? Using grep(DT$x,DT$y) only uses the first element of x.
You can simply do
DT[, test := grepl(x, y), by = x]
Or mapply (Vectorize is really just a wrapper for mapply)
DT$test <- mapply(grepl, pattern=DT$x, x=DT$y)
Thank you all for your responses. I've benchmarked them all, and come up with the following:
library(data.table)
library(microbenchmark)
DT <- data.table(x = rep(c("a","aa","bb"),1000), y = rep(c("b","a","bbb"),1000))
DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
DT4 <- copy(DT)
microbenchmark(
DT1[, test := grepl(x, y), by = x]
,
DT2$test <- apply(DT, 1, function(x) grepl(x[1], x[2]))
,
DT3$test <- mapply(grepl, pattern=DT3$x, x=DT3$y)
,
{vgrepl <- Vectorize(grepl)
DT4[, test := as.integer(vgrepl(x, y))]}
)
Results
Unit: microseconds
expr min lq mean median uq max neval
DT1[, `:=`(test, grepl(x, y)), by = x] 758.339 908.106 982.1417 959.6115 1035.446 1883.872 100
DT2$test <- apply(DT, 1, function(x) grepl(x[1], x[2])) 16840.818 18032.683 18994.0858 18723.7410 19578.060 23730.106 100
DT3$test <- mapply(grepl, pattern = DT3$x, x = DT3$y) 14339.632 15068.320 16907.0582 15460.6040 15892.040 117110.286 100
{ vgrepl <- Vectorize(grepl) DT4[, `:=`(test, as.integer(vgrepl(x, y)))] } 14282.233 15170.003 16247.6799 15544.4205 16306.560 26648.284 100
Along with being the most syntactically simple, the data.table solution is also the fastest.
You can pass the grepl function into an apply function to operate on each row of your data table where the first column contains the string to search for and the second column contains the string to search in. This should give you a vectorized solution to your problem.
> DT$test <- apply(DT, 1, function(x) as.integer(grepl(x[1], x[2])))
> DT
x y test
1: a b 0
2: aa a 0
3: bb bbb 1
You can use Vectorize:
vgrepl <- Vectorize(grepl)
DT[, test := as.integer(vgrepl(x, y))]
DT
x y test
1: a b 0
2: aa a 0
3: bb bbb 1

Replace NA with last non-NA in data.table by using only data.table

I want to replace NA values with last non-NA values in data.table and using data.table. I have one solution, but it's considerably slower than na.locf:
library(data.table)
library(zoo)
library(microbenchmark)
f1 <- function(x) {
x[, X := na.locf(X, na.rm = F)]
x
}
f2 <- function(x) {
cond <- !is.na(x[, X])
x[, X := .SD[, X][1L], by = cumsum(cond)]
x
}
m1 <- data.table(X = rep(c(NA,NA,1,2,NA,NA,NA,6,7,8), 100))
m2 <- data.table(X = rep(c(NA,NA,1,2,NA,NA,NA,6,7,8), 100))
microbenchmark(f1(m1), f2(m2), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# f1(m1) 2.648938 2.770792 2.959156 3.894635 6.032533 10
# f2(m2) 994.267610 1916.250440 1926.420436 1941.401077 2008.929024 10
I want to know, why it's so slow and whether a faster solution exists or not.
Here's a data.table-only solution, but it's slightly slower than na.locf:
m1[, X := X[1], by = cumsum(!is.na(X))]
m1
# X
# 1: NA
# 2: NA
# 3: 1
# 4: 2
# 5: 2
# ---
# 996: 2
# 997: 2
# 998: 6
# 999: 7
#1000: 8
Speed test:
m1 <- data.table(X = rep(c(NA,NA,1,2,NA,NA,NA,6,7,8), 1e6))
f3 = function(x) x[, X := X[1], by = cumsum(!is.na(X))]
system.time(f1(copy(m1)))
# user system elapsed
# 3.84 0.58 4.62
system.time(f3(copy(m1)))
# user system elapsed
# 5.56 0.19 6.04
And here's a perverse way of making it faster, but I think one that makes it considerably less readable:
f4 = function(x) {
x[, tmp := cumsum(!is.na(X))]
setattr(x, "sorted", "tmp") # set the key without any checks
x[x[!is.na(X)], X := i.X][, tmp := NULL]
}
system.time(f4(copy(m1)))
# user system elapsed
# 3.32 0.51 4.00
As I mentioned in my comment, Rcpp is pretty fast for this. Below I compare the zoo::na.locf approach, #eddi's f3 and f4, and the Rcpp approach posted here by #RomainFrancois.
First, the benchmark results:
microbenchmark(f.zoo(m1), eddi.f3(m2), eddi.f4(m3), f.Rcpp(m4), times = 10)
## Unit: milliseconds
## expr min lq median uq max neval
## f.zoo(m1) 1297.969 1403.67418 1443.5441 1527.7644 1597.9724 10
## eddi.f3(m2) 2982.103 2998.48809 3039.6543 3068.9303 3078.3963 10
## eddi.f4(m3) 1970.650 2017.55740 2061.6599 2074.1497 2099.8892 10
## f.Rcpp(m4) 95.411 98.44505 107.6925 119.2838 171.7855 10
And the function definitions:
library(data.table)
library(zoo)
library(microbenchmark)
library(Rcpp)
m1 <- m2 <- m3 <- m4 <-
data.table(X = rep(c(NA, NA, 1, 2, NA, NA, NA, 6, 7, 8), 1e6))
f.zoo <- function(x) {
x[, X := na.locf(X, na.rm = F)]
x
}
eddi.f3 = function(x) x[, X := X[1], by = cumsum(!is.na(X))]
eddi.f4 = function(x) {
x[, tmp := cumsum(!is.na(X))]
setattr(x, "sorted", "tmp")
x[x[!is.na(X)], X := i.X][, tmp := NULL]
}
# Make the Cpp function available
cppFunction('
NumericVector naLocfCpp(NumericVector x) {
double *p=x.begin(), *end = x.end() ;
double v = *p ; p++ ;
while( p < end ){
while( p<end && !NumericVector::is_na(*p) ) p++ ;
v = *(p-1) ;
while( p<end && NumericVector::is_na(*p) ) {
*p = v ;
p++ ;
}
}
return x;
}')
f.Rcpp <- function(x) {
naLocfCpp(x$X)
x
}
And all produce identical results:
out1 <- f.zoo(m1)
out2 <- eddi.f3(m2)
out3 <- eddi.f4(m3)
out4 <- f.Rcpp(m4)
all(identical(out1, out2), identical(out1, out3), identical(out1, out4))
## TRUE

Resources