I am trying to get function application based on lagging/forwarding. I use extensively data.table and I even have working code, but knowing power of data.table I think there must be a simpler way to achieve the same with possibly improving the performance (I do a lot of creation of variables inside the functions). Below is working code of functions (available in https://gist.github.com/tomaskrehlik/5262087#file-gistfile1-r )
# Lag-function lags the given variable by the date_variable
lag_variable <- function(data, variable, lags, date_variable = c("Date")) {
if (lags == 0) {
return(data)
}
if (lags>0) {
name <- "lag"
} else {
name <- "forward"
}
require(data.table)
setkeyv(data, date_variable)
if (lags>0) {
data[,index:=seq(1:.N)]
} else {
data[,index:=rev(seq(1:.N))]
}
setkeyv(data, "index")
lags <- abs(lags)
position <- which(names(data)==variable)
for ( j in 1:lags ) {
lagname <- paste(variable,"_",name,j,sep="")
lag <- paste("data[, ",lagname,":=data[list(index-",j,"), ",variable,", roll=TRUE][[",position,"L]]]", sep = "")
eval(parse( text = lag ))
}
setkeyv(data, date_variable)
data[,index:=NULL]
}
# window_func applies the function to the lagged or forwarded variables created by lag_variable
window_func <- function(data, func.name, variable, direction = "window", steps, date_variable = c("Date"), clean = TRUE) {
require(data.table)
require(stringr)
transform <- match.fun(func.name)
l <- length(names(data))
if (direction == "forward") {
lag_variable(data, variable, -steps, date_variable)
cols <- which((!(is.na(str_match(names(a), paste(variable,"_forward(",paste(1:steps,collapse="|"),")",sep=""))[,1])))*1==1)
} else {
if (direction == "backward") {
lag_variable(data, variable, steps, date_variable)
cols <- which((!(is.na(str_match(names(a), paste(variable,"_lag(",paste(1:steps,collapse="|"),")",sep=""))[,1])))*1==1)
} else {
if (direction == "window") {
lag_variable(data, variable, -steps, date_variable)
lag_variable(data, variable, steps, date_variable)
cols <- which((!(is.na(str_match(names(a), paste(variable,"_lag(",paste(1:steps,collapse="|"),")",sep=""))[,1])))*1==1)
cols <- c(cols,which((!(is.na(str_match(names(a), paste(variable,"_forward(",paste(1:steps,collapse="|"),")",sep=""))[,1])))*1==1))
} else {
stop("The direction must be either backward, forward or window.")
}
}
}
data[,transf := apply(data[,cols, with=FALSE], 1, transform)]
if (clean) {
data[,cols:=NULL,with=FALSE]
}
return(data)
}
# Typical use:
# I have a data.table DT with variables Date (class IDate), value1, value2
# I want to get cumulative sum of next five days
# window_func(DT, "sum", "value1", direction = "forward", steps = 5)
Edit: Sample data can be created by:
a <- data.table(Date = 1:1000, value = rnorm(1000))
For each Date (which, here, are integers just for an example, does not matter much), I want to create the sum of next ten observations. To run the code and get output, do:
window_func(data = a, func.name = "sum", variable = "value",
direction = "forward", steps = 10, date_variable = "Date", clean = TRUE)
The function first takes the variable and creates ten lagged variables (using function lag_variable) and then applies function column-wise and cleans after itself. Code is bloated because I sometimes need to use functions only on lag observations, sometimes on forward observations and sometimes on both, which is called window.
Any suggestions how to implement this better? My code seems to be somehow too big.
I'm not sure about the rest of your function, but you can get your lagged sum rather efficiently as follows:
a[ , lagSum :=
a[, list(sum=sum(value)), by=list(grp=(Date+lag-i) %/% lag)] [grp!=0, sum]
, by=list(i=Date %% lag)]
eg:
set.seed(1)
a[ , lagSum :=
a[, list(sum=sum(value)), by=list(grp=(Date+lag-i) %/% lag)] [grp!=0, sum]
, by=list(i=Date %% lag)]
> a
Date value lagSum
1: 1 -0.6264538 1.32202781
2: 2 0.1836433 3.46026279
3: 3 -0.8356286 3.66646270
4: 4 1.5952808 3.88085074
5: 5 0.3295078 0.07087005
---
996: 996 -0.3132929 -3.79332038
997: 997 -0.8806707 -3.48002750
998: 998 -0.4192869 -2.59935677
999: 999 -1.4827517 -2.18006988
1000: 1000 -0.6973182 -1.88854602
Confirming correct values:
# first n values
n <- 5
for (i in seq(n))
a[seq(i, length.out=10), print(sum(value))]
# [1] 1.322028
# [1] 3.460263
# [1] 3.666463
# [1] 3.880851
# [1] 0.07087005
BENCHMARKS (against for loop, so not quite fair)
set.seed(1)
a <- data.table(Date = 1:1000, value = rnorm(1000))
system.time({ a[ , lagSum :=
a[, list(sum=sum(value)), by=list(grp=(Date+lag-i) %/% lag)] [grp!=0, sum]
, by=list(i=Date %% lag)]
})
# user system elapsed
# 0.049 0.001 0.056
set.seed(1)
a <- data.table(Date = 1:1000, value = rnorm(1000))
system.time({ for (i in seq(nrow(a)-lag+1))
a[seq(i, length.out=10), lagSum := sum(value)]})
# user system elapsed
# 1.526 0.019 2.220
Related
I have a large dataset in R (say >40,000 rows and >20 categorical columns) that I repeatedly subset, so I would like to speed this up as much as possible. It needs to be a general function (each categorical column has a discrete number of possible values, say in string format).
Each time I subset, I need to identify the subset of rows that satisfy multiple logical set membership conditions (e.g. >10 conditions). I.e., I need to check several columns and check if values in that column match a certain set membership (hence the use of %in%).
# simple dataset example
library(dplyr)
num_col <- 15
num_row <- 100000
dat_list <- list()
for (i in 1:num_col) {
dat_list[[i]] <- data_frame(sample(letters[1:10], size = num_row, r = T))
}
dat <- bind_cols(dat_list)
names(dat) <- paste0("col", seq(15))
I've looked around the internet and SO a lot, but haven't found the discussion of performance I'm looking for. I mostly code using dplyr, so apologies if there's a clear performance improvement here in data.table; I've tried some simple benchmarks between the two (but without using any data.table indexing or etc.) and it's not obvious if one is faster.
Example options I've considered (since I'm not great at data.table, I've excluded data.table options from here):
base_filter <- function(dat) {
for (i in 1:7) {
col_name <- paste0('col', i)
dat <- dat[dat[[col_name]] %in% sample(letters[1:10], size = 4), ]
}
dat
}
dplyr_filter1 <- function(dat) {
for (i in 1:7) {
col_name <- paste0('col', i)
dat <- filter_(dat,
.dots = interp(~ colname %in% vals,
colname = as.name(col_name),
vals = sample(letters[1:10], size = 4)))
}
dat
}
dplyr_filter2 <- function(dat) {
dots_filter <- list()
for (i in 1:7) {
col_name <- paste0('col', i)
dots_filter[[i]] <- interp(~ colname %in% vals,
colname = as.name(col_name),
vals = sample(letters[1:10], size = 4))
}
filter_(dat, .dots = dots_filter)
}
Note: In practice, on my real datasets, dplyr_filter2 actually works fastest. I've also tried dtplyr or converting my data to a data.table, but this seems slower than without.
Note: On the other hand, in practice, the base R function outperforms the dplyr examples when data has fewer rows and fewer columns (perhaps due to copying speed?).
Thus, I'd like to ask SO what the general, most efficient way(s) to subset a categorical dataframe under multiple (set membership) conditions is. And if possible, explain the mechanics for why? Does this answer differ for smaller datasets? Does it depend on copying time or search time?
Useful related links
fast lookup for one key
using hash tables in R for key-value pairs
Understand that you prefer not to use data.table. Just providing some timings for reference below. With indexing, subsetting can be performed much faster and inner join of the 2 tables can also be done easily in data.table.
# simple dataset example
library(dplyr)
library(lazyeval)
set.seed(0L)
num_col <- 15
num_row <- 100000
dat_list <- list()
for (i in 1:num_col) {
dat_list[[i]] <- data_frame(sample(letters[1:10], size = num_row, r = T))
}
dat <- bind_cols(dat_list)
names(dat) <- paste0("col", seq(15))
selection <- lapply(1:7, function(n) sample(letters[1:10], size = 4))
base_filter <- function(df) {
for (i in 1:7) {
col_name <- paste0('col', i)
df <- df[df[[col_name]] %in% selection[[i]], ]
}
df
}
dplyr_filter1 <- function(df) {
for (i in 1:7) {
col_name <- paste0('col', i)
df <- filter_(df,
.dots = interp(~ colname %in% vals,
colname = as.name(col_name),
vals = selection[[i]]))
}
df
}
dplyr_filter2 <- function(df) {
dots_filter <- list()
for (i in 1:7) {
col_name <- paste0('col', i)
dots_filter[[i]] <- interp(~ colname %in% vals,
colname = as.name(col_name),
vals = selection[[i]])
}
filter_(df, .dots = dots_filter)
}
library(data.table)
#convert data.frame into data.table
dt <- data.table(dat, key=names(dat)[1:7])
#create the sets of selection
dtSelection <- data.table(expand.grid(selection, stringsAsFactors=FALSE))
library(microbenchmark)
microbenchmark(
base_filter(dat),
dplyr_filter1(dat),
dplyr_filter2(dat),
dt[dtSelection, nomatch=0], #perform inner join between dataset and selection
times=5L)
#Unit: milliseconds
# expr min lq mean median uq max neval
# base_filter(dat) 27.084801 27.870702 35.849261 32.045900 32.872601 59.372301 5
# dplyr_filter1(dat) 23.130100 24.114301 26.922081 24.860701 29.804301 32.701002 5
# dplyr_filter2(dat) 29.641101 30.686002 32.363681 31.103000 31.884701 38.503601 5
# dt[dtSelection, nomatch = 0] 3.626001 3.646201 3.829341 3.686601 3.687001 4.500901 5
In addition to chinsoon12's alternatives, one thing to consider is to avoid subsetting the data.frame in each iteration. So, instead of
f0 = function(x, cond)
{
for(j in seq_along(x)) x = x[x[[j]] %in% cond[[j]], ]
return(x)
}
one alternative is to accumulate a logical vector of whether to include each row in the final subset:
f1 = function(x, cond)
{
i = rep_len(TRUE, nrow(x))
for(j in seq_along(x)) i = i & (x[[j]] %in% cond[[j]])
return(x[i, ])
}
or, another alternative, is to iteratively reduce the amount of comparisons, but by reducing the row indices instead of the data.frame itself:
f2 = function(x, cond)
{
i = 1:nrow(x)
for(j in seq_along(x)) i = i[x[[j]][i] %in% cond[[j]]]
return(x[i, ])
}
And a comparison with data:
set.seed(1821)
dat = as.data.frame(replicate(30, sample(c(letters, LETTERS), 5e5, TRUE), FALSE),
stringsAsFactors = FALSE)
conds = replicate(ncol(dat), sample(c(letters, LETTERS), 48), FALSE)
system.time({ ans0 = f0(dat, conds) })
# user system elapsed
# 3.44 0.28 3.86
system.time({ ans1 = f1(dat, conds) })
# user system elapsed
# 0.66 0.01 0.68
system.time({ ans2 = f2(dat, conds) })
# user system elapsed
# 0.34 0.01 0.39
identical(ans0, ans1)
#[1] TRUE
identical(ans1, ans2)
#[1] TRUE
I asked a question before and received a good answer but I needed to apply it to a more specific problem. The DT needs to be divided into 16 sectors based on X and Y values. The X and Y variables represent the coordinates to loop through and divide the data table. I have successfully divided this data table into 16 different 'sectors' and I need to apply the sCalc function on each sector and output a number. I'm looking for a faster way to do this.
Refer to this link for clarification if needed: Faster way to subset data table instead of a for loop R.
library(data.table)
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
sCalc <- function(DT) {
setkey(DT, Norm)
cells <- DT[1:(nrow(DT)*0.02)]
nCells <- nrow(DT)
sumCell <- sum(cells[,Norm/sqrt(Unif)])
return(sumCell/nCells)
}
startstop <- function(width, y = FALSE) {
startend <- width - (width/4 - 1)
start <- round(seq(0, startend, length.out = 4))
stop <- round(seq(width/4, width, length.out = 4))
if (length(c(start,stop)[anyDuplicated(c(start,stop))]) != 0) {
dup <- anyDuplicated(c(start,stop))
stop[which(stop == c(start,stop)[dup])] <- stop[which(stop == c(start,stop)[dup])] - 1
}
if (y == TRUE) {
coord <- list(rep(start, each = 4), rep(stop, each = 4))
} else if (y == FALSE) {
coord <- list(rep(start, times = 4), rep(stop, times = 4))
}
return(coord)
}
sectorCalc <- function(x,y,DT) {
sector <- numeric(length = 16)
for (i in 1:length(sector)) {
sect <- DT[X %between% c(x[[1]][i],x[[2]][i]) & Y %between% c(y[[1]][i],y[[2]][i])]
sector[i] <- sCalc(sect)
}
return(sector)
}
x <- startstop(2000)
y <- startstop(1600, y = TRUE)
sectorLoop <- sectorCalc(x,y,DT)
sectorLoop returns:
-4.729271 -4.769156 -4.974996 -4.931120 -4.777013 -4.644919 -4.958968 -4.663221
-4.771545 -4.909868 -4.821098 -4.795526 -4.846709 -4.931514 -4.875148 -4.847105
One solution was using the cut function.
DT[, x.sect := cut(DT[, X], seq(0, 2000, by = 500), dig.lab=10)]
DT[, y.sect := cut(DT[, Y], seq(0, 1600, by = 400), dig.lab=10)]
sectorRef <- DT[order(Norm), .(sCalc = sum(Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)]) )/(0.02*.N)), by = .(x.sect, y.sect)]
sectorRef <- sectorRef[[3]]
The above solution returns a data table with the values:
-4.919447 -4.778576 -4.757455 -4.779086 -4.739814 -4.836497 -4.776635 -4.656748
-4.939441 -4.707901 -4.751791 -4.864481 -4.839134 -4.973294 -4.663360 -5.055344
cor(sectorRef, sectorLoop)
The above returns: 0.0726904
As far as I can understand the question, the first thing I would explain is that you can use .N to tell you how many rows there are in each by=.(...)group. I think that is analogous to your nCells.
And where your cells takes the top 2% of rows in each group, this can be accomplished at the vector level by indexing [1:(0.02*.N)]. Assuming you want the top 2% in order of increasing Norm (which is the order you would get from setkey(DT, Norm), although setting a key does more than just sorting), you could call setkey(DT, Norm) before the calculation, as in the example, or to make it clearer what you are doing, you could use order(Norm) inside your calculation.
The sum() part doesn't change, so the equivalent third line is:
DT[order(Norm),
.(sCalc = sum( Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)]) )/.N),
by = .(x.sect, y.sect)]
Which returns the operation for the 16 groups:
x.sect y.sect sCalc
1: (1500,2000] (800,1200] -0.09380209
2: (499,1000] (399,800] -0.09833151
3: (499,1000] (1200,1600] -0.09606350
4: (0,499] (399,800] -0.09623751
5: (0,499] (800,1200] -0.09598717
6: (1500,2000] (0,399] -0.09306580
7: (1000,1500] (399,800] -0.09669593
8: (1500,2000] (399,800] -0.09606388
9: (1500,2000] (1200,1600] -0.09368166
10: (499,1000] (0,399] -0.09611643
11: (1000,1500] (0,399] -0.09404482
12: (0,499] (1200,1600] -0.09387951
13: (1000,1500] (1200,1600] -0.10069461
14: (1000,1500] (800,1200] -0.09825285
15: (0,499] (0,399] -0.09890184
16: (499,1000] (800,1200] -0.09756506
I am trying to calculate a measure of association between all variables in a data.table. (This is not a stats question, but as an aside: the variables are all factors, and the measure is Cramér's V.)
Example dataset:
p = 50; n = 1e5; # actual dataset has p > 1e3, n > 1e5, much wider but barely longer
set.seed(1234)
obs <- as.data.table(
data.frame(
cbind( matrix(sample(c(LETTERS[1:4],NA), n*(p/2), replace=TRUE),
nrow=n, ncol=p/2),
matrix(sample(c(letters[1:6],NA), n*(p/2), replace=TRUE),
nrow=n, ncol=p/2) ),
stringsAsFactors=TRUE ) )
I am currently using the split-apply-combine approach, which involves looping (via plyr::adply) through all pairs of indices and returning one row for each pair. (I attempted to parallelize adply but failed.)
# Calculate Cramér's V between all variables -- my kludgey approach
pairs <- t( combn(ncol(obs), 2) ) # nx2 matrix contains indices of upper triangle of df
# library('doParallel') # I tried to parallelize -- bonus points for help here (Win 7)
# cl <- makeCluster(8)
# registerDoParallel(cl)
library('plyr')
out <- adply(pairs, 1, function(ix) {
complete_cases <- obs[,which(complete.cases(.SD)), .SDcols=ix]
chsq <- chisq.test(x= dcast(data = obs[complete_cases, .SD, .SDcols=ix],
formula = paste( names(obs)[ix], collapse='~'),
value.var = names(obs)[ix][1], # arbitrary
fun.aggregate=length)[,-1, with=FALSE] )
return(data.table(index_1 = ix[1],
var_1 = names(obs)[ix][1],
index_2 = ix[2],
var_2 = names(obs)[ix][2],
cramers_v = sqrt(chsq$statistic /
(sum(chsq$observed) *
(pmin(nrow(chsq$observed),
ncol(chsq$observed) ) -1 ) )
) )
)
})[,-1] #}, .parallel = TRUE)[,-1] # using .parallel returns Error in do.ply(i) :
# task 1 failed - "object 'obs' not found"
out <- data.table(out) # adply won't return a data.table
# stopCluster(cl)
What are my options for speeding up this calculation? My challenge is in passing the row-wise operation on pairs into the column-wise calculations in obs. I am wondering if it is possible to generate the column pairs directly into J, but the Force is just not strong enough with this data.table padawan.
First, I would go with 'long' data format as following:
obs[, id := 1:n]
mobs <- melt(obs, id.vars = 'id')
Next set key on data table setkeyv(mobs, 'id').
Finally, iterate through variables and do calculations on pairs:
out <- list()
for(i in 1:p) {
vari <- paste0('X', i)
tmp <- mobs[mobs[variable == vari]]
nn <- tmp[!(is.na(value) | is.na(i.value)), list(i.variable = i.variable[1], nij = length(id)), keyby = list(variable, value, i.value)]
cj <- nn[, CJ(value = value, i.value = i.value, sorted = FALSE, unique = TRUE), by = variable]
setkeyv(cj, c('variable', 'value', 'i.value'))
nn <- nn[cj]
nn[is.na(nij), nij := 0]
nn[, ni := sum(nij), by = list(variable, i.value)]
nn[, nj := sum(nij), by = list(variable, value)]
nn[, c('n', 'r', 'k') := list(sum(nij), length(unique(i.value)), length(unique(value))), by = variable]
out[[i]] <- nn[, list(i.variable = vari, cramers_v = (sqrt(sum((nij - ni * nj / n) ^ 2 / (ni * nj / n)) / n[1]) / min(k[1] - 1, r[1] - 1))), by = variable]
}
out <- rbindlist(out)
So you need to iterate only once through variables. As you see I would also wouldn't use chisq.test and would write computations myself.
Using a database with a numeric range defined by two columns start and end, I am trying to look up the factor, code, associated with a numeric value in a separate vector identityCodes.
database <- data.frame(start = seq(1, 150000000, 1000),
end = seq(1000, 150000000, 1000),
code = paste0(sample(LETTERS, 15000, replace = TRUE),
sample(LETTERS, 15000, replace = TRUE)))
identityCodes <- sample(1:15000000, 1000)
I've come up with a method for finding the corresponding codes using a for loop and subsetting:
fun <- function (x, y) {
z <- rep(NA, length(x))
for (i in 1:length(x)){
z[i] <- as.character(y[y["start"] <= x[i] & y["end"] >= x[i], "code"])
}
return(z)
}
a <- fun(identityCodes, database)
But the method is slow, especially if I am to scale it:
system.time(fun(identityCodes, database))
user system elapsed
15.36 0.00 15.50
How can I identify the factors associated with each identityCodes faster? Is there a better way to go about this than using a for loop and subsetting?
Here's my attempt using data.table. Very fast - even though I am sure I am not leveraging it efficiently.
Given function:
# method 1
system.time(result1 <- fun(identityCodes, database))
user system elapsed
8.99 0.00 8.98
Using data.table
# method 2
require(data.table)
# x: a data.frame with columns start, end, code
# y: a vector with lookup codes
dt_comb <- function(x, y) {
# convert x to a data.table and set 'start' and 'end' as keys
DT <- setDT(x)
setkey(DT, start, end)
# create a lookup data.table where start and end are the identityCodes
DT2 <- data.table(start=y, end=y)
# overlap join where DT2 start & end are within DT start and end
res <- foverlaps(DT2, DT[, .(start, end)], type="within")
# store i as row number and key (for sorting later)
res[, i:=seq_len(nrow(res))]
setkey(res, i)
# merge the joined table to the original to get codes
final <- merge(res, DT, by=c("start", "end"))[order(i), .(code)]
# export as character the codes
as.character(final[[1]])
}
system.time(result2 <- dt_comb(x=database, y=identityCodes))
user system elapsed
0.08 0.00 0.08
identical(result1, result2)
[1] TRUE
edit: trimmed a couple lines from the function
This is about 45% faster on my machine:
result = lapply(identityCodes, function(x) {
data.frame(identityCode=x,
code=database[database$start <= x & database$end >= x, "code"])
})
result = do.call(rbind, result)
Here's a sample of the output:
identityCode code
1 6836845 OK
2 14100352 RB
3 2313115 NK
4 8440671 XN
5 11349271 TI
6 14467193 VL
I'm looking for a faster solution to the problem below. I'll illustrate the problem with a small example and then provide the code to simulate a large data as that's the point of this question. My actual problem size is of list length = 1 million entries.
Say, I've two lists as shown below:
x <- list(c(82, 18), c(35, 50, 15))
y <- list(c(1,2,3,55,90), c(37,38,95))
Properties of x and y:
Each element of the list x always sums up to 100.
Each element of y will always be sorted and will be always between 1 and 100.
The problem:
Now, what I'd like is this. Taking x[[1]] and y[[1]], I'd like to find the count of numbers in y[[1]] that are 1) <= 82 and 2) > 82 and <= 100. That would be, c(4, 1) because numbers <= 82 are c(1,2,3,55) and number between 83 and 100 is c(90). Similarly for x[[2]] and y[[2]], c(0, 2, 1). That is, the answer should be:
[[1]]
[1] 4 1
[[2]]
[1] 0 2 1
Let me know if this is still unclear.
Simulated data with 1 million entries
set.seed(1)
N <- 100
n <- 1e6
len <- sample(2:3, n, TRUE)
x <- lapply(seq_len(n), function(ix) {
probs <- sample(100:1000, len[ix])
probs <- probs/sum(probs)
oo <- round(N * probs)
if (sum(oo) != 100) {
oo[1] <- oo[1] + (100 - sum(oo))
}
oo
})
require(data.table)
ss <- sample(1:10, n, TRUE)
dt <- data.table(val=sample(1:N, sum(ss), TRUE), grp=rep(seq_len(n), ss))
setkey(dt, grp, val)
y <- dt[, list(list(val)),by=grp]$V1
What I've done so far:
Using mapply (slow):
I thought of using rank with ties.method="first" and mapply (obvious choice with 2 lists) first and tried out this:
tt1 <- mapply(y, x, FUN=function(a,b) {
tt <- rank(c(a, cumsum(b)), ties="first")[-(1:length(a))]; c(tt[1]-1, diff(tt)-1)
})
Although this works just fine, it takes a lot of time on 1M entries. I think the overhead of computing rank and diff that many times adds to it. This takes 241 seconds!
Therefore, I decided to try and overcome the usage of rank and diff by using data.table and sorting with a "group" column. I came up with a longer but much faster solution shown below:
Using data.table (faster):
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl), type = "x")
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl), type = "y")
tt2 <-rbindlist(list(ydt, xdt[, list(cumval, grp, type)]))
setkey(tt2, grp, val)
xdt.pos <- which(tt2$type == "x")
tt2[, type.x := 0L][xdt.pos, type.x := xdt.pos]
tt2 <- tt2[xdt.pos][tt2[, .N, by = grp][, N := cumsum(c(0, head(N, -1)))]][, sub := type.x - N]
tt2[, val := xdt$val]
# time consuming step
tt2 <- tt2[, c(sub[1]-1, sub[2:.N] - sub[1:(.N-1)] - 1), by = grp]
tt2 <- tt2[, list(list(V1)),by=grp]$V1
This takes 26 seconds. So it's about 9 times faster. I'm wondering if it's possible to get much more speedup as I'll have to recursively compute this on 5-10 such 1 million elements. Thank you.
Here's another data.table approach. Edit I added a (dirty?) hack that speeds this up and makes it ~2x faster than the OP data.table solution.
# compile the data.table's, set appropriate keys
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl))
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl))
# hack #0, set key but prevent sorting, since we know data is already sorted
setattr(ydt, 'sorted', c('grp', 'val'))
# by setting the key in y to val and in x to cumval we can
# leverage the rolling joins
setattr(xdt, 'sorted', c('grp', 'cumval')) # hack #1 set key, but prevent sorting
vals = xdt[, cumval.copy := cumval][ydt, roll = -Inf]
# hack #2, same deal as above
# we know that the order of cumval and cumval.copy is the same
# so let's convince data.table in that
setattr(vals, 'sorted', c('grp', 'cumval.copy'))
# compute the counts and fill in the missing 0's
# for when there is no y in the appropriate x interval
tt2 = vals[, .N, keyby = list(grp, cumval.copy)][xdt][is.na(N), N := 0L]
# convert to list
tt2 = tt2[order(grp, cumval.copy), list(list(N)), by = grp]$V1
This is about 25% faster but outputs as a matrix rather than a list. You many be able to use appy/sappy to make it work with a list (saving as a list was slowing it down).
c=matrix(0,length(x),100)
for(j in 1:length(x)){
a=-1
b=0
for(i in 1:length(x[[j]])){
a=b
b=b+x[[j]][i]
c[j,i]=sum((a<=y[[j]])*(y[[j]]<=b))
}
}