by-group calculation, limited to first N rows of each group - r

I asked a question before and received a good answer but I needed to apply it to a more specific problem. The DT needs to be divided into 16 sectors based on X and Y values. The X and Y variables represent the coordinates to loop through and divide the data table. I have successfully divided this data table into 16 different 'sectors' and I need to apply the sCalc function on each sector and output a number. I'm looking for a faster way to do this.
Refer to this link for clarification if needed: Faster way to subset data table instead of a for loop R.
library(data.table)
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
sCalc <- function(DT) {
setkey(DT, Norm)
cells <- DT[1:(nrow(DT)*0.02)]
nCells <- nrow(DT)
sumCell <- sum(cells[,Norm/sqrt(Unif)])
return(sumCell/nCells)
}
startstop <- function(width, y = FALSE) {
startend <- width - (width/4 - 1)
start <- round(seq(0, startend, length.out = 4))
stop <- round(seq(width/4, width, length.out = 4))
if (length(c(start,stop)[anyDuplicated(c(start,stop))]) != 0) {
dup <- anyDuplicated(c(start,stop))
stop[which(stop == c(start,stop)[dup])] <- stop[which(stop == c(start,stop)[dup])] - 1
}
if (y == TRUE) {
coord <- list(rep(start, each = 4), rep(stop, each = 4))
} else if (y == FALSE) {
coord <- list(rep(start, times = 4), rep(stop, times = 4))
}
return(coord)
}
sectorCalc <- function(x,y,DT) {
sector <- numeric(length = 16)
for (i in 1:length(sector)) {
sect <- DT[X %between% c(x[[1]][i],x[[2]][i]) & Y %between% c(y[[1]][i],y[[2]][i])]
sector[i] <- sCalc(sect)
}
return(sector)
}
x <- startstop(2000)
y <- startstop(1600, y = TRUE)
sectorLoop <- sectorCalc(x,y,DT)
sectorLoop returns:
-4.729271 -4.769156 -4.974996 -4.931120 -4.777013 -4.644919 -4.958968 -4.663221
-4.771545 -4.909868 -4.821098 -4.795526 -4.846709 -4.931514 -4.875148 -4.847105
One solution was using the cut function.
DT[, x.sect := cut(DT[, X], seq(0, 2000, by = 500), dig.lab=10)]
DT[, y.sect := cut(DT[, Y], seq(0, 1600, by = 400), dig.lab=10)]
sectorRef <- DT[order(Norm), .(sCalc = sum(Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)]) )/(0.02*.N)), by = .(x.sect, y.sect)]
sectorRef <- sectorRef[[3]]
The above solution returns a data table with the values:
-4.919447 -4.778576 -4.757455 -4.779086 -4.739814 -4.836497 -4.776635 -4.656748
-4.939441 -4.707901 -4.751791 -4.864481 -4.839134 -4.973294 -4.663360 -5.055344
cor(sectorRef, sectorLoop)
The above returns: 0.0726904

As far as I can understand the question, the first thing I would explain is that you can use .N to tell you how many rows there are in each by=.(...)group. I think that is analogous to your nCells.
And where your cells takes the top 2% of rows in each group, this can be accomplished at the vector level by indexing [1:(0.02*.N)]. Assuming you want the top 2% in order of increasing Norm (which is the order you would get from setkey(DT, Norm), although setting a key does more than just sorting), you could call setkey(DT, Norm) before the calculation, as in the example, or to make it clearer what you are doing, you could use order(Norm) inside your calculation.
The sum() part doesn't change, so the equivalent third line is:
DT[order(Norm),
.(sCalc = sum( Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)]) )/.N),
by = .(x.sect, y.sect)]
Which returns the operation for the 16 groups:
x.sect y.sect sCalc
1: (1500,2000] (800,1200] -0.09380209
2: (499,1000] (399,800] -0.09833151
3: (499,1000] (1200,1600] -0.09606350
4: (0,499] (399,800] -0.09623751
5: (0,499] (800,1200] -0.09598717
6: (1500,2000] (0,399] -0.09306580
7: (1000,1500] (399,800] -0.09669593
8: (1500,2000] (399,800] -0.09606388
9: (1500,2000] (1200,1600] -0.09368166
10: (499,1000] (0,399] -0.09611643
11: (1000,1500] (0,399] -0.09404482
12: (0,499] (1200,1600] -0.09387951
13: (1000,1500] (1200,1600] -0.10069461
14: (1000,1500] (800,1200] -0.09825285
15: (0,499] (0,399] -0.09890184
16: (499,1000] (800,1200] -0.09756506

Related

How to create a summation function with data frame in R?

Just for fun, I am trying to create a basic savings calculator. My current code is:
value <- function(years,apr,initial,investment) {
df <- as.data.frame(matrix(nrow = years, ncol = 2))
colnames(df) <- c("year","value")
df$years <- c(1:years)
for (i in 1:years) {
current_value <-(last_value+investment)*apr
}
#repeating calculation for the data frame
print(df)
What I am trying to do is have the calculator create a table that displays the value each year. I've adapted my code from an old homework assignment, so I am not concerned with how to make the data frame. However, I do not know how to make the formula for the summation.
I am trying to model
Current Value = (Cumulative Value + Investment)*(Annual Percentage Rate)
As an example, let's say initial value is 10, investment is 10, and the APR is 1.05
(10+10)*(1.05)=21
(21+10)*(1.05)=32.55
(32.55+10)*(1.05)=44.68
and so on.
Year is there to number the rows accordingly.
We can use Reduce with accumulate = TRUE
calc_fun <- function(years,apr,initial,investment) {
value <- Reduce(function(x, y) (x + investment) * y, rep(apr, year), initial,
accumulate = TRUE)
data.frame(year = 0:year, value)
}
calc_fun(3, 1.05, 10, 10)
# year value
#1 0 10.0000
#2 1 21.0000
#3 2 32.5500
#4 3 44.6775
Using for loop we can do
calc_fun1 <- function(years,apr,initial,investment) {
value <- numeric(years + 1)
value[1] <- initial
for (i in 1:years) value[i + 1] <- (value[i] + investment) * apr
data.frame(year = 0:year, value)
}

how can i group a large number of elements quickly

I've created some 'data' here as an example. It's contains 100 elements where each start with a letter and are followed by 3 random numbers.
I want to know the best/quickest way to convert them into groups like i've started to do in the for loop.
Assume i need to create 50 groups and instead of 100 elements in 'data' there are a million.
The groupings themselves will be fairly random. I've used A000-A599 and A600-A999 as the first 2 groupings in my example but the groupings are not neatly spaced out e.g B000-B599 and B600-B999 are not necessarily the next groupings. The next groupings could for example be B000-C299,C300-C799,C800-D499 etc. I would need to enter these groupings manually.
I'm guessing a for loop isn't the best way to do this as it would take a long time to complete the loop.
library(stringr)
library(magicfor)
data <- paste(sample(LETTERS, 100, replace = T),
sample(str_pad(000:999, width = 3, side = "left", pad = "0"), 100, replace = T), sep = "")
magic_for()
for(x in seq_along(data)){
if( grepl("A[0-5]", data[1])){
range <- "A000-A599"
}elseif( grepl("A[6-9]", data[1])){
range <- "A600-A999"
}
put(range)
}
You can try something like this:
> #Round function
> roundUp <- function(x,to=10) {
+ to*(x%/%to + as.logical(x%%to))
+ }
> #Create a dataframe for easy store
> df <- data.frame(data = data, stringsAsFactors = F)
> df %>%
+ mutate(C = substr(data, 1, 1),
+ N = as.integer(substr(data, 2, 4))) %>%
+ mutate(N = roundUp(N, to = 500)) %>%
+ mutate(data2 = paste0(C, N)) %>%
+ select(data, data2)
data data2
1 U493 U500
2 A429 A500
3 N564 N1000
4 W656 W1000
5 J978 J1000
6 B232 B500
7 D240 D500
8 I796 I1000
9 E831 E1000
...(truncated)
The data2 field contains the new groups
Using the tidyverse packages, I would convert your data vector into a data.frame (or tibble) format.
library(tidyverse)
df <- tibble(my_variable = data) %>%
mutate(
first_char = substr(my_variable, 1, 1),
random_numbers = substr(my_variable, 2, 4)
)
Once you've gotten that far, it's easy to group the data however you want. Your loop could be achieved like so:
df %>%
mutate(
group = ifelse(as.numeric(random_numbers) < 600, "000-599", "600-999"),
desired_result = paste0(first_char, group)
)
I recommend reading the following (free) ebook cover to cover, it will equip you with a bunch of useful tools for everyday R tasks like the one you've outlined:
https://r4ds.had.co.nz/index.html
The groupings themselves will be fairly random. I've used A000-A599 and A600-A999 as the first 2 groupings in my example but the groupings are not neatly spaced out e.g B000-B599 and B600-B999 are not necessarily the next groupings. The next groupings could for example be B000-C299,C300-C799,C800-D499 etc.
Since your groups are lexicographic intervals, you could use rolling joins. In this case you only need to specify the lower bound for each group:
library(data.table)
# define decrement function
dec = function(x){
ltr = substr(x, 1, 1)
num = as.integer(substr(x, 2, 4))
w0 = num == 0L
ltr = replace(ltr, w0, LETTERS[match(ltr[w0], LETTERS) - 1L])
num = replace(num - 1L, w0, 999L)
sprintf("%s%03d", ltr, num)
}
# enumerate lower bounds and derive ranges
rangeDT = data.table(lb = c("A000", "A600", "B000", "C300", "C800"))
rangeDT[, ub := dec(shift(lb, type="lead", fill="Z999"))]
rangeDT[, range := sprintf("%s-%s", lb, ub)]
# lb ub range
# 1: A000 A599 A000-A599
# 2: A600 A999 A600-A999
# 3: B000 C299 B000-C299
# 4: C300 C799 C300-C799
# 5: C800 Z998 C800-Z998
Then the rolling update join is...
DT = data.table(x = data)
DT[, range := rangeDT[.SD, on=.(lb = x), roll=TRUE, x.range]]
The result looks like
> head(DT)
x range
1: C965 C800-Z999
2: Q973 C800-Z999
3: V916 C800-Z999
4: C701 C300-C799
5: A363 A000-A599
6: F144 C800-Z999
If your data were numeric, cut or findInterval from base R would work, but for whatever reason those do not support strings.
How about this?
library(data.table)
ranges <- c(paste0(LETTERS, "[0-5]"),paste0(LETTERS, "[6-9]"))
final <-lapply(ranges, function(y) {
matches <- grepl(y, data)
if(sum(matches)>0){
tmp <-data.table(element=data[matches], range=
paste0(str_sub(y,1,1), str_sub(y,3,3),0,0,"-", str_sub(y,1,1), str_sub(y,5,5),9,9))}
else return(NULL)
})
final_2 <- rbindlist(final)
# element range
# A374 A000-A599
# B498 B000-B599
# B064 B000-B599
# C131 C000-C599
# C460 C000-C599
# C099 C000-C599
structure(list(element = c("A374", "B498", "B064", "C131", "C460", "C099", "C193", "E428", "E108", "E527", "E138", "E375", "E312", "F046", "F417", "F094", "G142", "G461", "G068", "H372", "H523", "H027", "H506", "I470", "I169", "I050", "I495", "I405", "J298", "K165", "K169", "K131", "L510", "L210", "L277", "N257", "N554", "N452", "N484", "N247", "N373", "N492", "O347", "O221", "O176", "P578", "P477", "Q062", "Q257", "Q083", "R306", "S415", "S154", "S226", "S400", "T132", "T181", "T321", "V109", "V118", "V267", "W381", "W047", "X317", "X192", "Y390", "Y132", "Y327", "Y141", "Y353", "Z429", "C981", "D813", "F934", "G910", "G673", "G664", "I754", "I624", "L603", "N991", "N996", "O689", "O932", "P854", "P689", "P761", "P681", "Q631", "S620", "T923", "T841", "U787", "U929", "W942", "W702", "X770", "X880", "Y719", "Y969"), range = c("A000-A599", "B000-B599", "B000-B599", "C000-C599", "C000-C599", "C000-C599", "C000-C599", "E000-E599", "E000-E599", "E000-E599", "E000-E599", "E000-E599", "E000-E599", "F000-F599", "F000-F599", "F000-F599", "G000-G599", "G000-G599", "G000-G599", "H000-H599", "H000-H599", "H000-H599", "H000-H599", "I000-I599", "I000-I599", "I000-I599", "I000-I599", "I000-I599", "J000-J599", "K000-K599", "K000-K599", "K000-K599", "L000-L599", "L000-L599", "L000-L599", "N000-N599", "N000-N599", "N000-N599", "N000-N599", "N000-N599", "N000-N599", "N000-N599", "O000-O599", "O000-O599", "O000-O599", "P000-P599", "P000-P599", "Q000-Q599", "Q000-Q599", "Q000-Q599", "R000-R599", "S000-S599", "S000-S599", "S000-S599", "S000-S599", "T000-T599", "T000-T599", "T000-T599", "V000-V599", "V000-V599", "V000-V599", "W000-W599", "W000-W599", "X000-X599", "X000-X599", "Y000-Y599", "Y000-Y599", "Y000-Y599", "Y000-Y599", "Y000-Y599", "Z000-Z599", "C600-C999", "D600-D999", "F600-F999", "G600-G999", "G600-G999", "G600-G999", "I600-I999", "I600-I999", "L600-L999", "N600-N999", "N600-N999", "O600-O999", "O600-O999", "P600-P999", "P600-P999", "P600-P999", "P600-P999", "Q600-Q999", "S600-S999", "T600-T999", "T600-T999", "U600-U999", "U600-U999", "W600-W999", "W600-W999", "X600-X999", "X600-X999", "Y600-Y999", "Y600-Y999")), row.names = c(NA,
-100L), class = c("data.table", "data.frame"))

Faster way to subset data table instead of a for loop R

I have a data table (you'll need the data table package installed) in R generated with X and Y coordinates and random data values from both normal and uniform distributions. The coordinates represent points on a 2000x1600 array and has to be divided into 16 smaller "sectors" each 500x400. These sectors need their mean of Normal Distribution values taken, divided by the min^2 of the Uniform Distribution values. I also created two variables x and y using a provided function startstop, that have the coordinates for the 16 sectors and a function that calculates the numbers for each sector.
library(data.table)
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
sectorCalc <- function(x,y,DT) {
sector <- numeric(length = 16)
for (i in 1:length(sector)) {
sect <- DT[X %between% c(x[[1]][i],x[[2]][i]) & Y %between% c(y[[1]][i],y[[2]][i])]
sector[i] <- sCalc(sect)
}
return(sector)
}
startstop <- function(width, y = FALSE) {
startend <- width - (width/4 - 1)
start <- round(seq(0, startend, length.out = 4))
stop <- round(seq(width/4, width, length.out = 4))
if (length(c(start,stop)[anyDuplicated(c(start,stop))]) != 0) {
dup <- anyDuplicated(c(start,stop))
stop[which(stop == c(start,stop)[dup])] <- stop[which(stop == c(start,stop)[dup])] - 1
}
if (y == TRUE) {
coord <- list(rep(start, each = 4), rep(stop, each = 4))
} else if (y == FALSE) {
coord <- list(rep(start, times = 4), rep(stop, times = 4))
}
return(coord)
}
x <- startstop(2000)
y <- startstop(1600, T)
sectorNos <- sectorCalc(x,y,DT)
The startstop function isn't really an issue but I need a faster way to subset the data table. Some modifications have to be made to the 'sectorCalc' function. The for loop was the best way I could think of but I don't have too much experience with data tables. Any ideas on a faster method of breaking up the data table?
A solution using not only the package data.table but also the cut function to build the interval "groups":
# Create your test data
library(data.table)
set.seed(123) # make random numbers reproducible to allow comparison of different answers
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
# calculate the sector by cutting the x and y values into groups defined by the interval breaks
DT[, x.sect := cut(DT[, X], c(0, 499, 1000, 1500, 2000), dig.lab=10)] # Intervals should be: seq(0, 2000, by=500) lower bound is less one since it is not included in the interval (see help for cut function)
DT[, y.sect := cut(DT[, Y], c(0, 399, 800, 1200, 1600), dig.lab=10)] # Intervals should be: seq(0, 1600, by=400)
# Now calculate per group (calculation logic "stolen" from the working answer of user "Symbolix"
DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)]
Please note: I think the size of the first and second interval is wrong in the original solution (499 instead of 500 for x and 399 instead of 400 for y so that I could not use the seq function to reproduce your desired intervals but had to enumerate the interval breaks manually).
Edit 1: I have replaced the original code that adds the x.sect and y.sect columns by an improved solution that adds columns by reference (:=).
Edit 2: If you want to order the result you have (at least) two options:
# "Chaining" (output is input of next)
DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)][order(x.sect, y.sect),]
# Or: Use the "keyby" param instead of "by"
DT[, .(sect = mean(Norm)/min(Unif)^2), keyby=.(x.sect, y.sect)]
Edit 3: Added dig.lab=10 param to cut function in code above to avoid scientific notation of the interval breaks.
To replace your sectorCalc function I think we can make use of data.tables joins
As you are looping over each row of sector, you just have to create a data.table to join onto that is your sector data,
specify a column to join (here I'm using key_col), and specify a 'group' variable for each row, to enable us to do a
the calculation at the end:
x <- startstop(2000)
y <- startstop(1600, T)
## copy the original DT
dt <- copy(DT)
dt_xy <- data.table(x_1 = x[[1]],
x_2 = x[[2]],
y_1 = y[[1]],
y_2 = y[[2]])
dt[, key_col := 1]
dt_xy[, `:=`(key_col = 1, xy_grp = seq(1,.N))]
## Use a data.table join, allowing cartesian, then filter out results.
dt_res <- dt[ dt_xy, on="key_col", allow.cartesian=T][x_1 <= X & X <= x_2 & y_1 <= Y & Y <= y_2]
## calculate 'sect' as required.
dt_sect <- dt_res[, .(sect = mean(Norm)/min(Unif)^2) , by=.(xy_grp)]

How to calculate Mode (Statistics) for a set of every 10 numbers in a large data set

Like if i have 1223455567 1777666666 i want the output be 5 an 6 .
how can i do this in R language?
i know how to find the mean for every 10 data but what i want is mode.
here is what i tried for mean
mean10 <- aggregate(level, list(rep(1:(nrow(level) %/% n+1),each = n, len = nrow(level))), mean)[-1];
and there is a function for mode as follow:
MODE <- function(dataframe){
DF <- as.data.frame(dataframe)
MODE2 <- function(x){
if (is.numeric(x) == FALSE){
df <- as.data.frame(table(x))
df <- df[order(df$Freq), ]
m <- max(df$Freq)
MODE1 <- as.vector(as.character(subset(df, Freq == m)[, 1]))
if (sum(df$Freq)/length(df$Freq)==1){
warning("No Mode: Frequency of all values is 1", call. = FALSE)
}else{
return(MODE1)
}
}else{
df <- as.data.frame(table(x))
df <- df[order(df$Freq), ]
m <- max(df$Freq)
MODE1 <- as.vector(as.numeric(as.character(subset(df, Freq == m)[, 1])))
if (sum(df$Freq)/length(df$Freq)==1){
warning("No Mode: Frequency of all values is 1", call. = FALSE)
}else{
return(MODE1)
}
}
}
return(as.vector(lapply(DF, MODE2)))
}
This should work
Mode <- function(x) {
y <- unique(x)
y[which.max(tabulate(match(x, y)))]
}
library(zoo)
x<- c(1,2,2,3,4,5,5,5,6,7,1,7,7,7,6,6,6,6,6,6)
rollapply(data = x, width = 10, FUN = Mode, by = 10 )
Given you're not after a rolling mode but really a group mode, none of the other answers are accurate. It's actually much easier to do this in the case you have in mind; I'll use data.table.
#fixed cost: set-up of 'data.table'
library(data.table)
setDT(DF)
Now solving:
#this works on a single column;
# the rep(...) bit is about creating the
# sequence (1, ..., 1, 2, ..., 2, ...)
# of integers each repeated 10 times.
# Here, .N will give the frequency -- i.e.,
# this first step is basically running 'table' for every 10 rows
DF[ , .N, by = .(col1, grp = rep(1:(.N %/% 10 + 1), length.out = .N)))
#by going in descending order on frequency, we can simply
# extract the first element of each 'grp' to get the mode.
# (this glosses over the issue of ties, but you haven't given
# any guidance to that end)
][order(-N), .SD[1L], by = grp]
You can use the zoo package to calculate a moving mode:
library(zoo)
# sample data
d <- data.frame(x = sample(1:3, 100, T))
# mode function (handles ties by choosing one)
my_mode <- function(x) as.numeric(which.max(table(x)))
# add moving mode as new variable
transform(d, moving_mode = rollapply(x, 10, FUN = my_mode, fill = NA))
You can always convert to character and see which char is max in a table. E.g.
> which.max(table(strsplit(as.character(1777666666),"")))
6
2

Efficiently counting numbers falling within each range of numbers

I'm looking for a faster solution to the problem below. I'll illustrate the problem with a small example and then provide the code to simulate a large data as that's the point of this question. My actual problem size is of list length = 1 million entries.
Say, I've two lists as shown below:
x <- list(c(82, 18), c(35, 50, 15))
y <- list(c(1,2,3,55,90), c(37,38,95))
Properties of x and y:
Each element of the list x always sums up to 100.
Each element of y will always be sorted and will be always between 1 and 100.
The problem:
Now, what I'd like is this. Taking x[[1]] and y[[1]], I'd like to find the count of numbers in y[[1]] that are 1) <= 82 and 2) > 82 and <= 100. That would be, c(4, 1) because numbers <= 82 are c(1,2,3,55) and number between 83 and 100 is c(90). Similarly for x[[2]] and y[[2]], c(0, 2, 1). That is, the answer should be:
[[1]]
[1] 4 1
[[2]]
[1] 0 2 1
Let me know if this is still unclear.
Simulated data with 1 million entries
set.seed(1)
N <- 100
n <- 1e6
len <- sample(2:3, n, TRUE)
x <- lapply(seq_len(n), function(ix) {
probs <- sample(100:1000, len[ix])
probs <- probs/sum(probs)
oo <- round(N * probs)
if (sum(oo) != 100) {
oo[1] <- oo[1] + (100 - sum(oo))
}
oo
})
require(data.table)
ss <- sample(1:10, n, TRUE)
dt <- data.table(val=sample(1:N, sum(ss), TRUE), grp=rep(seq_len(n), ss))
setkey(dt, grp, val)
y <- dt[, list(list(val)),by=grp]$V1
What I've done so far:
Using mapply (slow):
I thought of using rank with ties.method="first" and mapply (obvious choice with 2 lists) first and tried out this:
tt1 <- mapply(y, x, FUN=function(a,b) {
tt <- rank(c(a, cumsum(b)), ties="first")[-(1:length(a))]; c(tt[1]-1, diff(tt)-1)
})
Although this works just fine, it takes a lot of time on 1M entries. I think the overhead of computing rank and diff that many times adds to it. This takes 241 seconds!
Therefore, I decided to try and overcome the usage of rank and diff by using data.table and sorting with a "group" column. I came up with a longer but much faster solution shown below:
Using data.table (faster):
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl), type = "x")
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl), type = "y")
tt2 <-rbindlist(list(ydt, xdt[, list(cumval, grp, type)]))
setkey(tt2, grp, val)
xdt.pos <- which(tt2$type == "x")
tt2[, type.x := 0L][xdt.pos, type.x := xdt.pos]
tt2 <- tt2[xdt.pos][tt2[, .N, by = grp][, N := cumsum(c(0, head(N, -1)))]][, sub := type.x - N]
tt2[, val := xdt$val]
# time consuming step
tt2 <- tt2[, c(sub[1]-1, sub[2:.N] - sub[1:(.N-1)] - 1), by = grp]
tt2 <- tt2[, list(list(V1)),by=grp]$V1
This takes 26 seconds. So it's about 9 times faster. I'm wondering if it's possible to get much more speedup as I'll have to recursively compute this on 5-10 such 1 million elements. Thank you.
Here's another data.table approach. Edit I added a (dirty?) hack that speeds this up and makes it ~2x faster than the OP data.table solution.
# compile the data.table's, set appropriate keys
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl))
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl))
# hack #0, set key but prevent sorting, since we know data is already sorted
setattr(ydt, 'sorted', c('grp', 'val'))
# by setting the key in y to val and in x to cumval we can
# leverage the rolling joins
setattr(xdt, 'sorted', c('grp', 'cumval')) # hack #1 set key, but prevent sorting
vals = xdt[, cumval.copy := cumval][ydt, roll = -Inf]
# hack #2, same deal as above
# we know that the order of cumval and cumval.copy is the same
# so let's convince data.table in that
setattr(vals, 'sorted', c('grp', 'cumval.copy'))
# compute the counts and fill in the missing 0's
# for when there is no y in the appropriate x interval
tt2 = vals[, .N, keyby = list(grp, cumval.copy)][xdt][is.na(N), N := 0L]
# convert to list
tt2 = tt2[order(grp, cumval.copy), list(list(N)), by = grp]$V1
This is about 25% faster but outputs as a matrix rather than a list. You many be able to use appy/sappy to make it work with a list (saving as a list was slowing it down).
c=matrix(0,length(x),100)
for(j in 1:length(x)){
a=-1
b=0
for(i in 1:length(x[[j]])){
a=b
b=b+x[[j]][i]
c[j,i]=sum((a<=y[[j]])*(y[[j]]<=b))
}
}

Resources