Expanding window (cumulative calculation) in data.table: how to improve performance - r

I have grouped data collected at different time steps. Within each time step, there are several registrations of values. Each value may occur one or more times within and among time steps.
Some toy data:
df <- data.frame(grp = rep(1:2, each = 8),
time = c(rep(1, 3), rep(2, 2), rep(3, 3)),
val = c(1, 2, 1, 2, 3, 2, 3, 4, 1, 2, 3, 1, 1, 1, 2, 3))
df
# grp time val
# 1 1 1 1
# 2 1 1 2
# 3 1 1 1
# 4 1 2 2
# 5 1 2 3
# 6 1 3 2
# 7 1 3 3
# 8 1 3 4
# 9 2 1 1
# 10 2 1 2
# 11 2 1 3
# 12 2 2 1
# 13 2 2 1
# 14 2 3 1
# 15 2 3 2
# 16 2 3 3
Objectives
I wish to do some calculations within an expanding time window, i.e. within time step 1, within time 1 and 2 together, within 1, 2, and 3 together, and so on. Within each window, I wish to calculate the number of unique values, the number of values which have occurred more than once, and the proportion of values which have occurred more than once.
For example, in my toy data, in group (grp) 1, in the second time window (time = 1 & 2 together) three unique values (val 1, 2, 3) have been registered (n_val = 3). Two of them (1, 2) occur more than once (n_re = 2), resulting in a "re_rate" of 0.67 (see below).
My data.table code produces the desired result. On a small data set it is slower than my base attempt, which I believe is fair enough, given some possible overhead in the data.table code. With a larger data set, the data.table code catches up, but is still slower. I expected (hoped) that the benefits would show up earlier.
Thus, what made me post this question is that I believe that the relative performance of my code is a strong indicator of me abusing data.table (I am sure the reason is not data.table performance itself). Thus, the main objective of my question is to get some advice on how to code this in a more data.table-esque way. For example, is it possible to avoid the loop over time windows altogether by vectorizing the calculations, as shown e.g. in the nice answer by #Khashaa here. If not, are there ways to make the loop and assignment more efficient?
My data.table code:
library(data.table)
f_dt <- function(df){
setDT(df, key = c("grp", "time", "val"))[ , {
# key or not only affects speed marginally
# unique time steps
times <- .SD[ , unique(time)]
# index vector to loop over
idx <- seq_along(times)
# pre-allocate data table
d2 <- data.table(time = times,
n_val = integer(1),
n_re = integer(1),
re_rate = numeric(1))
# loop to generate expanding window
for(i in idx){
# number of registrations per val
n <- .SD[time %in% times[seq_len(i)], .(n = .N), by = val][ , n]
# number of unique val
set(x = d2, i = i, j = 2L, length(n))
# number of val registered more than once
set(x = d2, i = i, j = 3L, sum(n > 1))
}
# proportion values registered more than once
d2[ , re_rate := round(n_re / n_val, 2)]
d2
}
, by = grp]
}
...which gives the desired result:
f_dt(df)
# grp time n_val n_re re_rate
# 1: 1 1 2 1 0.50
# 2: 1 2 3 2 0.67
# 3: 1 3 4 3 0.75
# 4: 2 1 3 0 0.00
# 5: 2 2 3 1 0.33
# 6: 2 3 3 3 1.00
Corresponding base code:
f_by <- function(df){
do.call(rbind,
by(data = df, df$grp, function(d){
times <- unique(d$time)
idx <- seq_along(times)
d2 <- data.frame(grp = d$grp[1],
time = times,
n_val = integer(1),
n_re = integer(1),
re_rate = numeric(1))
for(i in idx){
dat <- d[d$time %in% times[seq_len(i)], ]
tt <- table(dat$val)
n_re <- sum(tt > 1)
n_val <- length(tt)
re_rate <- round(n_re / n_val, 2)
d2[i, ] <- data.frame(d2$grp[1], time = times[i], n_val, n_re, re_rate)
}
d2
})
)
}
Timings:
Tiny toy data from above:
library(microbenchmark)
microbenchmark(f_by(df),
f_dt(df),
times = 10,
unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# f_by(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f_dt(df) 1.481724 1.450203 1.474037 1.452887 1.521378 1.502686 10
Some larger data:
set.seed(123)
df <- data.frame(grp = sample(1:100, 100000, replace = TRUE),
time = sample(1:100, 100000, replace = TRUE),
val = sample(1:100, 100000, replace = TRUE))
microbenchmark(f_by(df),
f_dt(df),
times = 10,
unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# f_by(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f_dt(df) 1.094424 1.099642 1.107821 1.096997 1.097693 1.194983 10
No, the data is still not large, but I would expect data.table to catch up by now. If coded properly... I believe this suggests that there is a large potential for improvement of my code. Any advice is highly appreciated.

f <- function(df){
setDT(df)[, n_val := cumsum(!duplicated(val)), grp
][, occ := 1:.N, .(grp, val)
][, occ1 := cumsum(occ == 1) - cumsum(occ == 2), grp
][, n_re := n_val - occ1,
][, re_rate := round(n_re/n_val, 2),
][, .(n_val = n_val[.N], n_re = n_re[.N], re_rate =re_rate[.N]), .(grp, time)]
}
where
cumsum(!duplicated(val)) counts the (cumulative) occurrences of the unique values, n_val,
occ counts the cumulative occurrences each value (note that it is grouped by val).
occ1 then counts the number of elements in val occurred only once so far.
The number of values occurred only once increases by 1 when occ==1, decreases by 1 when occ==2; hence cumsum(occ == 1) - cumsum(occ == 2).
The number of values which have occurred more than once is n_val-occ1
Speed Comparison
set.seed(123)
df <- data.frame(grp = sample(1:100, 100000, replace = TRUE),
time = sample(1:100, 100000, replace = TRUE),
val = sample(1:100, 100000, replace = TRUE))
system.time(f(df))
# user system elapsed
# 0.038 0.000 0.038
system.time(f_dt(df))
# user system elapsed
# 16.617 0.013 16.727
system.time(f_by(df))
# user system elapsed
# 16.077 0.040 16.122
Hope this helps.

Was looking for a better way to code expanding window of non-duplicated groups and came across this question.
This question seems to be more about expanding window where the group (i.e. time in the question) is duplicated. Below is a solution making use of between.
#expanding group by where groups are duplicated
library(data.table)
setDT(df)
df[ , {
#get list of unique time groups to be used in the expanding group
uniqt <- unique(time)
c(list(time=uniqt), #output time as well
#expanding window of each unique time group
do.call(rbind, lapply(uniqt, function(n) {
#tabulate the occurrences
x <- table(val[between(time, uniqt[1L], n)])
#calculate desired values
n_val <- length(x)
n_re <- sum(x > 1)
data.frame(n_val=n_val, n_re=n_re, re_rate=n_re/n_val)
})))
}, by=grp]
result:
# grp time n_val n_re re_rate
# 1: 1 1 2 1 0.5000000
# 2: 1 2 3 2 0.6666667
# 3: 1 3 4 3 0.7500000
# 4: 2 1 3 0 0.0000000
# 5: 2 2 3 1 0.3333333
# 6: 2 3 3 3 1.0000000
I was unable to find in which version of data.table was between first released and hence, between might be released after this question was posted.

Related

Find distribution of consecutive zeros

I have a vector, say x which contains only the integer numbers 0,1 and 2. For example;
x <- c(0,1,0,2,0,0,1,0,0,1,0,0,0,1,0)
From this I would like to extract how many times zero occurs in each "pattern". In this simple example it occurs three times on it own, twice as 00 and exactly once as 000, so I would like to output something like:
0 3
00 2
000 1
My actual dataset is quite large (1000-2000 elements in the vector) and at least in theory the maximum number of consecutive zeros is length(x)
1) rle Use rle and table like this. No packages are needed.
tab <- with(rle(x), table(lengths[values == 0]))
giving:
> tab
1 2 3
3 2 1
or
> as.data.frame(tab)
Var1 Freq
1 1 3
2 2 2
3 3 1
That is, there are 3 runs of one zero, 2 runs of two zeros and 1 run of three zeros.
The output format in the question is not really feasible if there are very long runs but just for fun here it is:
data.frame(Sequence = strrep(0, names(tab)), Freq = as.numeric(tab))
giving:
Sequence Freq
1 0 3
2 00 2
3 000 1
2) gregexpr Another possibility is to use a regular expression:
tab2 <- table(attr(gregexpr("0+", paste(x, collapse = ""))[[1]], "match.length"))
giving:
> tab2
1 2 3
3 2 1
Other output formats could be derived as in (1).
Note
I checked the speed with a length(x) of 2000 and (1) took about 1.6 ms on my laptop and (2) took about 9 ms.
1) We can use rleid from data.table
data.table(x)[, strrep(0, sum(x==0)) ,rleid(x == 0)][V1 != "",.N , V1]
# V1 N
#1: 0 3
#2: 00 2
#3: 000 1
2) or we can use tidyverse
library(tidyverse)
tibble(x) %>%
group_by(grp = cumsum(x != 0)) %>%
filter(x == 0) %>%
count(grp) %>%
ungroup %>%
count(n)
# A tibble: 3 x 2
# n nn
# <int> <int>
#1 1 3
#2 2 2
#3 3 1
3) Or we can use tabulate with rleid
tabulate(tabulate(rleid(x)[x==0]))
#[1] 3 2 1
Benchmarks
By checking with system.time on #SymbolixAU's dataset
system.time({
tabulate(tabulate(rleid(x2)[x2==0]))
})
# user system elapsed
# 0.03 0.00 0.03
Comparing with the Rcpp function, the above is not that bad
system.time({
m <- zeroPattern(x2)
m[m[,2] > 0, ]
})
# user system elapsed
# 0.01 0.01 0.03
With microbenchmark, removed the methods that are consuming more time (based on #SymbolixAU's comparisons) and initiated a new comparison. Note that here also, it is not exactly apples to apples but it is still a lot more similar as in the previous comparison there is an overhead of data.table along with some formatting to replicate the OP's expected output
microbenchmark(
akrun = {
tabulate(tabulate(rleid(x2)[x2==0]))
},
G = {
with(rle(x2), table(lengths[values == 0]))
},
sym = {
m <- zeroPattern(x2)
m[m[,2] > 0, ]
},
times = 5, unit = "relative"
)
#Unit: relative
# expr min lq mean median uq max neval cld
# akrun 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 5 a
# G 6.049181 8.272782 5.353175 8.106543 7.527412 2.905924 5 b
# sym 1.385976 1.338845 1.661294 1.399635 3.845435 1.211131 5 a
You mention a 'quite large' data set, so you can make use of C++ through Rcpp to speed this up (however, the benchmarking shows base rle solution is fairly quick anyway)
A function could be
library(Rcpp)
cppFunction('Rcpp::NumericMatrix zeroPattern(Rcpp::NumericVector x) {
int consecutive_counter = 0;
Rcpp::IntegerVector iv = seq(1, x.length());
Rcpp::NumericMatrix m(x.length(), 2);
m(_, 0) = iv;
for (int i = 0; i < x.length(); i++) {
if (x[i] == 0) {
consecutive_counter++;
} else if (consecutive_counter > 0) {
m(consecutive_counter-1, 1)++;
consecutive_counter = 0;
}
}
if (consecutive_counter > 0) {
m(consecutive_counter-1, 1)++;
}
return m;
}')
Which gives you a matrix of the counts of consecutive zeros
x <- c(0,1,0,2,0,0,1,0,0,1,0,0,0,1,0)
zeroPattern(x)
m <- zeroPattern(x)
m[m[,2] > 0, ]
# [,1] [,2]
# [1,] 1 3
# [2,] 2 2
# [3,] 3 1
On a larger data set we notice the speed improvements
set.seed(20180411)
x2 <- sample(x, 1e6, replace = T)
m <- zeroPattern(x2)
m[m[,2] > 0, ]
library(microbenchmark)
library(data.table)
microbenchmark(
akrun = {
data.table(x2)[, strrep(0, sum(x2==0)) ,rleid(x2 == 0)][V1 != "",.N , V1]
},
G = {
with(rle(x2), table(lengths[values == 0]))
},
sym = {
m <- zeroPattern(x2)
m[m[,2] > 0, ]
},
times = 5
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# akrun 3727.66899 3782.19933 3920.9151 3887.6663 4048.2275 4158.8132 5
# G 236.69043 237.32251 258.4320 246.1470 252.1043 319.8956 5
# sym 97.54988 98.76986 190.3309 225.2611 237.5781 292.4955 5
Note:
Mine and G's functions are returning a 'table'-style answer. Akrun has formatted his to include padded zeros, so will incur a slight cost.

How to store replicate runs in a dataframe

I have a function (call it random_func) that generates random numbers according to some rules using parameters. I'm trying to repeatedly call that function and store the results in a dataframe.
df <- lapply(c(1,2,3,4,5), FUN = function(x) replicate(100, expr = random_func(n=10, param=x)))
Right now, the output is a list of 5 vectors each with 100 elements. What R voodoo do I need to do in order to get it to look something like:
param, result
1, 5
1, 6
1, 8
...
5, 10
set.seed(42)
do.call(rbind, #rbind results for different x together
lapply(c(1,2), FUN = function(x)
data.frame(param = x, #will be recycled
result = do.call(what = c, #concatenate results of replicate
replicate(n = 2,
expr = rnorm(n = 3, mean = x), #replace with random_func
simplify = FALSE))))) #when FALSE, replicate returns list
# param result
# 1 1 2.3709584
# 2 1 0.4353018
# 3 1 1.3631284
# 4 1 1.6328626
# 5 1 1.4042683
# 6 1 0.8938755
# 7 2 3.5115220
# 8 2 1.9053410
# 9 2 4.0184237
# 10 2 1.9372859
# 11 2 3.3048697
# 12 2 4.2866454
rerun and map_df solution
from purrr
library(dplyr)
library(purrr)
Random function
random_func <- function(n, param) {
rnorm(n)+(param*10)
}
solution
myfun <- function() {
df <- 100 %>%
rerun(x=10, y=1:5) %>%
map_df(~data.frame(param=.x$y, result=random_func(n=.x$x, param=.x$y)))
}
Output
df <- myfun()
head(df)
param result
1 1 10.15325
2 2 19.52867
3 3 30.08218
4 4 40.06418
5 5 48.39804
6 1 11.00435
Additional validation
df %>%
group_by(param) %>%
summarise(mean = mean(result))
param mean
1 1 10.00634
2 2 20.03874
3 3 30.11093
4 4 40.06166
5 5 50.02632
Performance
library(microbenchmark)
microbenchmark(myfun())
expr min lq mean median uq max neval
myfun() 65.93166 66.80521 69.42349 68.5152 69.57185 90.77295 100

Efficiently change elements in data based on neighbouring elements

Let me delve right in. Imagine you have data that looks like this:
df <- data.frame(one = c(1, 1, NA, 13),
two = c(2, NA,10, 14),
three = c(NA,NA,11, NA),
four = c(4, 9, 12, NA))
This gives us:
df
# one two three four
# 1 1 2 NA 4
# 2 1 NA NA 9
# 3 NA 10 11 12
# 4 13 14 NA NA
Each row are measurements in week 1, 2, 3 and 4 respectively. Suppose the numbers represent some accumulated measure since the last time a measurement happened. For example, in row 1, the "4" in column "four" represents a cumulative value of week 3 and 4.
Now I want to "even out" these numbers (feel free to correct my terminology here) by evenly spreading out the measurements to all weeks before the measurement if no measurement took place in the preceeding weeks. For instance, row 1 should read
1 2 2 2
since the 4 in the original data represents the cumulative value of 2 weeks (week "three" and "four"), and 4/2 is 2.
The final end result should look like this:
df
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
I struggle a bit with how to best approach this. One candidate solution would be to get the indices of all missing values, then to count the length of runs (NAs occuring multiple times), and use that to fill up the values somehow. However, my real data is large, and I think such a strategy might be time consuming. Is there an easier and more efficient way?
A base R solution would be to first identify the indices that need to be replaced, then determine groupings of those indices, finally assigning grouped values with the ave function:
clean <- function(x) {
to.rep <- which(is.na(x) | c(FALSE, head(is.na(x), -1)))
groups <- cumsum(c(TRUE, head(!is.na(x[to.rep]), -1)))
x[to.rep] <- ave(x[to.rep], groups, FUN=function(y) {
rep(tail(y, 1) / length(y), length(y))
})
return(x)
}
t(apply(df, 1, clean))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
If efficiency is important (your question implies it is), then an Rcpp solution could be a good option:
library(Rcpp)
cppFunction(
"NumericVector cleanRcpp(NumericVector x) {
const int n = x.size();
NumericVector y(x);
int consecNA = 0;
for (int i=0; i < n; ++i) {
if (R_IsNA(x[i])) {
++consecNA;
} else if (consecNA > 0) {
const double replacement = x[i] / (consecNA + 1);
for (int j=i-consecNA; j <= i; ++j) {
y[j] = replacement;
}
consecNA = 0;
} else {
consecNA = 0;
}
}
return y;
}")
t(apply(df, 1, cleanRcpp))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
We can compare performance on a larger instance (10000 x 100 matrix):
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
all.equal(apply(mat, 1, clean), apply(mat, 1, cleanRcpp))
# [1] TRUE
system.time(apply(mat, 1, clean))
# user system elapsed
# 4.918 0.035 4.992
system.time(apply(mat, 1, cleanRcpp))
# user system elapsed
# 0.093 0.016 0.120
In this case the Rcpp solution provides roughly a 40x speedup compared to the base R implementation.
Here's a base R solution that's nearly as fast as josilber's Rcpp function:
spread_left <- function(df) {
nc <- ncol(df)
x <- rev(as.vector(t(as.matrix(cbind(df, -Inf)))))
ii <- cumsum(!is.na(x))
f <- tabulate(ii)
v <- x[!duplicated(ii)]
xx <- v[ii]/f[ii]
xx[xx == -Inf] <- NA
m <- matrix(rev(xx), ncol=nc+1, byrow=TRUE)[,seq_len(nc)]
as.data.frame(m)
}
spread_left(df)
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
It manages to be relatively fast by vectorizing everything and completely avoiding time-expensive calls to apply(). (The downside is that it's also relatively obfuscated; to see how it works, do debug(spread_left) and then apply it to the small data.frame df in the OP.
Here are benchmarks for all currently posted solutions:
library(rbenchmark)
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
df <- as.data.frame(mat)
## First confirm that it produces the same results
identical(spread_left(df), as.data.frame(t(apply(mat, 1, clean))))
# [1] TRUE
## Then compare its speed
benchmark(josilberR = t(apply(mat, 1, clean)),
josilberRcpp = t(apply(mat, 1, cleanRcpp)),
Josh = spread_left(df),
Henrik = t(apply(df, 1, fn)),
replications = 10)
# test replications elapsed relative user.self sys.self
# 4 Henrik 10 38.81 25.201 38.74 0.08
# 3 Josh 10 2.07 1.344 1.67 0.41
# 1 josilberR 10 57.42 37.286 57.37 0.05
# 2 josilberRcpp 10 1.54 1.000 1.44 0.11
Another base possibility. I first create a grouping variable (grp), over which the 'spread' is then made with ave.
fn <- function(x){
grp <- rev(cumsum(!is.na(rev(x))))
res <- ave(x, grp, FUN = function(y) sum(y, na.rm = TRUE) / length(y))
res[grp == 0] <- NA
res
}
t(apply(df, 1, fn))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
I was thinking that if NAs are relatively rare, it might be better to make the edits by reference. (I'm guessing this is how the Rcpp approach works.) Here's how it can be done in data.table, borrowing #Henrik's function almost verbatim and converting to long format:
require(data.table) # 1.9.5
fill_naseq <- function(df){
# switch to long format
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
mDT[,badv := is.na(value)]
mDT[
# subset to rows that need modification
badv|shift(badv),
# apply #Henrik's function, more or less
value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
# revert to wide format
(setDF(dcast(mDT,id~variable)[,id:=NULL]))
}
identical(fill_naseq(df),spread_left(df)) # TRUE
To show the best-case scenario for this approach, I simulated so that NAs are very infrequent:
nr = 1e4
nc = 100
nafreq = 1/1e4
mat <- matrix(sample(
c(NA,1:3),
nr*nc,
replace=TRUE,
prob=c(nafreq,rep((1-nafreq)/3,3))
),nrow=nr)
df <- as.data.frame(mat)
benchmark(F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 1 F 10 3.82 1.394 3.72
# 2 Josh 10 2.74 1.000 2.70
# I don't have Rcpp installed and so left off josilber's even faster approach
So, it's still slower. However, with data kept in a long format, reshaping wouldn't be necessary:
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
fill_naseq_long <- function(mDT){
mDT[,badv := is.na(value)]
mDT[badv|shift(badv),value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
mDT
}
benchmark(
F2=fill_naseq_long(mDT),F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 2 F 10 3.98 8.468 3.81
# 1 F2 10 0.47 1.000 0.45
# 3 Josh 10 2.72 5.787 2.69
Now it's a little faster. And who doesn't like keeping their data in long format? This also has the advantage of working even if we don't have the same number of observations per "id".

Groupby bins and aggregate in R

I have data like (a,b,c)
a b c
1 2 1
2 3 1
9 2 2
1 6 2
where 'a' range is divided into n (say 3) equal parts and aggregate function calculates b values (say max) and grouped by at 'c' also.
So the output looks like
a_bin b_m(c=1) b_m(c=2)
1-3 3 6
4-6 NaN NaN
7-9 NaN 2
Which is MxN where M=number of a bins, N=unique c samples or all range
How do I approach this? Can any R package help me through?
A combination of aggregate, cut and reshape seems to work
df <- data.frame(a = c(1,2,9,1),
b = c(2,3,2,6),
c = c(1,1,2,2))
breaks <- c(0, 3, 6, 9)
# Aggregate data
ag <- aggregate(df$b, FUN=max,
by=list(a=cut(df$a, breaks, include.lowest=T), c=df$c))
# Reshape data
res <- reshape(ag, idvar="a", timevar="c", direction="wide")
There would be easier ways.
If your dataset is dat
res <- sapply(split(dat[, -3], dat$c), function(x) {
a_bin <- with(x, cut(a, breaks = c(1, 3, 6, 9), include.lowest = T, labels = c("1-3",
"4-6", "7-9")))
c(by(x$b, a_bin, FUN = max))
})
res1 <- setNames(data.frame(row.names(res), res),
c("a_bin", "b_m(c=1)", "b_m(c=2)"))
row.names(res1) <- 1:nrow(res1)
res1
a_bin b_m(c=1) b_m(c=2)
1 1-3 3 6
2 4-6 NA NA
3 7-9 NA 2
I would use a combination of data.table and reshape2 which are both fully optimized for speed (not using for loops from apply family).
The output won't return the unused bins.
v <- c(1, 4, 7, 10) # creating bins
temp$int <- findInterval(temp$a, v)
library(data.table)
temp <- setDT(temp)[, list(b_m = max(b)), by = c("c", "int")]
library(reshape2)
temp <- dcast.data.table(temp, int ~ c, value.var = "b_m")
## colnames(temp) <- c("a_bin", "b_m(c=1)", "b_m(c=2)") # Optional for prettier table
## temp$a_bin<- c("1-3", "7-9") # Optional for prettier table
## a_bin b_m(c=1) b_m(c=2)
## 1 1-3 3 6
## 2 7-9 NA 2

Operate in defined number of rows of a data.table

I am working with a data table that has groups of data and for each a position (from -1000 to +1000) and a count for each position. A small example looks this this:
dt.ex <- data.table(newID=rep(c("A","B"), each = 6), pos=rep(c(-2:3), 2), count= sample(c(1:100), 12))
newID pos count
1: A -2 29
2: A -1 32
3: A 0 33
4: A 1 45
5: A 2 51
6: A 3 26
7: B -2 22
8: B -1 79
9: B 0 2
10: B 1 48
11: B 2 87
12: B 3 38
What I want to do is to calculate the mean (or sum) between every n rows for each group of newID. That is, split into n rows and aggregate the results. This would be output assuming n=3 and summing:
newID pos count
A -2 94
A 1 122
B -2 103
B 1 173
And I honestly have no idea on how to start without resorting some kind of looping - not advisable for a 67094000 x 3 table. If I wanted to calculate per newID only, something like this would do the trick but I am yet to see a solution that comes close to answering my question. Plyr solutions are also welcome although I feel it might be too slow for this.
An alternate way (without using .SD) would be:
dt.ex[, seq := (seq_len(.N)-1) %/% 3, by=newID][,
list(pos = mean(pos), count=sum(count)), list(newID, seq)]
Benchmarking on (relatively) bigger data:
set.seed(45)
get_grps <- function() paste(sample(letters, 5, TRUE), collapse="")
grps <- unique(replicate(1e4, get_grps()))
dt.in <- data.table(newID = sample(grps, 6e6, TRUE),
pos = sample(-1000:1000, 6e6, TRUE),
count = runif(6e6))
setkey(dt.in, newID)
require(microbenchmark)
eddi <- function(dt) {
dt[, .SD[, list(pos = mean(pos), count = sum(count)),
by = seq(0, .N-1) %/% 3], by = newID]
}
arun <- function(dt) {
dt[, seq := (seq_len(.N)-1) %/% 3, by=newID][,
list(pos = mean(pos), count=sum(count)), list(newID, seq)]
}
microbenchmark(o1 <- eddi(copy(dt.in)), o2 <- arun(copy(dt.in)), times=2)
Unit: seconds
expr min lq median uq max neval
o1 <- eddi(copy(dt.in)) 25.23282 25.23282 26.16009 27.08736 27.08736 2
o2 <- arun(copy(dt.in)) 13.59597 13.59597 14.41190 15.22783 15.22783 2
Try this:
dt.ex[, .SD[, list(pos = mean(pos), count = sum(count)),
by = seq(0, .N-1) %/% 3],
by = newID]
Note that the parent data.table's .N is used in the nested by, because .N only exists in the j-expression.

Resources