(Edited)
I am using the following code to create two columns in a data.table and update them with some numbers:
T <- data.table(Init_1 = rep(0, 100), Init_2 = rep(0, 100))
for (i in 1:100){
T[, Init_1 := i]
T[, Init_2 := 2*i]
}
I expected that this code would add two columns to the data.table T (Init_1 and Init_2) and fill them with numbers : (1:100) and (2,4,...200) respectively.
However, the code returns constant values:
> T
Init_1 Init_2
1: 100 200
2: 100 200
3: 100 200
4: 100 200
5: 100 200
6: 100 200
7: 100 200
8: 100 200
.................
Could you explain why my code is not working as expected and how it could be fixed?
Your advice will be appreciated.
Edit:
In relation to answer 2, eventually I want to use a function inside the for loop. More specifically:
# A FUNCTION THAT RETURNS THE TRANSITION PROBABILITIES AFTER N STEPS IN A MARKOV CHAIN
#-------------------------------------------------------------------------------------
R <- function(P, n){
if (n==1) return(P)
R(P, n-1) %*% P
}
# A ONE-STEP PROBABILITY MATRIX
#---------------------------------------------------------------------------------------
P = matrix(c(0.6, 0.1, 0.3, 0.2, 0.7, 0.1, 0.3, 0.3, 0.4), nrow = 3, byrow = TRUE)
# EXAMINING THE CONVERGENCE PROCESS OF THE PROBABILITIES OVER TIME
#########################################################################
T <- data.table(Init_1 = rep(0, 100), Init_2 = rep(0, 100))
for (i in 1:100){
T[, Init_1 := R(P, i)[1,1]]
T[, Init_2 := R(P, i)[2,1]]
}
or
for (i in 1:100){
T[, ':=' (Init_1 = R(P, i)[1,1],
Init_2 = R(P, i)[2,1]) ]
}
I'm no data.table expert. But I know it throws
helpful error messages. If you e.g. create an empty data.table and
try to use := to add columns, it says
T <- data.table()
T[,a:=1]
# Error in `[.data.table`(T, , `:=`(a, 1)) :
# Cannot use := to add columns to a null data.table (no columns), currently.
# You can use := to add (empty) columns to a 0-row data.table (1 or more empty columns),
# though.
Your problem might be related. Because data.table(numeric()) or rather T <- data.table(numeric(length = 0)) creates a a 0-row data.table. The empty column gets named V1 by default. Here you could use
:= to add empty columns. However, that's not what you want.
Instead you could do
T <- data.table(numeric(0))
for (i in 1:5){
T <- T[, .(
Init_1=if (exists("Init_1")) c(Init_1, i) else i,
Init_2=if (exists("Init_2")) c(Init_2, 2*i) else 2*i )]
}
T
# Init_1 Init_2
# 1: 1 2
# 2: 2 4
# 3: 3 6
# 4: 4 8
# 5: 5 10
Although that's pretty ugly und probably super unefficient.
First, you should not define a function with name as T. T is reserved for TRUE in logic. Also, it is not recommended to use i for iteration since it is also used for complex number, for example
> (2i)^2
[1] -4+0i
Third, iteration is slow in R. We should avoid to use iteration whenever possible.
Here are the simple codes to generate such matrix. Hope this helps.
T.data <- matrix(NA,nrow=100,ncol=2);
T.data[,1] <- 1:100;
T.data[,2] <- 2*T.data[,1]
Related
I'm a beginning R programmer. I have trouble in a loop calculation with a previous value like recursion.
An example of my data:
dt <- data.table(a = c(0:4), b = c( 0, 1, 2, 1, 3))
And calculated value 'c' is y[n] = (y[n-1] + b[n])*a[n]. Initial value of c is 0. (c[1] = 0)
I used the for loop and the code and result is as below.
dt$y <- 0
for (i in 2:nrow(dt)) {
dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
}
a b y
1: 0 0 0
2: 1 1 1
3: 2 2 6
4: 3 1 21
5: 4 3 96
This result is what I want. However, my data has over 1,000,000 rows and several columns, therefore I'm trying to find other ways without using a for loop. I tried to use "Reduce()", but it only works with a single vector (ex. y[n] = y_[n-1]+b[n]). As shown above, my function uses two vectors, a and b, so I can't find a solution.
Is there a more efficient way to be faster without using a for loop, such as using a recursive function or any good package functions?
This kind of computation cannot make use of R's advantage of vectorization because of the iterative dependencies. But the slow-down appears to really be coming from indexing performance on a data.frame or data.table.
Interestingly, I was able to speed up the loop considerably by accessing a, b, and y directly as numeric vectors (1000+ fold advantage for 2*10^5 rows) or as matrix "columns" (100+ fold advantage for 2*10^5 rows) versus as columns in a data.table or data.frame.
This old discussion may still shed some light on this rather surprising result: https://stat.ethz.ch/pipermail/r-help/2011-July/282666.html
Please note that I also made a different toy data.frame, so I could test a larger example without returning Inf as y grew with i:
Option data.frame (numeric vectors embedded in a data.frame or data.table per your example):
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
})
#user system elapsed
#79.39 146.30 225.78
#NOTE: Sorry, I didn't have the patience to let the data.table version finish for vec_length=2*10^5.
tail(dt$y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
Option vector (numeric vectors extracted in advance of loop):
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
y[i] <- (y[i - 1] + b[i]) * a[i]
})
#user system elapsed
#0.03 0.00 0.03
tail(y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
Option matrix (data.frame converted to matrix before loop):
vec_length <- 200000
dt <- as.matrix(data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0))
system.time(for (i in 2:nrow(dt)) {
dt[i, 1] <- (dt[i - 1, 3] + dt[i, 2]) * dt[i, 1]
})
#user system elapsed
#0.67 0.01 0.69
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
#NOTE: a matrix is actually a vector but with an additional attribute (it's "dim") that says how the "matrix" should be organized into rows and columns
Option data.frame with matrix style indexing:
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
dt[i, 3] <- (dt[(i - 1), 3] + dt[i, 2]) * dt[i, 1]
})
#user system elapsed
#110.69 0.03 112.01
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
An option is to use Rcpp since for this recursive equation is easy to code in C++:
library(Rcpp)
cppFunction("
NumericVector func(NumericVector b, NumericVector a) {
int len = b.size();
NumericVector y(len);
for (int i = 1; i < len; i++) {
y[i] = (y[i-1] + b[i]) * a[i];
}
return(y);
}
")
func(c( 0, 1, 2, 1, 3), c(0:4))
#[1] 0 1 6 21 96
timing code:
vec_length <- 1e7
dt <- data.frame(a=1:vec_length, b=1:vec_length, y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
y[i] <- (y[i - 1] + b[i]) * a[i]
})
# user system elapsed
# 19.22 0.06 19.44
system.time(func(b, a))
# user system elapsed
# 0.09 0.02 0.09
Here is a base R solution.
According to the information from #ThetaFC, an indication for speedup is to use matrix or vector (rather than data.frame for data.table). Thus, it is better to have the following preprocessing before calculating df$y, i.e.,
a <- as.numeric(df$a)
b <- as.numeric(df$b)
Then, you have two approaches to get df$y:
writing your customized recursion function
f <- function(k) {
if (k == 1) return(0)
c(f(k-1),(tail(f(k-1),1) + b[k])*a[k])
}
df$y <- f(nrow(df))
Or a non-recursion function (I guess this will be much faster then the recursive approach)
g <- Vectorize(function(k) sum(rev(cumprod(rev(a[2:k])))*b[2:k]))
df$y <- g(seq(nrow(df)))
such that
> df
a b y
1 0 0 0
2 1 1 1
3 2 2 6
4 3 1 21
5 4 3 96
I don't think this will be any faster, but here's one way to do it without an explicit loop
dt[, y := purrr::accumulate2(a, b, function(last, a, b) (last + b)*a
, .init = 0)[-1]]
dt
# a b y
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 6
# 4: 3 1 21
# 5: 4 3 96
I am bit surprised by the behaviour of data.table. I want to select from one row in the data.table all non-NA values.
With NA values it's working:
t = data.table(a=1,b=NA)
t[, !is.na(t), with=F]
Without NA values it doesn't working:
t = data.table(a=1, b=2)
t[, !is.na(t), with=F]
The basic difference is that t[, !c(F, F), with=F] doesn't work. Interestingly t[, c(T, T), with=F] is doing fine.
I know there are many ways to achieve the desired output, but I am only interested in this - for me strange - behaviour of data.table.
I've investigated the data.table:::`[.data.table` source code
And it indeed looks like a bug to me. What basically happens, is that the !is.na() call is divided into ! and is.na() calls. Then, it sums this vector up, and if the length is zero it returns null.data.table(). The issue is, that for dt <- data.table(a = 1, b = 2), sum(is.na(dt)) will always be zero.
Below is a shortened code to illustrate what goes under the hood
sim_dt <- function(...) {
## data.table catches the call
jsub <- substitute(...)
cat("This is your call:", paste0(jsub, collapse = ""))
## data.table separates the `!` from the call and sets notj = TRUE instead
## and saves `is.na(t)` into `jsub`
if (is.call(jsub) && deparse(jsub[[1L]], 500L, backtick=FALSE) %in% c("!", "-")) { # TODO is deparse avoidable here?
notj = TRUE
jsub = jsub[[2L]]
} else notj = FALSE
cat("\nnotj:", notj)
cat("\nThis is the new jsub: ", paste0(jsub, collapse = "("), ")", sep = "")
## data.table evaluates just the `jsub` part which obviously return a vector of `FALSE`s (because `!` was removed)
cat("\nevaluted j:", j <- eval(jsub, setattr(as.list(seq_along(dt)), 'names', names(dt)), parent.frame()))# else j will be evaluated for the first time on next line
## data.table checks if `j` is a logical vector and looks if there are any TRUEs and gets an empty vector
if (is.logical(j)) cat("\nj after `which`:", j <- which(j))
cat("\njs length:", length(j), "\n\n")
## data.table checks if `j` is empty (and it's obviously is) and returns a null.data.table
if (!length(j)) return(data.table:::null.data.table()) else return(dt[, j, with = FALSE])
}
## Your data.table
dt <- data.table(a = 1, b = 2)
sim_dt(!is.na(dt))
# This is your call: !is.na(dt)
# notj: TRUE
# This is the new jsub: is.na(dt)
# evaluted j: FALSE FALSE
# j after `which`:
# js length: 0
#
# Null data.table (0 rows and 0 cols)
dt <- data.table(a = 1, b = NA)
sim_dt(!is.na(dt))
# This is your call: !is.na(dt)
# notj: TRUE
# This is the new jsub: is.na(dt)
# evaluted j: FALSE TRUE
# j after `which`: 2
# js length: 1
#
# b
# 1: NA
As #Roland has already mentioned is.na(t) output is a matrix where you need a vector to select column.
But column selection should work in example given by OP as it got only single row in data.table. All we need to do is to wrap it in () to get that evaluated. e.g. :
library(data.table)
t = data.table(a=1, b=2)
t[,(!c(FALSE,FALSE)),with=FALSE]
# a b
# 1: 1 2
t[,(!is.na(t)),with=FALSE]
# a b
# 1: 1 2
I am currently working on a function that works on a big matrix of 2 columns ( number of values > 2000 in general) and have a time problem.
here the head of my matrix :
matrix
here my function :
get <- function()
{
v <- sample(1:1e6,20000, replace=TRUE) #for example
table <- #mymatrix
for ( i in 1:nrow(table))
{
b <- which(v > table[i,1] & v < table[i,2]) #want index between 2 intervals
}
return(b)
}
the problem is the which it is too long when I repeat my loop for the whole table, and i can't find how to fix it (still learning in R).
As Andrey said in a comment, you’re only returning the result for the last row. You’re also not passing table into the function (in fact, your function has no arguments), and it’s also unclear what v represents and in particular why it has more values than table has rows.
However, assuming that you want the results for all rows, you can do two things:
Don’t use which, you probably don’t need numeric indices.
Use vectorisation instead of a for loop:
get = function(table) {
v = sample(1 : 1E6, 20000, replace = TRUE)
v > table[, 1] & v < table[, 2]
}
That’s it.
Here is the code that would, for every value in vector v tell you which of the bins it fell into.
tbl = matrix(c(0,224,
225,233,
234,239,
240,243,
244,290,
291,292),
byrow = TRUE,
ncol = 2);
v = c(0,100,224,
225, 230, 233,
235)
fi1 = findInterval(v, tbl[,1]+1)
fi2 = findInterval(v, tbl[,2]-1)
set = (fi1!=fi2)
b = double(length(v))
b[set] = fi1[set];
# show the results
cbind(value = v, bin = b)
# value bin
# [1,] 0 0
# [2,] 100 1
# [3,] 224 0
# [4,] 225 0
# [5,] 230 2
# [6,] 233 0
# [7,] 235 3
I can't imagine it should be that difficult, but probably, coming from Python, my mindset is biased.
I know I'm going to carry out 50 calculations and the result of each calculation, together with two parameters characterizing the calculation, should build up a data frame.
So my approach is to instantiate the data frame and then I want to add the results whenever they become available. Please see the indicated row below:
# Number of simulations
nsim = 50
# The data frame which should carry the calculation (parameters and solutions).
sol <- data.frame(col.names=c("ni", "Xbar", "n"))
# Fifty values for n.
n <- seq.int(5, 5000, length.out=nsim)
for(ni in n)
{
# A random sample containing possible duplicates.
X <- sample(seq(-ni, ni, length=ni+1), replace=T)
Xbar <- round(mean(X), 3)
sol <- rbind(sol, c(ni, Xbar, n)) # <<-- How to do this correctly??
}
This doesn't work.
There are two ways to do this correctly. One is to pre-define your data.frame (its size) and then populate it iteratively in a for-loop:
nsim <- 10 # reduce to 10 to simplify output
n <- seq.int(5, 5000, length.out=nsim)
sol <- setNames(data.frame(matrix(nrow=nsim, ncol=3)), c("ni", "Xbar", "n"))
set.seed(1) # for reproducibility
for(ni in seq_along(n)) {
Xbar <- round(mean(sample(seq(-n[ni], n[ni], length=n[ni]+1), replace=T)), 3)
sol[ni,] <- c(ni, Xbar, n[ni])
}
Alternatively, you can use sapply on your n vector to create a vector of results and then cbind everything back together:
set.seed(1) # for reproducibility
sol <- data.frame(
ni = seq_along(n),
Xbar = sapply(n, function(ni) {
round(mean(sample(seq(-ni, ni, length=ni+1), replace=T)), 3)
}),
n = n
)
Either way, you'll end up with a nice dataframe:
> str(sol)
'data.frame': 10 obs. of 3 variables:
$ ni : num 1 2 3 4 5 6 7 8 9 10
$ Xbar: num 0.667 -0.232 -14.599 -26.026 36.51 ...
$ n : num 5 560 1115 1670 2225 ...
1) Check what your initial sol contains.
> sol <- data.frame(col.names=c("ni", "Xbar", "n"))
> sol
col.names
1 ni
2 Xbar
3 n
Not what you want. See this question.
2) Make sure seq.int does what you expect - check the documentation of (or just the output of) seq.int. e.g. look at what n contains:
> n
[1] 5.0000 106.9388 208.8776 310.8163 412.7551 514.6939 616.6327
[8] 718.5714 820.5102 922.4490 1024.3878 1126.3265 1228.2653 1330.2041
[15] 1432.1429 1534.0816 1636.0204 1737.9592 1839.8980 1941.8367 2043.7755
[22] 2145.7143 2247.6531 2349.5918 2451.5306 2553.4694 2655.4082 2757.3469
[29] 2859.2857 2961.2245 3063.1633 3165.1020 3267.0408 3368.9796 3470.9184
[36] 3572.8571 3674.7959 3776.7347 3878.6735 3980.6122 4082.5510 4184.4898
[43] 4286.4286 4388.3673 4490.3061 4592.2449 4694.1837 4796.1224 4898.0612
[50] 5000.0000
Is that what you meant?
3) Given (1) the problems are not surprising, but in any case, just carry out the first time through the loop a line at a time. See what happens:
sim = 50
sol <- data.frame(col.names=c("ni", "Xbar", "n"))
ni=5
X <- sample(seq(-ni, ni, length=ni+1), replace=T)
Xbar <- round(mean(X), 3)
sol <- rbind(sol, c(ni, Xbar, n))
print(sol)
Gives:
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 5) :
invalid factor level, NA generated
> print(sol)
col.names
1 ni
2 Xbar
3 n
4 <NA>
Now the behavior is unsurprising; we can't add three columns to something with one column.
4) You don't want to do it this way anyway. It's better to initialize sol to be its final size and then fill it in.
See, for example, this answer
However, the more common R idiom would be to avoid loops where possible; there are a number of functions that will let you create the whole thing at once.
First of all, can you clarify the expected output format that you expect?
As of now, on modifying the code to generate a data frame, the following output will be generated (let me know if this is what you expect & then its not difficult to generate the following) :
ni Xbar n
10.000 2.182 12.000
If this is what you expect, then one way to do this would be as follows:
Step 1: Create Vectors
Step 2: Create Data frame from the above vectors
Step 3: Run your operations in a loop & fill in row by row.
nsim=50
n=seq.int(5, 5000, length.out=nsim)
ni<-vector(mode='numeric',length=nsim)
Xbar<-vector(mode='numeric',length=nsim)
out<-data.frame(ni=ni,Xbar=Xbar,n=n)
for ( i in 1:length(n)){
X<- sample(seq(-n[i], n[i], length=n[i]+1), replace=T)
out[i,'Xbar'] <- round(mean(X), 3)
out[i,'ni']<-n[i]
}
The output is as follows:
Say there is a 2-column data frame with a time or distance column which sequentially increases and an observation column which may have NAs here and there. How can I efficiently use a sliding window function to get some statistic, say a mean, for the observations in a window of duration X (e.g. 5 seconds), slide the window over Y seconds (e.g. 2.5 seconds), repeat... The number of observations in the window is based on the time column, thus both the number of observations per window and the number of observations to slide the window may vary The function should accept any window size up to the number of observations and a step size.
Here is sample data (see "Edit:" for a larger sample set)
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
head(dat)
time measure
1 1.914806 1.0222694
2 2.937075 0.3490641
3 3.286140 NA
4 4.830448 0.8112979
5 5.641746 0.8773504
6 6.519096 1.2174924
Desired Output for the specific case of a 5 second window, 2.5 second step, first window from -2.5 to 2.5, na.rm=FALSE:
[1] 1.0222694
[2] NA
[3] NA
[4] 1.0126639
[5] 0.9965048
[6] 0.9514456
[7] 1.0518228
[8] NA
[9] NA
[10] NA
Explanation: In the desired output the very first window looks for times between -2.5 and 2.5. One observation of measure is in this window, and it is not an NA, thus we get that observation: 1.0222694. The next window is from 0 to 5, and there is an NA in the window, so we get NA. Same for the window from 2.5 to 7.5. The next window is from 5 to 10. There are 5 observations in the window, none are NA. So, we get the average of those 5 observations (i.e. mean(dat[dat$time >5 & dat$time <10,'measure']) )
What I tried: Here is what I tried for the specific case of a window where the step size is 1/2 the window duration:
windo <- 5 # duration in seconds of window
# partition into groups depending on which window(s) an observation falls in
# When step size >= window/2 and < window, need two grouping vectors
leaf1 <- round(ceiling(dat$time/(windo/2))+0.5)
leaf2 <- round(ceiling(dat$time/(windo/2))-0.5)
l1 <- tapply(dat$measure, leaf1, mean)
l2 <- tapply(dat$measure, leaf2, mean)
as.vector(rbind(l2,l1))
Not flexible, not elegant, not efficient. If step size isn't 1/2 window size, the approach will not work, as is.
Any thoughts on a general solution to this kind of problem? Any solution is acceptable. The faster the better, though I prefer solutions using base R, data.table, Rcpp, and/or parallel computation. In my real data set, there are several millions of observations contained in a list of data frames (max data frame is ~400,000 observations).
Below is a extra info: A larger sample set
Edit: As per request, here is a larger, more realistic example dataset with many more NAs and the minimum time span (~0.03). To be clear, though, the list of data frames contains small ones like the one above, as well as ones like the following and larger:
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)
Here is an attempt with Rcpp. The function assumes that data is sorted according to time. More testing would be advisable and adjustments could be made.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector rollAverage(const NumericVector & times,
NumericVector & vals,
double start,
const double winlen,
const double winshift) {
int n = ceil((max(times) - start) / winshift);
NumericVector winvals;
NumericVector means(n);
int ind1(0), ind2(0);
for(int i=0; i < n; i++) {
if (times[0] < (start+winlen)) {
while((times[ind1] <= start) &
(times[ind1+1] <= (start+winlen)) &
(ind1 < (times.size() - 1))) {
ind1++;
}
while((times[ind2+1] <= (start+winlen)) & (ind2 < (times.size() - 1))) {
ind2++;
}
if (times[ind1] >= start) {
winvals = vals[seq(ind1, ind2)];
means[i] = mean(winvals);
} else {
means[i] = NA_REAL;
}
} else {
means[i] = NA_REAL;
}
start += winshift;
}
return means;
}
Testing it:
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
rollAverage(dat$time, dat$measure, -2.5, 5.0, 2.5)
#[1] 1.0222694 NA NA 1.0126639 0.9965048 0.9514456 1.0518228 NA NA NA
With your list of data.frames (using data.table):
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)
library(data.table)
dat <- lapply(dat, setDT)
for (ind in seq_along(dat)) dat[[ind]][, i := ind]
#possibly there is a way to avoid these copies?
dat <- rbindlist(dat)
system.time(res <- dat[, rollAverage(time, measure, -2.5, 5.0, 2.5), by=i])
#user system elapsed
#1.51 0.02 1.54
print(res)
# i V1
# 1: 1 1.0217126
# 2: 1 0.9334415
# 3: 1 0.9609050
# 4: 1 1.0123473
# 5: 1 0.9965922
# ---
#6000596: 300 1.1121296
#6000597: 300 0.9984581
#6000598: 300 1.0093060
#6000599: 300 NA
#6000600: 300 NA
Here is a function that gives the same result for your small data frame. It's not particularly quick: it takes several seconds to run on one of the larger datasets in your second dat example.
rolling_summary <- function(DF, time_col, fun, window_size, step_size, min_window=min(DF[, time_col])) {
# time_col is name of time column
# fun is function to apply to the subsetted data frames
# min_window is the start time of the earliest window
times <- DF[, time_col]
# window_starts is a vector of the windows' minimum times
window_starts <- seq(from=min_window, to=max(times), by=step_size)
# The i-th element of window_rows is a vector that tells us the row numbers of
# the data-frame rows that are present in window i
window_rows <- lapply(window_starts, function(x) { which(times>=x & times<x+window_size) })
window_summaries <- sapply(window_rows, function(w_r) fun(DF[w_r, ]))
data.frame(start_time=window_starts, end_time=window_starts+window_size, summary=window_summaries)
}
rolling_summary(DF=dat,
time_col="time",
fun=function(DF) mean(DF$measure),
window_size=5,
step_size=2.5,
min_window=-2.5)
Here are some functions that will give the same output on your first example:
partition <- function(x, window, step = 0){
a = x[x < step]
b = x[x >= step]
ia = rep(0, length(a))
ib = cut(b, seq(step, max(b) + window, by = window))
c(ia, ib)
}
roll <- function(df, window, step = 0, fun, ...){
tapply(df$measure, partition(df$time, window, step), fun, ...)
}
roll_steps <- function(df, window, steps, fun, ...){
X = lapply(steps, roll, df = df, window = window, fun = fun, ...)
names(X) = steps
X
}
Output for your first example:
> roll_steps(dat, 5, c(0, 2.5), mean)
$`0`
1 2 3 4 5
NA 1.0126639 0.9514456 NA NA
$`2.5`
0 1 2 3 4
1.0222694 NA 0.9965048 1.0518228 NA
You can also ignore missing values this way easily:
> roll_steps(dat, 5, c(0, 2.5), mean, na.rm = TRUE)
$`0`
1 2 3 4 5
0.7275438 1.0126639 0.9514456 0.9351326 NaN
$`2.5`
0 1 2 3 4
1.0222694 0.8138012 0.9965048 1.0518228 0.6122983
This can also be used for a list of data.frames:
> x = lapply(dat2, roll_steps, 5, c(0, 2.5), mean)
Ok, how about this.
library(data.table)
dat <- data.table(dat)
setkey(dat, time)
# function to compute a given stat over a time window on a given data.table
window_summary <- function(start_tm, window_len, stat_fn, my_dt) {
pos_vec <- my_dt[, which(time>=start_tm & time<=start_tm+window_len)]
return(stat_fn(my_dt$measure[pos_vec]))
}
# a vector of window start times
start_vec <- seq(from=-2.5, to=dat$time[nrow(dat)], by=2.5)
# sapply'ing the function above over vector of start times
# (in this case, getting mean over 5 second windows)
result <- sapply(start_vec, window_summary,
window_len=5, stat_fn=mean, my_dt=dat)
On my machine, it processes the first 20,000 rows of your large dataset in 13.06781 secs; all rows in 51.58614 secs
Here's another attempt to use pure data.table approach and its between function.
Have compared Rprof against the above answers (except #Rolands answer) and it seems the most optimized one.
Haven't tested for bugs though, but if you"ll like it, I'll expand the answer.
Using your dat from above
library(data.table)
Rollfunc <- function(dat, time, measure, wind = 5, slide = 2.5, FUN = mean, ...){
temp <- seq.int(-slide, max(dat$time), by = slide)
temp <- cbind(temp, temp + wind)
setDT(dat)[, apply(temp, 1, function(x) FUN(measure[between(time, x[1], x[2])], ...))]
}
Rollfunc(dat, time, measure, 5, 2.5)
## [1] 1.0222694 NA NA 1.0126639 0.9965048 0.9514456 1.0518228 NA NA
## [10] NA
You can also specify the functions and its arguments, i.e., for example:
Rollfunc(dat, time, measure, 5, 2.5, max, na.rm = TRUE)
will also work
Edit: I did some benchnarks against #Roland and his method clearly wins (by far), so I would go with the Rcpp aproach