I'm trying to calculate the rolling mean of the previous k non-NA values within the dplyr/tidyverse framework. I've written a function that seems to work but was wondering if there's already a function from some package (which will probably be much more efficient than my attempt) doing exactly this. An example dataset:
tmp.df <- data.frame(
x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)
Let's say I want the rolling mean of the previous 3 non-NA values. Then the output y should be:
x y
1 NA NA
2 1 NA
3 2 NA
4 NA NA
5 3 NA
6 4 2
7 5 3
8 NA 4
9 NA 4
10 NA 4
11 6 4
12 7 5
13 NA 6
The first 5 elements of y are NAs because the first time x has 3 previous non-NA values is on row 6 and the average of those 3 elements is 2. The next y elements are self-explanatory. Row 9 gets a 4 because the 3 previous non-NA values of x are in rows 5, 6, and 7 and so on.
My attempt is this:
roll_mean_previous_k <- function(x, k){
require(dplyr)
res <- NA
lagged_vector <- dplyr::lag(x)
lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
previous_k_values <- tail(lagged_vector_without_na, k)
if (length(previous_k_values) >= k) res <- mean(previous_k_values)
res
}
to be used as follows (using the slide_dbl function from the slider package):
library(dplyr)
tmp.df %>%
mutate(
y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
)
which gives the desired output. However, I'm wondering if there's a ready-made, and (as mentioned before) more efficient way of doing this. I should mention that I know of rollmean and roll_mean from the zoo and RcppRoll packages respectively, but unless I'm mistaken, they seem to work on a fixed rolling window with the option of dealing with NA values (e.g ignoring them). In my case, I want to "extend" my window to include k non-NA values.
Any thoughts/suggestions are welcome.
EDIT - SIMULATION RESULTS
Thank you to all contributors. First of all, I had not mentioned that my datasets are indeed much larger and run often so any performance improvements are most welcome. I therefore ran the following simulation to check execution times, before deciding which answer to accept. Note, that some of the answers needed small tweaks to return the desired output, but if you feel that your solution is misrepresented (and therefore is less efficient than intended) feel free to let me know and I'll edit accordingly. I've used G. Grothendieck's trick from his answer below, to remove the need for if-else checks regarding the length of the lagged , non-NA vector.
So here's the simulation code:
library(tidyverse)
library(runner)
library(zoo)
library(slider)
library(purrr)
library(microbenchmark)
set.seed(20211004)
test_vector <- sample(x = 100, size = 1000, replace = TRUE)
test_vector[sample(1000, size = 250)] <- NA
# Based on GoGonzo's answer and the runner package
f_runner <- function(z, k){
runner(
x = z,
f = function(x) {
mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
}
)
}
# Based on my inital answer (but simplified), also mentioned by GoGonzo
f_slider <- function(z, k){
slide_dbl(
z,
function(x) {
mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
},
.before = Inf
)
}
# Based on helios' answer. Return the correct results but with a warning.
f_helios <- function(z, k){
reduced_vec <- na.omit(z)
unique_means <- rollapply(reduced_vec, width = k, mean)
start <- which(!is.na(z))[k] + 1
repeater <- which(is.na(z)) + 1
repeater_cut <- repeater[(repeater > start-1) & (repeater <= length(z))]
final <- as.numeric(rep(NA, length(z)))
index <- start:length(z)
final[setdiff(index, repeater_cut)] <- unique_means
final[(start):length(final)] <- na.locf(final)
final
}
# Based on G. Grothendieck's answer (but I couldn't get it to run with the performance improvements)
f_zoo <- function(z, k){
rollapplyr(
z,
seq_along(z),
function(x, k){
mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
},
k)
}
# Based on AnilGoyal's answer
f_purrr <- function(z, k){
map_dbl(
seq_along(z),
~ ifelse(
length(tail(na.omit(z[1:(.x -1)]), k)) == k,
mean(tail(na.omit(z[1:(.x -1)]), k)),
NA
)
)
}
# Check if all are identical #
all(
sapply(
list(
# f_helios(test_vector, 10),
f_purrr(test_vector, 10),
f_runner(test_vector, 10),
f_zoo(test_vector, 10)
),
FUN = identical,
f_slider(test_vector, 10),
)
)
# Run benchmarking #
microbenchmark(
# f_helios(test_vector, 10),
f_purrr(test_vector, 10),
f_runner(test_vector, 10),
f_slider(test_vector, 10),
f_zoo(test_vector, 10)
)
And the results:
Unit: milliseconds
expr min lq mean median uq max neval cld
f_purrr(test_vector, 10) 31.9377 37.79045 39.64343 38.53030 39.65085 104.9613 100 c
f_runner(test_vector, 10) 23.7419 24.25170 29.12785 29.23515 30.32485 98.7239 100 b
f_slider(test_vector, 10) 20.6797 21.71945 24.93189 26.52460 27.67250 32.1847 100 a
f_zoo(test_vector, 10) 43.4041 48.95725 52.64707 49.59475 50.75450 122.0793 100 d
Based on the above, and unless the code can be further improved, it seems as the slider and runner solutions are faster. Any final suggestions are more than welcome.
Many thanks for your time!!
With runner it will be something like mean of 3-elements tail window of non-na values. You can achive the same result with slider
library(runner)
tmp.df <- data.frame(
x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)
# using runner
tmp.df$y_runner <- runner(
x = tmp.df$x,
f = function(x) {
mean(
tail(
x[!is.na(x)],
3
)
)
}
)
# using slider
tmp.df$y_slider <- slider::slide_dbl(
tmp.df$x,
function(x) {
mean(
tail(
x[!is.na(x)],
3
)
)
},
.before = Inf
)
tmp.df
# x y_runner y_slider
# 1 NA NaN NaN
# 2 1 1.0 1.0
# 3 2 1.5 1.5
# 4 NA 1.5 1.5
# 5 3 2.0 2.0
# 6 4 3.0 3.0
# 7 5 4.0 4.0
# 8 NA 4.0 4.0
# 9 NA 4.0 4.0
# 10 NA 4.0 4.0
# 11 6 5.0 5.0
# 12 7 6.0 6.0
# 13 NA 6.0 6.0
rollapplyr. Regarding the comment about rollmean in the question, zoo also has rollappy and rollapplyr (right aligned) and those allow different widths (and offsets) for each component of the input by specifying a vector (as we do here) or list for width -- see ?rollapply for more info. We use a relatively naive vector of widths below and also show some improved width vectors which run faster.
Operation Create a Mean function which takes a vector, removes the last element and all NA's and takes the last k elements of what is left extending it to k elements with NA's as needed. Finally take the mean of that. We use rollapplyr to apply that to x with a width of seq_along(x).
Performance improvements. With this small data the following likely don't make much difference but if you have larger data you could try these which might improve the speed:
replace na.omit with na_rm from the collapse package
replace the second argument of rollapplyr with the code shown here.
The idea here is that the sum of the lengths of the k+1 longest runs of NA plus k+1 forms a bound on the number of elements that we need to consider. This (plus using na_rm) ran about 25% faster as the code in the question on a problem when I tried it with 1300 rows (formed from 100 copies of the data in the question) and does not add much extra code.
pmin(with(rle(is.na(x)), sum(tail(sort(lengths[values]), k+1)))+k+1, seq_along(x))
replace the second argument of rollapplyr with w where w is shown here. The idea here is to use findInterval to find the element k non-NA's back which provides an even tigher bound. This one (plus using na_rm) ran nearly twice as fast as the code in the question when tried with the same 1300 rows at the expense of adding 2 more lines of code.
tt <- length(x) - rev(cumsum(rev(!is.na(x))))
w <- seq_along(tt) - findInterval(tt - k - 1, tt)
Code. With the data in the question the code below (not using the above improvements) ran slightly faster (not a lot) than the code in the question based on my benchmarking and it is only two lines of code.
library(dplyr)
library(zoo)
Mean <- function(x, k) mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
tmp.df %>% mutate(y = rollapplyr(x, seq_along(x), Mean, k = 3))
giving:
x y
1 NA NA
2 1 NA
3 2 NA
4 NA NA
5 3 NA
6 4 2
7 5 3
8 NA 4
9 NA 4
10 NA 4
11 6 4
12 7 5
13 NA 6
Since I am not aware of a ready-made way of computing your output in any standard library, I came up with the implementation roll_mean_k_efficient below, which seems to speed up your computations considerably. Note that this implementation makes use of the rollapply and the na.locf methods from the zoo package.
rm(list = ls())
library("zoo")
library("rbenchmark")
library("dplyr")
x = rep(c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA), 100)
# your sample (extended)
tmp.df <- data.frame(
x = rep(c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA), 100)
)
# enhanced implementation
roll_mean_k_efficient <- function(x, k){
reduced_vec = na.omit(x)
unique_means = rollapply(reduced_vec, width=k, mean)
start = which(!is.na(x))[k] + 1
repeater = which(is.na(x)) + 1
repeater_cut = repeater[(repeater > start-1) & (repeater <= length(x))]
final <- as.numeric(rep(NA, length(x)))
index = start:length(x)
final[setdiff(index, repeater_cut)] <- unique_means
final[(start):length(final)] <- na.locf(final)
final
}
# old implementation
roll_mean_previous_k <- function(x, k){
res <- NA
lagged_vector <- dplyr::lag(x)
lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
previous_k_values <- tail(lagged_vector_without_na, k)
if (length(previous_k_values) >= k) res <- mean(previous_k_values)
res
}
# wrapper function for the benchmarking below
roll_mean_benchmark = function(){
res = tmp.df %>%
mutate(
y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
)
return(res)
}
# some benchmarking
benchmark(roll_mean_k_efficient(x = x, k=3),
roll_mean_benchmark(),
columns=c('test','elapsed','replications'),
replications = 100)
Furthermore, I extended your example vector x to get some more reliable benchmark results via the benchmark function from the rbenchmark package.
In my case the benchmark runtimes that are printed after running the code are:
test elapsed replications
2 roll_mean_benchmark() 4.463 100
1 roll_mean_k_efficient(x = x, k = 3) 0.039 100
Without using zoo. In tidyverse fashion, you can also do it using purrr::map
tmp.df %>% mutate(y = map(seq_along(x), ~ ifelse(length(tail(na.omit(tmp.df$x[1:(.x -1)]), 3)) ==3,
mean(tail(na.omit(tmp.df$x[1:(.x -1)]), 3)),
NA)))
x y
1 NA NA
2 1 NA
3 2 NA
4 NA NA
5 3 NA
6 4 2
7 5 3
8 NA 4
9 NA 4
10 NA 4
11 6 4
12 7 5
13 NA 6
Related
I have a vector of threshold values, thresholds, and another vector, x. I'd like to create a new vector, say vec_sum, of the same length as thresholds, that stores, for each element of thresholds, the sum of values of x larger than this element.
What is the fastest way of doing this?
The naive way I'm doing it is
vec_sum <- rep(NA,length(thresholds))
for(i in seq_along(thresholds))
{
vec_sum[i] <- sum(x[x>thresholds[i]])
}
In case it helps, thresholds is already sorted.
Here is another solution using cumsum:
f1 <- function(v, th){
v2 <- v[order(v)]
v2s <- rev(cumsum(rev(v2)))
return(v2s[findInterval(th, v2) + 1])
}
Here are some tests and comparison with the other answer (as well as the example data) by Ronak:
f2 <- function(x, thresholds){
if (all(x < thresholds[1])) return(rep(0, length(thresholds)))
if (all(x > thresholds[length(thresholds)])) return(rep(sum(x), length(thresholds)))
return(rev(cumsum(rev(tapply(x,
findInterval(x, thresholds, left.open = TRUE), sum)[-1]))))
}
test_th <- c(3, 5, 10)
test_x <- c(2, 3, 1, 19, 4, 6, 5, 15, 7:14, 16:18, 20)
vec_sum <- rep(NA,length(test_th))
for(i in seq_along(test_th)) {
vec_sum[i] <- sum(test_x[test_x>test_th[i]])
}
all(dplyr::near(f1(test_x, test_th), vec_sum))
# [1] TRUE
all(dplyr::near(f2(test_x, test_th), vec_sum))
# [1] TRUE
set.seed(123)
test_x <- rnorm(10000)
test_th <- sort(rnorm(100)) ## f2 requires sorted threshold values
vec_sum <- rep(NA,length(test_th))
for(i in seq_along(test_th)) {
vec_sum[i] <- sum(test_x[test_x>test_th[i]])
}
all(dplyr::near(f1(test_x, test_th), vec_sum))
# [1] TRUE
all(dplyr::near(f2(test_x, test_th), vec_sum))
# [1] FALSE
# Warning message:
# In x - y : longer object length is not a multiple of shorter object length
library(microbenchmark)
microbenchmark(
a = f1(test_x, test_th),
b = f2(test_x, test_th)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# a 587.116 682.864 900.3572 694.713 703.726 10647.206 100
# b 1157.213 1203.063 1260.0663 1223.600 1258.552 2143.069 100
Not sure if this is any faster, but we can use findInterval to cut x by thresholds. We take sum of each group using tapply and take cumsum in reverse.
as.integer(rev(cumsum(rev(tapply(x,
findInterval(x, thresholds, left.open = TRUE), sum)[-1]))))
Tested on
thresholds <- c(3, 5, 10)
x <- c(2, 3, 1, 19, 4, 6, 5, 15, 7:14, 16:18, 20) #1:20 in random order
vec_sum <- rep(NA,length(thresholds))
for(i in seq_along(thresholds)) {
vec_sum[i] <- sum(x[x>thresholds[i]])
}
vec_sum
#[1] 204 195 155
Using the proposed solution
as.integer(rev(cumsum(rev(tapply(x,
findInterval(x, thresholds, left.open = TRUE), sum)[-1]))))
#[1] 204 195 155
Explaining the answer. findInterval returns groups where each value of x belongs
findInterval(x, thresholds, left.open = TRUE)
#[1] 0 0 0 3 1 2 1 3 2 2 2 2 3 3 3 3 3 3 3 3
We use tapply to get sum of each group
tapply(x, findInterval(x, thresholds, left.open = TRUE), sum)
# 0 1 2 3
# 6 9 40 155
0-group should be excluded since they are smaller than all the values of threshold (hence -1). Group 2 should also contain sum from group 1 and group 3 should contain sum of group 1 and 2. So we reverse the sequence and take cumsum
cumsum(rev(tapply(x, findInterval(x, thresholds, left.open = TRUE), sum)[-1]))
# 3 2 1
#155 195 204
To get it in original order and to match it with threshold we reverse it again
rev(cumsum(rev(tapply(x, findInterval(x, thresholds, left.open = TRUE), sum)[-1])))
# 1 2 3
#204 195 155
Edge Cases :
If there are all values below threshold or all values above threshold, we might need to do an extra check and return the following.
if (all(x < thresholds[1])) rep(0, length(thresholds))
if (all(x > thresholds[length(thresholds)])) rep(sum(x), length(thresholds))
I am a newbie in R.
I am trying to calculate the deviation between columns, and expect several rules applied:
deviation is calculated by current value minus previous value
if current value is NA, then return NA without calculation
if previous value is NA, then current value minus the value before previous value, until minus a valid value
the value in the first column is always valid
For example:
start = c(1, 2, 3, 4)
a = c(2, NA, 5, 6)
b = c(4, 5, NA, 8)
test <- data.frame(start, a, b)
test
start a b
1 1 2 4
2 2 NA 5
3 3 5 NA
4 4 6 8
Expected:
result
a_delta b_delta
1 1 2
2 NA 3
3 2 NA
4 2 2
Note:
cell (2, 1) in result is NA because cell (2, 2) in test is NA
cell (2, 2) in result is 3 because cell (2, 3) minus cell (2, 1) in result gets 3
Here is my broken code. Any suggestions are welcomed:
f <- function(data){
cn <- colnames(data)
cl <- ncol(data)
for (i in 2:cl)){
if (is.na(data$i)) {a <- NA}
else if (!is.na(data$(i-1))) {paste(cn[i], "_delta") <- data$cn[i] - data$cn[i-1]}
else { # check if previous value is NA repeatively
t < i - 1
while (is.na(data$cn[t])) {
t <- t - 1
}
paste(cn[i], "_delta") <- data$cn[i] - data$cn[t]
}
}
}
f(test)
Main function:
CalcDev <- function(x){
x <- unlist(x)
if (!any(is.na(x))){
return(diff(x, 1))
} else {
tmp <- x[-1]
res <- diff(na.omit(x), 1)
tmp[!is.na(tmp)] <- res
return(tmp)
}
}
And how to use:
start = c(1, 2, 3, 4)
a = c(2, NA, 5, 6)
b = c(4, 5, NA, 8)
test <- data.frame(start, a, b)
plyr::adply(test, 1, CalcDev)[, -1]
Result:
a b
1 1 2
2 NA 3
3 2 NA
4 2 2
You just need to rename columns.
I was unable to run your code, so no benchmark.
EDIT:
Answering your comment, you can use CalcDev function in dplyr chain if you vecorize it:
CalcDev.Vect <- Vectorize(CalcDev)
test %>%
CalcDev.Vect %>%
.[, -1] %>%
as.data.frame
You will get similar results, and it will be much faster, especially for bigger data sets.
There is two alternatives: using CalcDev inside do({}) or adply directly in chain, but both will be slower solutions.
Benchmarks for small data set :
expr min lq mean median uq max neval cld
foo.plyr(test) 2240.34 2392.08 2511.3 2490.13 2577.32 3199.16 100 b
foo.do_dplyr(test) 2680.34 2933.70 3104.4 3015.15 3109.48 5771.83 100 d
foo.plyr_in_dplyr(test) 2471.51 2635.04 2805.7 2702.99 2802.29 9422.46 100 c
foo.Vect(test) 441.55 490.58 539.7 539.92 564.74 928.41 100 a
And for bigger data sets difference in evaluation time will be more drastic.
I am trying to clean a large dataset. I have a matrix of prices in date order with the most recent date in the first row and different stocks along the columns. If a price is missing or NA on a given day for a given stock, I use the price from the day before. If the price on the final day is NA, I leave it.
I start by looping over the whole matrix and using an IF statement for each (i,j) pair. This was extremely slow. The next approach is as follows for the matrix of prices b: I use indexing to find indices which are NA, and only deal with these.
for(j in 1:ncol(b))
{
Index<-which( is.na(b[,j]) | b[,j]==0)
if(length(Index)==0)
{print("0 Missings")
Index<-c(1)#to ensure its not empty}
for(k in length(Index):1 )#backwards to fill from the bottom
{
i=Index[k]
#If the oldest date is missing, then set it to N/A so that N/A is passed forward as opposed to 0.
if( i==nrow(b) & ( b[[i,j]]==0 | is.na(b[[i,j]]) ) )
{
b[[i,j]]<-'#N/A'
}
else( b[[i,j]]==0 | is.na(b[[i,j]]) )
{
b[[i,j]] <- b[[i+1,j]]#Take the price from the date before
}
}
}
This is a little faster, but not much. It still takes over an hour for a 400x6000 matrix. I was hoping for a fully vectorised approach, where I did something like:
b[[Index,j]]<-b[[Index+1,j]]
However, I don't think R will use sequentially updated values. By this, I mean it wont progressively update from the bottom so that new values are used. This is important when I have 2 NA entries in a row, since the vectorised approach above will only fill one. But some sort of efficient sequential vectorised code will update the first, and use this to update the second. Any ideas?
Many thanks for your efforts
Here's an possibility using the MESS package and is essentially not different from #Roland's comment above so I'm only including it here so you can see the formatting. The filldown function is written in C++ so it's rather fast:
x <- matrix(c(1, 2, 3, 4, NA, 6, NA, NA, NA, NA, 11, 12, 13, 14, 15, NA, 17, 18, NA, 20), nrow=5)
x
[,1] [,2] [,3] [,4]
[1,] 1 6 11 NA
[2,] 2 NA 12 17
[3,] 3 NA 13 18
[4,] 4 NA 14 NA
[5,] NA NA 15 20
and then use
library(MESS)
apply(x, 2, filldown)
which produces
[,1] [,2] [,3] [,4]
[1,] 1 6 11 NA
[2,] 2 6 12 17
[3,] 3 6 13 18
[4,] 4 6 14 18
[5,] 4 6 15 20
If you want a completely R version then I have a consideration for you:
First define a proper, big, test set:
set.seed(42)
nRow <- 1000
nCol <- 500
test <- matrix(rnorm(nRow * nCol),
nrow = nRow,
ncol = nCol)
test[sample(nRow * nCol, nRow)] <- NA
Then write come code that passes down in the way you want using vectorisation (applicable to each column). PLease note that the horrible excel '#N/A' has been converted to a NaN which is maintains the storage mode (i.e. numeric).
innerF <- function(x){
# Nothing to change
if(!any(idx <- is.na(x) | x == 0))
return(x)
# Alter first value
if(is.na(x[1]) | x[1] == 0)
x[1] <- NaN
# First value altered
idx[1] <- FALSE
# Pass down
x[idx] <- x[which(idx) - 1]
# Return
x
}
Then define a calling strategy:
outerF <- function(x){
# Run once
y <- innerF(x)
# Run till done
while(any((is.na(y) & !is.nan(y)) |
(!is.na(y) & y == 0L))){
y <- innerF(y)
}
# Return
y
}
The test it against the alternative, and weep.... (hint: use MESS and its C++):
library(microbenchmark)
library(MESS)
microbenchmark(apply(test, 2, outerF), times = 100)
#Unit: microseconds
# expr min lq mean median uq max neval
# apply(test, 2, outerF) 630.07 652.4505 806.4808 670.6965 686.234 3253.27 100
microbenchmark(apply(test, 2, filldown), times = 100)
#Unit: microseconds
# expr min lq mean median uq max neval
# apply(test, 2, filldown) 107.482 110.048 134.9092 112.329 114.895 1980.016 100
My R Studio does not allow me to install the packages MOSS and ZOO, so I have to settle for a solution similar to na.locf. The code is here in case anybody wants to use this approach:
start.time<-Sys.time()
nrow<-nrow(b)
for(j in 2:ncol(b))
{
ColumnReversed<-rev(b[,j]) #So we fill from the bottom - Oldest date first
Index<-!is.na(as.numeric(matrix(ColumnReversed,ncol=1))) #1 for valid, 0 for Missing
ValidVals <- c("NA",ColumnReversed[Index]) #[NA,final known, second final known,...,first known]
FilledIndex <- cumsum(Index) + 1 # [0,0,0,0,0...,1,1,1,1,...,2,2,2,2,2,...3,3,3,3,3...,k,k] + 1
#This line stores the index of ValidValues which contains the prices (and values to be filled)
b[,j]<-rev( matrix(ValidVals[FilledIndex],ncol=1) )#Store in reversed order
}
Timing improved from 90 minutes to 65 seconds. AMAZING!
Say there is a 2-column data frame with a time or distance column which sequentially increases and an observation column which may have NAs here and there. How can I efficiently use a sliding window function to get some statistic, say a mean, for the observations in a window of duration X (e.g. 5 seconds), slide the window over Y seconds (e.g. 2.5 seconds), repeat... The number of observations in the window is based on the time column, thus both the number of observations per window and the number of observations to slide the window may vary The function should accept any window size up to the number of observations and a step size.
Here is sample data (see "Edit:" for a larger sample set)
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
head(dat)
time measure
1 1.914806 1.0222694
2 2.937075 0.3490641
3 3.286140 NA
4 4.830448 0.8112979
5 5.641746 0.8773504
6 6.519096 1.2174924
Desired Output for the specific case of a 5 second window, 2.5 second step, first window from -2.5 to 2.5, na.rm=FALSE:
[1] 1.0222694
[2] NA
[3] NA
[4] 1.0126639
[5] 0.9965048
[6] 0.9514456
[7] 1.0518228
[8] NA
[9] NA
[10] NA
Explanation: In the desired output the very first window looks for times between -2.5 and 2.5. One observation of measure is in this window, and it is not an NA, thus we get that observation: 1.0222694. The next window is from 0 to 5, and there is an NA in the window, so we get NA. Same for the window from 2.5 to 7.5. The next window is from 5 to 10. There are 5 observations in the window, none are NA. So, we get the average of those 5 observations (i.e. mean(dat[dat$time >5 & dat$time <10,'measure']) )
What I tried: Here is what I tried for the specific case of a window where the step size is 1/2 the window duration:
windo <- 5 # duration in seconds of window
# partition into groups depending on which window(s) an observation falls in
# When step size >= window/2 and < window, need two grouping vectors
leaf1 <- round(ceiling(dat$time/(windo/2))+0.5)
leaf2 <- round(ceiling(dat$time/(windo/2))-0.5)
l1 <- tapply(dat$measure, leaf1, mean)
l2 <- tapply(dat$measure, leaf2, mean)
as.vector(rbind(l2,l1))
Not flexible, not elegant, not efficient. If step size isn't 1/2 window size, the approach will not work, as is.
Any thoughts on a general solution to this kind of problem? Any solution is acceptable. The faster the better, though I prefer solutions using base R, data.table, Rcpp, and/or parallel computation. In my real data set, there are several millions of observations contained in a list of data frames (max data frame is ~400,000 observations).
Below is a extra info: A larger sample set
Edit: As per request, here is a larger, more realistic example dataset with many more NAs and the minimum time span (~0.03). To be clear, though, the list of data frames contains small ones like the one above, as well as ones like the following and larger:
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)
Here is an attempt with Rcpp. The function assumes that data is sorted according to time. More testing would be advisable and adjustments could be made.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector rollAverage(const NumericVector & times,
NumericVector & vals,
double start,
const double winlen,
const double winshift) {
int n = ceil((max(times) - start) / winshift);
NumericVector winvals;
NumericVector means(n);
int ind1(0), ind2(0);
for(int i=0; i < n; i++) {
if (times[0] < (start+winlen)) {
while((times[ind1] <= start) &
(times[ind1+1] <= (start+winlen)) &
(ind1 < (times.size() - 1))) {
ind1++;
}
while((times[ind2+1] <= (start+winlen)) & (ind2 < (times.size() - 1))) {
ind2++;
}
if (times[ind1] >= start) {
winvals = vals[seq(ind1, ind2)];
means[i] = mean(winvals);
} else {
means[i] = NA_REAL;
}
} else {
means[i] = NA_REAL;
}
start += winshift;
}
return means;
}
Testing it:
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
rollAverage(dat$time, dat$measure, -2.5, 5.0, 2.5)
#[1] 1.0222694 NA NA 1.0126639 0.9965048 0.9514456 1.0518228 NA NA NA
With your list of data.frames (using data.table):
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)
library(data.table)
dat <- lapply(dat, setDT)
for (ind in seq_along(dat)) dat[[ind]][, i := ind]
#possibly there is a way to avoid these copies?
dat <- rbindlist(dat)
system.time(res <- dat[, rollAverage(time, measure, -2.5, 5.0, 2.5), by=i])
#user system elapsed
#1.51 0.02 1.54
print(res)
# i V1
# 1: 1 1.0217126
# 2: 1 0.9334415
# 3: 1 0.9609050
# 4: 1 1.0123473
# 5: 1 0.9965922
# ---
#6000596: 300 1.1121296
#6000597: 300 0.9984581
#6000598: 300 1.0093060
#6000599: 300 NA
#6000600: 300 NA
Here is a function that gives the same result for your small data frame. It's not particularly quick: it takes several seconds to run on one of the larger datasets in your second dat example.
rolling_summary <- function(DF, time_col, fun, window_size, step_size, min_window=min(DF[, time_col])) {
# time_col is name of time column
# fun is function to apply to the subsetted data frames
# min_window is the start time of the earliest window
times <- DF[, time_col]
# window_starts is a vector of the windows' minimum times
window_starts <- seq(from=min_window, to=max(times), by=step_size)
# The i-th element of window_rows is a vector that tells us the row numbers of
# the data-frame rows that are present in window i
window_rows <- lapply(window_starts, function(x) { which(times>=x & times<x+window_size) })
window_summaries <- sapply(window_rows, function(w_r) fun(DF[w_r, ]))
data.frame(start_time=window_starts, end_time=window_starts+window_size, summary=window_summaries)
}
rolling_summary(DF=dat,
time_col="time",
fun=function(DF) mean(DF$measure),
window_size=5,
step_size=2.5,
min_window=-2.5)
Here are some functions that will give the same output on your first example:
partition <- function(x, window, step = 0){
a = x[x < step]
b = x[x >= step]
ia = rep(0, length(a))
ib = cut(b, seq(step, max(b) + window, by = window))
c(ia, ib)
}
roll <- function(df, window, step = 0, fun, ...){
tapply(df$measure, partition(df$time, window, step), fun, ...)
}
roll_steps <- function(df, window, steps, fun, ...){
X = lapply(steps, roll, df = df, window = window, fun = fun, ...)
names(X) = steps
X
}
Output for your first example:
> roll_steps(dat, 5, c(0, 2.5), mean)
$`0`
1 2 3 4 5
NA 1.0126639 0.9514456 NA NA
$`2.5`
0 1 2 3 4
1.0222694 NA 0.9965048 1.0518228 NA
You can also ignore missing values this way easily:
> roll_steps(dat, 5, c(0, 2.5), mean, na.rm = TRUE)
$`0`
1 2 3 4 5
0.7275438 1.0126639 0.9514456 0.9351326 NaN
$`2.5`
0 1 2 3 4
1.0222694 0.8138012 0.9965048 1.0518228 0.6122983
This can also be used for a list of data.frames:
> x = lapply(dat2, roll_steps, 5, c(0, 2.5), mean)
Ok, how about this.
library(data.table)
dat <- data.table(dat)
setkey(dat, time)
# function to compute a given stat over a time window on a given data.table
window_summary <- function(start_tm, window_len, stat_fn, my_dt) {
pos_vec <- my_dt[, which(time>=start_tm & time<=start_tm+window_len)]
return(stat_fn(my_dt$measure[pos_vec]))
}
# a vector of window start times
start_vec <- seq(from=-2.5, to=dat$time[nrow(dat)], by=2.5)
# sapply'ing the function above over vector of start times
# (in this case, getting mean over 5 second windows)
result <- sapply(start_vec, window_summary,
window_len=5, stat_fn=mean, my_dt=dat)
On my machine, it processes the first 20,000 rows of your large dataset in 13.06781 secs; all rows in 51.58614 secs
Here's another attempt to use pure data.table approach and its between function.
Have compared Rprof against the above answers (except #Rolands answer) and it seems the most optimized one.
Haven't tested for bugs though, but if you"ll like it, I'll expand the answer.
Using your dat from above
library(data.table)
Rollfunc <- function(dat, time, measure, wind = 5, slide = 2.5, FUN = mean, ...){
temp <- seq.int(-slide, max(dat$time), by = slide)
temp <- cbind(temp, temp + wind)
setDT(dat)[, apply(temp, 1, function(x) FUN(measure[between(time, x[1], x[2])], ...))]
}
Rollfunc(dat, time, measure, 5, 2.5)
## [1] 1.0222694 NA NA 1.0126639 0.9965048 0.9514456 1.0518228 NA NA
## [10] NA
You can also specify the functions and its arguments, i.e., for example:
Rollfunc(dat, time, measure, 5, 2.5, max, na.rm = TRUE)
will also work
Edit: I did some benchnarks against #Roland and his method clearly wins (by far), so I would go with the Rcpp aproach
I have two vectors, A and B. For every element in A I want to find the index of the first element in B that is greater and has higher index. The length of A and B are the same.
So for vectors:
A <- c(10, 5, 3, 4, 7)
B <- c(4, 8, 11, 1, 5)
I want a result vector:
R <- c(3, 3, 5, 5, NA)
Of course I can do it with two loops, but it's very slow, and I don't know how to use apply() in this situation, when the indices matter. My data set has vectors of length 20000, so the speed is really important in this case.
A few bonus questions:
What if I have a sequence of numbers (like seq = 2:10), and I want to find the first number in B that is higher than a+s for every a of A and every s of seq.
Like with question 1), but I want to know the first greater, and the first lower value, and create a matrix, which stores which one was first. So for example I have a of A, and 10 from seq. I want to find the first value of B, which is higher than a+10, or lower than a-10, and then store it's index and value.
sapply(sapply(seq_along(a),function(x) which(b[-seq(x)]>a[x])+x),"[",1)
[1] 3 3 5 5 NA
This is a great example of when sapply is less efficient than loops.
Although the sapply does make the code look neater, you are paying for that neatness with time.
Instead you can wrap a while loop inside a for loop inside a nice, neat function.
Here are benchmarks comparing a nested-apply loop against nested for-while loop (and a mixed apply-while loop, for good measure). Update: added the vapply..match.. mentioned in comments. Faster than sapply, but still much slower than while loop.
BENCHMARK:
test elapsed relative
1 for.while 0.069 1.000
2 sapply.while 0.080 1.159
3 vapply.match 0.101 1.464
4 nested.sapply 0.104 1.507
Notice you save a third of your time; The savings will likely be larger when you start adding the sequences to A.
For the second part of your question:
If you have this all wrapped up in an nice function, it is easy to add a seq to A
# Sample data
A <- c(10, 5, 3, 4, 7, 100, 2)
B <- c(4, 8, 11, 1, 5, 18, 20)
# Sample sequence
S <- seq(1, 12, 3)
# marix with all index values (with names cleaned up)
indexesOfB <- t(sapply(S, function(s) findIndx(A+s, B)))
dimnames(indexesOfB) <- list(S, A)
Lastly, if you want to instead find values of B less than A, just swap the operation in the function.
(You could include an if-clause in the function and use only a single function. I find it more efficient
to have two separate functions)
findIndx.gt(A, B) # [1] 3 3 5 5 6 NA 8 NA NA
findIndx.lt(A, B) # [1] 2 4 4 NA 8 7 NA NA NA
Then you can wrap it up in one nice pacakge
rangeFindIndx(A, B, S)
# A S indxB.gt indxB.lt
# 10 1 3 2
# 5 1 3 4
# 3 1 5 4
# 4 1 5 NA
# 7 1 6 NA
# 100 1 NA NA
# 2 1 NA NA
# 10 4 6 4
# 5 4 3 4
# ...
FUNCTIONS
(Notice they depend on reshape2)
rangeFindIndx <- function(A, B, S) {
# For each s in S, and for each a in A,
# find the first value of B, which is higher than a+s, or lower than a-s
require(reshape2)
# Create gt & lt matricies; add dimnames for melting function
indexesOfB.gt <- sapply(S, function(s) findIndx.gt(A+s, B))
indexesOfB.lt <- sapply(S, function(s) findIndx.lt(A-s, B))
dimnames(indexesOfB.gt) <- dimnames(indexesOfB.gt) <- list(A, S)
# melt the matricies and combine into one
gtltMatrix <- cbind(melt(indexesOfB.gt), melt(indexesOfB.lt)$value)
# clean up their names
names(gtltMatrix) <- c("A", "S", "indxB.gt", "indxB.lt")
return(gtltMatrix)
}
findIndx.gt <- function(A, B) {
lng <- length(A)
ret <- integer(0)
b <- NULL
for (j in seq(lng-1)) {
i <- j + 1
while (i <= lng && ((b <- B[[i]]) < A[[j]]) ) {
i <- i + 1
}
ret <- c(ret, ifelse(i<lng, i, NA))
}
c(ret, NA)
}
findIndx.lt <- function(A, B) {
lng <- length(A)
ret <- integer(0)
b <- NULL
for (j in seq(lng-1)) {
i <- j + 1
while (i <= lng && ((b <- B[[i]]) > A[[j]]) ) { # this line contains the only difference from findIndx.gt
i <- i + 1
}
ret <- c(ret, ifelse(i<lng, i, NA))
}
c(ret, NA)
}