Calculating moving average - r

I'm trying to use R to calculate the moving average over a series of values in a matrix. There doesn't seem to be a built-in function in R that will allow me to calculate moving averages. Do any packages provide one? Or do I need to write my own?

Or you can simply calculate it using filter, here's the function I use:
ma <- function(x, n = 5){filter(x, rep(1 / n, n), sides = 2)}
If you use dplyr, be careful to specify stats::filter in the function above.

Rolling Means/Maximums/Medians in the zoo package (rollmean)
MovingAverages in TTR
ma in forecast

Using cumsum should be sufficient and efficient. Assuming you have a vector x and you want a running sum of n numbers
cx <- c(0,cumsum(x))
rsum <- (cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n
As pointed out in the comments by #mzuther, this assumes that there are no NAs in the data. to deal with those would require dividing each window by the number of non-NA values. Here's one way of doing that, incorporating the comment from #Ricardo Cruz:
cx <- c(0, cumsum(ifelse(is.na(x), 0, x)))
cn <- c(0, cumsum(ifelse(is.na(x), 0, 1)))
rx <- cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]
rn <- cn[(n+1):length(cx)] - cn[1:(length(cx) - n)]
rsum <- rx / rn
This still has the issue that if all the values in the window are NAs then there will be a division by zero error.

In data.table 1.12.0 new frollmean function has been added to compute fast and exact rolling mean carefully handling NA, NaN and +Inf, -Inf values.
As there is no reproducible example in the question there is not much more to address here.
You can find more info about ?frollmean in manual, also available online at ?frollmean.
Examples from manual below:
library(data.table)
d = as.data.table(list(1:6/2, 3:8/4))
# rollmean of single vector and single window
frollmean(d[, V1], 3)
# multiple columns at once
frollmean(d, 3)
# multiple windows at once
frollmean(d[, .(V1)], c(3, 4))
# multiple columns and multiple windows at once
frollmean(d, c(3, 4))
## three above are embarrassingly parallel using openmp

The caTools package has very fast rolling mean/min/max/sd and few other functions. I've only worked with runmean and runsd and they are the fastest of any of the other packages mentioned to date.

You could use RcppRoll for very quick moving averages written in C++. Just call the roll_mean function. Docs can be found here.
Otherwise, this (slower) for loop should do the trick:
ma <- function(arr, n=15){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n):i])
}
res
}

Here is example code showing how to compute a centered moving average and a trailing moving average using the rollmean function from the zoo package.
library(tidyverse)
library(zoo)
some_data = tibble(day = 1:10)
# cma = centered moving average
# tma = trailing moving average
some_data = some_data %>%
mutate(cma = rollmean(day, k = 3, fill = NA)) %>%
mutate(tma = rollmean(day, k = 3, fill = NA, align = "right"))
some_data
#> # A tibble: 10 x 3
#> day cma tma
#> <int> <dbl> <dbl>
#> 1 1 NA NA
#> 2 2 2 NA
#> 3 3 3 2
#> 4 4 4 3
#> 5 5 5 4
#> 6 6 6 5
#> 7 7 7 6
#> 8 8 8 7
#> 9 9 9 8
#> 10 10 NA 9

In fact RcppRoll is very good.
The code posted by cantdutchthis must be corrected in the fourth line to the window be fixed:
ma <- function(arr, n=15){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n+1):i])
}
res
}
Another way, which handles missings, is given here.
A third way, improving cantdutchthis code to calculate partial averages or not, follows:
ma <- function(x, n=2,parcial=TRUE){
res = x #set the first values
if (parcial==TRUE){
for(i in 1:length(x)){
t<-max(i-n+1,1)
res[i] = mean(x[t:i])
}
res
}else{
for(i in 1:length(x)){
t<-max(i-n+1,1)
res[i] = mean(x[t:i])
}
res[-c(seq(1,n-1,1))] #remove the n-1 first,i.e., res[c(-3,-4,...)]
}
}

In order to complement the answer of cantdutchthis and Rodrigo Remedio;
moving_fun <- function(x, w, FUN, ...) {
# x: a double vector
# w: the length of the window, i.e., the section of the vector selected to apply FUN
# FUN: a function that takes a vector and return a summarize value, e.g., mean, sum, etc.
# Given a double type vector apply a FUN over a moving window from left to the right,
# when a window boundary is not a legal section, i.e. lower_bound and i (upper bound)
# are not contained in the length of the vector, return a NA_real_
if (w < 1) {
stop("The length of the window 'w' must be greater than 0")
}
output <- x
for (i in 1:length(x)) {
# plus 1 because the index is inclusive with the upper_bound 'i'
lower_bound <- i - w + 1
if (lower_bound < 1) {
output[i] <- NA_real_
} else {
output[i] <- FUN(x[lower_bound:i, ...])
}
}
output
}
# example
v <- seq(1:10)
# compute a MA(2)
moving_fun(v, 2, mean)
# compute moving sum of two periods
moving_fun(v, 2, sum)

You may calculate the moving average of a vector x with a window width of k by:
apply(embed(x, k), 1, mean)

The slider package can be used for this. It has an interface that has been specifically designed to feel similar to purrr. It accepts any arbitrary function, and can return any type of output. Data frames are even iterated over row wise. The pkgdown site is here.
library(slider)
x <- 1:3
# Mean of the current value + 1 value before it
# returned as a double vector
slide_dbl(x, ~mean(.x, na.rm = TRUE), .before = 1)
#> [1] 1.0 1.5 2.5
df <- data.frame(x = x, y = x)
# Slide row wise over data frames
slide(df, ~.x, .before = 1)
#> [[1]]
#> x y
#> 1 1 1
#>
#> [[2]]
#> x y
#> 1 1 1
#> 2 2 2
#>
#> [[3]]
#> x y
#> 1 2 2
#> 2 3 3
The overhead of both slider and data.table's frollapply() should be pretty low (much faster than zoo). frollapply() looks to be a little faster for this simple example here, but note that it only takes numeric input, and the output must be a scalar numeric value. slider functions are completely generic, and you can return any data type.
library(slider)
library(zoo)
library(data.table)
x <- 1:50000 + 0L
bench::mark(
slider = slide_int(x, function(x) 1L, .before = 5, .complete = TRUE),
zoo = rollapplyr(x, FUN = function(x) 1L, width = 6, fill = NA),
datatable = frollapply(x, n = 6, FUN = function(x) 1L),
iterations = 200
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 slider 19.82ms 26.4ms 38.4 829.8KB 19.0
#> 2 zoo 177.92ms 211.1ms 4.71 17.9MB 24.8
#> 3 datatable 7.78ms 10.9ms 87.9 807.1KB 38.7

EDIT: took great joy in adding the side parameter, for a moving average (or sum, or ...) of e.g. the past 7 days of a Date vector.
For people just wanting to calculate this themselves, it's nothing more than:
# x = vector with numeric data
# w = window length
y <- numeric(length = length(x))
for (i in seq_len(length(x))) {
ind <- c((i - floor(w / 2)):(i + floor(w / 2)))
ind <- ind[ind %in% seq_len(length(x))]
y[i] <- mean(x[ind])
}
y
But it gets fun to make it independent of mean(), so you can calculate any 'moving' function!
# our working horse:
moving_fn <- function(x, w, fun, ...) {
# x = vector with numeric data
# w = window length
# fun = function to apply
# side = side to take, (c)entre, (l)eft or (r)ight
# ... = parameters passed on to 'fun'
y <- numeric(length(x))
for (i in seq_len(length(x))) {
if (side %in% c("c", "centre", "center")) {
ind <- c((i - floor(w / 2)):(i + floor(w / 2)))
} else if (side %in% c("l", "left")) {
ind <- c((i - floor(w) + 1):i)
} else if (side %in% c("r", "right")) {
ind <- c(i:(i + floor(w) - 1))
} else {
stop("'side' must be one of 'centre', 'left', 'right'", call. = FALSE)
}
ind <- ind[ind %in% seq_len(length(x))]
y[i] <- fun(x[ind], ...)
}
y
}
# and now any variation you can think of!
moving_average <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = mean, side = side, na.rm = na.rm)
}
moving_sum <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = sum, side = side, na.rm = na.rm)
}
moving_maximum <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = max, side = side, na.rm = na.rm)
}
moving_median <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = median, side = side, na.rm = na.rm)
}
moving_Q1 <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = quantile, side = side, na.rm = na.rm, 0.25)
}
moving_Q3 <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = quantile, side = side, na.rm = na.rm, 0.75)
}

Though a bit slow but you can also use zoo::rollapply to perform calculations on matrices.
reqd_ma <- rollapply(x, FUN = mean, width = n)
where x is the data set, FUN = mean is the function; you can also change it to min, max, sd etc and width is the rolling window.

One can use runner package for moving functions. In this case mean_run function. Problem with cummean is that it doesn't handle NA values, but mean_run does. runner package also supports irregular time series and windows can depend on date:
library(runner)
set.seed(11)
x1 <- rnorm(15)
x2 <- sample(c(rep(NA,5), rnorm(15)), 15, replace = TRUE)
date <- Sys.Date() + cumsum(sample(1:3, 15, replace = TRUE))
mean_run(x1)
#> [1] -0.5910311 -0.2822184 -0.6936633 -0.8609108 -0.4530308 -0.5332176
#> [7] -0.2679571 -0.1563477 -0.1440561 -0.2300625 -0.2844599 -0.2897842
#> [13] -0.3858234 -0.3765192 -0.4280809
mean_run(x2, na_rm = TRUE)
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.12188853 -0.13873536
#> [7] -0.13873536 -0.14571604 -0.12596067 -0.11116961 -0.09881996 -0.08871569
#> [13] -0.05194292 -0.04699909 -0.05704202
mean_run(x2, na_rm = FALSE )
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.12188853 -0.13873536
#> [7] NA NA NA NA NA NA
#> [13] NA NA NA
mean_run(x2, na_rm = TRUE, k = 4)
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.10546063 -0.16299272
#> [7] -0.21203756 -0.39209010 -0.13274756 -0.05603811 -0.03894684 0.01103493
#> [13] 0.09609256 0.09738460 0.04740283
mean_run(x2, na_rm = TRUE, k = 4, idx = date)
#> [1] -0.187600111 -0.090220655 -0.004349696 0.168349653 -0.206571573 -0.494335093
#> [7] -0.222969541 -0.187600111 -0.087636571 0.009742884 0.009742884 0.012326968
#> [13] 0.182442234 0.125737145 0.059094786
One can also specify other options like lag, and roll only at specific indexes. More in package and function documentation.

Here is a simple function with filter demonstrating one way to take care of beginning and ending NAs with padding, and computing a weighted average (supported by filter) using custom weights:
wma <- function(x) {
wts <- c(seq(0.5, 4, 0.5), seq(3.5, 0.5, -0.5))
nside <- (length(wts)-1)/2
# pad x with begin and end values for filter to avoid NAs
xp <- c(rep(first(x), nside), x, rep(last(x), nside))
z <- stats::filter(xp, wts/sum(wts), sides = 2) %>% as.vector
z[(nside+1):(nside+length(x))]
}

vector_avg <- function(x){
sum_x = 0
for(i in 1:length(x)){
if(!is.na(x[i]))
sum_x = sum_x + x[i]
}
return(sum_x/length(x))
}

I use aggregate along with a vector created by rep(). This has the advantage of using cbind() to aggregate more than 1 column in your dataframe at time. Below is an example of a moving average of 60 for a vector (v) of length 1000:
v=1:1000*0.002+rnorm(1000)
mrng=rep(1:round(length(v)/60+0.5), length.out=length(v), each=60)
aggregate(v~mrng, FUN=mean, na.rm=T)
Note the first argument in rep is to simply get enough unique values for the moving range, based on the length of the vector and the amount to be averaged; the second argument keeps the length equal to the vector length, and the last repeats the values of the first argument the same number of times as the averaging period.
In aggregate you could use several functions (median, max, min) - mean shown for example. Again, could could use a formula with cbind to do this on more than one (or all) columns in a dataframe.

Related

matching "sub-vectors" in R

so R has a great function, match, to find values in vectors (also %in% to test the existence). but what if I want to find a short vector in a big vector? that is, to test if a given vector is contained (in order!) in another vector? what if I want to find whether a given vector is a prefix/suffix of another vector? are there such functions in R?
example of what I would like:
x=c(1,3,4)
y=c(4,1,3,4,5)
z=c(3,1)
v_contains(x,y) # return TRUE x is contained in y
v_contains(z,y) # FALSE the values of z are in y, but not in the right order
v_match(x,y) # returns 2 because x appears in y starting at position 2
is there anything like it? how would you approach it efficiently?
A recent post uncovered this solution by Jonathan Carroll. I doubt a faster solution exists in R.
v_match <- function(needle, haystack, nomatch = 0L) {
sieved <- which(haystack == needle[1L])
for (i in seq.int(1L, length(needle) - 1L)) {
sieved <- sieved[haystack[sieved + i] == needle[i + 1L]]
}
sieved
}
v_contains <- function(needle, haystack) {
sieved <- which(haystack == needle[1L])
for (i in seq.int(1L, length(needle) - 1L)) {
sieved <- sieved[haystack[sieved + i] == needle[i + 1L]]
}
length(sieved) && !anyNA(sieved)
}
Tests and benchmarks:
library(testthat)
x=c(1,3,4)
y=c(4,1,3,4,5)
z=c(3,1)
expect_true(v_contains(x,y)) # return TRUE x is contained in y
expect_false(v_contains(z,y)) # FALSE the values of z are in y, but not in order
expect_equal(v_match(x,y), 2) # returns 2 because x appears in y starting at position 2
x <- c(5, 1, 3)
yes <- c(sample(5:1e6), c(5, 1, 3))
no <- c(sample(5:1e6), c(4, 1, 3))
expect_true(v_contains(x, yes))
expect_false(v_contains(x, no))
expect_equal(v_match(x, yes), 1e6 - 3)
v_contains_roll <- function(x, y) {
any(zoo::rollapply(y, length(x), identical, x))
}
v_contains_stri <- function(x, y) {
stringr::str_detect(paste(y, collapse = "_"),
paste(x, collapse = "_"))
}
options(digits = 2)
options(scipen = 99)
library(microbenchmark)
gc(0, 1, 1)
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 527502 28 1180915 63 527502 28
#> Vcells 3010073 23 8388608 64 3010073 23
microbenchmark(v_contains(x, yes),
v_contains(x, no),
v_contains_stri(x, yes),
v_contains_stri(x, no),
v_contains_roll(x, yes),
v_contains_roll(x, no),
times = 2L,
control = list(order = "block"))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> v_contains(x, yes) 3.8 3.8 3.8 3.8 3.9 3.9 2
#> v_contains(x, no) 3.7 3.7 3.7 3.7 3.8 3.8 2
#> v_contains_stri(x, yes) 1658.4 1658.4 1676.7 1676.7 1695.0 1695.0 2
#> v_contains_stri(x, no) 1632.3 1632.3 1770.0 1770.0 1907.8 1907.8 2
#> v_contains_roll(x, yes) 5447.4 5447.4 5666.1 5666.1 5884.7 5884.7 2
#> v_contains_roll(x, no) 5458.8 5458.8 5521.7 5521.7 5584.6 5584.6 2
#> cld
#> a
#> a
#> b
#> b
#> c
#> c
Created on 2018-08-18 by the reprex package (v0.2.0).
x=c(1,3,4)
y=c(4,1,3,4,5)
z=c(3,1)
# 1. return TRUE x is contained in y
stringr::str_detect(paste(y, collapse = "_"), paste(x, collapse = "_"))
# 2. FALSE the values of z are in y, but not in the right order
all(z %in% y) & stringr::str_detect(paste(y, collapse = "_"), paste(z, collapse = "_"))
# 3. returns 2 because x appears in y starting at position 2
stringr::str_locate(paste(y, collapse = "_"), paste(x, collapse = "_"))[1] - 1
If x and y are as in the question then here are some alternatives.
1) rollapply This checks if x is contained in y.
library(zoo)
any(rollapply(y, length(x), identical, x))
## [1] TRUE
2) embed Slightly more complex but still one line and no package dependencies.
any(apply(t(embed(y, length(x))) == rev(x), 2, all))
## [1] TRUE
2a) or this variation:
any(apply(embed(y, length(x)), 1, identical, rev(x)))
## [1] TRUE
3) strings Convert both x and y into character strings and use grepl. The comments to the question already point to code for such an approach.
4) Rcpp If speed is important then we can code it in C++. The standard library already has this as a builtin. Place this in a file Search.cpp in the current directory and from within R run library(Rcpp); sourceCpp("Search.cpp"). Then the R code Search(x, y) will invoke it.
#include <Rcpp.h>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
bool Search(NumericVector x, NumericVector y) {
return search(begin(y), end(y), begin(x), end(x)) != end(y);
}

How to perform sequential mathematical operations with R [duplicate]

I'm trying to use R to calculate the moving average over a series of values in a matrix. There doesn't seem to be a built-in function in R that will allow me to calculate moving averages. Do any packages provide one? Or do I need to write my own?
Or you can simply calculate it using filter, here's the function I use:
ma <- function(x, n = 5){filter(x, rep(1 / n, n), sides = 2)}
If you use dplyr, be careful to specify stats::filter in the function above.
Rolling Means/Maximums/Medians in the zoo package (rollmean)
MovingAverages in TTR
ma in forecast
Using cumsum should be sufficient and efficient. Assuming you have a vector x and you want a running sum of n numbers
cx <- c(0,cumsum(x))
rsum <- (cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n
As pointed out in the comments by #mzuther, this assumes that there are no NAs in the data. to deal with those would require dividing each window by the number of non-NA values. Here's one way of doing that, incorporating the comment from #Ricardo Cruz:
cx <- c(0, cumsum(ifelse(is.na(x), 0, x)))
cn <- c(0, cumsum(ifelse(is.na(x), 0, 1)))
rx <- cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]
rn <- cn[(n+1):length(cx)] - cn[1:(length(cx) - n)]
rsum <- rx / rn
This still has the issue that if all the values in the window are NAs then there will be a division by zero error.
In data.table 1.12.0 new frollmean function has been added to compute fast and exact rolling mean carefully handling NA, NaN and +Inf, -Inf values.
As there is no reproducible example in the question there is not much more to address here.
You can find more info about ?frollmean in manual, also available online at ?frollmean.
Examples from manual below:
library(data.table)
d = as.data.table(list(1:6/2, 3:8/4))
# rollmean of single vector and single window
frollmean(d[, V1], 3)
# multiple columns at once
frollmean(d, 3)
# multiple windows at once
frollmean(d[, .(V1)], c(3, 4))
# multiple columns and multiple windows at once
frollmean(d, c(3, 4))
## three above are embarrassingly parallel using openmp
The caTools package has very fast rolling mean/min/max/sd and few other functions. I've only worked with runmean and runsd and they are the fastest of any of the other packages mentioned to date.
You could use RcppRoll for very quick moving averages written in C++. Just call the roll_mean function. Docs can be found here.
Otherwise, this (slower) for loop should do the trick:
ma <- function(arr, n=15){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n):i])
}
res
}
Here is example code showing how to compute a centered moving average and a trailing moving average using the rollmean function from the zoo package.
library(tidyverse)
library(zoo)
some_data = tibble(day = 1:10)
# cma = centered moving average
# tma = trailing moving average
some_data = some_data %>%
mutate(cma = rollmean(day, k = 3, fill = NA)) %>%
mutate(tma = rollmean(day, k = 3, fill = NA, align = "right"))
some_data
#> # A tibble: 10 x 3
#> day cma tma
#> <int> <dbl> <dbl>
#> 1 1 NA NA
#> 2 2 2 NA
#> 3 3 3 2
#> 4 4 4 3
#> 5 5 5 4
#> 6 6 6 5
#> 7 7 7 6
#> 8 8 8 7
#> 9 9 9 8
#> 10 10 NA 9
In fact RcppRoll is very good.
The code posted by cantdutchthis must be corrected in the fourth line to the window be fixed:
ma <- function(arr, n=15){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n+1):i])
}
res
}
Another way, which handles missings, is given here.
A third way, improving cantdutchthis code to calculate partial averages or not, follows:
ma <- function(x, n=2,parcial=TRUE){
res = x #set the first values
if (parcial==TRUE){
for(i in 1:length(x)){
t<-max(i-n+1,1)
res[i] = mean(x[t:i])
}
res
}else{
for(i in 1:length(x)){
t<-max(i-n+1,1)
res[i] = mean(x[t:i])
}
res[-c(seq(1,n-1,1))] #remove the n-1 first,i.e., res[c(-3,-4,...)]
}
}
In order to complement the answer of cantdutchthis and Rodrigo Remedio;
moving_fun <- function(x, w, FUN, ...) {
# x: a double vector
# w: the length of the window, i.e., the section of the vector selected to apply FUN
# FUN: a function that takes a vector and return a summarize value, e.g., mean, sum, etc.
# Given a double type vector apply a FUN over a moving window from left to the right,
# when a window boundary is not a legal section, i.e. lower_bound and i (upper bound)
# are not contained in the length of the vector, return a NA_real_
if (w < 1) {
stop("The length of the window 'w' must be greater than 0")
}
output <- x
for (i in 1:length(x)) {
# plus 1 because the index is inclusive with the upper_bound 'i'
lower_bound <- i - w + 1
if (lower_bound < 1) {
output[i] <- NA_real_
} else {
output[i] <- FUN(x[lower_bound:i, ...])
}
}
output
}
# example
v <- seq(1:10)
# compute a MA(2)
moving_fun(v, 2, mean)
# compute moving sum of two periods
moving_fun(v, 2, sum)
You may calculate the moving average of a vector x with a window width of k by:
apply(embed(x, k), 1, mean)
The slider package can be used for this. It has an interface that has been specifically designed to feel similar to purrr. It accepts any arbitrary function, and can return any type of output. Data frames are even iterated over row wise. The pkgdown site is here.
library(slider)
x <- 1:3
# Mean of the current value + 1 value before it
# returned as a double vector
slide_dbl(x, ~mean(.x, na.rm = TRUE), .before = 1)
#> [1] 1.0 1.5 2.5
df <- data.frame(x = x, y = x)
# Slide row wise over data frames
slide(df, ~.x, .before = 1)
#> [[1]]
#> x y
#> 1 1 1
#>
#> [[2]]
#> x y
#> 1 1 1
#> 2 2 2
#>
#> [[3]]
#> x y
#> 1 2 2
#> 2 3 3
The overhead of both slider and data.table's frollapply() should be pretty low (much faster than zoo). frollapply() looks to be a little faster for this simple example here, but note that it only takes numeric input, and the output must be a scalar numeric value. slider functions are completely generic, and you can return any data type.
library(slider)
library(zoo)
library(data.table)
x <- 1:50000 + 0L
bench::mark(
slider = slide_int(x, function(x) 1L, .before = 5, .complete = TRUE),
zoo = rollapplyr(x, FUN = function(x) 1L, width = 6, fill = NA),
datatable = frollapply(x, n = 6, FUN = function(x) 1L),
iterations = 200
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 slider 19.82ms 26.4ms 38.4 829.8KB 19.0
#> 2 zoo 177.92ms 211.1ms 4.71 17.9MB 24.8
#> 3 datatable 7.78ms 10.9ms 87.9 807.1KB 38.7
EDIT: took great joy in adding the side parameter, for a moving average (or sum, or ...) of e.g. the past 7 days of a Date vector.
For people just wanting to calculate this themselves, it's nothing more than:
# x = vector with numeric data
# w = window length
y <- numeric(length = length(x))
for (i in seq_len(length(x))) {
ind <- c((i - floor(w / 2)):(i + floor(w / 2)))
ind <- ind[ind %in% seq_len(length(x))]
y[i] <- mean(x[ind])
}
y
But it gets fun to make it independent of mean(), so you can calculate any 'moving' function!
# our working horse:
moving_fn <- function(x, w, fun, ...) {
# x = vector with numeric data
# w = window length
# fun = function to apply
# side = side to take, (c)entre, (l)eft or (r)ight
# ... = parameters passed on to 'fun'
y <- numeric(length(x))
for (i in seq_len(length(x))) {
if (side %in% c("c", "centre", "center")) {
ind <- c((i - floor(w / 2)):(i + floor(w / 2)))
} else if (side %in% c("l", "left")) {
ind <- c((i - floor(w) + 1):i)
} else if (side %in% c("r", "right")) {
ind <- c(i:(i + floor(w) - 1))
} else {
stop("'side' must be one of 'centre', 'left', 'right'", call. = FALSE)
}
ind <- ind[ind %in% seq_len(length(x))]
y[i] <- fun(x[ind], ...)
}
y
}
# and now any variation you can think of!
moving_average <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = mean, side = side, na.rm = na.rm)
}
moving_sum <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = sum, side = side, na.rm = na.rm)
}
moving_maximum <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = max, side = side, na.rm = na.rm)
}
moving_median <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = median, side = side, na.rm = na.rm)
}
moving_Q1 <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = quantile, side = side, na.rm = na.rm, 0.25)
}
moving_Q3 <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = quantile, side = side, na.rm = na.rm, 0.75)
}
Though a bit slow but you can also use zoo::rollapply to perform calculations on matrices.
reqd_ma <- rollapply(x, FUN = mean, width = n)
where x is the data set, FUN = mean is the function; you can also change it to min, max, sd etc and width is the rolling window.
One can use runner package for moving functions. In this case mean_run function. Problem with cummean is that it doesn't handle NA values, but mean_run does. runner package also supports irregular time series and windows can depend on date:
library(runner)
set.seed(11)
x1 <- rnorm(15)
x2 <- sample(c(rep(NA,5), rnorm(15)), 15, replace = TRUE)
date <- Sys.Date() + cumsum(sample(1:3, 15, replace = TRUE))
mean_run(x1)
#> [1] -0.5910311 -0.2822184 -0.6936633 -0.8609108 -0.4530308 -0.5332176
#> [7] -0.2679571 -0.1563477 -0.1440561 -0.2300625 -0.2844599 -0.2897842
#> [13] -0.3858234 -0.3765192 -0.4280809
mean_run(x2, na_rm = TRUE)
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.12188853 -0.13873536
#> [7] -0.13873536 -0.14571604 -0.12596067 -0.11116961 -0.09881996 -0.08871569
#> [13] -0.05194292 -0.04699909 -0.05704202
mean_run(x2, na_rm = FALSE )
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.12188853 -0.13873536
#> [7] NA NA NA NA NA NA
#> [13] NA NA NA
mean_run(x2, na_rm = TRUE, k = 4)
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.10546063 -0.16299272
#> [7] -0.21203756 -0.39209010 -0.13274756 -0.05603811 -0.03894684 0.01103493
#> [13] 0.09609256 0.09738460 0.04740283
mean_run(x2, na_rm = TRUE, k = 4, idx = date)
#> [1] -0.187600111 -0.090220655 -0.004349696 0.168349653 -0.206571573 -0.494335093
#> [7] -0.222969541 -0.187600111 -0.087636571 0.009742884 0.009742884 0.012326968
#> [13] 0.182442234 0.125737145 0.059094786
One can also specify other options like lag, and roll only at specific indexes. More in package and function documentation.
Here is a simple function with filter demonstrating one way to take care of beginning and ending NAs with padding, and computing a weighted average (supported by filter) using custom weights:
wma <- function(x) {
wts <- c(seq(0.5, 4, 0.5), seq(3.5, 0.5, -0.5))
nside <- (length(wts)-1)/2
# pad x with begin and end values for filter to avoid NAs
xp <- c(rep(first(x), nside), x, rep(last(x), nside))
z <- stats::filter(xp, wts/sum(wts), sides = 2) %>% as.vector
z[(nside+1):(nside+length(x))]
}
vector_avg <- function(x){
sum_x = 0
for(i in 1:length(x)){
if(!is.na(x[i]))
sum_x = sum_x + x[i]
}
return(sum_x/length(x))
}
I use aggregate along with a vector created by rep(). This has the advantage of using cbind() to aggregate more than 1 column in your dataframe at time. Below is an example of a moving average of 60 for a vector (v) of length 1000:
v=1:1000*0.002+rnorm(1000)
mrng=rep(1:round(length(v)/60+0.5), length.out=length(v), each=60)
aggregate(v~mrng, FUN=mean, na.rm=T)
Note the first argument in rep is to simply get enough unique values for the moving range, based on the length of the vector and the amount to be averaged; the second argument keeps the length equal to the vector length, and the last repeats the values of the first argument the same number of times as the averaging period.
In aggregate you could use several functions (median, max, min) - mean shown for example. Again, could could use a formula with cbind to do this on more than one (or all) columns in a dataframe.

How to efficiently do complex row operations with nested functions in R?

Given a multidimensional array, e.g. a zoo object z, with columns a,b,c,x. Given further a function W(w=c(1,1,1), x) which for example weights every column individually, but which also DEPENDS on the specific row value in column x. How to efficiently do row operations here, e.g. calculating the rowWeightedMeans?
It is known that R::zoo is very fast and efficient for row operations, if the function is very simple, e.g.:
W <- function(w) { return(w); }
z[,"wmean"] <- rowWeightedMeans(z[,1:3], w=W(c(0.1,0.5,0.3)))
But what if W() depends on a value in that row? E.g.:
W <- function(w, x) { return(w*x); }
z[,"wmean"] <- rowWeightedMeans(z[,1:3], w=W(c(0.1,0.5,0.3), z[,4]))
R complains here because it does not know how to hanlde the multi-dimensions of the arguments in the nested function.
The solution could be a for(i in 1:nrow(z)) loop, and computing the values individually for every row i. However, for large data sets this takes a enormous amount of extra computational effort and time.
EDIT
Ok guys, thanks for your time and critics. I tried and tested all your answers but must admit that the actual problem was not solved or understood. For example, I hadn't ask to rewrite my weight function or calculations, because I already presented a minimal version of much more complex calculations. The issue or question here lies much deeper. So I sat back and tried to boil down the problem to the root of the evil and found a minimal working example for you without any zoos, weightedMeans, and so on. Here you go:
z <- data.frame(matrix (1:20, nrow = 4))
colnames (z) <- c ("a", "b", "c", "x", "y")
z
# a b c x y
#1 1 5 9 13 17
#2 2 6 10 14 18
#3 3 7 11 15 19
#4 4 8 12 16 20
W <- function(abc, w, p) {
ifelse (w[1] == p, return(length(p)), return(0))
# Please do not complain! I know this is stupid, but it is an MWE
# and my calculations contained in W() are much more complex!
}
z[,"y"] <- W(z[,1:3], c(14,7,8), z[,"x"])
# same result: z[,"y"] <- apply(z[,1:3], 1, W, c(14,7,8), z[,"x"])
z
# a b c x y
#1 1 5 9 13 4
#2 2 6 10 14 4
#3 3 7 11 15 4
#4 4 8 12 16 4
# expected outcome:
# a b c x y
#1 1 5 9 13 0
#2 2 6 10 14 4
#3 3 7 11 15 0
#4 4 8 12 16 0
The problem I am facing is, that R passes all lines of z[,"x"] to the function, however, I expect it to take only the line which corresponds to the line of z[,"y"] that is currently processed internally when R loops through it. In this example, I expect 14==14 only in line number 2!
So: how to tell R to pass line by line to functions?
SOLUTION
Besides the awarded and accepted answer, I like to summarize the solution here to improve clarity and provide a better overview about the discussion.
This question was not about rewriting the specific function W (e.g. weighting). It was only about the inability of R to pass multiple row-by-row arguments to a general function. By either using z$y <- f(z$a, z$x) or z$y <- apply(z$a, 1, f, z$x), both methods only pass the first argument as row-by-row, and the second argument as a complete column with all rows. It seems that this is an intrinsic behaviour of R around which we need to work around.
To solve this, the whole row needs to be passed as a single argument to a wrapper function, which in turn then applies the specific calculations on that row. Solution for the problem with the weights:
f <- function(x) weighted.mean(x[1:3], W(c(0.1,0.5,0.3), x[4]))
z[,"wmean"] <- apply(z[,1:4], 1, f)
Solution for the geenral problem with the data frame:
f <- function(x) W(x[1:3], c(14,7,8), x[4])
z$y <- apply(z, 1, f)
Brian presents also even faster methods using compiled C code in his accepted answer. Thanks to #BrianAlbertMonroe, #jaimedash and #inscaven for dealing with the poorly clarified question and for hinting to this solution.
Haven't really worked with zoo or rowWeightedMeans but if you simply apply weights to row elements before taking the mean of them, and require the weights to depend on one of the elements of the row:
z <- matrix(rnorm(100),ncol=4)
W <- function(row, weights){
weights <- weights * row[4]
row2 <- row[1:3] * weights
sum(row2) / sum(weights)
}
w.means <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3))
If the above gives the correct answer but you're worried about quickness write the W function in Rcpp or use the built in cmpfun,
N <- 10000
z <- matrix(rnorm(N),ncol=4)
# Interpreted R function
W1 <- function(row, weights){
weights <- weights * row[4]
row2 <- row[1:3] * weights
mean(row2)
}
# Compiled R function
W2 <- compiler::cmpfun(W1)
# C++ function imported into R via Rcpp
Rcpp::cppFunction('double Wcpp(NumericVector row, NumericVector weights){
int x = row.size() ;
NumericVector wrow(x - 1);
NumericVector nweights(x - 1);
nweights = weights * row[x - 1];
for( int i = 0; i < (x-1) ; i++){
wrow[i] = row[i] * nweights[i];
}
double res = sum(wrow) / sum(nweights);
return(res);
}')
w.means0 <- apply(z,1,W,weights=c(0.1,0.5,0.3))
w.means1 <- apply(z,1,W2,weights=c(0.1,0.5,0.3))
w.means2 <- apply(z,1,Wcpp,weights=c(0.1,0.5,0.3))
identical( w.means0, w.means1, w.means2 )
#[1] TRUE
Or
# Write the whole thing in C++
Rcpp::cppFunction('NumericVector WM(NumericMatrix z , NumericVector weights){
int x = z.ncol() ;
int y = z.nrow() ;
NumericVector res(y);
NumericVector wrow(x - 1);
NumericVector nweights(x - 1);
double nwsum;
double mult;
for( int row = 0 ; row < y ; row++){
mult = z(row,x-1);
nweights = weights * mult;
nwsum = sum(nweights);
for( int i = 0; i < (x-1) ; i++){
wrow[i] = z(row,i) * nweights[i] ;
}
res[row] = sum(wrow) / nwsum;
}
return(res);
}')
microbenchmark::microbenchmark(
w.means0 <- apply(z,1,W1,weights=c(0.1,0.5,0.3)),
w.means1 <- apply(z,1,W2,weights=c(0.1,0.5,0.3)),
w.means2 <- apply(z,1,Wcpp,weights=c(0.1,0.5,0.3)),
w.means3 <- WM(z = z, weights = c(0.1, 0.5, 0.3))
)
Unit: microseconds
expr min lq mean median uq max neval
w.means0 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)) 12114.834 12536.9330 12995.1722 12838.2805 13163.4835 15796.403 100
w.means1 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3)) 9941.571 10286.8085 10769.7330 10410.9465 10788.6800 19526.840 100
w.means2 <- apply(z, 1, Wcpp, weights = c(0.1, 0.5, 0.3)) 10919.112 11631.5530 12849.7294 13262.9705 13707.7465 17438.524 100
w.means3 <- WM(z = z, weights = c(0.1, 0.5, 0.3)) 94.172 107.9855 146.2606 125.0075 140.2695 2089.933 100
EDIT:
Incorporating the weighted.means function slows down the computation dramatically, and does not handle missing values specially according to the help file, so you will still need to write code to manage them.
> z <- matrix(rnorm(100),ncol=4)
> W <- function(row, weights){
+ weights <- weights * row[4]
+ row2 <- row[1:3] * weights
+ sum(row2) / sum(weights)
+
+ }
> W1 <- compiler::cmpfun(W)
> W2 <- function(row, weights){
+ weights <- weights * row[4]
+ weighted.mean(row[1:3],weights)
+ }
> W3 <- compiler::cmpfun(W2)
> w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3))
> w.means2 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3))
> identical(w.means1,w.means2)
[1] TRUE
> microbenchmark(
+ w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3)),
+ w.means1 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)),
+ w.means2 < .... [TRUNCATED]
Unit: microseconds
expr min lq mean median uq max neval
w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3)) 145.315 167.4550 172.8163 172.9120 180.6920 194.673 100
w.means1 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)) 124.087 134.3365 143.6803 137.8925 148.7145 225.459 100
w.means2 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3)) 307.311 346.6320 356.4845 354.7325 371.7620 412.110 100
w.means2 <- apply(z, 1, W3, weights = c(0.1, 0.5, 0.3)) 280.073 308.7110 323.0156 324.1230 333.7305 407.963 100
Here's a solution with zoo::rollapply. It produces the same answer as matrixStats::rowWeightedMeans for the simpler case.
if(! require(matrixStats)) {
install.packages('matrixStats')
library(matrixStats)
}
if(! require(zoo)) {
install.packages('zoo')
library(zoo)
}
z <- zoo (matrix (1:20, nrow = 5))
colnames (z) <- c ("a", "b", "c", "x")
z$x <- 0 # so we can see an effect below...
z
## a b c x
## 1 1 6 11 0
## 2 2 7 12 0
## 3 3 8 13 0
## 4 4 9 14 0
## 5 5 10 15 0
weights <- c(0.1,0.5,0.3)
W <- function (w) { return(w); }
z$wmean <- rowWeightedMeans(z[,1:3], w=W(weights))
## z[,new]<- doesn't work to create new columns in zoo
## objects
## use $
rowWeightMean_zoo <- function (r, W, weights) {
s <- sum(W(weights))
return(sum(r[1:3] * W(weights) / s))
}
z$wmean_zoo <- rollapply(z, width=1, by.column=FALSE,
function (r) rowWeightMean_zoo(r, W, weights))
z
For the requirement in the question, that the return value be dependent on some ancillary data in the row, rowWeightedMeans doesn't work. But, the function passed to rollapply can be modified to use other elements of the row.
W2 <- function (w, x) { return(w * x); }
# z$wmean2 <- rowWeightedMeans(z[,1:3], w=W2(c(0.1,0.5,0.3), z[,4]))
## doesn't work
## Error in rowWeightedMeans(z[, 1:3], w = W#(c(0.1, 0.5, 0.3), z[, 4])) :
## The length of argument 'w' is does not match the number of column in 'x': 5 != 3
## In addition: Warning message:
## In `*.default`(w, x) :
## longer object length is not a multiple of shorter object length
## Calls: rowWeightedMeans -> W -> Ops.zoo -> NextMethod
rowWeightMean_zoo_dependent <- function (r, W, weights) {
s <- sum(W(weights, r[4]))
return(sum(r[1:3] * W2(weights, r[4]) / s))
}
z$wmean2_zoo <- rollapply(z, width=1, by.column=FALSE,
function (r) rowWeightMean_zoo_dependent(r, W2, weights))
z
## a b c x wmean wmean_zoo wmean2_zoo
## 1 1 6 11 0 7.111111 7.111111 NaN
## 2 2 7 12 0 8.111111 8.111111 NaN
## 3 3 8 13 0 9.111111 9.111111 NaN
## 4 4 9 14 0 10.111111 10.111111 NaN
## 5 5 10 15 0 11.111111 11.111111 NaN
I think this can be solved by clever reshaping. I would use dplyr for that - but the workflow should work similar for plyr or data.table - all these packages are heavily optimized.
for this example I assume the weight function is w(x) = w0 ^ x
Here I create some sample data z, and generic weights w (note I add a row number r to z):
library(dplyr)
library(tidyr)
N <- 10
z <- data.frame(r=1:N, a=rnorm(N), b=rnorm(N), c=rnorm(N), x=rpois(N, 5))
w <- data.frame(key=c('a','b','c'), weight=c(0.1,0.5,0.3))
Now the calculation would be:
res <- z %>% gather(key,value,-r,-x) %>% # convert to long format, but keep row numbers and x
left_join(w, 'key') %>% # add generic weights
mutate(eff_weight = weight^x) %>% # calculate effective weights
group_by(r) %>% # group by the orignal lines for the weighted mean
summarise(ws = sum(value*eff_weight), ww=sum(eff_weight)) %>% # calculate to helper values
mutate(weighted_mean = ws/ww) %>% # effectively calculate the weighted mean
select(r, weighted_mean) # remove unneccesary output
left_join(z, res) # add to the original data
I added some notes - but if you have trouble understanding you could evaluate res stepwise (remove tail including %>%) and have a look at the results.
Update
took the challenge to find the way to do the same in base R:
N <- 10
z <- data.frame(a=rnorm(N), b=rnorm(N), c=rnorm(N), x=rpois(N, 5))
w <- data.frame(key=c('a','b','c'), weight=c(0.1,0.5,0.3))
long.z <- reshape(z, idvar = "row", times=c('a','b','c'),
timevar='key',
varying = list(c('a','b','c')), direction = "long")
compose.z <- merge(long.z,w, by='key')
compose.z2 <- within(compose.z, eff.weight <- weight^x)
sum.stat <- by(compose.z2, compose.z2$row, function(x) {sum(x$a * x$eff.weight )/sum(x$eff.weight)})
nice.data <- c(sum.stat)
It requires a bit more verbose function. But the same pattern can be applied.

average sequence of lag variables R [duplicate]

I'm trying to use R to calculate the moving average over a series of values in a matrix. There doesn't seem to be a built-in function in R that will allow me to calculate moving averages. Do any packages provide one? Or do I need to write my own?
Or you can simply calculate it using filter, here's the function I use:
ma <- function(x, n = 5){filter(x, rep(1 / n, n), sides = 2)}
If you use dplyr, be careful to specify stats::filter in the function above.
Rolling Means/Maximums/Medians in the zoo package (rollmean)
MovingAverages in TTR
ma in forecast
Using cumsum should be sufficient and efficient. Assuming you have a vector x and you want a running sum of n numbers
cx <- c(0,cumsum(x))
rsum <- (cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n
As pointed out in the comments by #mzuther, this assumes that there are no NAs in the data. to deal with those would require dividing each window by the number of non-NA values. Here's one way of doing that, incorporating the comment from #Ricardo Cruz:
cx <- c(0, cumsum(ifelse(is.na(x), 0, x)))
cn <- c(0, cumsum(ifelse(is.na(x), 0, 1)))
rx <- cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]
rn <- cn[(n+1):length(cx)] - cn[1:(length(cx) - n)]
rsum <- rx / rn
This still has the issue that if all the values in the window are NAs then there will be a division by zero error.
In data.table 1.12.0 new frollmean function has been added to compute fast and exact rolling mean carefully handling NA, NaN and +Inf, -Inf values.
As there is no reproducible example in the question there is not much more to address here.
You can find more info about ?frollmean in manual, also available online at ?frollmean.
Examples from manual below:
library(data.table)
d = as.data.table(list(1:6/2, 3:8/4))
# rollmean of single vector and single window
frollmean(d[, V1], 3)
# multiple columns at once
frollmean(d, 3)
# multiple windows at once
frollmean(d[, .(V1)], c(3, 4))
# multiple columns and multiple windows at once
frollmean(d, c(3, 4))
## three above are embarrassingly parallel using openmp
The caTools package has very fast rolling mean/min/max/sd and few other functions. I've only worked with runmean and runsd and they are the fastest of any of the other packages mentioned to date.
You could use RcppRoll for very quick moving averages written in C++. Just call the roll_mean function. Docs can be found here.
Otherwise, this (slower) for loop should do the trick:
ma <- function(arr, n=15){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n):i])
}
res
}
Here is example code showing how to compute a centered moving average and a trailing moving average using the rollmean function from the zoo package.
library(tidyverse)
library(zoo)
some_data = tibble(day = 1:10)
# cma = centered moving average
# tma = trailing moving average
some_data = some_data %>%
mutate(cma = rollmean(day, k = 3, fill = NA)) %>%
mutate(tma = rollmean(day, k = 3, fill = NA, align = "right"))
some_data
#> # A tibble: 10 x 3
#> day cma tma
#> <int> <dbl> <dbl>
#> 1 1 NA NA
#> 2 2 2 NA
#> 3 3 3 2
#> 4 4 4 3
#> 5 5 5 4
#> 6 6 6 5
#> 7 7 7 6
#> 8 8 8 7
#> 9 9 9 8
#> 10 10 NA 9
In fact RcppRoll is very good.
The code posted by cantdutchthis must be corrected in the fourth line to the window be fixed:
ma <- function(arr, n=15){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n+1):i])
}
res
}
Another way, which handles missings, is given here.
A third way, improving cantdutchthis code to calculate partial averages or not, follows:
ma <- function(x, n=2,parcial=TRUE){
res = x #set the first values
if (parcial==TRUE){
for(i in 1:length(x)){
t<-max(i-n+1,1)
res[i] = mean(x[t:i])
}
res
}else{
for(i in 1:length(x)){
t<-max(i-n+1,1)
res[i] = mean(x[t:i])
}
res[-c(seq(1,n-1,1))] #remove the n-1 first,i.e., res[c(-3,-4,...)]
}
}
In order to complement the answer of cantdutchthis and Rodrigo Remedio;
moving_fun <- function(x, w, FUN, ...) {
# x: a double vector
# w: the length of the window, i.e., the section of the vector selected to apply FUN
# FUN: a function that takes a vector and return a summarize value, e.g., mean, sum, etc.
# Given a double type vector apply a FUN over a moving window from left to the right,
# when a window boundary is not a legal section, i.e. lower_bound and i (upper bound)
# are not contained in the length of the vector, return a NA_real_
if (w < 1) {
stop("The length of the window 'w' must be greater than 0")
}
output <- x
for (i in 1:length(x)) {
# plus 1 because the index is inclusive with the upper_bound 'i'
lower_bound <- i - w + 1
if (lower_bound < 1) {
output[i] <- NA_real_
} else {
output[i] <- FUN(x[lower_bound:i, ...])
}
}
output
}
# example
v <- seq(1:10)
# compute a MA(2)
moving_fun(v, 2, mean)
# compute moving sum of two periods
moving_fun(v, 2, sum)
You may calculate the moving average of a vector x with a window width of k by:
apply(embed(x, k), 1, mean)
The slider package can be used for this. It has an interface that has been specifically designed to feel similar to purrr. It accepts any arbitrary function, and can return any type of output. Data frames are even iterated over row wise. The pkgdown site is here.
library(slider)
x <- 1:3
# Mean of the current value + 1 value before it
# returned as a double vector
slide_dbl(x, ~mean(.x, na.rm = TRUE), .before = 1)
#> [1] 1.0 1.5 2.5
df <- data.frame(x = x, y = x)
# Slide row wise over data frames
slide(df, ~.x, .before = 1)
#> [[1]]
#> x y
#> 1 1 1
#>
#> [[2]]
#> x y
#> 1 1 1
#> 2 2 2
#>
#> [[3]]
#> x y
#> 1 2 2
#> 2 3 3
The overhead of both slider and data.table's frollapply() should be pretty low (much faster than zoo). frollapply() looks to be a little faster for this simple example here, but note that it only takes numeric input, and the output must be a scalar numeric value. slider functions are completely generic, and you can return any data type.
library(slider)
library(zoo)
library(data.table)
x <- 1:50000 + 0L
bench::mark(
slider = slide_int(x, function(x) 1L, .before = 5, .complete = TRUE),
zoo = rollapplyr(x, FUN = function(x) 1L, width = 6, fill = NA),
datatable = frollapply(x, n = 6, FUN = function(x) 1L),
iterations = 200
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 slider 19.82ms 26.4ms 38.4 829.8KB 19.0
#> 2 zoo 177.92ms 211.1ms 4.71 17.9MB 24.8
#> 3 datatable 7.78ms 10.9ms 87.9 807.1KB 38.7
EDIT: took great joy in adding the side parameter, for a moving average (or sum, or ...) of e.g. the past 7 days of a Date vector.
For people just wanting to calculate this themselves, it's nothing more than:
# x = vector with numeric data
# w = window length
y <- numeric(length = length(x))
for (i in seq_len(length(x))) {
ind <- c((i - floor(w / 2)):(i + floor(w / 2)))
ind <- ind[ind %in% seq_len(length(x))]
y[i] <- mean(x[ind])
}
y
But it gets fun to make it independent of mean(), so you can calculate any 'moving' function!
# our working horse:
moving_fn <- function(x, w, fun, ...) {
# x = vector with numeric data
# w = window length
# fun = function to apply
# side = side to take, (c)entre, (l)eft or (r)ight
# ... = parameters passed on to 'fun'
y <- numeric(length(x))
for (i in seq_len(length(x))) {
if (side %in% c("c", "centre", "center")) {
ind <- c((i - floor(w / 2)):(i + floor(w / 2)))
} else if (side %in% c("l", "left")) {
ind <- c((i - floor(w) + 1):i)
} else if (side %in% c("r", "right")) {
ind <- c(i:(i + floor(w) - 1))
} else {
stop("'side' must be one of 'centre', 'left', 'right'", call. = FALSE)
}
ind <- ind[ind %in% seq_len(length(x))]
y[i] <- fun(x[ind], ...)
}
y
}
# and now any variation you can think of!
moving_average <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = mean, side = side, na.rm = na.rm)
}
moving_sum <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = sum, side = side, na.rm = na.rm)
}
moving_maximum <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = max, side = side, na.rm = na.rm)
}
moving_median <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = median, side = side, na.rm = na.rm)
}
moving_Q1 <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = quantile, side = side, na.rm = na.rm, 0.25)
}
moving_Q3 <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = quantile, side = side, na.rm = na.rm, 0.75)
}
Though a bit slow but you can also use zoo::rollapply to perform calculations on matrices.
reqd_ma <- rollapply(x, FUN = mean, width = n)
where x is the data set, FUN = mean is the function; you can also change it to min, max, sd etc and width is the rolling window.
One can use runner package for moving functions. In this case mean_run function. Problem with cummean is that it doesn't handle NA values, but mean_run does. runner package also supports irregular time series and windows can depend on date:
library(runner)
set.seed(11)
x1 <- rnorm(15)
x2 <- sample(c(rep(NA,5), rnorm(15)), 15, replace = TRUE)
date <- Sys.Date() + cumsum(sample(1:3, 15, replace = TRUE))
mean_run(x1)
#> [1] -0.5910311 -0.2822184 -0.6936633 -0.8609108 -0.4530308 -0.5332176
#> [7] -0.2679571 -0.1563477 -0.1440561 -0.2300625 -0.2844599 -0.2897842
#> [13] -0.3858234 -0.3765192 -0.4280809
mean_run(x2, na_rm = TRUE)
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.12188853 -0.13873536
#> [7] -0.13873536 -0.14571604 -0.12596067 -0.11116961 -0.09881996 -0.08871569
#> [13] -0.05194292 -0.04699909 -0.05704202
mean_run(x2, na_rm = FALSE )
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.12188853 -0.13873536
#> [7] NA NA NA NA NA NA
#> [13] NA NA NA
mean_run(x2, na_rm = TRUE, k = 4)
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.10546063 -0.16299272
#> [7] -0.21203756 -0.39209010 -0.13274756 -0.05603811 -0.03894684 0.01103493
#> [13] 0.09609256 0.09738460 0.04740283
mean_run(x2, na_rm = TRUE, k = 4, idx = date)
#> [1] -0.187600111 -0.090220655 -0.004349696 0.168349653 -0.206571573 -0.494335093
#> [7] -0.222969541 -0.187600111 -0.087636571 0.009742884 0.009742884 0.012326968
#> [13] 0.182442234 0.125737145 0.059094786
One can also specify other options like lag, and roll only at specific indexes. More in package and function documentation.
Here is a simple function with filter demonstrating one way to take care of beginning and ending NAs with padding, and computing a weighted average (supported by filter) using custom weights:
wma <- function(x) {
wts <- c(seq(0.5, 4, 0.5), seq(3.5, 0.5, -0.5))
nside <- (length(wts)-1)/2
# pad x with begin and end values for filter to avoid NAs
xp <- c(rep(first(x), nside), x, rep(last(x), nside))
z <- stats::filter(xp, wts/sum(wts), sides = 2) %>% as.vector
z[(nside+1):(nside+length(x))]
}
vector_avg <- function(x){
sum_x = 0
for(i in 1:length(x)){
if(!is.na(x[i]))
sum_x = sum_x + x[i]
}
return(sum_x/length(x))
}
I use aggregate along with a vector created by rep(). This has the advantage of using cbind() to aggregate more than 1 column in your dataframe at time. Below is an example of a moving average of 60 for a vector (v) of length 1000:
v=1:1000*0.002+rnorm(1000)
mrng=rep(1:round(length(v)/60+0.5), length.out=length(v), each=60)
aggregate(v~mrng, FUN=mean, na.rm=T)
Note the first argument in rep is to simply get enough unique values for the moving range, based on the length of the vector and the amount to be averaged; the second argument keeps the length equal to the vector length, and the last repeats the values of the first argument the same number of times as the averaging period.
In aggregate you could use several functions (median, max, min) - mean shown for example. Again, could could use a formula with cbind to do this on more than one (or all) columns in a dataframe.

Improving efficiency in R program [duplicate]

I'm trying to use R to calculate the moving average over a series of values in a matrix. There doesn't seem to be a built-in function in R that will allow me to calculate moving averages. Do any packages provide one? Or do I need to write my own?
Or you can simply calculate it using filter, here's the function I use:
ma <- function(x, n = 5){filter(x, rep(1 / n, n), sides = 2)}
If you use dplyr, be careful to specify stats::filter in the function above.
Rolling Means/Maximums/Medians in the zoo package (rollmean)
MovingAverages in TTR
ma in forecast
Using cumsum should be sufficient and efficient. Assuming you have a vector x and you want a running sum of n numbers
cx <- c(0,cumsum(x))
rsum <- (cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n
As pointed out in the comments by #mzuther, this assumes that there are no NAs in the data. to deal with those would require dividing each window by the number of non-NA values. Here's one way of doing that, incorporating the comment from #Ricardo Cruz:
cx <- c(0, cumsum(ifelse(is.na(x), 0, x)))
cn <- c(0, cumsum(ifelse(is.na(x), 0, 1)))
rx <- cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]
rn <- cn[(n+1):length(cx)] - cn[1:(length(cx) - n)]
rsum <- rx / rn
This still has the issue that if all the values in the window are NAs then there will be a division by zero error.
In data.table 1.12.0 new frollmean function has been added to compute fast and exact rolling mean carefully handling NA, NaN and +Inf, -Inf values.
As there is no reproducible example in the question there is not much more to address here.
You can find more info about ?frollmean in manual, also available online at ?frollmean.
Examples from manual below:
library(data.table)
d = as.data.table(list(1:6/2, 3:8/4))
# rollmean of single vector and single window
frollmean(d[, V1], 3)
# multiple columns at once
frollmean(d, 3)
# multiple windows at once
frollmean(d[, .(V1)], c(3, 4))
# multiple columns and multiple windows at once
frollmean(d, c(3, 4))
## three above are embarrassingly parallel using openmp
The caTools package has very fast rolling mean/min/max/sd and few other functions. I've only worked with runmean and runsd and they are the fastest of any of the other packages mentioned to date.
You could use RcppRoll for very quick moving averages written in C++. Just call the roll_mean function. Docs can be found here.
Otherwise, this (slower) for loop should do the trick:
ma <- function(arr, n=15){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n):i])
}
res
}
Here is example code showing how to compute a centered moving average and a trailing moving average using the rollmean function from the zoo package.
library(tidyverse)
library(zoo)
some_data = tibble(day = 1:10)
# cma = centered moving average
# tma = trailing moving average
some_data = some_data %>%
mutate(cma = rollmean(day, k = 3, fill = NA)) %>%
mutate(tma = rollmean(day, k = 3, fill = NA, align = "right"))
some_data
#> # A tibble: 10 x 3
#> day cma tma
#> <int> <dbl> <dbl>
#> 1 1 NA NA
#> 2 2 2 NA
#> 3 3 3 2
#> 4 4 4 3
#> 5 5 5 4
#> 6 6 6 5
#> 7 7 7 6
#> 8 8 8 7
#> 9 9 9 8
#> 10 10 NA 9
In fact RcppRoll is very good.
The code posted by cantdutchthis must be corrected in the fourth line to the window be fixed:
ma <- function(arr, n=15){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n+1):i])
}
res
}
Another way, which handles missings, is given here.
A third way, improving cantdutchthis code to calculate partial averages or not, follows:
ma <- function(x, n=2,parcial=TRUE){
res = x #set the first values
if (parcial==TRUE){
for(i in 1:length(x)){
t<-max(i-n+1,1)
res[i] = mean(x[t:i])
}
res
}else{
for(i in 1:length(x)){
t<-max(i-n+1,1)
res[i] = mean(x[t:i])
}
res[-c(seq(1,n-1,1))] #remove the n-1 first,i.e., res[c(-3,-4,...)]
}
}
In order to complement the answer of cantdutchthis and Rodrigo Remedio;
moving_fun <- function(x, w, FUN, ...) {
# x: a double vector
# w: the length of the window, i.e., the section of the vector selected to apply FUN
# FUN: a function that takes a vector and return a summarize value, e.g., mean, sum, etc.
# Given a double type vector apply a FUN over a moving window from left to the right,
# when a window boundary is not a legal section, i.e. lower_bound and i (upper bound)
# are not contained in the length of the vector, return a NA_real_
if (w < 1) {
stop("The length of the window 'w' must be greater than 0")
}
output <- x
for (i in 1:length(x)) {
# plus 1 because the index is inclusive with the upper_bound 'i'
lower_bound <- i - w + 1
if (lower_bound < 1) {
output[i] <- NA_real_
} else {
output[i] <- FUN(x[lower_bound:i, ...])
}
}
output
}
# example
v <- seq(1:10)
# compute a MA(2)
moving_fun(v, 2, mean)
# compute moving sum of two periods
moving_fun(v, 2, sum)
You may calculate the moving average of a vector x with a window width of k by:
apply(embed(x, k), 1, mean)
The slider package can be used for this. It has an interface that has been specifically designed to feel similar to purrr. It accepts any arbitrary function, and can return any type of output. Data frames are even iterated over row wise. The pkgdown site is here.
library(slider)
x <- 1:3
# Mean of the current value + 1 value before it
# returned as a double vector
slide_dbl(x, ~mean(.x, na.rm = TRUE), .before = 1)
#> [1] 1.0 1.5 2.5
df <- data.frame(x = x, y = x)
# Slide row wise over data frames
slide(df, ~.x, .before = 1)
#> [[1]]
#> x y
#> 1 1 1
#>
#> [[2]]
#> x y
#> 1 1 1
#> 2 2 2
#>
#> [[3]]
#> x y
#> 1 2 2
#> 2 3 3
The overhead of both slider and data.table's frollapply() should be pretty low (much faster than zoo). frollapply() looks to be a little faster for this simple example here, but note that it only takes numeric input, and the output must be a scalar numeric value. slider functions are completely generic, and you can return any data type.
library(slider)
library(zoo)
library(data.table)
x <- 1:50000 + 0L
bench::mark(
slider = slide_int(x, function(x) 1L, .before = 5, .complete = TRUE),
zoo = rollapplyr(x, FUN = function(x) 1L, width = 6, fill = NA),
datatable = frollapply(x, n = 6, FUN = function(x) 1L),
iterations = 200
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 slider 19.82ms 26.4ms 38.4 829.8KB 19.0
#> 2 zoo 177.92ms 211.1ms 4.71 17.9MB 24.8
#> 3 datatable 7.78ms 10.9ms 87.9 807.1KB 38.7
EDIT: took great joy in adding the side parameter, for a moving average (or sum, or ...) of e.g. the past 7 days of a Date vector.
For people just wanting to calculate this themselves, it's nothing more than:
# x = vector with numeric data
# w = window length
y <- numeric(length = length(x))
for (i in seq_len(length(x))) {
ind <- c((i - floor(w / 2)):(i + floor(w / 2)))
ind <- ind[ind %in% seq_len(length(x))]
y[i] <- mean(x[ind])
}
y
But it gets fun to make it independent of mean(), so you can calculate any 'moving' function!
# our working horse:
moving_fn <- function(x, w, fun, ...) {
# x = vector with numeric data
# w = window length
# fun = function to apply
# side = side to take, (c)entre, (l)eft or (r)ight
# ... = parameters passed on to 'fun'
y <- numeric(length(x))
for (i in seq_len(length(x))) {
if (side %in% c("c", "centre", "center")) {
ind <- c((i - floor(w / 2)):(i + floor(w / 2)))
} else if (side %in% c("l", "left")) {
ind <- c((i - floor(w) + 1):i)
} else if (side %in% c("r", "right")) {
ind <- c(i:(i + floor(w) - 1))
} else {
stop("'side' must be one of 'centre', 'left', 'right'", call. = FALSE)
}
ind <- ind[ind %in% seq_len(length(x))]
y[i] <- fun(x[ind], ...)
}
y
}
# and now any variation you can think of!
moving_average <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = mean, side = side, na.rm = na.rm)
}
moving_sum <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = sum, side = side, na.rm = na.rm)
}
moving_maximum <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = max, side = side, na.rm = na.rm)
}
moving_median <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = median, side = side, na.rm = na.rm)
}
moving_Q1 <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = quantile, side = side, na.rm = na.rm, 0.25)
}
moving_Q3 <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = quantile, side = side, na.rm = na.rm, 0.75)
}
Though a bit slow but you can also use zoo::rollapply to perform calculations on matrices.
reqd_ma <- rollapply(x, FUN = mean, width = n)
where x is the data set, FUN = mean is the function; you can also change it to min, max, sd etc and width is the rolling window.
One can use runner package for moving functions. In this case mean_run function. Problem with cummean is that it doesn't handle NA values, but mean_run does. runner package also supports irregular time series and windows can depend on date:
library(runner)
set.seed(11)
x1 <- rnorm(15)
x2 <- sample(c(rep(NA,5), rnorm(15)), 15, replace = TRUE)
date <- Sys.Date() + cumsum(sample(1:3, 15, replace = TRUE))
mean_run(x1)
#> [1] -0.5910311 -0.2822184 -0.6936633 -0.8609108 -0.4530308 -0.5332176
#> [7] -0.2679571 -0.1563477 -0.1440561 -0.2300625 -0.2844599 -0.2897842
#> [13] -0.3858234 -0.3765192 -0.4280809
mean_run(x2, na_rm = TRUE)
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.12188853 -0.13873536
#> [7] -0.13873536 -0.14571604 -0.12596067 -0.11116961 -0.09881996 -0.08871569
#> [13] -0.05194292 -0.04699909 -0.05704202
mean_run(x2, na_rm = FALSE )
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.12188853 -0.13873536
#> [7] NA NA NA NA NA NA
#> [13] NA NA NA
mean_run(x2, na_rm = TRUE, k = 4)
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.10546063 -0.16299272
#> [7] -0.21203756 -0.39209010 -0.13274756 -0.05603811 -0.03894684 0.01103493
#> [13] 0.09609256 0.09738460 0.04740283
mean_run(x2, na_rm = TRUE, k = 4, idx = date)
#> [1] -0.187600111 -0.090220655 -0.004349696 0.168349653 -0.206571573 -0.494335093
#> [7] -0.222969541 -0.187600111 -0.087636571 0.009742884 0.009742884 0.012326968
#> [13] 0.182442234 0.125737145 0.059094786
One can also specify other options like lag, and roll only at specific indexes. More in package and function documentation.
Here is a simple function with filter demonstrating one way to take care of beginning and ending NAs with padding, and computing a weighted average (supported by filter) using custom weights:
wma <- function(x) {
wts <- c(seq(0.5, 4, 0.5), seq(3.5, 0.5, -0.5))
nside <- (length(wts)-1)/2
# pad x with begin and end values for filter to avoid NAs
xp <- c(rep(first(x), nside), x, rep(last(x), nside))
z <- stats::filter(xp, wts/sum(wts), sides = 2) %>% as.vector
z[(nside+1):(nside+length(x))]
}
vector_avg <- function(x){
sum_x = 0
for(i in 1:length(x)){
if(!is.na(x[i]))
sum_x = sum_x + x[i]
}
return(sum_x/length(x))
}
I use aggregate along with a vector created by rep(). This has the advantage of using cbind() to aggregate more than 1 column in your dataframe at time. Below is an example of a moving average of 60 for a vector (v) of length 1000:
v=1:1000*0.002+rnorm(1000)
mrng=rep(1:round(length(v)/60+0.5), length.out=length(v), each=60)
aggregate(v~mrng, FUN=mean, na.rm=T)
Note the first argument in rep is to simply get enough unique values for the moving range, based on the length of the vector and the amount to be averaged; the second argument keeps the length equal to the vector length, and the last repeats the values of the first argument the same number of times as the averaging period.
In aggregate you could use several functions (median, max, min) - mean shown for example. Again, could could use a formula with cbind to do this on more than one (or all) columns in a dataframe.

Resources