How can I use rollapply with a 5 month window? - r

I noticed this in the documentation of rollapply() to roll by 3 days:
## rolling mean by time window (e.g., 3 days) rather than
## by number of observations (e.g., when these are unequally spaced):
#
## - test data
tt <- as.Date("2000-01-01") + c(1, 2, 5, 6, 7, 8, 10)
z <- zoo(seq_along(tt), tt)
## - fill it out to a daily series, zm, using NAs
## using a zero width zoo series g on a grid
g <- zoo(, seq(start(z), end(z), "day"))
zm <- merge(z, g)
## - 3-day rolling mean
rollapply(zm, 3, mean, na.rm = TRUE, fill = NA)
Suppose I have the following data:
data.zoo <- read.zoo(
data.frame(
date = sample(seq(as.Date('2001-04-12'), as.Date("2019-04-05"), by="day"), 600),
val = runif(1:600),
val2 = runif(1:600)
))
Is it possible to somehow use rollapply() with a 5 month rolling window to calculate the rolling mean of val? The problem with a 5-month rolling window is that the number of days in a month varies...
NOTE: I would prefer a base-R solution but other libraries would be interesting to see

Since width can be a vector of widths, one for each row of the input, we can simply compute the number of days between each date and 5 months prior and use those numbers for the width vector:
library(zoo)
ym <- as.yearmon(time(data.zoo))
w <- as.Date(ym) - as.Date(ym - 5/12)
r <- rollapplyr(data.zoo, w, mean, fill = NA)
Alternately we could write w like this with lubridate.
library(lubridate)
w <- time(data.zoo) - (time(data.zoo) %m-% months(5))
Update
If there can be missing dates then
library(lubridate)
w <- sapply(time(data.zoo), function(x)
length(intersect(seq(x %m-% months(5), x, "day"), time(data.zoo)))
or repeat this replacing %m-% months(5) with subtract5m which does not use additional packages:
subtract5m <- function(x) {
if (length(x) == 1) seq(x, length = 2, by = "-5 month")[2]
else as.Date(sapply(x, subtract5m))
}
w <- sapply(time(data.zoo), function(x)
length(intersect(seq(subtract5m(x), x, "day"), time(data.zoo))))
Note that due to the ambiguity of the definition of 5 months ago the various computations for w may vary slightly based on slightly different assumptions.

Improving on G. Grothendieck's ideas I went with:
ym <- as.yearmon(time(data.zoo))
ym.cutoff.ideal <- ym - 5/12
ym.cutoff.closest.to.ideal <- as.yearmon(time(data.zoo)[findInterval(as.Date(ym.cutoff.ideal), as.Date(ym)) + 1])
w <- time(data.zoo) - as.Date(ym.cutoff.closest.to.ideal) + 1
r <- rollapplyr(data.zoo, w, mean, fill = NA)
It looks like it is working correctly...

Related

rollapply for moving average with non-business day

I'd like to get MovingAverage in data which have "NA" in the middle of data like below.
date <- seq.Date(as.Date("2018-07-02"),as.Date("2018-07-14"),by = "days")
A <- c(100,110,120,130,140,NA,NA,150,160,170,180,190,200)
B <- c(200,220,240,260,280,NA,NA,300,320,340,360,380,400)
C <- c(150,160,170,180,190,200,210,NA,NA,220,230,240,250)
dataset <- data.frame(A,B,C)
dataset <- as.xts(dataset, order.by = date)
If I use rollapply like below to get 3-day MovingAverage...
y <- rollapply(dataset, width = 3, function(x) mean(x, na.rm = TRUE ))
This is not what I want.
For example, In MovingAverage of A at "2018-07-09", the result is (NA+NA+150)/1 = 150. But I want to get (130+140+150)/3 = 140.
How can I do that?
I assume you want NAs to stay as NA and otherwise to take the mean of the last 3 non-NAs.
1) Take 5 elements at a time and if the last element is NA then return NA; otherwise, remove the NAs and take the mean of the last 3. Note that this does imply that the first 4 rows will be NA.
mean_bus <- function(x) if (is.na(tail(x, 1))) NA else mean(tail(na.omit(x), 3))
y1 <- rollapplyr(dataset, width = 5, mean_bus)
2) An alternate would be to take the last 3 non-NAs and then overwrite that with NAs in all positions where the input is NA.
mean_omit <- function(x) mean(tail(na.omit(x), 3))
y <- rollapplyr(dataset, 5, mean_omit)
y2 <- replace(y, is.na(dataset), NA)
all.equal(y1, y2)
## [1] TRUE
3) If you prefer to fill in the first 4 rows with partial values then convert to zoo and use the partial= argument of rollapplyr.zoo. mean_bus is from (1).
y3 <- as.xts(rollapplyr(as.zoo(dataset), 5, mean_bus, partial = TRUE))
You could either remove the NAs in each series before you compute the moving average (MA).
Or you use a larger window and keep only the last three values for the MA.
y <- rollapply(dataset, width = 5,
function(x) {mean(tail(x[ !is.na(x) ], 3))})

Calculate a weighted sum from a timeseries with irregular spacing based on a function

Given a dataframe containing a timeseries with irrgularly spaced intervals, defined as:
df <- data.frame(date = as.Date("2016-01-01") + ((1:100) + sample(1:5, 100, replace = TRUE)),
data = rnorm(100) )
How can I calculate a rolling sum of the data column over the previous 30 days, with weights defined by this decay function?
tau <- 0.05
decay = function(tau, day){
exp(-tau * day)
}
The current day's data then has a weight of 1 and the data from 30 days ago has a weight of decay(0.05, 30) = 0.2231302. Missing days from the input time series should still be accounted for in computing the weights using the decay function.
If possible, I would like to convert the data frame to a zoo or xts object and then use the rollapplyr function or similar, and to do this with dplyr pipes.
Define a function weighted that takes the last 30 points and from those only keeps the points within 30 days of the last one. Then using those it multiplies that by the weights.
In the pipeline we convert df to zoo and then use rollapplyr with weighted. Note that it is important that we use coredata = FALSE so that the time index is passed to weighted. Without that it would not be.
library(dplyr)
library(zoo)
weighted <- function(x, tau) {
tx <- time(x)
cx <- coredata(x)[tx > tail(tx, 1) - 30] # only keep if within 30 days
w <- decay(tau, seq(to = 0, by = -1, length = length(cx)) )
sum(w * cx)
}
df %>%
read.zoo %>%
rollapplyr(30, weighted, tau = tau, partial = TRUE, coredata = FALSE)
If you want to treat missing days as 0 then use this instead:
weighted <- function(x, tau) {
tx <- as.numeric(time(x))
days <- tail(tx, 1) - tx
w <- (days < 30) * decay(tau, days)
sum(w * coredata(x))
}
Note
We have used the following input modified from the question by adding set.seed for reproducibility. Also the code used in the question might by chance give rise to multiple values with the same date and we eliminated such duplicates.
set.seed(123)
df <- data.frame(date = as.Date("2016-01-01") + 1:100 + sample(1:5, 100, replace = TRUE),
data = rnorm(100) )
df <- df[!duplicated(df$date), ]
tau <- 0.05
decay = function(tau, day){
exp(-tau * day)
}
I am not sure about pipes, but this should get you going:
d <- decay(tau, 29:0)
rollapply(df, 30, function(z) {
data <- as.data.frame(z, stringsAsFactors = FALSE)
data$data <- as.numeric(data$data)
sum(data$data * d)
}, by.column = FALSE)

Rolling mean with different window length

How can I calculate a two-month rolling mean, if the two-months windows are not of equal length? Preferably in datatable. Sample code:
set.seed(24)
test <- data.table(x = rnorm(762),time=seq(as.Date("1988/03/15"), as.Date("1990/04/15"), "day"))
So here, the first mean would be from 1988/03/15 until 1988/04/30, the second one from 1988/04/01 until 1988/05/31, and so on. The data availability in each month can be of different length, by intention.
Add a yearmon column and then summarize the sum and length of x by yearmon.
Finally divide a rolling sum over x by a rolling sum over the length N.
library(data.table)
library(zoo)
Means <- test[, yearmon := as.yearmon(time)][
, list(x = sum(x), N = .N), by = "yearmon"][
, list(yearmon, mean = rollsumr(x, 2, fill = NA) / rollsumr(N, 2, fill = NA))]
Alternately convert test to a zoo object, sum x and the length by yearmon, calculate the rolling sum of both x and n and divide giving a zoo object with the year/months and means. See ?fortify.zoo if you would like to convert that to a data frame.
z <- cbind(x = read.zoo(test, index = "time"), n = 1)
zym <- aggregate(z, as.yearmon, sum)
transform(rollsumr(zym, 2), mean = x / n)
Note
Input used is:
set.seed(24)
test <- data.table(x = rnorm(762), time=seq(as.Date("1988/03/15"),
as.Date("1990/04/15"), "day"))

Faster way to subset data table instead of a for loop R

I have a data table (you'll need the data table package installed) in R generated with X and Y coordinates and random data values from both normal and uniform distributions. The coordinates represent points on a 2000x1600 array and has to be divided into 16 smaller "sectors" each 500x400. These sectors need their mean of Normal Distribution values taken, divided by the min^2 of the Uniform Distribution values. I also created two variables x and y using a provided function startstop, that have the coordinates for the 16 sectors and a function that calculates the numbers for each sector.
library(data.table)
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
sectorCalc <- function(x,y,DT) {
sector <- numeric(length = 16)
for (i in 1:length(sector)) {
sect <- DT[X %between% c(x[[1]][i],x[[2]][i]) & Y %between% c(y[[1]][i],y[[2]][i])]
sector[i] <- sCalc(sect)
}
return(sector)
}
startstop <- function(width, y = FALSE) {
startend <- width - (width/4 - 1)
start <- round(seq(0, startend, length.out = 4))
stop <- round(seq(width/4, width, length.out = 4))
if (length(c(start,stop)[anyDuplicated(c(start,stop))]) != 0) {
dup <- anyDuplicated(c(start,stop))
stop[which(stop == c(start,stop)[dup])] <- stop[which(stop == c(start,stop)[dup])] - 1
}
if (y == TRUE) {
coord <- list(rep(start, each = 4), rep(stop, each = 4))
} else if (y == FALSE) {
coord <- list(rep(start, times = 4), rep(stop, times = 4))
}
return(coord)
}
x <- startstop(2000)
y <- startstop(1600, T)
sectorNos <- sectorCalc(x,y,DT)
The startstop function isn't really an issue but I need a faster way to subset the data table. Some modifications have to be made to the 'sectorCalc' function. The for loop was the best way I could think of but I don't have too much experience with data tables. Any ideas on a faster method of breaking up the data table?
A solution using not only the package data.table but also the cut function to build the interval "groups":
# Create your test data
library(data.table)
set.seed(123) # make random numbers reproducible to allow comparison of different answers
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
# calculate the sector by cutting the x and y values into groups defined by the interval breaks
DT[, x.sect := cut(DT[, X], c(0, 499, 1000, 1500, 2000), dig.lab=10)] # Intervals should be: seq(0, 2000, by=500) lower bound is less one since it is not included in the interval (see help for cut function)
DT[, y.sect := cut(DT[, Y], c(0, 399, 800, 1200, 1600), dig.lab=10)] # Intervals should be: seq(0, 1600, by=400)
# Now calculate per group (calculation logic "stolen" from the working answer of user "Symbolix"
DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)]
Please note: I think the size of the first and second interval is wrong in the original solution (499 instead of 500 for x and 399 instead of 400 for y so that I could not use the seq function to reproduce your desired intervals but had to enumerate the interval breaks manually).
Edit 1: I have replaced the original code that adds the x.sect and y.sect columns by an improved solution that adds columns by reference (:=).
Edit 2: If you want to order the result you have (at least) two options:
# "Chaining" (output is input of next)
DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)][order(x.sect, y.sect),]
# Or: Use the "keyby" param instead of "by"
DT[, .(sect = mean(Norm)/min(Unif)^2), keyby=.(x.sect, y.sect)]
Edit 3: Added dig.lab=10 param to cut function in code above to avoid scientific notation of the interval breaks.
To replace your sectorCalc function I think we can make use of data.tables joins
As you are looping over each row of sector, you just have to create a data.table to join onto that is your sector data,
specify a column to join (here I'm using key_col), and specify a 'group' variable for each row, to enable us to do a
the calculation at the end:
x <- startstop(2000)
y <- startstop(1600, T)
## copy the original DT
dt <- copy(DT)
dt_xy <- data.table(x_1 = x[[1]],
x_2 = x[[2]],
y_1 = y[[1]],
y_2 = y[[2]])
dt[, key_col := 1]
dt_xy[, `:=`(key_col = 1, xy_grp = seq(1,.N))]
## Use a data.table join, allowing cartesian, then filter out results.
dt_res <- dt[ dt_xy, on="key_col", allow.cartesian=T][x_1 <= X & X <= x_2 & y_1 <= Y & Y <= y_2]
## calculate 'sect' as required.
dt_sect <- dt_res[, .(sect = mean(Norm)/min(Unif)^2) , by=.(xy_grp)]

Rolling Mean from fixed starting point (and by Group)

Suppose you have the following data frame:
set.seed(100)
Pts <- floor(runif(20, 0, 10))
Individual <- c(rep("Adam",5), rep("Ben",5), rep("Charlie",5), rep("Daisy",5))
Date <- c(rep(seq(as.Date("2015-01-01"), as.Date("2015-01-05"), "days"), 4))
RollMean <- rep(NA,20)
df <- data.frame(Pts, Individual, Date, RollMean)
I would like to to calculate what the mean RollMean for Pts is for each row, by Individual, but only including entries between the earliest date and the date on the current row.
For example:
df$RollMean[3] = (5+2+3)/3
df$RollMean[4] = (5+2+3+0)/4
df$RollMean[7] = (8+4)/2
I have tried using functions such as SMA() from the TTR package and then using ave to sort by Group, such as:
df$RollMean <- ave(df$Pts, df$Individual, FUN= function(x) SMA(x, n))
but there I have to pre-specify n which changes based on which row R is dealing with.
What code can I use to generate the Rolling Means I am looking for?
You can try:
library(data.table)
setDT(df)[,cumsum(Pts[order(Date)])/seq(.N), Individual]
Here are a few alternatives:
1) This does not use any packages:
transform(df, Rollmean = ave(Pts, Individual, FUN = function(x) cumsum(x) / seq_along(x)))
2) An alternative is to use zoo's rollmeanr with vector of widths:
library(zoo)
Rollmean <- function(x) rollapplyr(x, seq_along(x), mean))
transform(df, Rollmean = ave(Pts, Individual, FUN = Rollmean)

Resources