Moving average with changing period in R - r

I have a data frame named abc on which I'm doing moving average using rollapply. The following code works:
forecast <- rollapply(abc, width=12, FUN=mean, align = "right", fill=NA)
Now, I want to do the same thing with the width being variable, i.e. for the 1st month, it'll be empty, for the second month, first month's value will come. For the third month, it'll be (first month+second month/2), i.e. for the ith month, if i<=12, the value will be (sum(1:i-1)/(i-1)) and for i>=12 it will be the average of the last 12 months as done by the forecast. Please help.

Here are some appraoches:
1) partial=TRUE
n <- length(x)
c(NA, rollapplyr(x, 12, mean, partial = TRUE)[-n])
Note the r at the end of rollapplyr.
2) width as list The width argument of rollapply can be a list such that the ith list element is a vector of the offsets to use for the ith rolling computation. If we specify partial=TRUE then offsets that run off the end of the vector will be ignored. If we only specify one element in the list it will be recycled:
rollapply(x, list(-seq(12)), mean, partial = TRUE, fill = NA)
2a) Rather than recycling and depending on partial we can write it out. Here we want width <- list(numeric(0), -1, -(1:2), -(1:3), ..., -(1:12), ..., -(1:12)) which can be calculated like this:
width <- lapply(seq_along(x), function(x) -seq_len(min(12, x-1)))
rollapply(x, width, mean)
This one would mainly be of interest if you want to modify the specification slightly because it is very flexible.
Note: Later in the comments the poster asked for the same rolling average except for it not to be lagged. That would be just:
rollapplyr(x, 12, mean, partial = TRUE)
Note the r at the end of rollapplyr.
Update Some improvements and additional solutions.

Related

Calculating the difference of elements in a vector with varying lag/lead

I have some lab data and I am looking to calculate the difference between sample measurements over a moving time frame/window e.g 2 minutes (as apposed to 0-2, 2-4, 4-6 minute, static windows)
The problem is that although the data is sampled every second there are some missed samples (e.g. 1,2,4,6,7) so I cannot use a fixed lag function especially for larger time windows.
Here is the most promising I have tried. I have tried to calculate the difference in the row positions that will then use that to determine the lag value.
library(tidyverse)
df <- data.frame(sample_group = c(rep("a", 25), rep("b", 25)),t_seconds = c(1:50), measurement = seq(1,100,2))
df <- df[-c(5,10,23,33,44),] #remove samples
t_window = 5
df_diff <- df %>%
group_by(sample_group) %>%
arrange(t_seconds) %>%
mutate(lag_row = min(which(t_seconds >= t_seconds + t_window))- min(which(t_seconds == t_seconds)), #attempt to identify the lag value for each element
Meas_diff = measurement - lag(measurement, lag_row))
In this example (lag_row) I am trying to call an element from a vector and the vector itself, which obviously does not work! to make it clearer, I have added '_v' to identify what I wanted as a vector and '_e' as an element of that vector min(which(t_seconds_v >= t_seconds_e + t_window))- min(which(t_seconds_v == t_seconds_e))
I have tried to stay away from using loops but I have failed to solve the problem.
I would appreciate if anyone has any better ideas?
Your first step should be inserting missing observations into your time series. Then you could fill the missing values using a Last-Observation-Carried-Backwards operation. This provides you with a complete regular time series.
Your desired output is very unclear, so the next step after that in the following example is just a guess. Adjust as needed.
#complete time series (using a data.table join):
library(data.table)
setDT(df)
df_fill <- df[, .SD[data.table(t_seconds = min(t_seconds):max(t_seconds)),
on = "t_seconds"],
by = sample_group]
df_fill[, filled := is.na(measurement)]
#last observation carried backwards
library(zoo)
df_fill[, measurement := na.locf(measurement, fromLast = TRUE), by = sample_group]
#differences
df_fill[, diff_value := shift(measurement, -t_window) - measurement, by = sample_group]

Apply between function over a matrix by using lower bound and upper bound vectors

I have a data frame composed of numeric values. I calculated the standard deviation and mean for each column and created Upper_Bound and Lower_Bound vectors as follows:
std_devs = apply(exp_vars[,sapply(exp_vars,is.numeric)], 2, sd)
means = apply(exp_vars[,sapply(exp_vars,is.numeric)], 2, mean)
Upper_Bound = means + 3*std_devs
Lower_Bound = means - 3*std_devs
Now i want to detect the rows that has at least one value that does not fall between the relevant upperbound and lowerbound. For example a value in column j must be equal or greater than Lower_Bound[j] and equal or smaller than Upper_Bound[j], if at least one value in a row i violates this condition I want to save the index of that row (I also have row names, saving row names would be fine too.) What I want to obtain is a vector of indices (or row names) that shows all rows which violate the rule. I tried the following:
outliers = apply(my_data ,1, between(x,Lower_Bound, Upper_Bound,incbounds = TRUE))
But i guess it was too much to expect between to automatically go over every value in a row and compare them with the relevant bounds. This was my second hopeless attempt that did not work:
outliers = apply(exp_vars_numeric,1, apply(x,2,between(x,Lower_Bound, Upper_Bound, incbounds = TRUE)))
I know that i can do it with a for loop but i am hoping for a more efficient solution. Any suggestion is highly appreciated.
Thanks in advance.
Consider keeping everything in one data frame by adding lower and upper bound columns with help of ave() for inline aggregation of sd and mean. Then run conditional ifelse() for the flagging of such rows.
num_cols <- sapply(exp_vars,is.numeric)
num_names <- colnames(exp_vars)[num_cols]
means <- sapply(exp_vars[,num_cols], function(x) ave(x, FUN=mean))
std_devs <- sapply(exp_vars[,num_cols], function(x) ave(x, FUN=sd))
exp_vars[,paste0(num_names, "_lower")] <- means - 3*std_devs
exp_vars[,paste0(num_names, "_upper")] <- means + 3*std_devs
# CONDITIONALLY ASSIGN FLAG COLS
exp_vars[,paste0(num_names, "_flag")] <- ifelse(exp_vars[,num_names] >= exp_vars[,paste0(num_names, "_lower")] &
exp_vars[,num_names] <= exp_vars[,paste0(num_names, "_upper")], 1, 0)
# ADD ALL FLAG COLS HORIZONTALLY
exp_vars$index <- ifelse(rowSums(exp_vars[,paste0(num_names, "_flag")]) > 0, row.names(exp_vars), NA)
exp_vars[is.na(exp_vars$index), ]
It is recommended to include a small example of how your data looks like so that it is easier for us to respond to your question :) I generated data.frames based on your description, and it seems that the following solves your problem:
df <- data.frame(a=c(1:10),b=c(5:14))
ncols <- ncol(df)
bounds <- data.frame(lower=seq(.5,5,.5),upper=seq(6.5,11,.5))
one_plus_fall_outside <- sapply(1:nrow(df),
function(i)
sum(between(df[i,],bounds$lower[i],bounds$upper[i]))/ncols<1
)
which(one_plus_fall_outside)
you can check if this works well by looking at all the columns together:
cbind(df,bounds,one_plus_fall_outside)

alignment and offsets in rollapply

I am trying to calculate some statistics for a moving window and am using rollapply in the zoo package. My question is how do I get rollapply to apply that function to the previous n observations instead of the current observation and the previous n-1 observations as align right seems to do.
require(zoo)
z <- data.frame(x1=11:111, x2=111:211, x3=as.Date(31:131))#generate data
output<-data.frame(dates=z$x3,
rollapply(z[,1:2],by.column=TRUE, 5, max, fill=NA, align='right'))
I have a hunch this is answered by ?rollapply "If width is a plain numeric vector its elements are regarded as widths to be interpreted in conjunction with align whereas if width is a list its components are regarded as offsets. In the above cases if the length of width is 1 then width is recycled for every by-th point. If width is a list its components represent integer offsets such that the i-th component of the list refers to time points at positions i + width[[i]]." But I have no idea what that means in terms of R code an no example is provided.
Nevermind, I deciphered the 'help.' Adding the parameter width to rollapply like this:
width=list(-1:-5)
accomplishes it.
If I'm reading correctly, you just want the column "shifted" down by one - so that the value for row n is the value that row n+1 has now.
This can be easily done using the lag function:
z <- data.frame(x1=11:111, x2=111:211, x3=as.Date(31:131))#generate data
output<-data.frame(dates=z$x3,
rollapply(z[,1:2],by.column=TRUE, 5, max, fill=NA, align='right'))
output$x1 <- lag(output$x1, 1)
output$x2 <- lag(output$x2, 1)

rollmean length of timeseries

I have a time series of regular 15 minute data intervals and I want to take a rolling average of the data for every 2 hours, hence the width of the rollmean interval would be 8. The only issue is that the length of the original time series that I put in is 35034, but the length of the data that I get as output is 35027. Is there a way that the length of the output is the same as the input and does have data in it at the end as well. i dont want to fill it with NA's at the end
interval <- 2 #2 hours
data <- data.frame(t=streamflowDateTime,flow=streamflow)
data2hr <- data
rollingWidth <- (interval*60)/15
library(zoo)
smoothedFlow <- rollmean(data2hr$flow,rollingWidth,align="center")
I am not completely clear on how you want to do the filling but here are some ways:
1) Add the argument fill = NA to rollmean to add NA's to the ends.
2) If you want partial means at the ends then use this where x is your data, width is the number of points to take the mean of each time:
rollapply(x, width, mean, partial = TRUE)
(The points near the ends will be the mean of fewer than width points since, of course, there just aren't that many near the ends.)
3) Assuming width is odd you could pad the input series with (width-1)/2 values (possibly NA depending on what you want) on each end.
4) This keeps the original values near the ends:
out <- rollmean(x, width, fill = NA)
isNA <- is.na(out)
out[isNA] <- x[isNA]
Note: align = "center" is the default for both rollmean and rollapply so that can be omitted.
If you don't want NA, you can use fill parameter for extending the data range with simple constant values:
With keyword "extend" rollmean will extend your input vector:
rollmean(x, k, align="center", fill = "extend")
or define a three component constant for left|at|right:
rollmean(x, k, align="center", fill = c(42, 69, 666))

R xts: apply over a rolling window

I wish to execute a function FUN over a rolling window of 1 year. My xts has not the same number of points per year. How can I do that in an efficient way?
P.S. usually to execute a FUN over a fixed number of datapoints (for instance 100) I use:
as.xts(rollapply(data = zoo(indicator), FUN = FUN, width = 100, align = "right"))
but obviously this doesn't work if there are not always the same number of points per year.
I'll try to answer my own question: One way to do that is:
First to NA-pad the time series so that there is one datapoint per day (or any unit relevant for your case),
(optional, depending on your FUN) Then to use na.locf to carry over the last data to fill the holes.
Finally to use the usual rollapply as shown in the question, over a fixed number of datapoints that corresponds to 1 year.
Your can use the apply.yearly(x, FUN, ...) function from the xts library.
dat <- xts(rnorm(1000), order.by = as.Date(1:1000))
plot(dat)
apply.yearly(dat, mean)

Resources