R Rolling average from irregular time series - r

I've encountered this problem several times over the years so maybe I'm just misunderstanding something or somehow just being silly about this. I've met a wierd problem when doing a rolling average on irregular time series. A good overview of the available methods in packages and simple script is here: Calculating moving average in R
I may be misreading some of them, but I see an issue in dealing with irregular time series. For example the common method of Rolling means in the zoo package requires unique values for each data. But in this case this is not the rolling average, but a rolling average of averages per time unit - time units will less data points will have comparably more influence on the average than ones with more.
A true moving average seems to me to need to work not with aggregates, but with distributions for each calculated point.
Given that I have the following data frame or irregular values, how can I best create a moving average measure for each of the values.
df <- data.frame(year = c(rep(2000,3),rep(2001,1),rep(2004,4),rep(2005,3),+
rep(2006,3),rep(2007,1),rep(2008,2),rep(2009,6),rep(2010,8)),+
value1=rnorm(31), value2=rnorm(31), value3=rnorm(31))
I found an easy way to do it via subsetting that I'll post as an initial answer, but this works in limited circumstances and needs to be customized each time. I'm wondering what is a general solution. Also, if anyone is able to comment on the practices of using averages of averages vs averages of distributions in rolling means calculations, that would be extra helpful. Thanks!

Calcuations within a sliding or rolling window of an irregular time series can be solved by data.table's ability to aggregate in a non-equi join.
There are many similar questions, e.g., r calculating rolling average with window based on value (not number of rows or date/time variable) or Rolling regression on irregular time series.
However, this question is different and thus deserves an answer on its own. From OP's own answer it can be concluded that the OP is looking for a centred rolling window. In addition, the rolling mean is to be computed for several columns.
library(data.table)
cols <- c("value2", "value3")
setDT(df)[SJ(year = (min(year) + 2):(max(year) - 2))[, c("start", "end") := .(year - 2, year + 2)],
on = .(year >= start, year < end),
c(.(year = i.year), lapply(.SD, mean)), .SDcols = cols, by = .EACHI][, -(1:2)]
year value2 value3
1: 2002 0.57494219 -0.53001134
2: 2003 0.33925292 0.75541896
3: 2004 -0.05834453 0.23987209
4: 2005 0.17031099 0.13074666
5: 2006 0.05272739 0.09297215
6: 2007 -0.12935805 -0.38780964
7: 2008 0.19716437 -0.11587017
The result is identical to OP's own result rmeans.
Data
set.seed(123) # ensure reproducible sample data
df <- data.frame(
year = rep(2000:2010, c(3, 1, 0, 0, 4, 3, 3, 1, 2, 6, 8)),
value1 = rnorm(31), value2 = rnorm(31), value3 = rnorm(31))

So here is the simple subsetting I came up with. Could be helpful if anyone finds themselves finding the same issues:
df <- data.frame(year = c(rep(2000,3),rep(2001,1),rep(2004,4),rep(2005,3), +
rep(2006,3),rep(2007,1),rep(2008,2),rep(2009,6),rep(2010,8)), +
value1=rnorm(31), value2=rnorm(31), value3=rnorm(31))
rmeans <- data.frame()
for (i in (min(df$year)+2):(max(df$year)-2)){
rmeans <- rbind(rmeans, data.frame(year=i,as.data.frame.list(colMeans(df +
[df$year>=(i-2)&df$year<(i+2),-c(1,2)]))))
}

Related

Moving average varying window

I have an unbalanced panel, in which I have certain observations (variable x) per ID and month. I am trying to calculate a 6-month-rolling average of x, but only every March. I know that with zoo, I can calculate the average every single time, but I think that is computationally expensive. I have a very large panel, so it would be better to define an index first and pass it to the function. Also, my panel is imbalanced, so sometimes I have all 6 past values at a given March, and sometimes I do not. If there is a minimum of 3 values available, I would still like to compute the average.
Here is some sample code and my solution so far:
library(data.table)
set.seed(1)
time=rep(seq(as.Date("2010-02-01"), length=42, by="1 month") - 1,2)
IDs=rep(letters[1:2],each=length(time))
DT <- data.table(time=time,
ID=IDs,
ind=rep(1:(2*length(time))),
row=1:(2*length(time)),
x=sample(2*length(time)))
DT
DT <- DT[!ind %in% c(11,12,26)]
DT
library(zoo)
DT[,movavg := if(length(x) >= 3){ rollapply(x, 6, sum, na.rm = FALSE,align = "right",fill = NA)}else{
rep(NA,length(x))
},by=ID]
DT
The target is to simply show for each March the corresponding moving average, which contains the past 6 observations. I don't mind if the original panel is kept, that is, only in March the results are shown, or if only the March values are extracted and nothing else is shown.
My code works, but it does the calculation every row/month. What I want it to do is to work only at a defined index. The issue is, as the panel is unbalanced, the distance between the Marches is not equally long. For example, it can be 12 months from one to another year, but it could be 10 months from the next to the following year when 2 observations are unfortunately missing. Can roll apply still be used? Any hints for data table or dplyr are highly appreciated.
If this code from the question gives what you want
DT[,movavg := if(length(x) >= 3){ rollapply(x, 6, sum, na.rm = FALSE,align = "right",fill = NA)}else{
rep(NA,length(x))
},by=ID]
then the first of these ran 2.8x faster and gave the same result and the second one using frollsum from data.table ran 4.8x faster.
DT[, movavg := rollsumr(x, 6, fill = NA), by = ID]
DT[, movavg := frollsum(x, 6), by = ID]

R calculating time differences in a (layered) long dataset

I've been struggling with a bit of timestamp data (haven't had to work with dates much until now, and it shows). Hope you can help out.
I'm working with data from a website showing for each customer (ID) their respective visits and the timestamp for those visits. It's grouped in the sense that one customer might have multiple visits/timestamps.
The df is structured as follows, in a long format:
df <- data.frame("Customer" = c(1, 1, 1, 2, 3, 3),
"Visit" =c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited the site three times.
"Timestamp" = c("2019-12-31 12:13:25", "2019-12-31 16:13:25", "2020-01-05 10:13:25", "2019-11-12 15:18:42", "2019-11-13 19:22:35", "2019-12-10 19:43:55"))
Note: In the real dataset the timestamp isn't a factor but some other haggard character-type abomination which I should probably first try to convert into a POSIXct format somehow.
What I would like to do here is to create a df that displays per customer their average time between visits (let's say in minutes, or hours). Visitors with only a single visit (e.g., second customer in my example) could be filtered out in advance or should display a 0. My final goal is to visualize that distribution, and possibly calculate a grand mean across all customers.
Because the number of visits can vary drastically (e.g. one or 256 visits) I can't just use a 'wide' version of the dataset where a fixed number of visits are the columns which I could then subtract and average.
I'm at a bit of a loss how to best approach this type of problem, thanks a bunch!
Using dplyr:
df %>%
arrange(Customer, Timestamp) %>%
group_by(Customer) %>%
mutate(Difference = Timestamp - lag(Timestamp)) %>%
summarise(mean(Difference, na.rm = TRUE))
Due to the the grouping, the first value of difference for any costumer should be NA (including those with only one visit), so they will be dropped with the mean.
Using base R (no extra packages):
sort the data, ordering by customer Id, then by timestamp.
calculate the time difference between consecutive rows (using the diff() function), grouping by customer id (tapply() does the grouping).
find the average
squish that into a data.frame.
# 1 sort the data
df$Timestamp <- as.POSIXct(df$Timestamp)
# not debugged
df <- df[order(df$Customer, df$Timestamp),]
# 2 apply a diff.
# if you want to force the time units to seconds, convert
# the timestamp to numeric first.
# without conversion
diffs <- tapply(df$Timestamp, df$Customer, diff)
# ======OR======
# convert to seconds
diffs <- tapply(as.numeric(df$Timestamp), df$Customer, diff)
# 3 find the averages
diffs.mean <- lapply(diffs, mean)
# 4 squish that into a data.frame
diffs.df <- data.frame(do.call(rbind, diffs.mean))
diffs.df$Customer <- names(diffs.mean)
# 4a tidy up the data.frame names
names(diffs.df)[1] <- "Avg_Interval"
diffs.df
You haven't shown your timestamp strings, but when you need to wrangle them, the lubridate package is your friend.

Define different timeseries for different columns

I have a dataframe where some of the columns are starting later than the other. Please find a reproducible example.
set.seed(354)
df <- data.frame(Product_Id = rep(1:100, each = 50),
Date = seq(from = as.Date("2014/1/1"),
to = as.Date("2018/2/1"),
by = "month"),
Sales = rnorm(100, mean = 50, sd= 20))
df <- df[-c(251:256, 301:312, 2551:2562, 2651:2662, 2751:2762), ]
library(zoo)
z <- read.zoo(df, index = "Date", split = "Product_Id", FUN = as.yearmon)
tt <- as.ts(z)
Now for this dataframe for the columns 6,7,52,54 and 56 I want to define them as timeseries starting from a different date as compared to the rest of the dataframe. Supposedly the data begins from Jan 2000, column 6 will begin from July 2000, column 7 from Jan 2001 and so on. How should I proceed to do this?
Later, I want to perform a forecast on this dataset. Any inputs on this? Should I consider each column as a seperate dataframe and do the forecasting. Or can I convert each column to a different timeseries object that starts from the first non NA value?
Now for this dataframe for the columns 6,7,52,54 and 56 I want to define them as timeseries starting from a different date as compared to the rest of the dataframe. Supposedly the data begins from Jan 2000, column 6 will begin from July 2000, column 7 from Jan 2001 and so on. How should I proceed to do this?
There, AFAIK, no way to do this in R in a time series matrix. And if each column started at a different date, then (since each column has the same number of entries), each column would also need to end at a different date. Is this really what you need? A collection of time series that all happen to be of the same length (so they can fit into a matrix), but that start and end with offsets? I struggle to understand where something like this would be useful, outside a kind of forecasting competition.
If you really need this, then I would recommend you put your time series into a list structure. Then each one can start and end at any date, and they can be the same or different lengths. Take inspiration from Mcomp::M3.
Later, I want to perform a forecast on this dataset. Any inputs on this? Should I consider each column as a seperate dataframe and do the forecasting. Or can I convert each column to a different timeseries object that starts from the first non NA value?
Since your tt is already a time series object, the simplest way would be simply to iterate over its columns:
fcst <- matrix(nrow=10,ncol=ncol(tt))
for ( ii in 1:ncol(tt) ) fcst <- forecast(ets(tt[,ii]),10)$mean
Note that most modeling functions in forecast will throw a warning and do something reasonable on encountering NA values. Here, e.g.:
1: In ets(tt[, ii]) :
Missing values encountered. Using longest contiguous portion of time series
Of course, you could do something yourself inside the loop, e.g., search for the last NA and start the time series for modeling right after that (but make sure you fail gracefully if the last entry is NA).

Efficient time-weighted averages

given a dataset containing short intervals, and a value representing an average measure of something over each interval, I would like to average those values up to the calendar year, separately for each individual ("id").
The issue is that these intervals are not aligned with calendar year, so time-weighting of those values is necessary in order to get the best estimate of the annual average from the shorter interval averages.
Note that the intervals are inclusive for the start date and exclusive for the end date.
Example data
start_date and end_date are intervals that are unique non-overlapping within levels of id:
set.seed(30)
library(lubridate)
library(data.table)
x <- CJ(id=1:5, start_date=seq(from=as.Date("2005-01-12"),by=14,length=100))
#add noise so intervals don't all start on 2005-01-12
x[,start_date:=start_date + rbinom(1,size=20,prob=.15)*15L,by=id]
#all intervals are two weeks:
x[,end_date:=start_date+14]
x[,value:=rnorm(nrow(x))]
#for each id, calculate the mean value over each calendar year.
years <- c(year(min(x$start_date)), year(max(x$start_date)))
Additional constraints:
works for intervals that aren't exactly two weeks long
works even if the intervals aren't all the same length (as long as
they're non-overlapping)
works even if the earliest start_date isn't the same for each
participant
averages for calendar years that are don't have enough periods for
that id to complete the year should be NA
Potential solution that is too slow for my purposes.
complete_date_seq <- seq(as.Date(ymd(paste0(years[1],"-01-01"))), as.Date(ymd(paste0(years[2],"-12-12"))),by=1)
m <- matrix(NA,nrow=length(unique(x$id)),ncol=length(complete_date_seq))
rownames(m) <- unique(x$id)
colnames(m) <- as.character(complete_date_seq)
for(i in 1:nrow(m)){
temp <- x[id==rownames(m)[i]]
for(j in 1:nrow(temp)){
m[i, as.Date(complete_date_seq) %within% temp[j,interval(start_date,end_date-1)]] <- temp[j,value]
}
}
out <- CJ(id=unique(x$id),year=years[1]:years[2])
intervalfromyear <- function(y) interval(as.Date(ymd(paste0(y,"-01-01"))), as.Date(ymd(paste0(y,"-12-31"))))
out[, annual_avg:=mean(m[rownames(m)==.BY$id,complete_date_seq %within% intervalfromyear(.BY$year)]) ,by=c("id","year")]
I'm guessing there's some package for doing time-weighting that I'm not aware of. Is this true? Ideally there's a native data.table solution that's fast.
This is basically the same approach I posed in the question but a lot more efficient since it creates a long data.table rather than a matrix. I spent some time looking for a different solution (that doesn't actually involve actually creating a cell for each date and instead uses a weighted average product formula) using foverlaps but it was way more work, less easily extensible, and more error-prone.
#switch from exclusive to inclusive end_date
x[, actual_end_date:=as.Date(as.numeric(end_date)-1,origin="1970-01-01")]
z <- x[, list(date=seq(start_date,actual_end_date,by=1),value),by=c("id","start_date")]
complete_date_seq <- seq(from=as.Date(paste0(years[1],"-01-01")),
to=as.Date(paste0(years[2],"-12-31")),by=1)
missing_dates <- z[,list(date=as.Date(setdiff( complete_date_seq,date ),origin="1970-01-01"),value=NA),by=id]
result <- rbind(z,missing_dates,fill=TRUE)[order(id,date)]
result[, year:=substr(date,1,4)]
result[, mean(value),by=c("id","year")]

R carry forward last observation n times by group

This one is driving me nuts. I have a large data.table with monthly stock data. Every June I assign every stock to one of 10 portfolios based on an accounting variable. I would like to carry forward the assigned portfolio variable to the next 11 month until each stock gets assigned to a new portfolio 1 to 10 in June next year. na.locf is basically what I'm looking for but I am running into 2 issues:
Some stocks lack sufficient accounting data the next year, so they shouldn't be assigned to a portfolio in that year (i.e. portfolio variable should stay NA). But of course na.locf keeps carrying forward the portfolio number until there is a new one.
Some stocks may get delisted after e.g. 3 months so they don't have another 11 month of data.
That's why I looking for a code that carries forward the last observation a maxium of 11 times until June next year (when there is a new portfolio number).
That's the na.locf solution right now with the 2 issues (PERMNO is the stock identifier):
COMPUSTAT_CRSP_IBES1[,
Portfolio_Monthly := na.locf(Portfolio_Monthly,
na.rm = FALSE),
by = PERMNO]
I tried to use rep but that didn't work at all:
COMPUSTAT_CRSP_IBES1[,
Portfolio_Monthly := if_else(!is.na(Portfolio_Monthly),
rep(Portfolio_Monthly, 11),
NA),
by = PERMNO]
Thank's for any hints!
You can create and/or use your fiscal year (June - May) as one of the group by criteria in your na.locf solution
#show data before calculations
data.frame(dat)
#demo FY calculation
dat[, FY := year(MONTH) + as.numeric(month(MONTH) >= 6)]
#actual code
dat[, Portfolio_Monthly := zoo::na.locf(Portfolio_Monthly, na.rm=FALSE),
by=list(PERMNO, year(MONTH) + as.numeric(month(MONTH) >= 6))]
#show results
data.frame(dat)
sample data:
library(data.table)
set.seed(0L)
dat <- data.table(PERMNO=rep(LETTERS[1:12], each=20),
MONTH=rep(seq(as.Date("2000-01-01"), by="1 month", length.out=20), 12),
Portfolio_Monthly=NA_real_)
for (i in sample(1:dat[,.N], 5)) {
set(dat, i, 3L, rnorm(1))
}
setorder(dat, PERMNO, MONTH)

Resources