Moving average varying window - r

I have an unbalanced panel, in which I have certain observations (variable x) per ID and month. I am trying to calculate a 6-month-rolling average of x, but only every March. I know that with zoo, I can calculate the average every single time, but I think that is computationally expensive. I have a very large panel, so it would be better to define an index first and pass it to the function. Also, my panel is imbalanced, so sometimes I have all 6 past values at a given March, and sometimes I do not. If there is a minimum of 3 values available, I would still like to compute the average.
Here is some sample code and my solution so far:
library(data.table)
set.seed(1)
time=rep(seq(as.Date("2010-02-01"), length=42, by="1 month") - 1,2)
IDs=rep(letters[1:2],each=length(time))
DT <- data.table(time=time,
ID=IDs,
ind=rep(1:(2*length(time))),
row=1:(2*length(time)),
x=sample(2*length(time)))
DT
DT <- DT[!ind %in% c(11,12,26)]
DT
library(zoo)
DT[,movavg := if(length(x) >= 3){ rollapply(x, 6, sum, na.rm = FALSE,align = "right",fill = NA)}else{
rep(NA,length(x))
},by=ID]
DT
The target is to simply show for each March the corresponding moving average, which contains the past 6 observations. I don't mind if the original panel is kept, that is, only in March the results are shown, or if only the March values are extracted and nothing else is shown.
My code works, but it does the calculation every row/month. What I want it to do is to work only at a defined index. The issue is, as the panel is unbalanced, the distance between the Marches is not equally long. For example, it can be 12 months from one to another year, but it could be 10 months from the next to the following year when 2 observations are unfortunately missing. Can roll apply still be used? Any hints for data table or dplyr are highly appreciated.

If this code from the question gives what you want
DT[,movavg := if(length(x) >= 3){ rollapply(x, 6, sum, na.rm = FALSE,align = "right",fill = NA)}else{
rep(NA,length(x))
},by=ID]
then the first of these ran 2.8x faster and gave the same result and the second one using frollsum from data.table ran 4.8x faster.
DT[, movavg := rollsumr(x, 6, fill = NA), by = ID]
DT[, movavg := frollsum(x, 6), by = ID]

Related

Identify Min & Max Numeric Value within Date/Datetime range repeatedly

I am completely new to R so this is proving too complex to handle for me right now, so any help is much appreciated.
I am analysing price action data for BTC. I have 1 minute candles from 2019-09-08 19:13:00 to 2022-03-15 00:22:00 with the variables of open, high, low, close price as well as volume in BTC & USD and trade count for each of those minutes. Data source is https://www.cryptodatadownload.com/data/binance/ for anyone interested.
I cleaned up & correctly formatted the data and now want to analyse when BTC price made a low & high for various date & time ranges, for example:
What time of day in 30 minute increments did BTC made a low for the week?
Here is what I believe I need to do:
I need to tell R that 30 minutes is a range and identify the lowest & highest value for the "Low" and "High" variables within in as well as that a day is a range and within that the lowest & highest value for the "Low" and "High" variables as well as define a week as a range and within that the lowest & highest value for the "Low" and "High" variables.
Then I'd need to mark these values, the best method I can think of would be creating a new variable and have it as a TRUE/FALSE column like so:
btcusdt_binance_fut_1min$pa.low.of.week.30min
btcusdt_binance_fut_1min$pa.high.of.week.30min
Every minute row that is within that 30min low and high will be marked TRUE and every other minute within that week will be marked FALSE.
I looked at lubridate's interval() function but as far as I know the problem is I'd need to define each year, month, week, day, 30mins interval individually with start and end time, which is obviously not feasible. I believe I run into the same problem with the subset() function.
Another option seems to be the seq() and seq.POSIXt() functions as well as the range() function, but I haven't found a way for it.
Here is all my code and I am using this data set: https://www.cryptodatadownload.com/cdd/BTCUSDT_Binance_futures_data_minute.csv
library(readr)
library(lubridate)
library(tidyverse)
library(plyr)
library(dplyr)
# IMPORT CSV FILE AS DATA SET
# Name data set & choose import file
# Skip = 1 for skipping first row of CSV
btcusdt_binance_fut_1min <-
read.csv(
file.choose(),
skip = 1,
header = T,
sep = ","
)
# CLEAN UP & REORGANISE DATA
# Remove unix & symbol column
btcusdt_binance_fut_1min$unix = NULL
btcusdt_binance_fut_1min$symbol = NULL
# Rename date column to datetime
colnames(btcusdt_binance_fut_1min)[colnames(btcusdt_binance_fut_1min) == "date"] <-
"datetime"
# Convert datetime column to POSIXct format
btcusdt_binance_fut_1min$datetime <-
as_datetime(btcusdt_binance_fut_1min$datetime, tz = "UTC")
# Create variable column for each time element
btcusdt_binance_fut_1min$year <-
year(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$month <-
month(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$week <-
isoweek(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$weekday <-
wday(btcusdt_binance_fut_1min$datetime,
label = TRUE,
abbr = FALSE)
btcusdt_binance_fut_1min$hour <-
hour(btcusdt_binance_fut_1min$datetime)
btcusdt_binance_fut_1min$minute <-
minute(btcusdt_binance_fut_1min$datetime)
# Reorder columns
btcusdt_binance_fut_1min <-
btcusdt_binance_fut_1min[, c(1, 9, 10, 11, 12, 13, 14, 4, 3, 2, 5, 6, 7, 8)]
Using data.table we can do the following:
btcusdt_binance_fut_1min <- data.table(datetime = seq.POSIXt(as.POSIXct("2022-01-01 0:00"), as.POSIXct("2022-01-01 2:59"), by = "1 min"))
btcusdt_binance_fut_1min[, group := format(as.POSIXct(cut(datetime, breaks = "30 min")), "%H:%M")]
the cut function will "floor" each datetime to it's nearest, smaller, half an hour. The format and as.POSIXct are just there to remove the date part to allow for easy comparing between dates for the same half hours, but if you prefer to keep it a datetime you can remove these functions.
After this the next steps are pretty straightforward:
btcusdt_binance_fut_1min[, .(High = max(High), Low = min(Low)), by=.(group)]

R calculating time differences in a (layered) long dataset

I've been struggling with a bit of timestamp data (haven't had to work with dates much until now, and it shows). Hope you can help out.
I'm working with data from a website showing for each customer (ID) their respective visits and the timestamp for those visits. It's grouped in the sense that one customer might have multiple visits/timestamps.
The df is structured as follows, in a long format:
df <- data.frame("Customer" = c(1, 1, 1, 2, 3, 3),
"Visit" =c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited the site three times.
"Timestamp" = c("2019-12-31 12:13:25", "2019-12-31 16:13:25", "2020-01-05 10:13:25", "2019-11-12 15:18:42", "2019-11-13 19:22:35", "2019-12-10 19:43:55"))
Note: In the real dataset the timestamp isn't a factor but some other haggard character-type abomination which I should probably first try to convert into a POSIXct format somehow.
What I would like to do here is to create a df that displays per customer their average time between visits (let's say in minutes, or hours). Visitors with only a single visit (e.g., second customer in my example) could be filtered out in advance or should display a 0. My final goal is to visualize that distribution, and possibly calculate a grand mean across all customers.
Because the number of visits can vary drastically (e.g. one or 256 visits) I can't just use a 'wide' version of the dataset where a fixed number of visits are the columns which I could then subtract and average.
I'm at a bit of a loss how to best approach this type of problem, thanks a bunch!
Using dplyr:
df %>%
arrange(Customer, Timestamp) %>%
group_by(Customer) %>%
mutate(Difference = Timestamp - lag(Timestamp)) %>%
summarise(mean(Difference, na.rm = TRUE))
Due to the the grouping, the first value of difference for any costumer should be NA (including those with only one visit), so they will be dropped with the mean.
Using base R (no extra packages):
sort the data, ordering by customer Id, then by timestamp.
calculate the time difference between consecutive rows (using the diff() function), grouping by customer id (tapply() does the grouping).
find the average
squish that into a data.frame.
# 1 sort the data
df$Timestamp <- as.POSIXct(df$Timestamp)
# not debugged
df <- df[order(df$Customer, df$Timestamp),]
# 2 apply a diff.
# if you want to force the time units to seconds, convert
# the timestamp to numeric first.
# without conversion
diffs <- tapply(df$Timestamp, df$Customer, diff)
# ======OR======
# convert to seconds
diffs <- tapply(as.numeric(df$Timestamp), df$Customer, diff)
# 3 find the averages
diffs.mean <- lapply(diffs, mean)
# 4 squish that into a data.frame
diffs.df <- data.frame(do.call(rbind, diffs.mean))
diffs.df$Customer <- names(diffs.mean)
# 4a tidy up the data.frame names
names(diffs.df)[1] <- "Avg_Interval"
diffs.df
You haven't shown your timestamp strings, but when you need to wrangle them, the lubridate package is your friend.

Efficient time-weighted averages

given a dataset containing short intervals, and a value representing an average measure of something over each interval, I would like to average those values up to the calendar year, separately for each individual ("id").
The issue is that these intervals are not aligned with calendar year, so time-weighting of those values is necessary in order to get the best estimate of the annual average from the shorter interval averages.
Note that the intervals are inclusive for the start date and exclusive for the end date.
Example data
start_date and end_date are intervals that are unique non-overlapping within levels of id:
set.seed(30)
library(lubridate)
library(data.table)
x <- CJ(id=1:5, start_date=seq(from=as.Date("2005-01-12"),by=14,length=100))
#add noise so intervals don't all start on 2005-01-12
x[,start_date:=start_date + rbinom(1,size=20,prob=.15)*15L,by=id]
#all intervals are two weeks:
x[,end_date:=start_date+14]
x[,value:=rnorm(nrow(x))]
#for each id, calculate the mean value over each calendar year.
years <- c(year(min(x$start_date)), year(max(x$start_date)))
Additional constraints:
works for intervals that aren't exactly two weeks long
works even if the intervals aren't all the same length (as long as
they're non-overlapping)
works even if the earliest start_date isn't the same for each
participant
averages for calendar years that are don't have enough periods for
that id to complete the year should be NA
Potential solution that is too slow for my purposes.
complete_date_seq <- seq(as.Date(ymd(paste0(years[1],"-01-01"))), as.Date(ymd(paste0(years[2],"-12-12"))),by=1)
m <- matrix(NA,nrow=length(unique(x$id)),ncol=length(complete_date_seq))
rownames(m) <- unique(x$id)
colnames(m) <- as.character(complete_date_seq)
for(i in 1:nrow(m)){
temp <- x[id==rownames(m)[i]]
for(j in 1:nrow(temp)){
m[i, as.Date(complete_date_seq) %within% temp[j,interval(start_date,end_date-1)]] <- temp[j,value]
}
}
out <- CJ(id=unique(x$id),year=years[1]:years[2])
intervalfromyear <- function(y) interval(as.Date(ymd(paste0(y,"-01-01"))), as.Date(ymd(paste0(y,"-12-31"))))
out[, annual_avg:=mean(m[rownames(m)==.BY$id,complete_date_seq %within% intervalfromyear(.BY$year)]) ,by=c("id","year")]
I'm guessing there's some package for doing time-weighting that I'm not aware of. Is this true? Ideally there's a native data.table solution that's fast.
This is basically the same approach I posed in the question but a lot more efficient since it creates a long data.table rather than a matrix. I spent some time looking for a different solution (that doesn't actually involve actually creating a cell for each date and instead uses a weighted average product formula) using foverlaps but it was way more work, less easily extensible, and more error-prone.
#switch from exclusive to inclusive end_date
x[, actual_end_date:=as.Date(as.numeric(end_date)-1,origin="1970-01-01")]
z <- x[, list(date=seq(start_date,actual_end_date,by=1),value),by=c("id","start_date")]
complete_date_seq <- seq(from=as.Date(paste0(years[1],"-01-01")),
to=as.Date(paste0(years[2],"-12-31")),by=1)
missing_dates <- z[,list(date=as.Date(setdiff( complete_date_seq,date ),origin="1970-01-01"),value=NA),by=id]
result <- rbind(z,missing_dates,fill=TRUE)[order(id,date)]
result[, year:=substr(date,1,4)]
result[, mean(value),by=c("id","year")]

R carry forward last observation n times by group

This one is driving me nuts. I have a large data.table with monthly stock data. Every June I assign every stock to one of 10 portfolios based on an accounting variable. I would like to carry forward the assigned portfolio variable to the next 11 month until each stock gets assigned to a new portfolio 1 to 10 in June next year. na.locf is basically what I'm looking for but I am running into 2 issues:
Some stocks lack sufficient accounting data the next year, so they shouldn't be assigned to a portfolio in that year (i.e. portfolio variable should stay NA). But of course na.locf keeps carrying forward the portfolio number until there is a new one.
Some stocks may get delisted after e.g. 3 months so they don't have another 11 month of data.
That's why I looking for a code that carries forward the last observation a maxium of 11 times until June next year (when there is a new portfolio number).
That's the na.locf solution right now with the 2 issues (PERMNO is the stock identifier):
COMPUSTAT_CRSP_IBES1[,
Portfolio_Monthly := na.locf(Portfolio_Monthly,
na.rm = FALSE),
by = PERMNO]
I tried to use rep but that didn't work at all:
COMPUSTAT_CRSP_IBES1[,
Portfolio_Monthly := if_else(!is.na(Portfolio_Monthly),
rep(Portfolio_Monthly, 11),
NA),
by = PERMNO]
Thank's for any hints!
You can create and/or use your fiscal year (June - May) as one of the group by criteria in your na.locf solution
#show data before calculations
data.frame(dat)
#demo FY calculation
dat[, FY := year(MONTH) + as.numeric(month(MONTH) >= 6)]
#actual code
dat[, Portfolio_Monthly := zoo::na.locf(Portfolio_Monthly, na.rm=FALSE),
by=list(PERMNO, year(MONTH) + as.numeric(month(MONTH) >= 6))]
#show results
data.frame(dat)
sample data:
library(data.table)
set.seed(0L)
dat <- data.table(PERMNO=rep(LETTERS[1:12], each=20),
MONTH=rep(seq(as.Date("2000-01-01"), by="1 month", length.out=20), 12),
Portfolio_Monthly=NA_real_)
for (i in sample(1:dat[,.N], 5)) {
set(dat, i, 3L, rnorm(1))
}
setorder(dat, PERMNO, MONTH)

R Rolling average from irregular time series

I've encountered this problem several times over the years so maybe I'm just misunderstanding something or somehow just being silly about this. I've met a wierd problem when doing a rolling average on irregular time series. A good overview of the available methods in packages and simple script is here: Calculating moving average in R
I may be misreading some of them, but I see an issue in dealing with irregular time series. For example the common method of Rolling means in the zoo package requires unique values for each data. But in this case this is not the rolling average, but a rolling average of averages per time unit - time units will less data points will have comparably more influence on the average than ones with more.
A true moving average seems to me to need to work not with aggregates, but with distributions for each calculated point.
Given that I have the following data frame or irregular values, how can I best create a moving average measure for each of the values.
df <- data.frame(year = c(rep(2000,3),rep(2001,1),rep(2004,4),rep(2005,3),+
rep(2006,3),rep(2007,1),rep(2008,2),rep(2009,6),rep(2010,8)),+
value1=rnorm(31), value2=rnorm(31), value3=rnorm(31))
I found an easy way to do it via subsetting that I'll post as an initial answer, but this works in limited circumstances and needs to be customized each time. I'm wondering what is a general solution. Also, if anyone is able to comment on the practices of using averages of averages vs averages of distributions in rolling means calculations, that would be extra helpful. Thanks!
Calcuations within a sliding or rolling window of an irregular time series can be solved by data.table's ability to aggregate in a non-equi join.
There are many similar questions, e.g., r calculating rolling average with window based on value (not number of rows or date/time variable) or Rolling regression on irregular time series.
However, this question is different and thus deserves an answer on its own. From OP's own answer it can be concluded that the OP is looking for a centred rolling window. In addition, the rolling mean is to be computed for several columns.
library(data.table)
cols <- c("value2", "value3")
setDT(df)[SJ(year = (min(year) + 2):(max(year) - 2))[, c("start", "end") := .(year - 2, year + 2)],
on = .(year >= start, year < end),
c(.(year = i.year), lapply(.SD, mean)), .SDcols = cols, by = .EACHI][, -(1:2)]
year value2 value3
1: 2002 0.57494219 -0.53001134
2: 2003 0.33925292 0.75541896
3: 2004 -0.05834453 0.23987209
4: 2005 0.17031099 0.13074666
5: 2006 0.05272739 0.09297215
6: 2007 -0.12935805 -0.38780964
7: 2008 0.19716437 -0.11587017
The result is identical to OP's own result rmeans.
Data
set.seed(123) # ensure reproducible sample data
df <- data.frame(
year = rep(2000:2010, c(3, 1, 0, 0, 4, 3, 3, 1, 2, 6, 8)),
value1 = rnorm(31), value2 = rnorm(31), value3 = rnorm(31))
So here is the simple subsetting I came up with. Could be helpful if anyone finds themselves finding the same issues:
df <- data.frame(year = c(rep(2000,3),rep(2001,1),rep(2004,4),rep(2005,3), +
rep(2006,3),rep(2007,1),rep(2008,2),rep(2009,6),rep(2010,8)), +
value1=rnorm(31), value2=rnorm(31), value3=rnorm(31))
rmeans <- data.frame()
for (i in (min(df$year)+2):(max(df$year)-2)){
rmeans <- rbind(rmeans, data.frame(year=i,as.data.frame.list(colMeans(df +
[df$year>=(i-2)&df$year<(i+2),-c(1,2)]))))
}

Resources