Rolling mean with different window length - r

How can I calculate a two-month rolling mean, if the two-months windows are not of equal length? Preferably in datatable. Sample code:
set.seed(24)
test <- data.table(x = rnorm(762),time=seq(as.Date("1988/03/15"), as.Date("1990/04/15"), "day"))
So here, the first mean would be from 1988/03/15 until 1988/04/30, the second one from 1988/04/01 until 1988/05/31, and so on. The data availability in each month can be of different length, by intention.

Add a yearmon column and then summarize the sum and length of x by yearmon.
Finally divide a rolling sum over x by a rolling sum over the length N.
library(data.table)
library(zoo)
Means <- test[, yearmon := as.yearmon(time)][
, list(x = sum(x), N = .N), by = "yearmon"][
, list(yearmon, mean = rollsumr(x, 2, fill = NA) / rollsumr(N, 2, fill = NA))]
Alternately convert test to a zoo object, sum x and the length by yearmon, calculate the rolling sum of both x and n and divide giving a zoo object with the year/months and means. See ?fortify.zoo if you would like to convert that to a data frame.
z <- cbind(x = read.zoo(test, index = "time"), n = 1)
zym <- aggregate(z, as.yearmon, sum)
transform(rollsumr(zym, 2), mean = x / n)
Note
Input used is:
set.seed(24)
test <- data.table(x = rnorm(762), time=seq(as.Date("1988/03/15"),
as.Date("1990/04/15"), "day"))

Related

Memory efficient ways of merging a vector with a large data.table to perform calculations (R)

I have a dataset with year-based data predicted by multiple models, in data.table format.
library(data.table)
nYears = 20 # real data: 110
nMod = 3 # real data: ~ 100
nGrp = 45
dataset <- data.table(
group_code = rep(seq(1:nGrp ), times= 3*nYears ),
Year = rep(seq(1:nYears ), each=nGrp ),
value = rnorm(2700 , mean = 10, sd = 2),
var1 = rep (rnorm(nGrp , mean = nMod, sd = 1) , times= nMod*nYears ),
var2 = rep (rnorm(nGrp , mean = 1.5, sd = 0.5) , times= nMod*nYears ),
model = as.character(rep(seq( from = 1, to = nMod ) , each=nGrp *nYears ))
)
setkey(dataset, Year, model)
I need to perform a set of calculations from this dataset based on a vector, named x, of lenght=1001 and consists on a seq(-2, 8, by=0.01).
To do so, I created a new data.table (dt) with repeated versions of dataset to merge vector x, accordingly:
dt <- dataset[, lapply(.SD, function(x) rep(x, 1001))]
dt[, x := rep(round(seq(-2, 8, by=0.01), 2), each= nYears*nGrp*nMod) ]
Since my original dataset includes hundreds of models, this operation is not memory efficient.
The most important operation I need, includes the generation of normal distribution of x, with mean = var1 and sd= var2, by group_code, Year and model. For example:
# key computation
dt [, norm_dist := dnorm (x, var1, var2) , by= .(group_code, Year, model )]
This last operation is quite fast in my desktop. However, I have other operations to perform that require to subset the data.table and are highly RAM consuming. An example:
dt[ x %between% c( 2, 5.99), dt2 := rep_len( rev(dt [x %between% c(-2, 1.99)]$value), length.out=.N) , by= .(Year, model) ]
The following error pop-s up:
Error: cannot allocate vector of size 1.3 Gb
I believe the problem in this specific step is related to the subset and the rev() function.
Nevertheless, the approach I'm using to performing the set of calculations based on the vector "x" from data.table dt, does not seem appropriate since the moment I merged the dataset with the vector I need for calculations ("x").
I was hoping someone could teach me how to efficiently improve my code, since I have a considerable amount of models in the original dataset, greatly increasing its size.
Thank you!
I think that this part of code should be clearer
dt[ x %between% c( 2, 5.99), dt2 := rep_len( rev(dt [x %between% c(-2, 1.99)]$value), length.out=.N) , by= .(Year, model) ]
as it is a bit like a black box to me. Especially because this double subsetting is where your problem is generated.
These bits of code, x %between% c( 2, 5.99) and dt[x %between% c(-2, 1.99)], should result always to the same positions in all your cases. You should consider that in your code to make it more efficient.
Try something like this to make things a bit clearer:
by_YM <- split(dt, by=c("Year", "model"))
ind1 <- which(by_YM[[1]][["x"]] %between% c( 2, 5.99))
ind2 <- which(by_YM[[1]][["x"]] %between% c(-2, 1.99))
for(i in 1:length(by_YM)){
dt_i <- by_YM[[i]]
#val1 <- rep_len(rev(dt_i$value[ind2]), length.out=length(ind1)) #val1 is equal to val, no need for rep_len
val <- rev(dt_i$value[ind2])
by_YM[[i]] <- dt_i[ind1, dt2 := val]
}
however ours dt2 columns are not equal but as I am not sure of how the final result should be I cannot debug it.
dt2_a <- dt[Year == 20 & model == 3, dt2]
dt2_b <- by_YM[["20.3"]][, dt2]
test <- cbind(dt2_a, dt2_b)
The second code is also much faster.
library(microbenchmark)
microbenchmark( "new_code" = {
by_YM <- split(dt, by=c("Year", "model"))
ind1 <- which(by_YM[[1]][["x"]] %between% c( 2, 5.99))
ind2 <- which(by_YM[[1]][["x"]] %between% c(-2, 1.99))
for(i in 1:length(by_YM)){
dt_i <- by_YM[[i]]
val1 <- rep_len(rev(dt_i$value[ind2]), length.out=length(ind1)) #val1 is equal to val, no need for rep_len
val <- rev(dt_i$value[ind2])
by_YM[[i]] <- dt_i[ind1, dt2 := val]
}}, "old_code" = dt[ x %between% c( 2, 5.99),
dt2 := rep_len( rev(dt [x %between% c(-2, 1.99)]$value), length.out=.N) , by= .(Year, model) ],
times = 5)
Unit: milliseconds
expr min lq mean median uq max neval cld
new_code 155.426 156.4916 200.6587 185.0347 188.9436 317.3977 5 a
old_code 1290.909 1299.8570 1398.6866 1370.4526 1471.0569 1561.1574 5 b
Give it a try and good luck

How can I use rollapply with a 5 month window?

I noticed this in the documentation of rollapply() to roll by 3 days:
## rolling mean by time window (e.g., 3 days) rather than
## by number of observations (e.g., when these are unequally spaced):
#
## - test data
tt <- as.Date("2000-01-01") + c(1, 2, 5, 6, 7, 8, 10)
z <- zoo(seq_along(tt), tt)
## - fill it out to a daily series, zm, using NAs
## using a zero width zoo series g on a grid
g <- zoo(, seq(start(z), end(z), "day"))
zm <- merge(z, g)
## - 3-day rolling mean
rollapply(zm, 3, mean, na.rm = TRUE, fill = NA)
Suppose I have the following data:
data.zoo <- read.zoo(
data.frame(
date = sample(seq(as.Date('2001-04-12'), as.Date("2019-04-05"), by="day"), 600),
val = runif(1:600),
val2 = runif(1:600)
))
Is it possible to somehow use rollapply() with a 5 month rolling window to calculate the rolling mean of val? The problem with a 5-month rolling window is that the number of days in a month varies...
NOTE: I would prefer a base-R solution but other libraries would be interesting to see
Since width can be a vector of widths, one for each row of the input, we can simply compute the number of days between each date and 5 months prior and use those numbers for the width vector:
library(zoo)
ym <- as.yearmon(time(data.zoo))
w <- as.Date(ym) - as.Date(ym - 5/12)
r <- rollapplyr(data.zoo, w, mean, fill = NA)
Alternately we could write w like this with lubridate.
library(lubridate)
w <- time(data.zoo) - (time(data.zoo) %m-% months(5))
Update
If there can be missing dates then
library(lubridate)
w <- sapply(time(data.zoo), function(x)
length(intersect(seq(x %m-% months(5), x, "day"), time(data.zoo)))
or repeat this replacing %m-% months(5) with subtract5m which does not use additional packages:
subtract5m <- function(x) {
if (length(x) == 1) seq(x, length = 2, by = "-5 month")[2]
else as.Date(sapply(x, subtract5m))
}
w <- sapply(time(data.zoo), function(x)
length(intersect(seq(subtract5m(x), x, "day"), time(data.zoo))))
Note that due to the ambiguity of the definition of 5 months ago the various computations for w may vary slightly based on slightly different assumptions.
Improving on G. Grothendieck's ideas I went with:
ym <- as.yearmon(time(data.zoo))
ym.cutoff.ideal <- ym - 5/12
ym.cutoff.closest.to.ideal <- as.yearmon(time(data.zoo)[findInterval(as.Date(ym.cutoff.ideal), as.Date(ym)) + 1])
w <- time(data.zoo) - as.Date(ym.cutoff.closest.to.ideal) + 1
r <- rollapplyr(data.zoo, w, mean, fill = NA)
It looks like it is working correctly...

How to interpolate missing values in a time series, limited by the number of sequential NAs (R)?

I have missing values in a time series of dates. For example:
set.seed(101)
df <- data.frame(DATE = as.Date(c('2012-01-01', '2012-01-02',
'2012-01-03', '2012-01-05', '2012-01-06', '2012-01-15', '2012-01-18',
'2012-01-19', '2012-01-20', '2012-01-22')),
VALUE = rnorm(10, mean = 5, sd = 2))
How can I write a function that will fill all the missing dates between the first and last date (ie 2012-01-01 and 2012-01-22'), then use interpolation (linear and smoothing spline) to fill the missing values, but not more than 3 sequential missing values (ie no interpolation between 2012-01-06 and 2012-01-15)?
The function will be applied to a very large dataframe. I have been able to write a function that uses linear interpolation to fill all missing values between two dates (see below), but I can not figure out how to stop it interpolating long stretches of missing values.
interpolate.V <- function(df){
# sort data by time
df <- df[order(df$DATE),]
# linnearly interpolate VALUE for all missing DATEs
temp <- with(df, data.frame(approx(DATE, VALUE, xout = seq(DATE[1],
DATE[nrow(df)], "day"))))
colnames(temp) <- c("DATE", "VALUE_INTERPOLATED")
temp$ST_ID <- df$ST_ID[1]
out <- merge(df, temp, all = T)
rm(temp)
return(out)
}
Any help will be greatly appreciated!
Thanks
Function that adds rows for all missing dates:
date.range <- function(sub){
sub$DATE <- as.Date(sub$DATE)
DATE <- seq.Date(min(sub$DATE), max(sub$DATE), by="day")
all.dates <- data.frame(DATE)
out <- merge(all.dates, sub, all = T)
return(out)
}
Use na.approx or na.spline from zoo package with maxgap argument:
interpolate.zoo <- function(df){
df$VALUE_INT <- na.approx(df$VALUE, maxgap = 3, na.rm = F)
return(df)
}

Faster way to subset data table instead of a for loop R

I have a data table (you'll need the data table package installed) in R generated with X and Y coordinates and random data values from both normal and uniform distributions. The coordinates represent points on a 2000x1600 array and has to be divided into 16 smaller "sectors" each 500x400. These sectors need their mean of Normal Distribution values taken, divided by the min^2 of the Uniform Distribution values. I also created two variables x and y using a provided function startstop, that have the coordinates for the 16 sectors and a function that calculates the numbers for each sector.
library(data.table)
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
sectorCalc <- function(x,y,DT) {
sector <- numeric(length = 16)
for (i in 1:length(sector)) {
sect <- DT[X %between% c(x[[1]][i],x[[2]][i]) & Y %between% c(y[[1]][i],y[[2]][i])]
sector[i] <- sCalc(sect)
}
return(sector)
}
startstop <- function(width, y = FALSE) {
startend <- width - (width/4 - 1)
start <- round(seq(0, startend, length.out = 4))
stop <- round(seq(width/4, width, length.out = 4))
if (length(c(start,stop)[anyDuplicated(c(start,stop))]) != 0) {
dup <- anyDuplicated(c(start,stop))
stop[which(stop == c(start,stop)[dup])] <- stop[which(stop == c(start,stop)[dup])] - 1
}
if (y == TRUE) {
coord <- list(rep(start, each = 4), rep(stop, each = 4))
} else if (y == FALSE) {
coord <- list(rep(start, times = 4), rep(stop, times = 4))
}
return(coord)
}
x <- startstop(2000)
y <- startstop(1600, T)
sectorNos <- sectorCalc(x,y,DT)
The startstop function isn't really an issue but I need a faster way to subset the data table. Some modifications have to be made to the 'sectorCalc' function. The for loop was the best way I could think of but I don't have too much experience with data tables. Any ideas on a faster method of breaking up the data table?
A solution using not only the package data.table but also the cut function to build the interval "groups":
# Create your test data
library(data.table)
set.seed(123) # make random numbers reproducible to allow comparison of different answers
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
# calculate the sector by cutting the x and y values into groups defined by the interval breaks
DT[, x.sect := cut(DT[, X], c(0, 499, 1000, 1500, 2000), dig.lab=10)] # Intervals should be: seq(0, 2000, by=500) lower bound is less one since it is not included in the interval (see help for cut function)
DT[, y.sect := cut(DT[, Y], c(0, 399, 800, 1200, 1600), dig.lab=10)] # Intervals should be: seq(0, 1600, by=400)
# Now calculate per group (calculation logic "stolen" from the working answer of user "Symbolix"
DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)]
Please note: I think the size of the first and second interval is wrong in the original solution (499 instead of 500 for x and 399 instead of 400 for y so that I could not use the seq function to reproduce your desired intervals but had to enumerate the interval breaks manually).
Edit 1: I have replaced the original code that adds the x.sect and y.sect columns by an improved solution that adds columns by reference (:=).
Edit 2: If you want to order the result you have (at least) two options:
# "Chaining" (output is input of next)
DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)][order(x.sect, y.sect),]
# Or: Use the "keyby" param instead of "by"
DT[, .(sect = mean(Norm)/min(Unif)^2), keyby=.(x.sect, y.sect)]
Edit 3: Added dig.lab=10 param to cut function in code above to avoid scientific notation of the interval breaks.
To replace your sectorCalc function I think we can make use of data.tables joins
As you are looping over each row of sector, you just have to create a data.table to join onto that is your sector data,
specify a column to join (here I'm using key_col), and specify a 'group' variable for each row, to enable us to do a
the calculation at the end:
x <- startstop(2000)
y <- startstop(1600, T)
## copy the original DT
dt <- copy(DT)
dt_xy <- data.table(x_1 = x[[1]],
x_2 = x[[2]],
y_1 = y[[1]],
y_2 = y[[2]])
dt[, key_col := 1]
dt_xy[, `:=`(key_col = 1, xy_grp = seq(1,.N))]
## Use a data.table join, allowing cartesian, then filter out results.
dt_res <- dt[ dt_xy, on="key_col", allow.cartesian=T][x_1 <= X & X <= x_2 & y_1 <= Y & Y <= y_2]
## calculate 'sect' as required.
dt_sect <- dt_res[, .(sect = mean(Norm)/min(Unif)^2) , by=.(xy_grp)]

Rolling Mean from fixed starting point (and by Group)

Suppose you have the following data frame:
set.seed(100)
Pts <- floor(runif(20, 0, 10))
Individual <- c(rep("Adam",5), rep("Ben",5), rep("Charlie",5), rep("Daisy",5))
Date <- c(rep(seq(as.Date("2015-01-01"), as.Date("2015-01-05"), "days"), 4))
RollMean <- rep(NA,20)
df <- data.frame(Pts, Individual, Date, RollMean)
I would like to to calculate what the mean RollMean for Pts is for each row, by Individual, but only including entries between the earliest date and the date on the current row.
For example:
df$RollMean[3] = (5+2+3)/3
df$RollMean[4] = (5+2+3+0)/4
df$RollMean[7] = (8+4)/2
I have tried using functions such as SMA() from the TTR package and then using ave to sort by Group, such as:
df$RollMean <- ave(df$Pts, df$Individual, FUN= function(x) SMA(x, n))
but there I have to pre-specify n which changes based on which row R is dealing with.
What code can I use to generate the Rolling Means I am looking for?
You can try:
library(data.table)
setDT(df)[,cumsum(Pts[order(Date)])/seq(.N), Individual]
Here are a few alternatives:
1) This does not use any packages:
transform(df, Rollmean = ave(Pts, Individual, FUN = function(x) cumsum(x) / seq_along(x)))
2) An alternative is to use zoo's rollmeanr with vector of widths:
library(zoo)
Rollmean <- function(x) rollapplyr(x, seq_along(x), mean))
transform(df, Rollmean = ave(Pts, Individual, FUN = Rollmean)

Resources