Suppose you have the following data frame:
set.seed(100)
Pts <- floor(runif(20, 0, 10))
Individual <- c(rep("Adam",5), rep("Ben",5), rep("Charlie",5), rep("Daisy",5))
Date <- c(rep(seq(as.Date("2015-01-01"), as.Date("2015-01-05"), "days"), 4))
RollMean <- rep(NA,20)
df <- data.frame(Pts, Individual, Date, RollMean)
I would like to to calculate what the mean RollMean for Pts is for each row, by Individual, but only including entries between the earliest date and the date on the current row.
For example:
df$RollMean[3] = (5+2+3)/3
df$RollMean[4] = (5+2+3+0)/4
df$RollMean[7] = (8+4)/2
I have tried using functions such as SMA() from the TTR package and then using ave to sort by Group, such as:
df$RollMean <- ave(df$Pts, df$Individual, FUN= function(x) SMA(x, n))
but there I have to pre-specify n which changes based on which row R is dealing with.
What code can I use to generate the Rolling Means I am looking for?
You can try:
library(data.table)
setDT(df)[,cumsum(Pts[order(Date)])/seq(.N), Individual]
Here are a few alternatives:
1) This does not use any packages:
transform(df, Rollmean = ave(Pts, Individual, FUN = function(x) cumsum(x) / seq_along(x)))
2) An alternative is to use zoo's rollmeanr with vector of widths:
library(zoo)
Rollmean <- function(x) rollapplyr(x, seq_along(x), mean))
transform(df, Rollmean = ave(Pts, Individual, FUN = Rollmean)
Related
reprod:
df1 <- data.frame(X = c(0:9), Y = c(10:19))
df2 <- data.frame(X = c(0:9), Y = c(10:19))
df3 <- data.frame(X = c(0:9), Y = c(10:19))
list_of_df <- list(A = df1, B = df2, C = df3)
list_of_df
I'm trying to apply the rollmean function from zoo to every 'Y' column in this list of dataframes.
I've tried lapply with no success, It seems no matter which way i spin it, there is no way to get around specifying the dataframe you want to apply to at some point.
This does one of the dataframes
roll_mean <- rollmean(list_of_df$A, 2)
roll_mean
obviously this doesn't work:
roll_mean1 <- rollmean(list_of_df, 2)
roll_mean1
I also tried this:
subset(may not be necessary)
Sub1 <- lapply(list_of_df, "[", 2)
roll_mean1 <- rollmean(Sub1, 2)
roll_mean1
there doesn't seem to be a way to do it without having to
specify the particular dataframe in the rollmean function
lapply(list_of_df), function(x) rollmean(list_of_df, 2))
for loop? also no success
For (i in list_of_df) {roll_mean1 <- rollmean(Sub1, 2)
Exp
}
Stating the obvious but I'm very new to coding in general and would appreciate some pointers.
It has occurred to me that even if it did work, the column that has been averaged would be one value longer than the rest of the dataframe; how would I get around that?
The question at one point says that it wants to perform the rollmean only on Y and at another point says that this works roll_mean <- rollmean(list_of_df$A, 2) but that does all columns.
1) Assuming that you want to apply rollmean to all columns:
Use lapply like this:
lapply(list_of_df, rollmean, 2)
This also works:
for(i in seq_along(list_of_df)) list_of_df[[i]] <- rollmean(list_of_df[[i]], 2)
2) If you only want to apply it to the Y column:
lapply(list_of_df, transform, Y = rollmean(Y, 2, fill = NA))
or
for(i in seq_along(list_of_df)) {
list_of_df[[i]]$Y <- rollmean(list_of_df[[i]]$Y, 2, fill = NA)
}
I noticed this in the documentation of rollapply() to roll by 3 days:
## rolling mean by time window (e.g., 3 days) rather than
## by number of observations (e.g., when these are unequally spaced):
#
## - test data
tt <- as.Date("2000-01-01") + c(1, 2, 5, 6, 7, 8, 10)
z <- zoo(seq_along(tt), tt)
## - fill it out to a daily series, zm, using NAs
## using a zero width zoo series g on a grid
g <- zoo(, seq(start(z), end(z), "day"))
zm <- merge(z, g)
## - 3-day rolling mean
rollapply(zm, 3, mean, na.rm = TRUE, fill = NA)
Suppose I have the following data:
data.zoo <- read.zoo(
data.frame(
date = sample(seq(as.Date('2001-04-12'), as.Date("2019-04-05"), by="day"), 600),
val = runif(1:600),
val2 = runif(1:600)
))
Is it possible to somehow use rollapply() with a 5 month rolling window to calculate the rolling mean of val? The problem with a 5-month rolling window is that the number of days in a month varies...
NOTE: I would prefer a base-R solution but other libraries would be interesting to see
Since width can be a vector of widths, one for each row of the input, we can simply compute the number of days between each date and 5 months prior and use those numbers for the width vector:
library(zoo)
ym <- as.yearmon(time(data.zoo))
w <- as.Date(ym) - as.Date(ym - 5/12)
r <- rollapplyr(data.zoo, w, mean, fill = NA)
Alternately we could write w like this with lubridate.
library(lubridate)
w <- time(data.zoo) - (time(data.zoo) %m-% months(5))
Update
If there can be missing dates then
library(lubridate)
w <- sapply(time(data.zoo), function(x)
length(intersect(seq(x %m-% months(5), x, "day"), time(data.zoo)))
or repeat this replacing %m-% months(5) with subtract5m which does not use additional packages:
subtract5m <- function(x) {
if (length(x) == 1) seq(x, length = 2, by = "-5 month")[2]
else as.Date(sapply(x, subtract5m))
}
w <- sapply(time(data.zoo), function(x)
length(intersect(seq(subtract5m(x), x, "day"), time(data.zoo))))
Note that due to the ambiguity of the definition of 5 months ago the various computations for w may vary slightly based on slightly different assumptions.
Improving on G. Grothendieck's ideas I went with:
ym <- as.yearmon(time(data.zoo))
ym.cutoff.ideal <- ym - 5/12
ym.cutoff.closest.to.ideal <- as.yearmon(time(data.zoo)[findInterval(as.Date(ym.cutoff.ideal), as.Date(ym)) + 1])
w <- time(data.zoo) - as.Date(ym.cutoff.closest.to.ideal) + 1
r <- rollapplyr(data.zoo, w, mean, fill = NA)
It looks like it is working correctly...
How can I calculate a two-month rolling mean, if the two-months windows are not of equal length? Preferably in datatable. Sample code:
set.seed(24)
test <- data.table(x = rnorm(762),time=seq(as.Date("1988/03/15"), as.Date("1990/04/15"), "day"))
So here, the first mean would be from 1988/03/15 until 1988/04/30, the second one from 1988/04/01 until 1988/05/31, and so on. The data availability in each month can be of different length, by intention.
Add a yearmon column and then summarize the sum and length of x by yearmon.
Finally divide a rolling sum over x by a rolling sum over the length N.
library(data.table)
library(zoo)
Means <- test[, yearmon := as.yearmon(time)][
, list(x = sum(x), N = .N), by = "yearmon"][
, list(yearmon, mean = rollsumr(x, 2, fill = NA) / rollsumr(N, 2, fill = NA))]
Alternately convert test to a zoo object, sum x and the length by yearmon, calculate the rolling sum of both x and n and divide giving a zoo object with the year/months and means. See ?fortify.zoo if you would like to convert that to a data frame.
z <- cbind(x = read.zoo(test, index = "time"), n = 1)
zym <- aggregate(z, as.yearmon, sum)
transform(rollsumr(zym, 2), mean = x / n)
Note
Input used is:
set.seed(24)
test <- data.table(x = rnorm(762), time=seq(as.Date("1988/03/15"),
as.Date("1990/04/15"), "day"))
I have missing values in a time series of dates. For example:
set.seed(101)
df <- data.frame(DATE = as.Date(c('2012-01-01', '2012-01-02',
'2012-01-03', '2012-01-05', '2012-01-06', '2012-01-15', '2012-01-18',
'2012-01-19', '2012-01-20', '2012-01-22')),
VALUE = rnorm(10, mean = 5, sd = 2))
How can I write a function that will fill all the missing dates between the first and last date (ie 2012-01-01 and 2012-01-22'), then use interpolation (linear and smoothing spline) to fill the missing values, but not more than 3 sequential missing values (ie no interpolation between 2012-01-06 and 2012-01-15)?
The function will be applied to a very large dataframe. I have been able to write a function that uses linear interpolation to fill all missing values between two dates (see below), but I can not figure out how to stop it interpolating long stretches of missing values.
interpolate.V <- function(df){
# sort data by time
df <- df[order(df$DATE),]
# linnearly interpolate VALUE for all missing DATEs
temp <- with(df, data.frame(approx(DATE, VALUE, xout = seq(DATE[1],
DATE[nrow(df)], "day"))))
colnames(temp) <- c("DATE", "VALUE_INTERPOLATED")
temp$ST_ID <- df$ST_ID[1]
out <- merge(df, temp, all = T)
rm(temp)
return(out)
}
Any help will be greatly appreciated!
Thanks
Function that adds rows for all missing dates:
date.range <- function(sub){
sub$DATE <- as.Date(sub$DATE)
DATE <- seq.Date(min(sub$DATE), max(sub$DATE), by="day")
all.dates <- data.frame(DATE)
out <- merge(all.dates, sub, all = T)
return(out)
}
Use na.approx or na.spline from zoo package with maxgap argument:
interpolate.zoo <- function(df){
df$VALUE_INT <- na.approx(df$VALUE, maxgap = 3, na.rm = F)
return(df)
}
I am trying to sum values that are greater than 70 in several different data sets. I believe that aggregate can do this but my research has not pointed to an obvious solution to obtaining the values that exceed seventy in my data sets. I have first used aggregate to get the daily max values and put these values into the data frame called yearmaxs. Here is my code and what I have tried:
number of times O3 >70 in a year per site
Sys.setenv(TZ = "UTC")
library(openair)
library(lubridate)
filedir <- "C:/Users/dfmcg/Documents/Thesisfiles/8hravg"
myfiles <- c(list.files(path = filedir))
paste(filedir, myfiles, sep = '/')
npsfiles <- c(paste(filedir, myfiles,sep = '/'))
for (i in npsfiles[22]) {
x <- substr(i,45,61)
y <- paste('C:/Users/dfmcg/Documents/Thesisfiles/exceedenceall', x, sep='/')
timeozone <- import(i, date="DATES", date.format = "%Y-%m-%d %H", header=TRUE, na.strings="NA")
overseventy <- c()
yearmaxs <- aggregate(rolling.O3new ~ format(as.Date(date)), timeozone, max)
colnames(yearmaxs) <- c("date", "daymax")
overseventy <- aggregate(daymax ~ format(as.Date(date)), yearmaxs, FUN = length,
subset = as.numeric(daymax) > 70)
colnames(overseventy) <- c("date", "daymax")
aggregate(daymax ~ format(as.Date(date), "%Y"), overseventy, sum)
I have also tried: sum > "70 and sum(daymax > "70).
My other idea at this point is using a for loop to iterate through the values. I was hoping that a could use aggregate again to sum the values of interest. Any help at all would be greatly appreciated!
I think you want:
aggregate(daymax ~ format(as.Date(date)), yearmaxs, FUN = length,
subset = as.numeric(daymax) > 70)
To things:
you need numerical comparison, so use as.numeric(daymax) > 70 not daymax > "70";
use the subset argument in aggregate.formula.