Finding a more elegant was to aggregate hourly data to mean hourly data using zoo

Finding a more elegant was to aggregate hourly data to mean hourly data using zoo - r

I have a chunk of data logging temperatures from a few dozen devices every hour for over a year. The data are stored as a zoo object. I'd very much like to summarize those data by looking at the average values for every one of the 24 hours in a day (1am, 2am, 3am, etc.). So that for each device I can see what its average value is for all the 1am times, 2am times, and so on. I can do this with a loop but sense that there must be a way to do this in zoo with an artful use of aggregate.zoo. Any help?
require(zoo)
# random hourly data over 30 days for five series
x <- matrix(rnorm(24 * 30 * 5),ncol=5)
# Assign hourly data with a real time and date
x.DateTime <- as.POSIXct("2014-01-01 0100",format = "%Y-%m-%d %H") +
seq(0,24 * 30 * 60 * 60, by=3600)
# make a zoo object
x.zoo <- zoo(x, x.DateTime)
#plot(x.zoo)
# what I want:
# the average value for each series at 1am, 2am, 3am, etc. so that
# the dimensions of the output are 24 (hours) by 5 (series)
# If I were just working on x I might do something like:
res <- matrix(NA,ncol=5,nrow=24)
for(i in 1:nrow(res)){
res[i,] <- apply(x[seq(i,nrow(x),by=24),],2,mean)
}
res
# how can I avoid the loop and write an aggregate statement in zoo that
# will get me what I want?

Calculate the hour for each time point and then aggregate by that:
hr <- as.numeric(format(time(x.zoo), "%H"))
ag <- aggregate(x.zoo, hr, mean)
dim(ag)
## [1] 24 5
ADDED
Alternately use hours from chron or hour from data.table:
library(chron)
ag <- aggregate(x.zoo, hours, mean)

This is quite similar to the other answer but takes advantage of the fact the the by=... argument to aggregate.zoo(...) can be a function which will be applied to time(x.zoo):
as.hour <- function(t) as.numeric(format(t,"%H"))
result <- aggregate(x.zoo,as.hour,mean)
identical(result,ag) # ag from G. Grothendieck answer
# [1] TRUE
Note that this produces a result identical to the other answer, not not the same as yours. This is because your dataset starts at 1:00am, not midnight, so your loop produces a matrix wherein the 1st row corresponds to 1:00am and the last row corresponds to midnight. These solutions produce zoo objects wherein the first row corresponds to midnight.

Related

Creating time stamps and repeat in the same column

I want to create a vector of time stamps consisting of 60 monthly dates and repeat the process for n number of times. That means, if n = 2, the vector should contain 120 times stamps.
A single vector of time stamps I am creating in this way,
t <- seq(as.Date("2014-01-01"), as.Date("2018-12-31"), by = "month")
To repeat it n number of times I am doing the following,
n <- 2
X <- data.frame(replicate(n, seq(as.Date("2014-01-01"), as.Date("2018-12-31"), by = "month")))
Y <- stack(X)[,"values", drop=FALSE]
head(Y)
> head(Y)
values
1 16071
2 16102
3 16130
4 16161
5 16191
6 16222
As you see the values are not in time format anymore. My question is how to retain the time format in the vector Y? Is there any smarter way to do this problem?

Take a look at the 'zoo' package, there is an old thread here https://stat.ethz.ch/pipermail/r-help//2010-March/233159.html
where they talk about sort of the same problem.
Either way, after installing zoo you can do
as.Date(16071)
and it will return the date in date format. Hope this makes sense.

R - Next highest value in a time series

A relatively simple question, but one I can't seem to find any examples.
I have simple forex price data which is in a 2 column xts object called subx1:
Datetime, Price
2016-09-01 00:00:01, 1.11563
2016-09-01 00:00:01, 1.11564
2016-09-01 00:00:02, 1.11564
2016-09-01 00:00:03, 1.11565
... and so forth.
I'm trying to find the first time after 2pm when the price goes higher than the pre-2pm high which is held in another object's column called daypeakxts$before2.High and
Where a sample of daypeakxts is:
Date, before2.High
2016-09-01, 1.11567
2016-09-02, 1.11987
This is a bad example of what I'm trying to do:
subxresult <- index(subx1, subx1$datetime > daypeakxts$before2.High)
... so I'm looking to discover a datetime for a price using a conditional statement with a day's value in another xts object.

You didn't provide enough data for a reproducible example, so I'm going to use some daily data that comes with the xts package.
library(xts)
data(sample_matrix)
x <- as.xts(sample_matrix, dateForamt = "Date")
# Aggregate and find the high for each week
Week.High <- apply.weekly(x, function(x) max(x$High))
# Finding the pre-2pm high would be something like:
# Pre.2pm.High <- apply.daily(x["T00:00/T14:00"], function(x) max(x$High))
# Merge the period high with the original data, and
# fill NA with the last observation carried forward
y <- merge(x, Week.High, fill = na.locf)
# Lag the period high, so it aligns with the following period
y$Week.High <- lag(y$Week.High)
# Find the first instance where the next period's high
# is higher than the previous period's high
y$First.Higher <- apply.weekly(y, function(x) which(x$High > x$Week.High)[1])

How do I subset every day except the last five days of zoo data?

I am trying to extract all dates except for the last five days from a zoo dataset into a single object.
This question is somewhat related to How do I subset the last week for every month of a zoo object in R?
You can reproduce the dataset with this code:
set.seed(123)
price <- rnorm(365)
data <- cbind(seq(as.Date("2013-01-01"), by = "day", length.out = 365), price)
zoodata <- zoo(data[,2], as.Date(data[,1]))
For my output, I'm hoping to get a combined dataset of everything except the last five days of each month. For example, if there are 20 days in the first month's data and 19 days in the second month's, I only want to subset the first 15 and 14 days of data respectively.
I tried using the head() function and the first() function to extract the first three weeks, but since each month will have a different amount of days according to month or leap year months, it's not ideal.
Thank you.

Here are a few approaches:
1) as.Date Let tt be the dates. Then we compute a Date vector the same length as tt which has the corresponding last date of the month. We then pick out those dates which are at least 5 days away from that:
tt <- time(zoodata)
last.date.of.month <- as.Date(as.yearmon(tt), frac = 1)
zoodata[ last.date.of.month - tt >= 5 ]
2) tapply/head For each month tapply head(x, -5) to the data and then concatenate the reduced months back together:
do.call("c", tapply(zoodata, as.yearmon(time(zoodata)), head, -5))
3) ave Define revseq which given a vector or zoo object returns sequence numbers in reverse order so that the last element corresponds to 1. Then use ave to create a vector ix the same length as zoodata which assigns such reverse sequence numbers to the days of each month. Thus the ix value for the last day of the month will be 1, for the second last day 2, etc. Finally subset zoodata to those elements corresponding to sequence numbers greater than 5:
revseq <- function(x) rev(seq_along(x))
ix <- ave(seq_along(zoodata), as.yearmon(time(zoodata)), FUN = revseq)
z <- zoodata[ ix > 5 ]
ADDED Solutions (1) and (2).

Exactly the same way as in the answer to your other question:
Split dataset by month, remove last 5 days, just add a "-":
library(xts)
xts.data <- as.xts(zoodata)
lapply(split(xts.data, "months"), last, "-5 days")
And the same way, if you want it on one single object:
do.call(rbind, lapply(split(xts.data, "months"), last, "-5 days"))

Mean hour-of-day and imputation...would this be easier with time calculations?

I'm working with a data set and am imputing NAs for times. I have a simplified example below where I am creating a new column that includes the original data and imputed values for NAs (i.e., the mean of the time of day). The code works fine, but I am so weak with dates I was wondering if there was an easier way to calculate the mean time of day date/time values?
arrivals <- data.frame(
ships=c("Glory","Discover","Intrepid","Enchantment","Summit"),
times=c("8:00","10:00","11:42",NA,"9:20"), stringsAsFactors=FALSE)
sumtime <- sapply(strsplit(as.character(arrivals$times),":"),
function(x) as.numeric(x[1])*60 + as.numeric(x[2]))
avgtime <- paste(trunc((mean(sumtime, na.rm=TRUE)/60)),":",
trunc(mean(sumtime, na.rm=TRUE)%%60), sep="")
arrivals$times2 <- arrivals$times
arrivals$times2[is.na(arrivals$times)] <- avgtime

You can use the chron package to convert your times column to a numeric representation that you can take the average of:
library(chron)
Arrivals <- arrivals[,c("ships","times")]
# Will give some warnings due to the missing value
Arrivals$times <- chron(times.=paste(Arrivals$times, ":00", sep=""))
Arrivals$times[is.na(Arrivals$times)] <- mean(Arrivals$times,na.rm=TRUE)
ships times
1 Glory 08:00:00
2 Discover 10:00:00
3 Intrepid 11:42:00
4 Enchantment 09:45:30
5 Summit 09:20:00

Calculating a daily mean in R

Say I have the following matrix:
x1 = 1:288
x2 = matrix(x1,nrow=96,ncol=3)
Is there an easy way to get the mean of rows 1:24,25:48,49:72,73:96 for column 2?
Basically I have a one year time series and I have to average some data every 24 hours.

There is.
Suppose we have the days :
Days <- rep(1:4,each=24)
you could do easily
tapply(x2[,2],Days,mean)
If you have a dataframe with a Date variable, you can use that one. You can do that for all variables at once, using aggregate :
x2 <- as.data.frame(cbind(x2,Days))
aggregate(x2[,1:3],by=list(Days),mean)
Take a look at the help files of these functions to start with. Also do a search here, there are quite some other interesting answers on this problem :
Aggregating daily content
Compute means of a group by factor
PS : If you're going to do a lot of timeseries, you should take a look at the zoo package (on CRAN : http://cran.r-project.org/web/packages/zoo/index.html )

1) ts. Since this is a regularly spaced time series, convert it to a ts series and then aggregate it from frequency 24 to frequency 1:
aggregate(ts(x2[, 2], freq = 24), 1, mean)
giving:
Time Series:
Start = 1
End = 4
Frequency = 1
[1] 108.5 132.5 156.5 180.5
2) zoo. Here it is using zoo. The zoo package can also handle irregularly spaced series (if we needed to extend this). Below day.hour is the day number (1, 2, 3, 4) plus the hour as a fraction of the day so that floor(day.hour) is just the day number:
library(zoo)
day.hour <- seq(1, length = length(x2[, 2]), by = 1/24)
z <- zoo(x2[, 2], day.hour)
aggregate(z, floor, mean)
## 1 2 3 4
## 108.5 132.5 156.5 180.5
If zz is the output from aggregate then coredata(zz) and time(zz) are the values and times, respectively, as ordinary vectors.

Quite compact and computationally fast way of doing this is to reshape the vector into a suitable matrix and calculating the column means.
colMeans(matrix(x2[,2],nrow=24))