Deseasonalize a "zoo" object containing intraday data - r

Using the zoo package (and help from SO) I have created a time series from the following:
z <- read.zoo("D:\\Futures Data\\BNVol3.csv", sep = ",", header = TRUE, index = 1:2,
tz="", format = "%d/%m/%Y %H:%M")
This holds data in the following format:(Intra-day from 07:00 to 20.50)
2012-10-01 14:50:00 2012-10-01 15:00:00 2012-10-01 15:10:00 2012-10-01 15:20:00
8638 9014 9402 9505
I want to "deseasonalize" the intra-day component of this data so that 1 day is considered a complete seasonal cycle. (I am using the day component because not all days will run from 07.00 to 20.50 due to bank holidays etc, but running from 07.00 to 20.50 is usually the standard. I assume that if i used the 84 intra-day points as 1 seasonal cycle then as some point the deseasonalizing will begin to get thrown off track)
I have tried to use the decompose method but this has not worked.
x <- Decompose(z)
Not sure "zoo" and decompose method are compatible but I thought "zoo" and "ts" were designed to be. Is there another way to do this?
Thanks in advance for any help.

Related

best practices for avoiding roundoff gotchas in date manipulation

I am doing some date/time manipulation and experiencing explicable, but unpleasant, round-tripping problems when converting date -> time -> date . I have temporarily overcome this problem by rounding at appropriate points, but I wonder if there are best practices for date handling that would be cleaner. I'm using a mix of base-R and lubridate functions.
tl;dr is there a good, simple way to convert from decimal date (YYYY.fff) to the Date class (and back) without going through POSIXt and incurring round-off (and potentially time-zone) complications??
Start with a few days from 1918, as separate year/month/day columns (not a critical part of my problem, but it's where my pipeline happens to start):
library(lubridate)
dd <- data.frame(year=1918,month=9,day=1:12)
Convert year/month/day -> date -> time:
dd <- transform(dd,
time=decimal_date(make_date(year, month, day)))
The successive differences in the resulting time vector are not exactly 1 because of roundoff: this is understandable but leads to problems down the road.
table(diff(dd$time)*365)
## 0.999999999985448 1.00000000006844
## 9 2
Now suppose I convert back to a date: the dates are slightly before or after midnight (off by <1 second in either direction):
d2 <- lubridate::date_decimal(dd$time)
# [1] "1918-09-01 00:00:00 UTC" "1918-09-02 00:00:00 UTC"
# [3] "1918-09-03 00:00:00 UTC" "1918-09-03 23:59:59 UTC"
# [5] "1918-09-04 23:59:59 UTC" "1918-09-05 23:59:59 UTC"
# [7] "1918-09-07 00:00:00 UTC" "1918-09-08 00:00:00 UTC"
# [9] "1918-09-09 00:00:00 UTC" "1918-09-09 23:59:59 UTC"
# [11] "1918-09-10 23:59:59 UTC" "1918-09-12 00:00:00 UTC"
If I now want dates (rather than POSIXct objects) I can use as.Date(), but to my dismay as.Date() truncates rather than rounding ...
tt <- as.Date(d2)
## [1] "1918-09-01" "1918-09-02" "1918-09-03" "1918-09-03" "1918-09-04"
## [6] "1918-09-05" "1918-09-07" "1918-09-08" "1918-09-09" "1918-09-09"
##[11] "1918-09-10" "1918-09-12"
So the differences are now 0/1/2 days:
table(diff(tt))
# 0 1 2
# 2 7 2
I can fix this by rounding first:
table(diff(as.Date(round(d2))))
## 1
## 11
but I wonder if there is a better way (e.g. keeping POSIXct out of my pipeline and staying with dates ...
As suggested by this R-help desk article from 2004 by Grothendieck and Petzoldt:
When considering which class to use, always
choose the least complex class that will support the
application. That is, use Date if possible, otherwise use
chron and otherwise use the POSIX classes. Such a strategy will greatly reduce the potential for error and increase the reliability of your application.
The extensive table in this article shows how to translate among Date, chron, and POSIXct, but doesn't include decimal time as one of the candidates ...
It seems like it would be best to avoid converting back from decimal time if at all possible.
When converting from date to decimal date, one also needs to account for time. Since Date does not have a specific time associated with it, decimal_date inherently assumes it to be 00:00:00.
However, if we are concerned only with the date (and not the time), we could assume the time to be anything. Arguably, middle of the day (12:00:00) is as good as the beginning of the day (00:00:00). This would make the conversion back to Date more reliable as we are not at the midnight mark and a few seconds off does not affect the output. One of the ways to do this would be to add 12*60*60/(365*24*60*60) to dd$time
dd$time2 = dd$time + 12*60*60/(365*24*60*60)
data.frame(dd[1:3],
"00:00:00" = as.Date(date_decimal(dd$time)),
"12:00:00" = as.Date(date_decimal(dd$time2)),
check.names = FALSE)
# year month day 00:00:00 12:00:00
#1 1918 9 1 1918-09-01 1918-09-01
#2 1918 9 2 1918-09-02 1918-09-02
#3 1918 9 3 1918-09-03 1918-09-03
#4 1918 9 4 1918-09-03 1918-09-04
#5 1918 9 5 1918-09-04 1918-09-05
#6 1918 9 6 1918-09-05 1918-09-06
#7 1918 9 7 1918-09-07 1918-09-07
#8 1918 9 8 1918-09-08 1918-09-08
#9 1918 9 9 1918-09-09 1918-09-09
#10 1918 9 10 1918-09-09 1918-09-10
#11 1918 9 11 1918-09-10 1918-09-11
#12 1918 9 12 1918-09-12 1918-09-12
It should be noted, however, that the value of decimal time obtained in this way will be different.
lubridate::decimal_date() is returning a numeric. If I understand you correctly, the question is how to convert that numeric into Date and have it round appropriately without bouncing through POSIXct.
as.Date(1L, origin = '1970-01-01') shows us that we can provide as.Date with days since some specified origin and convert immediately to the Date type. Knowing this, we can skip the year part entirely and set it as origin. Then we can convert our decimal dates to days:
as.Date((dd$time-trunc(dd$time)) * 365, origin = "1918-01-01").
So, a function like this might do the trick (at least for years without leap days):
date_decimal2 <- function(decimal_date) {
years <- trunc(decimal_date)
origins <- paste0(years, "-01-01")
# c.f. https://stackoverflow.com/questions/14449166/dates-with-lapply-and-sapply
do.call(c, mapply(as.Date.numeric, x = (decimal_date-years) * 365, origin = origins, SIMPLIFY = FALSE))
}
Side note: I admit I went down a bit of a rabbit hole with trying to move origin around deal with the pre-1970 date. I found that the further origin shifted from the target date, the more weird the results got (and not in ways that seemed to be easily explained by leap days). Since origin is flexible, I decided to target it right on top of the target values. For leap days, seconds, and whatever other weirdness time has in store for us, on your own head be it. =)

Why am I getting this error message even after transforming my data set into a ts file for time series analysis?

I intend to perform a time series analysis on my data set. I have imported the data (monthly data from January 2015 till December 2017) from a csv file and my codes in RStudio appear as follows:
library(timetk)
library(tidyquant)
library(timeSeries)
library(tseries)
library(forecast)
mydata1 <- read.csv("mydata.csv", as.is=TRUE, header = TRUE)
mydata1
date pkgrev
1 1/1/2015 39103770
2 2/1/2015 27652952
3 3/1/2015 30324308
4 4/1/2015 35347040
5 5/1/2015 31093119
6 6/1/2015 20670477
7 7/1/2015 24841570
mydata2 <- mydata1 %>%
mutate(date = mdy(date))
mydata2
date pkgrev
1 2015-01-01 39103770
2 2015-02-01 27652952
3 2015-03-01 30324308
4 2015-04-01 35347040
5 2015-05-01 31093119
6 2015-06-01 20670477
7 2015-07-01 24841570
class(mydata2)
[1] "data.frame"
It is when running this piece of code that things get a little weird (for me at least):
mydata2_ts <- ts(mydata2, start=c(2015,1), freq=12)
mydata2_ts
date pkgrev
Jan 2015 16436 39103770
Feb 2015 16467 27652952
Mar 2015 16495 30324308
Apr 2015 16526 35347040
May 2015 16556 31093119
Jun 2015 16587 20670477
Jul 2015 16617 24841570
I don't really understand the values in the date column! It seems the dates have been converted into numeric format.
class(mydata2_ts)
[1] "mts" "ts" "matrix"
Now, running the following codes give me an error:
stlRes <- stl(mydata2_ts, s.window = "periodic")
Error in stl(mydata2_ts, s.window = "periodic") :
only univariate series are allowed
What is wrong with my process?
The reason that you got this error is because you tried to feed a data set with two variables (date + pkgrev) into STL's argument, which only takes a univariate time series as a proper argument.
To solve this problem, you could create a univariate ts object without the date variable. In your case, you need to use mydata2$pkgrev (or mydata2["pkgrev"] after mydata2 is converted into a dataframe) instead of mydata2 in your code mydata2_ts <- ts(mydata2, start=c(2015,1), freq=12). The ts object is already supplied with the temporal information as you specified start date and frequency in the argument.
If you would like to create a new dataframe with both the ts object and its corresponding date variable, I would suggest you to use the following code:
mydata3 = cbind(as.Date(time(mydata2_ts)), mydata2_ts)
mydata3 = as.data.frame(mydata3)
However, for the purpose of STL decompostion, the input of the first argument should be a ts object, i.e., mydata2_ts.

computing and formatting averages and squares of time intervals

I have a model which predicts the duration of certain events, and measures of durations for those events. I then want to compute the difference between Predicted and Measured, the mean difference and the RMSE. I'm able to do it, but the formatting is really awkward and not what I expected:
database <- data.frame(Predicted = c(strptime(c("4:00", "3:35", "3:38"), format = "%H:%M")),
Measured = c(strptime(c("3:39", "3:40", "3:53"), format = "%H:%M")))
database
> Predicted Measured
1 2016-11-28 04:00:00 2016-11-28 03:39:00
2 2016-11-28 03:35:00 2016-11-28 03:40:00
3 2016-11-28 03:38:00 2016-11-28 03:53:00
This is the first weirdness: why does R shows me a time and a date, even if I clearly specified a time-only format (%H:%M), and there was no date in my data to start with? It gets weirder:
database$Error <- with(database, Predicted-Measured)
database$Mean_Error <- with(database, mean(Predicted-Measured))
database$RMSE <- with(database, sqrt(mean(as.numeric(Predicted-Measured)^2)))
> database
Predicted Measured Error Mean_Error RMSE
1 2016-11-28 04:00:00 2016-11-28 03:39:00 21 mins 0.3333333 15.17674
2 2016-11-28 03:35:00 2016-11-28 03:40:00 -5 mins 0.3333333 15.17674
3 2016-11-28 03:38:00 2016-11-28 03:53:00 -15 mins 0.3333333 15.17674
Why is the variable Error expressed in minutes? For Error it's not a bad choice, but it becomes quite hard to read for Mean_Error. For RMSE it's even worse, but this could be due to the as.numeric function: if I remove it, R complains that '^' not defined for "difftime" objects. My questions are:
Is it possible to show the first 2 columns (Predicted and Measured) shown in the %H:%M format?
for the other 3 columns ( Error, Mean_Error and RMSE) I would like to compare a %M:%S format and a format in only seconds, and choose among the two. Is it possible?
EDIT: just to be more clear, my goal is to insert observations of time intervals into a dataframe and compute a vector of time interval differences. Then, compute some statistics for that vector: mean, RMSE, etc.. I know I could just enter the time observations in seconds, but that doesn't look very good: it's difficult to tell that 13200 seconds are 3 hours and 40 minutes. Thus I would like to be able to store the time intervals in the %H:%M, but then be able to manipulate them algebraically and show the results in a format of my choosing. Is that possible?
We can use difftime to specify the units for the difference in time. The output of difftime is an object of class difftime. When this difftime object is coerced to numeric using as.numeric, we can change these units (see the examples in ?difftime):
## Note we don't convert to date-time because we just want %H:%M
database <- data.frame(Predicted = c("4:00", "3:35", "3:38"),
Measured = c("3:39", "3:40", "3:53"))
## We now convert to date-time and use difftime to compute difference in minutes
database$Error <- with(database, difftime(strptime(Predicted,format="%H:%M"),strptime(Measured,format="%H:%M"), units="mins"))
## Use as.numeric to change units to seconds
database$Mean_Error <- with(database, mean(as.numeric(Error,units="secs")))
database$RMSE <- with(database, sqrt(mean(as.numeric(Error,units="secs")^2)))
## Predicted Measured Error Mean_Error RMSE
##1 4:00 3:39 21 mins 20 910.6042
##2 3:35 3:40 -5 mins 20 910.6042
##3 3:38 3:53 -15 mins 20 910.6042

R: using stl() on an xts object

I'm relatively new to R so please bear with me. I'm trying to get to grips with basic irregular time-series analysis.
That's what my data file looks like, some 40k lines. The spacing is not always exactly 20sec.
Time, Avg
04/03/2015 00:00:23,20.24
04/03/2015 00:00:43,20.38
04/03/2015 00:01:03,20.53
04/03/2015 00:01:23,20.54
04/03/2015 00:01:43,20.53
data <- read.zoo("data.csv",sep=",",tz='',header=T,format='%d/%m/%Y %H:%M:%S')
I'm happy to aggregate by minutes
data <- to.minutes(as.xts(data))
Using the "open" column as an example
head(data[,1])
as.xts(data).Open
2015-03-04 00:00:43 20.24
2015-03-04 00:01:43 20.53
2015-03-04 00:02:43 20.47
2015-03-04 00:03:43 20.38
2015-03-04 00:04:43 20.05
2015-03-04 00:05:43 19.84
data <- data[,1]
And here is where it all falls apart for me
fit <- stl(data, t.window=15, s.window="periodic", robust=TRUE)
Error in stl(data, t.window = 15, s.window = "periodic", robust = TRUE) :
series is not periodic or has less than two periods
I've googled the error message, but it's not really clear to me. Is period = frequency? For my dataset I would expect the seasonal component to be weekly.
frequency(data) <- 52
fit <- stl(data, t.window=15, s.window="periodic", robust=TRUE)
Error in na.fail.default(as.ts(x)) : missing values in object
?
head(as.ts(data))
[1] 20.24 NA NA NA NA NA
Uh, what?
What am I doing wrong? How do I have to prepare the xts object to be able to properly pass it to stl()?
Thank you.
I extract numeric values of xts_object and build a ts object for stl function. However the time stamps of xts_object is completely ignored in this case.
stl(ts(as.numeric(xts_object), frequency=52), s.window="periodic", robust=TRUE)

Remove time from XTS

I am trying to compare different timeseries, by day.
Currently a typical XTS object looks like:
> vwap.crs
QUANTITY QUANTITY.1
2014-03-03 13:00:00 3423.500 200000
2014-03-04 17:00:00 3459.941 4010106
2014-03-05 16:00:00 3510.794 1971234
2014-03-06 17:00:00 3510.582 185822
now, i can strip the time out of the index as follows:
> round(index(vwap.crs),"day")
[1] "2014-03-04" "2014-03-05" "2014-03-06" "2014-03-07"
My question is, how do I replace the existing index in variable vwap.crs, with the rounded output above?
EDIT: to.daily fixed it
This should do it
indexClass(vwap.crs) <- "Date"
Also, take a look at the code in xts:::.drop.time
You could also do it the way you're trying to do it if you use index<-
index(vwap.crs) <- round(index(vwap.crs),"day")

Resources