Correct imputation for a zooreg object? - r

My objective is to impute NAs in a zooreg time series object. The pattern of the time series is cyclic. My code is:
#load libraries required
library("zoo")
# create sequence every 15 minutes from 1st Dec to 20th Dec, 2018
timeStamp <- seq.POSIXt(from=as.POSIXct('2018-01-01 00:00:00', tz="UTC"), to=as.POSIXct('2018-01-20 23:45:00', tz="UTC"), by = "15 min")
# data which increases from 12am to 12pm, then decreases till 12 am of next day, for 20 days
readings <- rep(c(seq(1,48,1), seq(48,1,-1)), 20)
dF <- data.frame(timeStamp=timeStamp, readings=readings)
# create a regular zooreg object, frequency is 1 day( 4 readings * 24 hours)
readingsZooReg <- zooreg(dF$readings, order.by = dF$timeStamp, frequency = 4*24)
plot(readingsZooReg)
# force some data to be NAs
window(readingsZooReg, start = as.POSIXct("2018-01-14 00:00:00", tz="UTC"), end = as.POSIXct("2018-01-16 23:45:00", tz="UTC")) <- NA
plot(readingsZooReg)
# plot imputed values
plot(na.approx(readingsZooReg))
The plots are:
Full time series, NAs added, Imputed time series
I'm purposely using zoo here, since the time series I work on are irregular(eg. solar, oil wells, etc)
1) Is my usage of "zooreg" correct? Or would a "zoo" object suffice ?
2) Is my frequency variable right?
3) Why won't na.approx work? I've also tried na.StructTs, the R script hangs.
4) Is there a solution using any other package? xts, ts, etc?

Your current example time-series is a regular time-series.
(a irregular time series would have time-steps with different time distances between observations)
E.g.:
10:00:10, 10:00:20, 10:00:30, 10:00:40, 10:00:50 (regular spaced)
10:00:10, 10:00:17, 10:00:33, 10:00:37, 10:00:50 (irregular spaced)
If you really need to handle irregular spaced time-series, zoo is your go to package. Otherwise you can also use other time series classes as xts and ts.
About the frequency:
You set the frequency of a time-series usually according to a value where you expect patterns to repeat. (in your example this could be 96). In real live this is often 1 day, 1 week, 1 month,....but it can be also different from these like 1,5 days. (e.g. if you have daily returning patterns and 1 minute observations you would set the frequency to 1440).
na.approx of zoo workes perfectly. It is exactly doing what it is expected to. A interpolation between the points 0 before the gap and 0 at the end of the gap will give a straight line at 0. Of course that is probably not the result you expected, because it does not account for seasonality. That is why G. Grothendieck suggests you na.StructTS as a method to choose. (this method is usually better in accounting for seasonality)
The best choice if you are not bound to zoo would in this specific case be using na_seadec from the imputeTS package ( a package solely dedicated to time series imputation).
I have added you a example also with nice plots from the imputeTS package
library(imputeTS)
yourTS <- ts(coredata(readingsZooReg), frequency = 96)
ggplot_na_distribution(yourTS)
imputedTS <- na_seadec(yourTS)
ggplot_na_imputations(yourTS, imputedTS)
Usually imputeTS also works perfectly with zoo time-series as input. I only changed it to ts again, because something with your zoo object seems odd...that is also why na.StructTS from zoo itself breaks. Maybe somebody with better knowledge can help out here.
Beware, if you really should have irregular time series do not use other packages / imputation functions than from zoo. Because they all assume the data to be regular spaced and will give results accordingly.

Related

R: What does the frequency argument to xts do? [duplicate]

I'm creating an xts object with a weekly (7 day) frequency to use in forecasting. However, even when using the frequency=7 argument in the xts call, the resulting xts object has a frequency of 1.
Here's an example with random data:
> values <- rnorm(364, 10)
> days <- seq.Date(from=as.Date("2014-01-01"), to=as.Date("2014-12-30"), by='days')
> x <- xts(values, order.by=days, frequency=7)
> frequency(x)
[1] 1
I have also tried, after using the above code, frequency(x) <- 7. However, this changes the class of x to only zooreg and zoo, losing the xts class and messing with the time stamp formats.
Does xts automatically choose a frequency based on analyzing the data in some way? If so, how can you override this to set a specific frequency for forecasting purposes (in this case, passing a seasonal time series to ets from the forecast package)?
I understand that xts may not allow frequencies that don't make sense, but a frequency of 7 with daily time stamps seems pretty logical.
Consecutive Date class dates always have a frequency of 1 since consecutive dates are 1 apart. Use ts or zooreg to get a frequency of 7:
tt <- ts(values, frequency = 7)
library(zoo)
zr <- as.zooreg(tt)
# or
zr <- zooreg(values, frequency = 7)
These will create a series whose times are 1, 1+1/7, 1+2/7, ...
If we have some index values of zr
zrdates <- index(zr)[5:12]
we can recover the dates from zrdates like this:
days[match(zrdates, index(zr))]
As pointed out in the comments xts does not support this type of series.

Weekly time series plot in R

I am trying to create a plot of weekly data. Though this is not the exact problem I am having it illustrates it well. Basically imagine you want to make a plot of 1,2,....,7 for for 7 weeks from Jan 1 2015. So basically my plot should just be a line that trends upward but instead I get 7 different lines. I tried the code (and some other to no avail). Help would be greatly appreciated.
startDate = "2015-01-01"
endDate = "2015-02-19"
y=c(1,2,3,4,5,6,7)
tsy=ts(y,start=as.Date(startDate),end=as.Date(endDate))
plot(tsy)
You are plotting both the time and y together as individual plots.
Instead use:
plot(y)
lines(y)
Also, create a date column based on the specifics you gave which will be a time series. From here you can add the date on the x-axis to easily see how your variable changes over time.
To make your life easier I think your first step should be to create a (xts) time series object (install/load the xts-package), then it is a piece of cake to plot, subset or do whatever you like with the series.
Build your vector of dates as a sequence with start/end date:
seq( as.Date("2011-07-01"), by=1, len=7)
and your data vector: 1:7
a one-liner builds and plots the above time series object:
plot(as.xts(1:7,order.by=seq( as.Date("2011-07-01"), by=1, len=7)))

Interpolation in R

I have hourly time series and would like to interpolate sub-hourly values like every 15 min. Linear interpolation will do. But if there is any way to specify Gaussian, Polynomial, that would be great.
For example if I have
a<-c(4.5,7,3.3) which is the first three hour data. How can I get 15 min sub-hourly data, total of 9 values in this case? I have been using approx function and studying zoo package and still don't know how I can do it. Thank you very much!
How about this:
b<-xts(c(4.5,7,3.3), order.by=as.POSIXct(c('2013-07-26 0:00',
'2013-07-26 2:00',
'2013-07-26 3:00')))
approx(b, n=13) ,
adjusting n for the appropriate time interval?
Within xts package, you can either na.approx or na.spline.
Coerce you times series to an xts object
Create a new index having 15 minutes intervals
Use this new index to create a NULL xts object that you merge with your object
Approximate missing values using na.approx for linear/constant approx or na.spline for polynomial one.
here a complete example:
library(xts)
set.seed(21)
## you create the xts object
x <- xts(rnorm(10),
seq(from=as.POSIXct(Sys.Date()),
length.out=10,
by=as.difftime(1,units='hours')))
## new index to be used
new.index <-
seq(min(index(x)),max(index(x)), by=as.difftime(15,units='mins'))
## linear approx
na.approx(merge(x,xts(NULL,new.index)))
## polynomial approx
na.spline(merge(x,xts(NULL,new.index)))

Plotting truncated times from zoo time series

Let's say I have a data frame with lots of values under these headers:
df <- data.frame(c("Tid", "Value"))
#Tid.format = %Y-%m-%d %H:%M
Then I turn that data frame over to zoo, because I want to handle it as a time series:
library("zoo")
df <- zoo(df$Value, df$Tid)
Now I want to produce a smooth scatter plot over which time of day each measurement was taken (i.e. discard date information and only keep time) which supposedly should be done something like this: https://stat.ethz.ch/pipermail/r-help/2009-March/191302.html
But it seems the time() function doesn't produce any time at all; instead it just produces a number sequence. Whatever I do from that link, I can't get a scatter plot of values over an average day. The data.frame code that actually does work (without using zoo time series) looks like this (i.e. extracting the hour from the time and converting it to numeric):
smoothScatter(data.frame(as.numeric(format(df$Tid,"%H")),df$Value)
Another thing I want to do is produce a density plot of how many measurements I have per hour. I have plotted on hours using a regular data.frame with no problems, so the data I have is fine. But when I try to do it using zoo then I either get errors or I get the wrong results when trying what I have found through Google.
I did manage to get something plotted through this line:
plot(density(as.numeric(trunc(time(df),"01:00:00"))))
But it is not correct. It seems again that it is just producing a sequence from 1 to 217, where I wanted it to be truncating any date information and just keep the time rounded off to hours.
I am able to plot this:
plot(density(df))
Which produces a density plot of the Values. But I want a density plot over how many values were recorded per hour of the day.
So, if someone could please help me sort this out, that would be great. In short, what I want to do is:
1) smoothScatter(x-axis: time of day (0-24), y-axis: value)
2) plot(density(x-axis: time of day (0-24)))
EDIT:
library("zoo")
df <- data.frame(Tid=strptime(c("2011-01-14 12:00:00","2011-01-31 07:00:00","2011-02-05 09:36:00","2011-02-27 10:19:00"),"%Y-%m-%d %H:%M"),Values=c(50,52,51,52))
df <- zoo(df$Values,df$Tid)
summary(df)
df.hr <- aggregate(df, trunc(df, "hours"), mean)
summary(df.hr)
png("temp.png")
plot(df.hr)
dev.off()
This code is some actual values that I have. I would have expected the plot of "df.hr" to be an hourly average, but instead I get some weird new index that is not time at all...
There are three problems with the aggregate statement in the question:
We wish to truncate the times not df.
trunc.POSIXt unfortunately returns a POSIXlt result so it needs to be converted back to POSIXct
It seems you did not intend to truncate to the hour in the first place but wanted to extract the hours.
To address the first two points the aggregate statement needs to be changed to:
tt <- as.POSIXct(trunc(time(df), "hours"))
aggregate(df, tt, mean)
but to address the last point it needs to be changed entirely to
tt <- as.POSIXlt(time(df))$hour
aggregate(df, tt, mean)

15min time aggregation in R

I have a zoo series in R. I can choose between a chron or a POSIXct index.
How can I aggregate to 15min, taking the last element every 15min?
I know how to aggregate daily, writing as.Date, but not how to aggregate every 15min.
thanks.
If I recall, this is documented in the zoo vignettes. Did you look there?
The xts package, which builds on zoo has helper functions -- see help(to.period) in particular and the to.minutes15 function.
Here are a couple of possibilities depending on what you want. Both make use of trunc.times from the chron package. The aggregate.zoo solution takes the last value within each 15 minute interval and labels it using the time at the beginning of the 15 minute interval so the times used are: 00:00:00, 00:15:00, 00:30:00 and 00:45:00. The duplicated solution uses the same values but labels them using the last time actually found in the data. In both cases we only include intervals for which data is present.
There are more examples of aggregate.zoo in (1) ?aggregate.zoo, (2) all three of the zoo vignettes have examples and (3) searching the r-help archives for the words aggregate.zoo and trunc finds even more examples.
library(zoo)
library(chron)
z <- zoo(1:10, chron(1:10/(24*13)))
# 1. last value in each 15 minute interval
# using time at which interval begins
aggregate(z, trunc(time(z), "00:15:00"), tail, 1)
# 2. last value in each 15 minute interval
# time of last point in data within interval
z[!duplicated(trunc(time(z), "00:15:00"), fromLast = TRUE)]

Resources