dealing with NA in seasonal cycle analysis R - r

I have a timeseries of monthly data with lots of missing datapoints, set to NA. I want to simply subtract the annual cycle from the data, ignoring the missing entries. It seems that the decompose function can't handle missing data points, but I have seen elsewhere that the seasonal package is suggested instead. However I am also running into problems there too with the NA.
Here is a minimum reproducible example of the problem using a built in dataset...
library(seasonal)
# set range to missing NA in Co2 dataset
c2<-co2
c2[c2>330 & c2<350]=NA
seas(c2,na.action=na.omit)
Error in na.omit.ts(x) : time series contains internal NAs
Yes, I know! that's why I asked you to omit them! Let's try this:
seas(c2,na.action=na.x13)
Error: X-13 run failed
Errors:
- Adding MV1981.Apr exceeds the number of regression effects
allowed in the model (80).
Hmmm, interesting, no idea what that means, okay, please just exclude the NA:
seas(c2,na.action=na.exclude)
Error in na.omit.ts(x) : time series contains internal NAs
that didn't help much! and for good measure
decompose(c2)
Error in na.omit.ts(x) : time series contains internal NAs
I'm on the following:
R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
Why is leaving out NA such a problem? I'm obviously being completely stupid, but I can't see what I'm doing wrong with the seas function. Happy to consider an alternative solution using xts.

My first solution, simply manually calculating the seasonal cycle, converting to a dataframe to subtract the vector and then transforming back.
# seasonal cycle
scycle=tapply(c2,cycle(c2),mean,na.rm=T)
# converting to df
df=tapply(c2, list(year=floor(time(c2)), month = cycle(c2)), c)
# subtract seasonal cycle
for (i in 1:nrow(df)){df[i,]=df[i,]-scycle}
# convert back to timeseries
anomco2=ts(c(t(df)),start=start(c2),freq=12)
Not very pretty, and not very efficient either.
The comment of missuse lead me to another Seasonal decompose of monthly data including NA in r I missed with a near duplicate question and this suggested the package zoo, which seems to work really well for additive series
library(zoo)
c2=co2
c2[c2>330&c2<350]=NA
d=decompose(na.StructTS(c2))
plot(co2)
lines(d$x,col="red")
shows that the series is very well reconstructed through the missing period.
The output of deconstruct has the trend and seasonal cycle available. I wish I could transfer my bounty to user https://stackoverflow.com/users/516548/g-grothendieck for this helpful response. Thanks to user missuse too.
However, if the missing portion is at the end of the series, the software has to extrapolate the trend and has more difficulties. The original series (in black) maintains the trend, while the trend is smaller in the reconstructed series (red):
c2=co2
c2[c2>350]=NA
d=decompose(na.StructTS(c2))
plot(co2)
lines(d$x,col="red")
Lastly, if instead the missing portion is at the start of the series, the software is unable to extrapolate backwards in time and throws an error... I feel another SO question coming on...
c2=co2
c2[c2<330]=NA
d=decompose(na.StructTS(c2))
Error in StructTS(y) :
the first value of the time series must not be missing

You could just use some algorithm that fills the missing data before.
(e.g. from package imputeTS or zoo)
imputeTS for example has extra imputation algorithms for seasonal time series e.g.:
x <- na_seadec(co2)
Another good option for seasonal data:
x <- na_kalman(co2)
And now just go on without the missing data.
An important hint from Adrian Tompkins (see also comment below):
This will work best, when the missing data is somewhere in the middle. For a lot of leading NAs the method is no good choice. In this case it fills the NAs, but it is not able to extrapolate the trend backwards:
c2<-co2
c2[c2<330]<-NA
c3<-na_kalman(c2)
c4<-na_seadec(c2)
plot(co2)
lines(c3,col="blue")
lines(c4,col="red")

Related

Analyzing disparate time series in R

Are there tools in R that simplify analysis of lagged and disparate time series. For example:
Daily values that only occur on weekdays (no entry on weekends or holidays)
vs
Bi-annual values
What I'm seeking is ways to:
Complete the missing daily values (with interpolated, or last value rolled forward, etc.)
Look for correlation between daily values and the bi-annual value (only the values that came before the bi-annual event)
As an example:
10-year treasury note interest rate (daily on non-holiday weekdays) as "X" and i-bond fixed rate as "Y" (set May 1/Nov 1)
Any suggestions appreciated.
I've built a test dataset manually for "x" and used functions in zoo to populate the missing values (interpolated), but I'm hoping for a less "brute-force" method for looking at analyzing the disparate time series. I've used lag functions in the past, but those were on matching interval time series.
What Jon commented is what I had in mind:
expand a weekday time series to full week using missing value function(s) in zoo
Sample the daily value - say April 15 for the May 1, Oct 15 for Nov 1
Ideally be able to automate - say loop through April 1-30, Oct 1-30 to look for highest RSqr for the model of choice (linear, polynomial, etc.)
Not have to build discrete datasets for each of the above - but if that is what is required I can do it programmatically - I've done that with stock data in the past. I was looking for a more efficient means of selecting the datasets ad hoc during the analysis.
I don't have code to post, because I'm clueless as to the feature/function that would make the date selection I'm after possible (at least in R).
Thanks for the input so far. It has already been useful in helping me look at alternative methods to achieve what I'm after.

How to calculate daily standard deviation from data collected hourly in R?

I calculated daily means from hourly data for all four variables in the dataset without any issues using the xts function daily.apply(df.xts, FUN=mean). By doing that I get daily averages of each of my variables. However, I am not being able to do the same for the standard deviation. When using daily.apply(df.xts, FUN=sd) I just get the index (time stamps) and only one column with values as an output. What am I missing? Similar issue for var.
Thank you
See issue's 124 and 128 on github to see the reasoning why.
Solution (for now?), use an extra package called matrixStats. The code below will return the standard deviation per column of an xts object.
apply.daily(df.xts, matrixStats::colSds)

Understanding the time() and cycle() functions in R

I have time series of the monthly international airplane passengers for many years, what does the time function applied to my data set tell me? What does the cycle function do to my data set? What are these functions useful for?
The syntax of the time statement would be as follows:-
library(tseries)
library(forecast)
data(AirPassengers)
AP <- AirPassengers
time(AP, offset(0))
The offset of 0 indicates that the sampling of this dataset took place at the start of the time series and the time function would create a vector of times, starting the first unit or the first month in this case.
The cycle function on the other hand, will just show the position of the datapoint or the observation in the entire cycle. The syntax would be
cycle(AP)
If you run this in R, you would see that the position assigned to Jan is 1, Feb is 2, March is 3 and so on for all the 12 months.
Use of these functions would be to get an overview of the data before starting to model it.
Here is an additional link for you to browse.
An old question, but perhaps someone is googling this, so an illuminating example could still help some. When trying to make sense Cowpertwait's R book...
Try this to see a visual example of what is going on:
AP
cycle(AP)
layout(matrix(c(1,2,3, 4), 2, 2, byrow = TRUE),widths=c(1,1), heights=c(1,1))
plot(AP)
plot(aggregate(AP))
boxplot(AP)
boxplot(AP ~ cycle(AP))

NA in time series handling?

I am dealing with a forecast of time series in R. I have several questions:
I would like to ask how we can handle missing values in time series?
I guess we can somehow interpolate them?
Can you suggest some solution in R for this?
One of the solutions imputeTS library.
library(imputeTS)
# amount of NA
table(is.na(tsAirgap))
# Kalman smoothing imputation (one of the best)
imp_tsAirgap <- na_kalman(tsAirgap)
# Imputed time-series, no NAs
table(is.na(imp_tsAirgap))
If you would like to delete the missing values and their corresponding time-stamps, you can also use the na.remove function within the tseries package.

Designating dynamic start in time-series vectors in R

I have some problems with time-series designation of vectors in R.
I work with time-series and when I want to set a vector to a certain period, I feel quite confident about how to do it. I have simply done as follow name<- ts(name, frequency=12, start=c(2007,1)). As you can see I have monthly data
I am making an R template for colleagues to use, and I want them to be able to carry out a recursive ARIMA regression from any given starting point. That is, I have a range of in-sample predicted valued and I want to designate a start-value that is n monthly observation after 2007 (or whatever start data is used), where n is the start-value of the recursive regression.
first and last from the xts time-series package do exactly what you want.i.e. to get the first 2 months of an object x:
first(x, '2 months’)
or the last 6 weeks:
last(x, '6 weeks’)
Valid period.types are: secs, seconds, mins, minutes, hours, days, weeks, months, quarters, and years. As always you can find much more detailed information using ?xts::first.

Resources