Auto.arima() function does not result in white noise. How else should I go about modeling data - r

Here is the plot of the initial data (after performing a log transformation).
It is evident there is both a linear trend as well as a seasonal trend. I can address both of these by taking the first and twelfth (seasonal) difference: diff(diff(data), 12). After doing so, here is the plot of the resulting data
.
This data does not look great. While the mean in constant, we see a funneling effect as time progresses. Here are the ACF/PACF:.
Any suggestions for possible fits to try. I used the auto.arima() function which suggested an ARIMA(2,0,2)xARIMA(1,0,2)(12) model. However, once I took the residuals from the fit, it was clear there was still some sort of structure in them. Here is the plot of the residuals from the fit as well as the ACF/PACF of the residuals.
There does not appear to be a seasonal pattern regarding which lags have spikes in the ACF/PACF of residuals. However, this is still something not captured by the previous steps. What do you suggest I do? How could I go about building a better model that has better model diagnostics (which at this point is just a better looking ACF and PACF)?
Here is my simplified code thus far:
library(TSA)
library(forecast)
beer <- read.csv('beer.csv', header = TRUE)
beer <- ts(beer$Production, start = c(1956, 1), frequency = 12)
# transform data
boxcox <- BoxCox.ar(beer) # 0 in confidence interval
beer.log <- log(beer)
firstDifference <- diff(diff(beer.log), 12) # get rid of linear and
# seasonal trend
acf(firstDifference)
pacf(firstDifference)
eacf(firstDifference)
plot(armasubsets(firstDifference, nar=12, nma=12))
# fitting the model
auto.arima(firstDifference, ic = 'bic') # from forecasting package
modelFit <- arima(firstDifference, order=c(1,0,0),seasonal
=list(order=c(2, 0, 0), period = 12))
# assessing model
resid <- modelFit$residuals
acf(resid, lag.max = 15)
pacf(resid, lag.max = 15)
Here is the data, if interested (I think you can use an html to csv converter if you would like): https://docs.google.com/spreadsheets/d/1S8BbNBdQFpQAiCA4J18bf7PITb8kfThorMENW-FRvW4/pubhtml

Jane,
There are a few things going on here.
Instead of logs, we used the tsay variance test which shows that the variance increased after period 118. Weighted least squares deals with it.
March becomes higher beginning at period 111. An alternative to an ar12 or seasonal differencing is to identify seasonal dummies. We found that 7 of the 12 months were unusual with a couple level shifts, an AR2 with 2 outliers.
Here is the fit and forecasts.
Here are the residuals.
ACF of residuals
Note: I am a developer of the software Autobox. All models are wrong. Some are useful.
Here is Tsay's paper
http://onlinelibrary.wiley.com/doi/10.1002/for.3980070102/abstract

Related

Incorporating Known Limits into Time-Series Forecasting

I am trying to forecast a time-series with clear seasonality. I believe there is weekly and daily seasonality, and have a time-series sampled at 1-minute intervals over a year. To start, I fit a dynamic harmonic regression model with ARMA errors, by doing:
library("forecast")
fourier_components <- fourier( x = data, K=c(k_value,k_value) )
arima_model_fit <- auto.arima( data, seasonal=FALSE, lambda=0, xreg=fourier_components )
And this, as a first attempt (I can tune K and so on), works reasonably well. If I forecast ahead, I get:
forecast_fourier <- fourier( x = data, K=c(k_value, k_value), h = 1440*7*4 )
seasonal_test_forecast <- forecast( object = arima_model_fit, xreg = forecast_fourier )
autoplot(seasonal_test_forecast, include = 1440*7*4*2, ylab="data")
However, I know due to physical limitations that the time-series cannot go below 0 or above some threshold T. I can stop the prediction intervals going below 0 with the lambda=0 call in auto.arima, as this provides a log-transform. But, is there some way I can account for the upper limit in the typical behaviour? Without explicitly doing it, I clearly get physically unreasonable prediction intervals over longer horizons.
EDIT: As pointed out in the comments, a method to account for this is detailed in otexts.com/fpp2/limits.html using an adjustment to the log-transform.

Auto.Arima fits well except for a single spike

I'm an engineering grad student and as a small part of my thesis I'm trying to analyze some groundwater data in r using the auto.arima function. The fitted functions for my data fit well except for one spike in the data and I cannot figure out for the life of me why they go off the rails here. There are no oddities or missing values in the data. The data is the elevation of the groundwater and has one recorded point per day.
My raw unfitted data looks like this:
#load tseries library
library(tseries)
# RESERVOIR ONLY ANALYSIS#
#Daily Piezometric Data from PS13-01
PS1301 = read.csv("PS13-01.csv",TRUE,",")
#impute missing data from data set
PS1301 = imputeTS::na_interpolation(PS1301)
#Create Time Series
PS1301 = ts(PS1301[,2],frequency = (365),start = c(2013,116))
plot(PS1301, xlab='Time', ylab = 'Piezometric Head')
And then after running Auto.Arima it fits this:
#Auto Arima of only piezometers
#PS1301
AAPS1301 = auto.arima(PS1301)
AAPS1301
summary(AAPS1301)
## Series: PS1301
## ARIMA(2,1,0)(0,1,0)[365]
##
## Coefficients:
## ar1 ar2
## 0.3362 0.5722
## s.e. 0.0643 0.0625
##
## sigma^2 estimated as 0.02372: log likelihood=2779.3
## AIC=-5552.61 AICc=-5552.59 BIC=-5536.39
plot(PS1301,col="red")
lines(fitted(AAPS1301),col="blue")
Any help would be appreciated, I'm pretty unsure as to what to do from here. I feel like this has to be an error because of how well the fit is(visually) for the rest of the time series. I'm also more than happy to provide the raw data but I am not sure how to put it in this post other than as a dropbox link https://www.dropbox.com/sh/563nu3daeid0agb/AAB6NSddVUKgBCCbQtuqXPsZa?dl=0
The problem here is that the seasonal period is very long (365) and R is trying to fit a diffuse prior to the corresponding state space model -- which becomes increasingly difficult with very long periods. There appears to be some numerical instability as a result, giving inaccurate fitted values at the 366th and 367th observations.
I am not convinced that using a seasonal ARIMA with such a long period makes any sense, but if you want to do it, use the CSS estimation method instead of full likelihood:
fit_css <- auto.arima(PS1301, method='CSS')
It is also much faster.

Result of nnetar is strangely flat

I'm new to R but have some experience with ARIMA models. Now I wanted to learn a bit about neural networks for forecasting.
I tried to repeat the procedure from Rob's post. It worked great for the data set he used. It also worked great for imaginary datasets I created.
But then I tried to use real-life data (revenue data for 7 years monthly) and the resulting forecasts are strangely flat. My code:
read.csv("Revenue.csv",header=TRUE)
x <-read.csv("Revenue.csv",header=TRUE)
y<-ts(x,freq=12,start=c(2011,1))
(fit<-nnetar(y))
fcast <- forecast(fit, PI=TRUE, h=20, bootstrap=TRUE)
autoplot(fcast)
The result is an almost straight line (attached as picture 1). That strikes me as odd, because the trend has been positive so far: there was a revenue growth of more than 100% every year. Still the result of nnetar is that the revenue will stabilise. How is that possible?
As a comparison I used Auto.arima for the same data set (picture 2). It shows a clear upward trend.
One suggestion, even if its hard to help without the data sample.
It appears than nnetar is not capturing very well the trend in your data.
Probably you could try to use a trend as external regressors ( xreg argument)
For example for a deterministic trend.
Trend=seq(from=start, to=end, by=1)
(fit <- nnetar(y, xreg=Trend))
(f <- forecast(fit,h=h, xreg=seq(from=end, to=end+h, by=1))
An alternative would be to use more lag or seasonal lags (p and P argument in your nnetar model)

Convert double differenced forecast into actual value diff() in R

I have already read
Time Series Forecast: Convert differenced forecast back to before difference level
and
How to "undifference" a time series variable
None of these unfortunately gives any clear answer how to convert forecast done in ARIMA using differenced method(diff()) to reach at stationary series.
code sample.
## read data and start from 1 jan 2014
dat<-read.csv("rev forecast 2014-23 dec 2015.csv")
val.ts <- ts(dat$Actual,start=c(2014,1,1),freq=365)
##Check how we can get stationary series
plot((diff(val.ts)))
plot(diff(diff(val.ts)))
plot(log(val.ts))
plot(log(diff(val.ts)))
plot(sqrt(val.ts))
plot(sqrt(diff(val.ts)))
##I found that double differencing. i.e.diff(diff(val.ts)) gives stationary series.
#I ran below code to get value of 3 parameters for ARIMA from auto.arima
ARIMAfit <- auto.arima(diff(diff(val.ts)), approximation=FALSE,trace=FALSE, xreg=diff(diff(xreg)))
#Finally ran ARIMA
fit <- Arima(diff(diff(val.ts)),order=c(5,0,2),xreg = diff(diff(xreg)))
#plot original to see fit
plot(diff(diff(val.ts)),col="orange")
#plot fitted
lines(fitted(fit),col="blue")
This gives me a perfect fit time series. However, how do i reconvert fitted values into their original metric from the current form it is now in? i mean from double differencing into actual number? For log i know we can do 10^fitted(fit) for square root there is similar solution, however what to do for differencing, that too double differencing?
Any help on this please in R? After days of rigorous exercise, i am stuck at this point.
i ran test to check if differencing has any impact on model fit of auto.arima function and found that it does. so auto.arima can't handle non stationary series and it requires some effort on part of analyst to convert the series to stationary.
Firstly, auto.arima without any differencing. Orange color is actual value, blue is fitted.
ARIMAfit <- auto.arima(val.ts, approximation=FALSE,trace=FALSE, xreg=xreg)
plot(val.ts,col="orange")
lines(fitted(ARIMAfit),col="blue")
secondly, i tried differencing
ARIMAfit <- auto.arima(diff(val.ts), approximation=FALSE,trace=FALSE, xreg=diff(xreg))
plot(diff(val.ts),col="orange")
lines(fitted(ARIMAfit),col="blue")
enter image description here
thirdly, i did differencing 2 times.
ARIMAfit <- auto.arima(diff(diff(val.ts)), approximation=FALSE,trace=FALSE,
xreg=diff(diff(xreg)))
plot(diff(diff(val.ts)),col="orange")
lines(fitted(ARIMAfit),col="blue")
enter image description here
A visual inspection can suggest that 3rd graph is more accurate out of all. This i am aware of. The challenge is how to reconvert this fitted value which is in the form of double differenced form into the actual metric!
The opposite of diff is kind of cumsum, but you need to know the starting values at each diff.
e.g:
set.seed(1234)
x <- runif(100)
z <- cumsum(c(x[1], cumsum(c(diff(x)[1], diff(diff(x))))))
all.equal(z, x)
[1] TRUE
Share some of your data to make a reproducible example to better help answer the question.
If you expect that differencing will be necessary to obtain stationarity, then why not simply include the maximum differencing order in the function call? That is, the "I" in ARIMA is the order of differencing prior to fitting an ARMA model, such that if
y = diff(diff(x)) and y is an ARMA(p,q) process,
then
x follows an ARIMA(p,2,q) process.
In auto.arima() you specify the differencing with the d argument (or D if it involves seasons). So, you want something like this (for a maximum of 3 differences):
fit <- auto.arima(val.ts, d=3, ...)
From this, you can verify that the fitted values will indeed map onto the original data
plot(val.ts)
lines(fit, col="blue")
In the example below containing dummy data, I have double differenced. First, I removed seasonality (lag = 12) and then I removed trend from the differenced data (lag = 1).
set.seed(1234)
x <- rep(NA,24)
x <- x %>%
rnorm(mean = 10, sd = 5) %>%
round(.,0) %>%
abs()
yy <- diff(x, lag = 12)
z <- diff(yy, lag = 1)
Using the script that #jeremycg included above and I include below, how would I remove the double difference? Would I need to add lag specifiers to the two nested diff() commands? If so, which diff() would have the lag = 12 specifier and which would have the lag = 1?
zz <- cumsum(c(x[1], cumsum(c(diff(x)[1], diff(diff(x))))))

Auto.arima is not showing any order

I am trying to fit arima model using auto.arima function in R. The result is showing order (0,0,0) even though the data is non-stationary.
auto.arima(x,approximation=TRUE)
ARIMA(0,0,0) with non-zero mean
Can someone advice why such results are coming? Btw i am running this function on only 10 data points.
10 data points is a very low number of observations for estimating an ARIMA model. I doubt that you can make any sensible estimation based on this. Moreover, the estimated model may depend strongly on the part of a time series you looked at and adding only very few observations can change the characteristics of the estimated model significantly. For example:
When I take a time series with only 10 observations, I also get a ARIMA(0,0,0) model:
library(forecast)
vec1 <- ts(c(10.26063, 10.60462, 10.37365, 11.03608, 11.19136, 11.13591, 10.84063, 10.66458, 11.06324, 10.75535), frequency = 12)
fit1 <- auto.arima(vec1)
summary(fit1)
However, if I use about 30 observations, it an ARIMA(1,0,0) model is estimated:
vec2 <- ts(c(10.260626, 10.604616, 10.373652, 11.036079, 11.191359, 11.135914, 10.840628, 10.664575, 11.063239, 10.755350,
10.158032, 10.653669, 10.659231, 10.483478, 10.739133, 10.400146, 10.205993, 10.827950, 11.018257, 11.633930,
11.287756, 11.202727, 11.244572, 11.452180, 11.199706, 10.970823, 10.386131, 10.184201, 10.209338, 9.544736), frequency = 12)
fit1 <- auto.arima(vec2)
summary(fit1)
If I use the whole time series (413 observations), the auto.arima function estimates a "ARIMA(2,1,4)(0,0,1)[12] with drift".
Thus, I would think that 10 observation is indeed not enough information for fitting a model.

Resources