Problems Interpreting fitted values from ETS() and AUTO.ARIMA() models in R - r

I'm stuck into this question that I can't solve.
When using AirPassengers data and model it through ETS() and AUTO.ARIMA(), the fitted values seems reasonable well fitted to observed values:
library(forecast)
a <- ts(AirPassengers, start = 1949, frequency = 12)
a <- window(a, start = 1949, end = c(1954,12), frequency = 12)
fit_a_ets <- ets(a)
fit_a_arima <- auto.arima(a)
plot(a)
lines(fit_a_ets$fitted, col = "blue")
lines(fit_a_arima$fitted, col = "red")
Plot from AirPassengers and fitted models
When I tried same code on my data, it seems dislocated 1 period:
b <- c(1237,1982,1191,1163,1418,1687,2331,2181,1943,1782,177,1871,391,1397,734,712,1006,508,368,767,675,701,989,725,1292,983,1094,1105,928,1246,1604,1163,1390,959,1630,789,1173,910,875,718,655,606,968,716,476,476,655,499,544,1250,359,386,458,947,542,953,1450,1195,1317,957,778,1030,1399,1119,3142,1024,1537,1321,2062,1897,2094,2546,1796,2089,1194,896,727,599,785,674,828,311,375,315,365,314,126,315,372,666,596,589,001,613,498,635,644,1018,873,900,502,121,293,259,311,169,378,153,24,115,250,565,349,201,393,83,327,325,185,307,501,194)
b <- ts(b, start = 1949, frequency = 12)
b <- window(b, start = 1949, end = c(1954,12), frequency = 12)
fit_b_ets <- ets(b)
fit_b_arima <- auto.arima(b)
plot(b)
lines(fit_b_ets$fitted, col = "blue")
lines(fit_b_arima$fitted, col = "red")
Plot from my data and fitted models
Does anyone know why?
Tried here https://otexts.com/fpp2/index.html and I didn't get why this happens.
I thought it would be because it's not well fitted into my data, but for others set's of data, same occurs. For example, figure 7.1 from https://otexts.com/fpp2/ses.html.

This is typical.
In the context of forecasting, the "fitted" value is the one-step-ahead forecast. For many different types of series, the best that we can do is something that's close to the latest observation, plus a small adjustment. This makes it look like the "fitted" value lags by 1 period because it is then usually quite close to the previous observed value.
Asking why the fitted series lags is like asking "why can't we know the future before it happens?". It's simply not that easy, and it doesn't indicate that the model is necessarily inadequate (it may not be possible to do better).
Plots comparing the time series of observations and fitted values are rarely of any use for forecasting; they always essentially look like this. It also makes it difficult to judge the vertical distance between the lines, which is what you actually care about (the forecasting error). Better to plot the forecasting error directly.
The AirPassengers series is unusual because it is extremely easy to forecast based on its seasonality. Most series you will encounter in the wild are not quite like this.

Related

Incorporating Known Limits into Time-Series Forecasting

I am trying to forecast a time-series with clear seasonality. I believe there is weekly and daily seasonality, and have a time-series sampled at 1-minute intervals over a year. To start, I fit a dynamic harmonic regression model with ARMA errors, by doing:
library("forecast")
fourier_components <- fourier( x = data, K=c(k_value,k_value) )
arima_model_fit <- auto.arima( data, seasonal=FALSE, lambda=0, xreg=fourier_components )
And this, as a first attempt (I can tune K and so on), works reasonably well. If I forecast ahead, I get:
forecast_fourier <- fourier( x = data, K=c(k_value, k_value), h = 1440*7*4 )
seasonal_test_forecast <- forecast( object = arima_model_fit, xreg = forecast_fourier )
autoplot(seasonal_test_forecast, include = 1440*7*4*2, ylab="data")
However, I know due to physical limitations that the time-series cannot go below 0 or above some threshold T. I can stop the prediction intervals going below 0 with the lambda=0 call in auto.arima, as this provides a log-transform. But, is there some way I can account for the upper limit in the typical behaviour? Without explicitly doing it, I clearly get physically unreasonable prediction intervals over longer horizons.
EDIT: As pointed out in the comments, a method to account for this is detailed in otexts.com/fpp2/limits.html using an adjustment to the log-transform.

How do I fix the abline warning, only using first two coefficients?

I have been unable to resolve an error when using abline(). I keep getting the warning message: In abline(model): only using the first two of 7 regression coefficients. I've been searching and seen many instances of others with this error but their examples are for multiple linear functions. I'm new to R and below is a simple example I'm using to work with. Thanks for any help!
year = c('2010','2011','2012','2013','2014','2015','2016')
population = c(25244310,25646389,26071655,26473525,26944751,27429639,27862596)
Texas=data.frame(year,population)
plot(population~year,data=Texas)
model = lm(population~year,data=Texas)
abline(model)
You probably want something like the following where we make sure that year is interpreted as a numeric variable in your model:
plot(population ~ year, data = Texas)
model <- lm(population ~ as.numeric(as.character(year)), data = Texas)
abline(model)
This makes lm to estimate an intercept (corresponding to a year 0) and slope (the mean increase in population each year), which is correctly interpreted by abline as can also be seen on the plot.
The reason for the warning is that year becomes a factor with 7 levels and so your lm call estimate the mean value for the refence year 2010 (the intercept) and 6 contrasts to the other years. Hence you get many coefficients and abline only uses the first two incorrectly.
Edit: With that said, you probably want change the way year is stored to a numeric. Then your code works, and plot also makes a proper scatter plot as regression line.
Texas$year <- as.numeric(as.character(Texas$year))
plot(population ~ year, data = Texas, pch = 16)
model <- lm(population ~ year, data = Texas)
abline(model)
Note that the as.character is needed in general, but it works in lm without it by coincidence (because the years are consecutive)

Match "next day" using forecast() in R

I am working through the "Forecasting Using R" DataCamp course. I have completed the entire thing except for the last part of one particular exercise (link here, if you have an account), where I'm totally lost. The error help it's giving me isn't helping either. I'll put the various parts of the task down with the code I'm using to solve them:
Produce time plots of only the daily demand and maximum temperatures with facetting.
autoplot(elec[, c("Demand", "Temperature")], facets = TRUE)
Index elec accordingly to set up the matrix of regressors to include MaxTemp for the maximum temperatures, MaxTempSq which represents the squared value of the maximum temperature, and Workday, in that order.
xreg <- cbind(MaxTemp = elec[, "Temperature"],
MaxTempSq = elec[, "Temperature"] ^2,
Workday = elec[,"Workday"])
Fit a dynamic regression model of the demand column with ARIMA errors and call this fit.
fit <- auto.arima(elec[,"Demand"], xreg = xreg)
If the next day is a working day (indicator is 1) with maximum temperature forecast to be 20°C, what is the forecast demand? Fill out the appropriate values in cbind() for the xreg argument in forecast().
This is where I'm stuck. The sample code they supply looks like this:
forecast(___, xreg = cbind(___, ___, ___))
I have managed to work out that the first blank is fit, so I'm trying code that looks like this:
forecast(fit, xreg = cbind(elec[,"Workday"]==1, elec[, "Temperature"]==20, elec[,"Demand"]))
But that is giving me the error hint "Make sure to forecast the next day using the inputs given in the instructions." Which... doesn't tell me anything useful. Any ideas what I should be doing instead?
When you are forecasting ahead of time, you use new data that was not included in elec (which is the data set you used to fit your model). The new data was given to you in the question (temperature 20C and workday 1). Therefore, you do not need elec in your forecastcall. Just use the new data to forecast ahead:
forecast(fit, xreg = cbind(20, 20^2, 1))

Auto.arima() function does not result in white noise. How else should I go about modeling data

Here is the plot of the initial data (after performing a log transformation).
It is evident there is both a linear trend as well as a seasonal trend. I can address both of these by taking the first and twelfth (seasonal) difference: diff(diff(data), 12). After doing so, here is the plot of the resulting data
.
This data does not look great. While the mean in constant, we see a funneling effect as time progresses. Here are the ACF/PACF:.
Any suggestions for possible fits to try. I used the auto.arima() function which suggested an ARIMA(2,0,2)xARIMA(1,0,2)(12) model. However, once I took the residuals from the fit, it was clear there was still some sort of structure in them. Here is the plot of the residuals from the fit as well as the ACF/PACF of the residuals.
There does not appear to be a seasonal pattern regarding which lags have spikes in the ACF/PACF of residuals. However, this is still something not captured by the previous steps. What do you suggest I do? How could I go about building a better model that has better model diagnostics (which at this point is just a better looking ACF and PACF)?
Here is my simplified code thus far:
library(TSA)
library(forecast)
beer <- read.csv('beer.csv', header = TRUE)
beer <- ts(beer$Production, start = c(1956, 1), frequency = 12)
# transform data
boxcox <- BoxCox.ar(beer) # 0 in confidence interval
beer.log <- log(beer)
firstDifference <- diff(diff(beer.log), 12) # get rid of linear and
# seasonal trend
acf(firstDifference)
pacf(firstDifference)
eacf(firstDifference)
plot(armasubsets(firstDifference, nar=12, nma=12))
# fitting the model
auto.arima(firstDifference, ic = 'bic') # from forecasting package
modelFit <- arima(firstDifference, order=c(1,0,0),seasonal
=list(order=c(2, 0, 0), period = 12))
# assessing model
resid <- modelFit$residuals
acf(resid, lag.max = 15)
pacf(resid, lag.max = 15)
Here is the data, if interested (I think you can use an html to csv converter if you would like): https://docs.google.com/spreadsheets/d/1S8BbNBdQFpQAiCA4J18bf7PITb8kfThorMENW-FRvW4/pubhtml
Jane,
There are a few things going on here.
Instead of logs, we used the tsay variance test which shows that the variance increased after period 118. Weighted least squares deals with it.
March becomes higher beginning at period 111. An alternative to an ar12 or seasonal differencing is to identify seasonal dummies. We found that 7 of the 12 months were unusual with a couple level shifts, an AR2 with 2 outliers.
Here is the fit and forecasts.
Here are the residuals.
ACF of residuals
Note: I am a developer of the software Autobox. All models are wrong. Some are useful.
Here is Tsay's paper
http://onlinelibrary.wiley.com/doi/10.1002/for.3980070102/abstract

Convert double differenced forecast into actual value diff() in R

I have already read
Time Series Forecast: Convert differenced forecast back to before difference level
and
How to "undifference" a time series variable
None of these unfortunately gives any clear answer how to convert forecast done in ARIMA using differenced method(diff()) to reach at stationary series.
code sample.
## read data and start from 1 jan 2014
dat<-read.csv("rev forecast 2014-23 dec 2015.csv")
val.ts <- ts(dat$Actual,start=c(2014,1,1),freq=365)
##Check how we can get stationary series
plot((diff(val.ts)))
plot(diff(diff(val.ts)))
plot(log(val.ts))
plot(log(diff(val.ts)))
plot(sqrt(val.ts))
plot(sqrt(diff(val.ts)))
##I found that double differencing. i.e.diff(diff(val.ts)) gives stationary series.
#I ran below code to get value of 3 parameters for ARIMA from auto.arima
ARIMAfit <- auto.arima(diff(diff(val.ts)), approximation=FALSE,trace=FALSE, xreg=diff(diff(xreg)))
#Finally ran ARIMA
fit <- Arima(diff(diff(val.ts)),order=c(5,0,2),xreg = diff(diff(xreg)))
#plot original to see fit
plot(diff(diff(val.ts)),col="orange")
#plot fitted
lines(fitted(fit),col="blue")
This gives me a perfect fit time series. However, how do i reconvert fitted values into their original metric from the current form it is now in? i mean from double differencing into actual number? For log i know we can do 10^fitted(fit) for square root there is similar solution, however what to do for differencing, that too double differencing?
Any help on this please in R? After days of rigorous exercise, i am stuck at this point.
i ran test to check if differencing has any impact on model fit of auto.arima function and found that it does. so auto.arima can't handle non stationary series and it requires some effort on part of analyst to convert the series to stationary.
Firstly, auto.arima without any differencing. Orange color is actual value, blue is fitted.
ARIMAfit <- auto.arima(val.ts, approximation=FALSE,trace=FALSE, xreg=xreg)
plot(val.ts,col="orange")
lines(fitted(ARIMAfit),col="blue")
secondly, i tried differencing
ARIMAfit <- auto.arima(diff(val.ts), approximation=FALSE,trace=FALSE, xreg=diff(xreg))
plot(diff(val.ts),col="orange")
lines(fitted(ARIMAfit),col="blue")
enter image description here
thirdly, i did differencing 2 times.
ARIMAfit <- auto.arima(diff(diff(val.ts)), approximation=FALSE,trace=FALSE,
xreg=diff(diff(xreg)))
plot(diff(diff(val.ts)),col="orange")
lines(fitted(ARIMAfit),col="blue")
enter image description here
A visual inspection can suggest that 3rd graph is more accurate out of all. This i am aware of. The challenge is how to reconvert this fitted value which is in the form of double differenced form into the actual metric!
The opposite of diff is kind of cumsum, but you need to know the starting values at each diff.
e.g:
set.seed(1234)
x <- runif(100)
z <- cumsum(c(x[1], cumsum(c(diff(x)[1], diff(diff(x))))))
all.equal(z, x)
[1] TRUE
Share some of your data to make a reproducible example to better help answer the question.
If you expect that differencing will be necessary to obtain stationarity, then why not simply include the maximum differencing order in the function call? That is, the "I" in ARIMA is the order of differencing prior to fitting an ARMA model, such that if
y = diff(diff(x)) and y is an ARMA(p,q) process,
then
x follows an ARIMA(p,2,q) process.
In auto.arima() you specify the differencing with the d argument (or D if it involves seasons). So, you want something like this (for a maximum of 3 differences):
fit <- auto.arima(val.ts, d=3, ...)
From this, you can verify that the fitted values will indeed map onto the original data
plot(val.ts)
lines(fit, col="blue")
In the example below containing dummy data, I have double differenced. First, I removed seasonality (lag = 12) and then I removed trend from the differenced data (lag = 1).
set.seed(1234)
x <- rep(NA,24)
x <- x %>%
rnorm(mean = 10, sd = 5) %>%
round(.,0) %>%
abs()
yy <- diff(x, lag = 12)
z <- diff(yy, lag = 1)
Using the script that #jeremycg included above and I include below, how would I remove the double difference? Would I need to add lag specifiers to the two nested diff() commands? If so, which diff() would have the lag = 12 specifier and which would have the lag = 1?
zz <- cumsum(c(x[1], cumsum(c(diff(x)[1], diff(diff(x))))))

Resources