Convert double differenced forecast into actual value diff() in R - r

I have already read
Time Series Forecast: Convert differenced forecast back to before difference level
and
How to "undifference" a time series variable
None of these unfortunately gives any clear answer how to convert forecast done in ARIMA using differenced method(diff()) to reach at stationary series.
code sample.
## read data and start from 1 jan 2014
dat<-read.csv("rev forecast 2014-23 dec 2015.csv")
val.ts <- ts(dat$Actual,start=c(2014,1,1),freq=365)
##Check how we can get stationary series
plot((diff(val.ts)))
plot(diff(diff(val.ts)))
plot(log(val.ts))
plot(log(diff(val.ts)))
plot(sqrt(val.ts))
plot(sqrt(diff(val.ts)))
##I found that double differencing. i.e.diff(diff(val.ts)) gives stationary series.
#I ran below code to get value of 3 parameters for ARIMA from auto.arima
ARIMAfit <- auto.arima(diff(diff(val.ts)), approximation=FALSE,trace=FALSE, xreg=diff(diff(xreg)))
#Finally ran ARIMA
fit <- Arima(diff(diff(val.ts)),order=c(5,0,2),xreg = diff(diff(xreg)))
#plot original to see fit
plot(diff(diff(val.ts)),col="orange")
#plot fitted
lines(fitted(fit),col="blue")
This gives me a perfect fit time series. However, how do i reconvert fitted values into their original metric from the current form it is now in? i mean from double differencing into actual number? For log i know we can do 10^fitted(fit) for square root there is similar solution, however what to do for differencing, that too double differencing?
Any help on this please in R? After days of rigorous exercise, i am stuck at this point.
i ran test to check if differencing has any impact on model fit of auto.arima function and found that it does. so auto.arima can't handle non stationary series and it requires some effort on part of analyst to convert the series to stationary.
Firstly, auto.arima without any differencing. Orange color is actual value, blue is fitted.
ARIMAfit <- auto.arima(val.ts, approximation=FALSE,trace=FALSE, xreg=xreg)
plot(val.ts,col="orange")
lines(fitted(ARIMAfit),col="blue")
secondly, i tried differencing
ARIMAfit <- auto.arima(diff(val.ts), approximation=FALSE,trace=FALSE, xreg=diff(xreg))
plot(diff(val.ts),col="orange")
lines(fitted(ARIMAfit),col="blue")
enter image description here
thirdly, i did differencing 2 times.
ARIMAfit <- auto.arima(diff(diff(val.ts)), approximation=FALSE,trace=FALSE,
xreg=diff(diff(xreg)))
plot(diff(diff(val.ts)),col="orange")
lines(fitted(ARIMAfit),col="blue")
enter image description here
A visual inspection can suggest that 3rd graph is more accurate out of all. This i am aware of. The challenge is how to reconvert this fitted value which is in the form of double differenced form into the actual metric!

The opposite of diff is kind of cumsum, but you need to know the starting values at each diff.
e.g:
set.seed(1234)
x <- runif(100)
z <- cumsum(c(x[1], cumsum(c(diff(x)[1], diff(diff(x))))))
all.equal(z, x)
[1] TRUE
Share some of your data to make a reproducible example to better help answer the question.

If you expect that differencing will be necessary to obtain stationarity, then why not simply include the maximum differencing order in the function call? That is, the "I" in ARIMA is the order of differencing prior to fitting an ARMA model, such that if
y = diff(diff(x)) and y is an ARMA(p,q) process,
then
x follows an ARIMA(p,2,q) process.
In auto.arima() you specify the differencing with the d argument (or D if it involves seasons). So, you want something like this (for a maximum of 3 differences):
fit <- auto.arima(val.ts, d=3, ...)
From this, you can verify that the fitted values will indeed map onto the original data
plot(val.ts)
lines(fit, col="blue")

In the example below containing dummy data, I have double differenced. First, I removed seasonality (lag = 12) and then I removed trend from the differenced data (lag = 1).
set.seed(1234)
x <- rep(NA,24)
x <- x %>%
rnorm(mean = 10, sd = 5) %>%
round(.,0) %>%
abs()
yy <- diff(x, lag = 12)
z <- diff(yy, lag = 1)
Using the script that #jeremycg included above and I include below, how would I remove the double difference? Would I need to add lag specifiers to the two nested diff() commands? If so, which diff() would have the lag = 12 specifier and which would have the lag = 1?
zz <- cumsum(c(x[1], cumsum(c(diff(x)[1], diff(diff(x))))))

Related

Problems Interpreting fitted values from ETS() and AUTO.ARIMA() models in R

I'm stuck into this question that I can't solve.
When using AirPassengers data and model it through ETS() and AUTO.ARIMA(), the fitted values seems reasonable well fitted to observed values:
library(forecast)
a <- ts(AirPassengers, start = 1949, frequency = 12)
a <- window(a, start = 1949, end = c(1954,12), frequency = 12)
fit_a_ets <- ets(a)
fit_a_arima <- auto.arima(a)
plot(a)
lines(fit_a_ets$fitted, col = "blue")
lines(fit_a_arima$fitted, col = "red")
Plot from AirPassengers and fitted models
When I tried same code on my data, it seems dislocated 1 period:
b <- c(1237,1982,1191,1163,1418,1687,2331,2181,1943,1782,177,1871,391,1397,734,712,1006,508,368,767,675,701,989,725,1292,983,1094,1105,928,1246,1604,1163,1390,959,1630,789,1173,910,875,718,655,606,968,716,476,476,655,499,544,1250,359,386,458,947,542,953,1450,1195,1317,957,778,1030,1399,1119,3142,1024,1537,1321,2062,1897,2094,2546,1796,2089,1194,896,727,599,785,674,828,311,375,315,365,314,126,315,372,666,596,589,001,613,498,635,644,1018,873,900,502,121,293,259,311,169,378,153,24,115,250,565,349,201,393,83,327,325,185,307,501,194)
b <- ts(b, start = 1949, frequency = 12)
b <- window(b, start = 1949, end = c(1954,12), frequency = 12)
fit_b_ets <- ets(b)
fit_b_arima <- auto.arima(b)
plot(b)
lines(fit_b_ets$fitted, col = "blue")
lines(fit_b_arima$fitted, col = "red")
Plot from my data and fitted models
Does anyone know why?
Tried here https://otexts.com/fpp2/index.html and I didn't get why this happens.
I thought it would be because it's not well fitted into my data, but for others set's of data, same occurs. For example, figure 7.1 from https://otexts.com/fpp2/ses.html.
This is typical.
In the context of forecasting, the "fitted" value is the one-step-ahead forecast. For many different types of series, the best that we can do is something that's close to the latest observation, plus a small adjustment. This makes it look like the "fitted" value lags by 1 period because it is then usually quite close to the previous observed value.
Asking why the fitted series lags is like asking "why can't we know the future before it happens?". It's simply not that easy, and it doesn't indicate that the model is necessarily inadequate (it may not be possible to do better).
Plots comparing the time series of observations and fitted values are rarely of any use for forecasting; they always essentially look like this. It also makes it difficult to judge the vertical distance between the lines, which is what you actually care about (the forecasting error). Better to plot the forecasting error directly.
The AirPassengers series is unusual because it is extremely easy to forecast based on its seasonality. Most series you will encounter in the wild are not quite like this.

Incorporating Known Limits into Time-Series Forecasting

I am trying to forecast a time-series with clear seasonality. I believe there is weekly and daily seasonality, and have a time-series sampled at 1-minute intervals over a year. To start, I fit a dynamic harmonic regression model with ARMA errors, by doing:
library("forecast")
fourier_components <- fourier( x = data, K=c(k_value,k_value) )
arima_model_fit <- auto.arima( data, seasonal=FALSE, lambda=0, xreg=fourier_components )
And this, as a first attempt (I can tune K and so on), works reasonably well. If I forecast ahead, I get:
forecast_fourier <- fourier( x = data, K=c(k_value, k_value), h = 1440*7*4 )
seasonal_test_forecast <- forecast( object = arima_model_fit, xreg = forecast_fourier )
autoplot(seasonal_test_forecast, include = 1440*7*4*2, ylab="data")
However, I know due to physical limitations that the time-series cannot go below 0 or above some threshold T. I can stop the prediction intervals going below 0 with the lambda=0 call in auto.arima, as this provides a log-transform. But, is there some way I can account for the upper limit in the typical behaviour? Without explicitly doing it, I clearly get physically unreasonable prediction intervals over longer horizons.
EDIT: As pointed out in the comments, a method to account for this is detailed in otexts.com/fpp2/limits.html using an adjustment to the log-transform.

Auto.arima() function does not result in white noise. How else should I go about modeling data

Here is the plot of the initial data (after performing a log transformation).
It is evident there is both a linear trend as well as a seasonal trend. I can address both of these by taking the first and twelfth (seasonal) difference: diff(diff(data), 12). After doing so, here is the plot of the resulting data
.
This data does not look great. While the mean in constant, we see a funneling effect as time progresses. Here are the ACF/PACF:.
Any suggestions for possible fits to try. I used the auto.arima() function which suggested an ARIMA(2,0,2)xARIMA(1,0,2)(12) model. However, once I took the residuals from the fit, it was clear there was still some sort of structure in them. Here is the plot of the residuals from the fit as well as the ACF/PACF of the residuals.
There does not appear to be a seasonal pattern regarding which lags have spikes in the ACF/PACF of residuals. However, this is still something not captured by the previous steps. What do you suggest I do? How could I go about building a better model that has better model diagnostics (which at this point is just a better looking ACF and PACF)?
Here is my simplified code thus far:
library(TSA)
library(forecast)
beer <- read.csv('beer.csv', header = TRUE)
beer <- ts(beer$Production, start = c(1956, 1), frequency = 12)
# transform data
boxcox <- BoxCox.ar(beer) # 0 in confidence interval
beer.log <- log(beer)
firstDifference <- diff(diff(beer.log), 12) # get rid of linear and
# seasonal trend
acf(firstDifference)
pacf(firstDifference)
eacf(firstDifference)
plot(armasubsets(firstDifference, nar=12, nma=12))
# fitting the model
auto.arima(firstDifference, ic = 'bic') # from forecasting package
modelFit <- arima(firstDifference, order=c(1,0,0),seasonal
=list(order=c(2, 0, 0), period = 12))
# assessing model
resid <- modelFit$residuals
acf(resid, lag.max = 15)
pacf(resid, lag.max = 15)
Here is the data, if interested (I think you can use an html to csv converter if you would like): https://docs.google.com/spreadsheets/d/1S8BbNBdQFpQAiCA4J18bf7PITb8kfThorMENW-FRvW4/pubhtml
Jane,
There are a few things going on here.
Instead of logs, we used the tsay variance test which shows that the variance increased after period 118. Weighted least squares deals with it.
March becomes higher beginning at period 111. An alternative to an ar12 or seasonal differencing is to identify seasonal dummies. We found that 7 of the 12 months were unusual with a couple level shifts, an AR2 with 2 outliers.
Here is the fit and forecasts.
Here are the residuals.
ACF of residuals
Note: I am a developer of the software Autobox. All models are wrong. Some are useful.
Here is Tsay's paper
http://onlinelibrary.wiley.com/doi/10.1002/for.3980070102/abstract

how to use forecast function for simple moving average model in r?

I want to predict the future values for my simple moving average model. I used the following procedure:
x <- c(14,10,11,7,10,9,11,19,7,10,21,9,8,16,21,14,6,7)
df <- data.frame(x)
dftimeseries <- ts(df)
library(TTR)
smadf <- SMA(dftimeseries, 4) # lag is 4
library(forecast)
forecasteddf <- forecast(smadf, 4) # future 4 values
When run the above code, my forecast values are the same for all the next 4 days. Am I coding it correctly? Or, am I conceptually wrong?
The same is the case with exponential moving average, weighted moving average, and ARIMA also.
For a moving average model you can read here
"Since the model assumes a constant underlying mean, the forecast for any number of periods in the future is the same...".
So, your result are to be expected considering the characteristics of the moving average mode.
The forecast is from the fpp2 package and the moving average function is from the smooth package.
This is an example:
library(smooth)
library(fpp2)
library(readxl)
setwd("C:\Users\lferreira\Desktop\FORECASTING")
data<- read_xlsx("BASE_TESTE.xlsx")
ts <- ts(data$1740,start=c(2014,1),frequency=4)
fc <- forecast(sma(ts),h=3)
Error: The provided model is not Simple Moving Average!

Negative values in timeseries when removing seasonal values with HoltWinters (R)

i'm new to R, so I'm having trouble with this time series data
For example (the real data is way larger)
data <- c(7,5,3,2,5,2,4,11,5,4,7,22,5,14,18,20,14,22,23,20,23,16,21,23,42,64,39,34,39,43,49,59,30,15,10,12,4,2,4,6,7)
ts <- ts(data,frequency = 12, start = c(2010,1))
So if I try to decompose the data to adjust it
ts.decompose <- decompose(ts)
ts.adjust <- ts - ts.decompose$seasonal
ts.hw <- HoltWinters(ts.adjust)
ts.forecast <- forecast.HoltWinters(ts.hw, h = 10)
plot.forecast(ts.forecast)
But for the first values I got negative values, why this is happening?
Well, you are forecasting the seasonally adjusted time series, and of course the deseasonalized series ts.adjust can already contain negative values by itself, and in fact, it actually does.
In addition, even if the original series contained only positive values, Holt-Winters can yield negative forecasts. It is not constrained.
I would suggest trying to model your original (not seasonally adjusted) time series directly using ets() in the forecast package. It usually does a good job in detecting seasonality. (And it can also yield negative forecasts or prediction intervals.)
I very much recommend this free online forecasting textbook. Given your specific question, this may also be helpful.

Resources