Related
I'm stuck into this question that I can't solve.
When using AirPassengers data and model it through ETS() and AUTO.ARIMA(), the fitted values seems reasonable well fitted to observed values:
library(forecast)
a <- ts(AirPassengers, start = 1949, frequency = 12)
a <- window(a, start = 1949, end = c(1954,12), frequency = 12)
fit_a_ets <- ets(a)
fit_a_arima <- auto.arima(a)
plot(a)
lines(fit_a_ets$fitted, col = "blue")
lines(fit_a_arima$fitted, col = "red")
Plot from AirPassengers and fitted models
When I tried same code on my data, it seems dislocated 1 period:
b <- c(1237,1982,1191,1163,1418,1687,2331,2181,1943,1782,177,1871,391,1397,734,712,1006,508,368,767,675,701,989,725,1292,983,1094,1105,928,1246,1604,1163,1390,959,1630,789,1173,910,875,718,655,606,968,716,476,476,655,499,544,1250,359,386,458,947,542,953,1450,1195,1317,957,778,1030,1399,1119,3142,1024,1537,1321,2062,1897,2094,2546,1796,2089,1194,896,727,599,785,674,828,311,375,315,365,314,126,315,372,666,596,589,001,613,498,635,644,1018,873,900,502,121,293,259,311,169,378,153,24,115,250,565,349,201,393,83,327,325,185,307,501,194)
b <- ts(b, start = 1949, frequency = 12)
b <- window(b, start = 1949, end = c(1954,12), frequency = 12)
fit_b_ets <- ets(b)
fit_b_arima <- auto.arima(b)
plot(b)
lines(fit_b_ets$fitted, col = "blue")
lines(fit_b_arima$fitted, col = "red")
Plot from my data and fitted models
Does anyone know why?
Tried here https://otexts.com/fpp2/index.html and I didn't get why this happens.
I thought it would be because it's not well fitted into my data, but for others set's of data, same occurs. For example, figure 7.1 from https://otexts.com/fpp2/ses.html.
This is typical.
In the context of forecasting, the "fitted" value is the one-step-ahead forecast. For many different types of series, the best that we can do is something that's close to the latest observation, plus a small adjustment. This makes it look like the "fitted" value lags by 1 period because it is then usually quite close to the previous observed value.
Asking why the fitted series lags is like asking "why can't we know the future before it happens?". It's simply not that easy, and it doesn't indicate that the model is necessarily inadequate (it may not be possible to do better).
Plots comparing the time series of observations and fitted values are rarely of any use for forecasting; they always essentially look like this. It also makes it difficult to judge the vertical distance between the lines, which is what you actually care about (the forecasting error). Better to plot the forecasting error directly.
The AirPassengers series is unusual because it is extremely easy to forecast based on its seasonality. Most series you will encounter in the wild are not quite like this.
I am working through the "Forecasting Using R" DataCamp course. I have completed the entire thing except for the last part of one particular exercise (link here, if you have an account), where I'm totally lost. The error help it's giving me isn't helping either. I'll put the various parts of the task down with the code I'm using to solve them:
Produce time plots of only the daily demand and maximum temperatures with facetting.
autoplot(elec[, c("Demand", "Temperature")], facets = TRUE)
Index elec accordingly to set up the matrix of regressors to include MaxTemp for the maximum temperatures, MaxTempSq which represents the squared value of the maximum temperature, and Workday, in that order.
xreg <- cbind(MaxTemp = elec[, "Temperature"],
MaxTempSq = elec[, "Temperature"] ^2,
Workday = elec[,"Workday"])
Fit a dynamic regression model of the demand column with ARIMA errors and call this fit.
fit <- auto.arima(elec[,"Demand"], xreg = xreg)
If the next day is a working day (indicator is 1) with maximum temperature forecast to be 20°C, what is the forecast demand? Fill out the appropriate values in cbind() for the xreg argument in forecast().
This is where I'm stuck. The sample code they supply looks like this:
forecast(___, xreg = cbind(___, ___, ___))
I have managed to work out that the first blank is fit, so I'm trying code that looks like this:
forecast(fit, xreg = cbind(elec[,"Workday"]==1, elec[, "Temperature"]==20, elec[,"Demand"]))
But that is giving me the error hint "Make sure to forecast the next day using the inputs given in the instructions." Which... doesn't tell me anything useful. Any ideas what I should be doing instead?
When you are forecasting ahead of time, you use new data that was not included in elec (which is the data set you used to fit your model). The new data was given to you in the question (temperature 20C and workday 1). Therefore, you do not need elec in your forecastcall. Just use the new data to forecast ahead:
forecast(fit, xreg = cbind(20, 20^2, 1))
Here is the plot of the initial data (after performing a log transformation).
It is evident there is both a linear trend as well as a seasonal trend. I can address both of these by taking the first and twelfth (seasonal) difference: diff(diff(data), 12). After doing so, here is the plot of the resulting data
.
This data does not look great. While the mean in constant, we see a funneling effect as time progresses. Here are the ACF/PACF:.
Any suggestions for possible fits to try. I used the auto.arima() function which suggested an ARIMA(2,0,2)xARIMA(1,0,2)(12) model. However, once I took the residuals from the fit, it was clear there was still some sort of structure in them. Here is the plot of the residuals from the fit as well as the ACF/PACF of the residuals.
There does not appear to be a seasonal pattern regarding which lags have spikes in the ACF/PACF of residuals. However, this is still something not captured by the previous steps. What do you suggest I do? How could I go about building a better model that has better model diagnostics (which at this point is just a better looking ACF and PACF)?
Here is my simplified code thus far:
library(TSA)
library(forecast)
beer <- read.csv('beer.csv', header = TRUE)
beer <- ts(beer$Production, start = c(1956, 1), frequency = 12)
# transform data
boxcox <- BoxCox.ar(beer) # 0 in confidence interval
beer.log <- log(beer)
firstDifference <- diff(diff(beer.log), 12) # get rid of linear and
# seasonal trend
acf(firstDifference)
pacf(firstDifference)
eacf(firstDifference)
plot(armasubsets(firstDifference, nar=12, nma=12))
# fitting the model
auto.arima(firstDifference, ic = 'bic') # from forecasting package
modelFit <- arima(firstDifference, order=c(1,0,0),seasonal
=list(order=c(2, 0, 0), period = 12))
# assessing model
resid <- modelFit$residuals
acf(resid, lag.max = 15)
pacf(resid, lag.max = 15)
Here is the data, if interested (I think you can use an html to csv converter if you would like): https://docs.google.com/spreadsheets/d/1S8BbNBdQFpQAiCA4J18bf7PITb8kfThorMENW-FRvW4/pubhtml
Jane,
There are a few things going on here.
Instead of logs, we used the tsay variance test which shows that the variance increased after period 118. Weighted least squares deals with it.
March becomes higher beginning at period 111. An alternative to an ar12 or seasonal differencing is to identify seasonal dummies. We found that 7 of the 12 months were unusual with a couple level shifts, an AR2 with 2 outliers.
Here is the fit and forecasts.
Here are the residuals.
ACF of residuals
Note: I am a developer of the software Autobox. All models are wrong. Some are useful.
Here is Tsay's paper
http://onlinelibrary.wiley.com/doi/10.1002/for.3980070102/abstract
I have already read
Time Series Forecast: Convert differenced forecast back to before difference level
and
How to "undifference" a time series variable
None of these unfortunately gives any clear answer how to convert forecast done in ARIMA using differenced method(diff()) to reach at stationary series.
code sample.
## read data and start from 1 jan 2014
dat<-read.csv("rev forecast 2014-23 dec 2015.csv")
val.ts <- ts(dat$Actual,start=c(2014,1,1),freq=365)
##Check how we can get stationary series
plot((diff(val.ts)))
plot(diff(diff(val.ts)))
plot(log(val.ts))
plot(log(diff(val.ts)))
plot(sqrt(val.ts))
plot(sqrt(diff(val.ts)))
##I found that double differencing. i.e.diff(diff(val.ts)) gives stationary series.
#I ran below code to get value of 3 parameters for ARIMA from auto.arima
ARIMAfit <- auto.arima(diff(diff(val.ts)), approximation=FALSE,trace=FALSE, xreg=diff(diff(xreg)))
#Finally ran ARIMA
fit <- Arima(diff(diff(val.ts)),order=c(5,0,2),xreg = diff(diff(xreg)))
#plot original to see fit
plot(diff(diff(val.ts)),col="orange")
#plot fitted
lines(fitted(fit),col="blue")
This gives me a perfect fit time series. However, how do i reconvert fitted values into their original metric from the current form it is now in? i mean from double differencing into actual number? For log i know we can do 10^fitted(fit) for square root there is similar solution, however what to do for differencing, that too double differencing?
Any help on this please in R? After days of rigorous exercise, i am stuck at this point.
i ran test to check if differencing has any impact on model fit of auto.arima function and found that it does. so auto.arima can't handle non stationary series and it requires some effort on part of analyst to convert the series to stationary.
Firstly, auto.arima without any differencing. Orange color is actual value, blue is fitted.
ARIMAfit <- auto.arima(val.ts, approximation=FALSE,trace=FALSE, xreg=xreg)
plot(val.ts,col="orange")
lines(fitted(ARIMAfit),col="blue")
secondly, i tried differencing
ARIMAfit <- auto.arima(diff(val.ts), approximation=FALSE,trace=FALSE, xreg=diff(xreg))
plot(diff(val.ts),col="orange")
lines(fitted(ARIMAfit),col="blue")
enter image description here
thirdly, i did differencing 2 times.
ARIMAfit <- auto.arima(diff(diff(val.ts)), approximation=FALSE,trace=FALSE,
xreg=diff(diff(xreg)))
plot(diff(diff(val.ts)),col="orange")
lines(fitted(ARIMAfit),col="blue")
enter image description here
A visual inspection can suggest that 3rd graph is more accurate out of all. This i am aware of. The challenge is how to reconvert this fitted value which is in the form of double differenced form into the actual metric!
The opposite of diff is kind of cumsum, but you need to know the starting values at each diff.
e.g:
set.seed(1234)
x <- runif(100)
z <- cumsum(c(x[1], cumsum(c(diff(x)[1], diff(diff(x))))))
all.equal(z, x)
[1] TRUE
Share some of your data to make a reproducible example to better help answer the question.
If you expect that differencing will be necessary to obtain stationarity, then why not simply include the maximum differencing order in the function call? That is, the "I" in ARIMA is the order of differencing prior to fitting an ARMA model, such that if
y = diff(diff(x)) and y is an ARMA(p,q) process,
then
x follows an ARIMA(p,2,q) process.
In auto.arima() you specify the differencing with the d argument (or D if it involves seasons). So, you want something like this (for a maximum of 3 differences):
fit <- auto.arima(val.ts, d=3, ...)
From this, you can verify that the fitted values will indeed map onto the original data
plot(val.ts)
lines(fit, col="blue")
In the example below containing dummy data, I have double differenced. First, I removed seasonality (lag = 12) and then I removed trend from the differenced data (lag = 1).
set.seed(1234)
x <- rep(NA,24)
x <- x %>%
rnorm(mean = 10, sd = 5) %>%
round(.,0) %>%
abs()
yy <- diff(x, lag = 12)
z <- diff(yy, lag = 1)
Using the script that #jeremycg included above and I include below, how would I remove the double difference? Would I need to add lag specifiers to the two nested diff() commands? If so, which diff() would have the lag = 12 specifier and which would have the lag = 1?
zz <- cumsum(c(x[1], cumsum(c(diff(x)[1], diff(diff(x))))))
So I am currently trying to draw the confidence interval for a linear model. I found out I should use predict.lm() for this, but I have a few problems really understanding the function and I do not like using functions without knowing what's happening. I found several how-to's on this subject, but only with the corresponding R-code, no real explanation.
This is the function itself:
## S3 method for class 'lm'
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
interval = c("none", "confidence", "prediction"),
level = 0.95, type = c("response", "terms"),
terms = NULL, na.action = na.pass,
pred.var = res.var/weights, weights = 1, ...)
Now, what I've trouble understanding:
1) newdata
An optional data frame in which to look for variables
with which to predict. If omitted, the fitted values are used.
Everyone seems to use newdata for this, but I cannot quite understand why. For calculating the confidence interval I obviously need the data which this interval is for (like the # of observations, mean of x etc), so cannot be what is meant by it. But then: What is does it mean?
2) interval
Type of interval calculation.
okay.. but what is "none" for?
3a) type
Type of prediction (response or model term).
3b) terms
If type="terms", which terms (default is all terms)
3a: Can I by that get the confidence interval for one specific variable in my model? And if so, what is 3b for then? If I can specify the term in 3a, it wouldn't make sense to do it in 3b again.. so I guess I'm wrong again, but I cannot figure out why.
I guess some of you might think: Why don't just try this out? And I would (even if it would maybe not solve everything here), but I right now don't know how to. As I do not now what the newdata is for, I don't know how to use it and if I try, I do not get the right confidence interval. Somehow it is very important how you choose that data, but I just don't understand!
EDIT: I want to add that my intention is to understand how predict.lm works. By that I mean I don't understand if it works the way I think it does. That is it calculates y-hat (predicted values) and than uses adds/subtracts for each the upr/lwr-bounds of the interval to calculate several datapoints(looking like a confidence-line then) ?? Then I would undestand why it is necessary to have the same lenght in the newdata as in the linear model.
Make up some data:
d <- data.frame(x=c(1,4,5,7),
y=c(0.8,4.2,4.7,8))
Fit the model:
lm1 <- lm(y~x,data=d)
Confidence and prediction intervals with the original x values:
p_conf1 <- predict(lm1,interval="confidence")
p_pred1 <- predict(lm1,interval="prediction")
Conf. and pred. intervals with new x values (extrapolation and more finely/evenly spaced than original data):
nd <- data.frame(x=seq(0,8,length=51))
p_conf2 <- predict(lm1,interval="confidence",newdata=nd)
p_pred2 <- predict(lm1,interval="prediction",newdata=nd)
Plotting everything together:
par(las=1,bty="l") ## cosmetics
plot(y~x,data=d,ylim=c(-5,12),xlim=c(0,8)) ## data
abline(lm1) ## fit
matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
matlines(d$x,p_pred1[,c("lwr","upr")],col=2,lty=2,type="b",pch=1)
matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")
matlines(nd$x,p_pred2[,c("lwr","upr")],col=4,lty=2,type="b",pch=1)
Using new data allows for extrapolation beyond the original data; also, if the original data are sparsely or unevenly spaced, the prediction intervals (which are not straight lines) may not be well approximated by linear interpolation between the original x values ...
I'm not quite sure what you mean by the "confidence interval for one specific variable in my model"; if you want confidence intervals on a parameter, then you should use confint. If you want predictions for the changes based only on some of the parameters changing (ignoring the uncertainty due to the other parameters), then you do indeed want to use type="terms".
interval="none" (the default) just tells R not to bother computing any confidence or prediction intervals, and to return just the predicted values.