Issues with forecasting model fit - r

I want to forecast rainfall in Albury. I have a data set holding rainfall in mm from 2009 to 2016. Besides rainfall the data set holds other variables: date, windgustspeed, windspeed, humidity, maximum temperature, and pressure. All of these have in earlier tests shown to have influence on rainfall.
I have tried forecasting rainfall the next year using arima, vector autogressive, tslm, snaive, naive, mean, rw models. All of them seem only to forecast a mean and do not catch all.
Code used for model fitting
albury_fit <- albury %>%
model(
naive = NAIVE(Rainfall),
drift = RW(Rainfall ~ drift()),
mean = MEAN(Rainfall),
seasonal_naive = SNAIVE(Rainfall),
tslm = TSLM(Rainfall ~trend()),
arima = ARIMA(Rainfall)
)
albury_fc <- albury_fit %>%
forecast(h = '1 year')
plot of model fit
Code for checking model performance for arima, all models have the same output as arima
albury%>%
model(ARIMA(Rainfall)) %>%
gg_tsresiduals()
output from the code above
So the bottom line is, that the models do not perform well, but I don't know how to fix it.
I hope that someone can help me :)

Related

How to set bound when running the fable package ETS exponential smoothing state space model?

I have 32 months of data, and I'm trying out different models for testing the forecasting of unit transitions to dead state "X" for months 13-32, by training from transitions data for months 1-12. I then compare the forecasts with the actual data for months 13-32. The data represents the unit migration of a beginning population into the dead state over 32 months. Not all beginning units die off, only a portion. I understand that 12 months of data for model training isn't much and that forecasting for 20 months from those 12 months should result in a wide distribution of outcomes, those are the real-world limitations I usually grapple with.
I am using the fable package ETS model and would like to know how, or if it's possible, to set bounds for outputs when running simulations based on ETS. When I go to https://fable.tidyverts.org/reference/ETS.html to research setting bounds, the bounds argument as duplicated in the image below (perhaps I misunderstand what is meant by "bounds"), but those instructions don't say how to actually specify the lower and upper bounds:
When I run ETS on my data and plot out the forecast I get the following, where the forecast mean (in blue) at least visually reasonably hews to the actual data for those same months 13-32 (in black)(I have run other tests of residuals and autocorrelations, as well as run the benchmark methods recommended in the book, and this Holt's linear method looks fine based on those tests):
However, when I run simulations using that ETS model (code is presented at the bottom with simulation function flagged by #), I often get a maximum of transitions into dead state X for the forecast horizon (aggregate forecasted transitions during months 13-32) in excess of the beginning number of elements, which totals 60,000. In other words, there is no real-world scenario where transitions to dead state can exceed the beginning population! Is there a way to set an upper bound on the forecast distribution and the simulations so the total forecast doesn't exceed the cap of 60,000 possible transitions? While maintaining objective statistical integrity without injecting too much judgment?
I use a log-transformation of the data to prevent the forecast from falling negative. Negative value transitions aren't a real-world possibility for this data.
Below is the code for generating the above, and running simulations, including the dataset:
library(dplyr)
library(fabletools)
library(fable)
library(feasts)
library(ggplot2)
library(tidyr)
library(tsibble)
# my data
data <- data.frame(
Month =c(1:32),
StateX=c(
9416,6086,4559,3586,2887,2175,1945,1675,1418,1259,1079,940,923,776,638,545,547,510,379,
341,262,241,168,155,133,76,69,45,17,9,5,0
)
) %>%
as_tsibble(index = Month)
# fit the model to my data, generate forecast for months 13-32, and plot
fit <- data[1:12,] |> model(ETS(log(StateX) ~ error("A") + trend("A") + season("N")))
fc <- fit |> forecast(h = 20)
fc |>
autoplot(data) +
geom_line(aes(y = .fitted), col="#D55E00",
data = augment(fit)) +
labs(y="Unit transitions", title="Holt's linear method for transitions to dead state X") +
guides(colour = "none")
# run simulations and show aggregate nbr of transitions for months 13-32
sim <- fit %>% generate(h = 20, times = 5000, bootstrap = TRUE)
agg_sim <- sim %>% group_by(.rep) %>% summarise(sum_FC = sum(.sim),.groups = 'drop')
max(agg_sim[,"sum_FC"])
I ran the scaled logit transformation which offers a good solution, see https://otexts.com/fpp3/limits.html. However, observing Mitchell's 2nd comment "...likely here is that your model is inappropriate for the structure in the data", I explored this point further and am getting better results with a log-transformed ETS(M,A,N) model as opposed to the log-transformed ETS(A,A,N) model presented in the OP. The results are more reasonable with log-transformed ETS(M,A,N) than scaled-logit tranformed ETS(M,A,N). Using log ETS(M,A,N) I get the below point and interval forecast:
Researching ETS(M,A,N) further, there is a view that the conditional distribution from these "class 2" models is not Gaussian, so there are no formulae for the prediction intervals from these models although in some cases the Normal distribution might be used as an approximation for the real one, but simulations should generally be preferred. Therefore, running simulations for this data results in the following histogram which I take comfort from because the forecast simulated mean is a bit higher than actuals which is a desired outcome for the end purposes of this simulation, and the extreme outliers and what looks like a lognormal distribution is to be expected for this type of data which can be summarized as, when things go wrong in this process, they can go EXTREMELY BAD (outcomes can be really bad but never really good, sadly):
Below is code for the above (using packages from OP):
testDF <- data.frame(
Month =c(1:32),
StateX=c(
9416,6086,4559,3586,2887,2175,1945,1675,1418,1259,1079,940,923,776,638,545,547,510,379,
341,262,241,168,155,133,76,69,45,17,9,5,0
)) %>% as_tsibble(index = Month)
fit <- testDF[1:12,] |> model(ETS(log(StateX) ~ trend("A")))
fit |>
forecast(h = 20) |>
autoplot(testDF) +
labs(title = "Transition forecast versus actual data",
y = "Unit transitions",
x = "Months 13-32 are forecast periods")
sim <- fit |> generate(h = 20, times = 5000, bootstrap = TRUE)
agg_sim <- sim |>
as.data.frame() |>
group_by(.rep) |>
summarise(sumFC = sum(.sim),.groups = 'drop')
agg_sim %>%
ggplot(aes(x = sumFC)) +
geom_histogram(bins = 50) +
geom_vline(
aes(xintercept = mean(as.data.frame(agg_sim)[,"sumFC"]),
color = "forecast period simulation mean"),
linetype="solid",
size=1.5)+
geom_vline(
aes(xintercept = sum(as.data.frame(testDF[13:32,"StateX"])),
color = "forecast period actual data"
),
linetype="solid",
size=1.5) +
scale_color_manual(
name = "Vertical lines:",
values = c(
'forecast period actual data' = "blue",
'forecast period simulation mean' = "red"
)
) +
labs(title = "Histogram of simulations during forecast months 13-24",
x = 'Cumulative transitions during forecast period',
y = 'Bin counts')+
theme(legend.position = c(0.75,0.85)) +
xlim(0, 20000)

(R)Discovery dataset from faraway package: How can you make predictions with both dependent and independent variables? Poission Regression

##Here I used the 'discoveries' dataset from the faraway package.
library('faraway')
data("discoveries")
#made a variable 'year' from 1860 to 1959
year = 1860:1959
#here I fit a poisson regression model with discoveries as dependent and year as independent.
fit_pois = glm(discoveries ~ year, data = discoveries, family = poisson)
##The question is, what is the probability that there are 4 discoveries in year '1960'(assuming model is right and is predicting the future). I tried to do this with
pred_pr = predict.glm(fit_pois, data.frame(discoveries = 4, year = 1960, type = 'response'))
##However when I predict on the data, it gives out numbers that are not probabilities. Plz Help!!
Let's start by replicating your model:
library('faraway')
data("discoveries")
year = 1860:1959
fit_pois <- glm(discoveries ~ year, data = discoveries, family = poisson)
Now if we use predict, our fit_pois model will tell us the predicted rate of discoveries for any given year(s). It will completely ignore any discoveries in the data frame passed to the newdata parameter of predict, because our model predicts discoveries based solely on the year variable.
Note also that in your example you have included type = "response" as a variable in the newdata data frame, rather than passing it as a parameter to predict. So the prediction line should look like this:
pred_pr = predict.glm(fit_pois, newdata = data.frame(year = 1960), type = 'response')
And the result we get is:
pred_pr
#> 1
#> 2.336768
Let's think what this means. Since we are doing a Poisson regression, this number represents the expected value of discoveries in the year 1960. This means that we can estimate the probability of there being exactly 4 discoveries if we examine the Poisson distribution with an expected value (also known as a lambda) of 2.336768. Let's use dpois to see the probabilities of getting 0 to 6 discoveries if the lambda is 2.336768:
plot(0:6, dpois(0:6, pred_pr), type = "h")
So the probability of there being exactly 4 discoveries in 1960 is:
dpois(4, pred_pr)
#> [1] 0.1200621
i.e. almost exactly 12%

ARIMA modelling, prediction and plotting with CO2 dataset in R

I am working with arima0() and co2. I would like to plot arima0() model over my data. I have tried fitted() and curve() with no success.
Here is my code:
###### Time Series
# format: time series
data(co2)
# format: matrix
dmn <- list(month.abb, unique(floor(time(co2))))
co2.m <- matrix(co2, 12, dimnames = dmn)
co2.dt <- pracma::detrend(co2.m, tt = 'linear')
co2.dt <- ts(as.numeric(co2.dt), start = c(1959,1), frequency=12)
# first diff
co2.dt.dif <- diff(co2.dt,lag = 12)
# Second diff
co2.dt.dif2 <- diff(co2.dt.dif,lag = 1)
With the data prepared, I ran the following arima0:
results <- arima0(co2.dt.dif2, order = c(2,0,0), method = "ML")
resultspredict <- predict(results, n.ahead = 36)
I would like to plot the model and the prediction. I am hoping there is a way to do this in base R. I would also like to be able to plot the predictions as well.
Session 1: To begin with...
To be honest, I am pretty much worried about your way in modelling co2 time series. Something wrong happened already when you de-trended co2. Why use tt = "linear"? You fit a linear trend within each period (i.e., year), and take the residuals for further inspection. This is often not recommended as it tends to introduce artificial effects to the residual series. I would incline to do tt = "constant", i.e., simply dropping off yearly average. This would at least preserve the with-season correlation as in the original data.
Perhaps you want to see some evidence here. Consider using ACF to help you diagnose.
data(co2)
## de-trend by dropping yearly average (no need to use `pracma::detrend`)
yearlymean <- ave(co2, gl(39, 12), FUN = mean)
co2dt <- co2 - yearlymean
## de-trend by dropping within season linear trend
co2.m <- matrix(co2, 12)
co2.dt <- pracma::detrend(co2.m, tt = "linear")
co2.dt <- ts(as.numeric(co2.dt), start = c(1959, 1), frequency = 12)
## compare time series and ACF
par(mfrow = c(2, 2))
ts.plot(co2dt); acf(co2dt)
ts.plot(co2.dt); acf(co2.dt)
Both de-trended series have strong seasonal effect, thus a further seasonal differencing is required.
## seasonal differencing
co2dt.dif <- diff(co2dt, lag = 12)
co2.dt.dif <- diff(co2.dt, lag = 12)
## compare time series and ACF
par(mfrow = c(2, 2))
ts.plot(co2dt.dif); acf(co2dt.dif)
ts.plot(co2.dt.dif); acf(co2.dt.dif)
The ACF for co2.dt.dif has more significant negative correlations. This is the sign of over-de-trending. So we prefer to co2dt. co2dt is already stationary, and no more differencing is needed (otherwise you just over-difference it and introduce more negative autocorrelation).
The big negative spike at lag 1 for ACF of co2dt.dif suggests that we want seasonal MA. Also, the positive spike with the season implies a mild AR process in general. So consider:
## we exclude mean because we found estimation of mean is 0 if we include it
fit <- arima0(co2dt.dif, order = c(1,0,0), seasonal = c(0,0,1), include.mean = FALSE)
Whether this model is doing good, we need to inspect ACF of residuals:
acf(fit$residuals)
Looks like this model is decent (actually pretty great).
For prediction purpose, it is actually a better idea to integrate seasonal differencing of co2dt with model fitting of co2dt.dif. Let's do
fit <- arima0(co2dt, order = c(1,0,0), seasonal = c(0,1,1), include.mean = FALSE)
This will give exactly as same estimate for AR and MA coefficients as above two-stage work, but now prediction is fairly easy to be dealt with a single predict call.
## 3 years' ahead prediction (no prediction error; only mean)
predco2dt <- predict(fit, n.ahead = 36, se.fit = FALSE)
Let's plot co2dt, fitted model and prediction together:
fittedco2dt <- co2dt - fit$residuals
ts.plot(co2dt, fittedco2dt, predco2dt, col = 1:3)
The result looks very promising!
Now the final stage, is to actually map this back to the original co2 series. For fitted values, we just add back the yearly mean we have dropped off:
fittedco2 <- fittedco2dt + yearlymean
But for prediction it is more difficult, because we don't know what yearly mean in the future would be. In this regard, our modelling though looks good, is not practically useful. I will talk about a better idea in another answer. To finish this session, we plot co2 with its fitted values only:
ts.plot(co2, fittedco2, col = 1:2)
Session 2: A better idea for time series modelling
In previous session, we have seen the difficulty in prediction if we separate de-trending and modelling of de-trended series. Now, we try to combine those two stages in one go.
The seasonal pattern of co2 is really strong, so we need a seasonal differencing anyway:
data(co2)
co2dt <- diff(co2, lag = 12)
par(mfrow = c(1,2)); ts.plot(co2dt); acf(co2dt)
After this seasonal differencing, co2dt does not look stationary. So we need further a non-seasonal differencing.
co2dt.dif <- diff(co2dt)
par(mfrow = c(1,2)); ts.plot(co2dt.dif); acf(co2dt.dif)
The negative spikes within season and between season suggest that a MA process is needed for both. I will not work with co2dt.dif; we can work with co2 directly:
fit <- arima0(co2, order = c(0,1,1), seasonal = c(0,1,1))
acf(fit$residuals)
Now the residuals are perfectly uncorrelated! So we have an ARIMA(0,1,1)(0,1,1)[12] model for co2 series.
As usual, fitted values are obtained by subtracting residuals from data:
co2fitted <- co2 - fit$residuals
Predictions are made by a single call to predict:
co2pred <- predict(fit, n.ahead = 36, se.fit = FALSE)
Let's plot them together:
ts.plot(co2, co2fitted, co2pred, col = 1:3)
Oh, this is just gorgeous!
Session 3: Model selection
The story should have finished by now; but I would like to make a comparison with auto.arima from forecast, that can automatically decide on the "best" model.
library(forecast)
autofit <- auto.arima(co2)
#Series: co2
#ARIMA(1,1,1)(1,1,2)[12]
#
#Coefficients:
# ar1 ma1 sar1 sma1 sma2
# 0.2569 -0.5847 -0.5489 -0.2620 -0.5123
#s.e. 0.1406 0.1204 0.5880 0.5701 0.4819
#
#sigma^2 estimated as 0.08576: log likelihood=-84.39
#AIC=180.78 AICc=180.97 BIC=205.5
auto.arima has chosen ARIMA(1,1,1)(1,1,2)[12], which is much more complicated as it involves both seasonal differencing and non-seasonal differencing.
Our model based on step-by-step investigation suggests an ARIMA(0,1,1)(0,1,1)[12]:
fit <- arima0(co2, order = c(0,1,1), seasonal = c(0,1,1))
#Call:
#arima0(x = co2, order = c(0, 1, 1), seasonal = c(0, 1, 1))
#
#Coefficients:
# ma1 sma1
# -0.3495 -0.8515
#s.e. 0.0497 0.0254
#
#sigma^2 estimated as 0.08262: log likelihood = -85.98, aic = 177.96
AIC values suggest our model better. So does BIC:
BIC = -2 * loglik + log(n) * p
We have n <- length(co2) data, and p <- length(fit$coef) + 1 parameters (the additional one for sigma2), thus our model has BIC
-2 * fit$loglik + log(n) * p
# [1] 196.5503
So, auto.arima has over-fitted data.
In fact, as soon as we see ARIMA(1,1,1)(1,1,2)[12], we have strong suspicion for its over-fitting. Because different effects "cancel off" each other. This happens to the additional seasonal MA and non-seasonal AR introduced by auto.arima, as AR introduces positive autocorrelation while MA introduces negative one.

Forecasting ARIMA with xreg

I'm trying to forecast time in time out ("TiTo") for someone ordering food at a restaurant using the code below. TiTo is the total time it takes someone from the time they walk through the door to the time they get their food. TimeTT is the time the customer spends talking to the waiter. I believe TimeTT is a predictor of TiTo and I would like to use it as a covariate in the forecast for TiTo. I've read some about ARIMA, and as I understand it you add the predictors to the model in the xreg parameter. I'm thinking of the xreg parameter as something like the independent variable for a regression model, like lm(TiTo ~ TimeTT). Is this the correct way to think of the xreg parameter? Also what does the error message below mean? Do I need to convert TimeTT into a time series to use it in the xreg parameter? I'm new to forecasting so all help is very appreciated.
Forecast Attempt:
OV<-zoo(SampleData$TiTo, order.by=SampleData$DateTime)
eData <- ts(OV, frequency = 24)
Train <-eData[1:15000]
Test <- eData[15001:20809]
Arima.fit <- auto.arima(Train)
Acast<-forecast(Arima.fit, h=5808, xreg = SampleData$TimeTT)
Error:
Error in if (ncol(xreg) != ncol(object$call$xreg)) stop("Number of regressors does not match fitted model") :
argument is of length zero
Data:
dput(Train[1:5])
c(1152L, 1680L, 1680L, 968L, 1680L)
dput(SampleData[1,]$TimeTT)
structure(1156L, .Label = c("0.000000", "0.125000", "0.142857",
"96.750000", "97.800000", "99.000000", "99.600000", "NULL"), class = "factor")
You need to define the xreg when you estimate the model itself, and these need to be forecasted ahead as well. So this will look something like:
Arima.fit <- auto.arima(Train, xreg = SampleData$TimeTT)
forecast(Arima.fit, h = 508, xreg = NewData$TimeTT)
Here is an example using Arima and xreg from Rob Hyndman (here is the link to the example, but to read more about using contemporaneous covariates in ARIMA models go here), this is analogous to auto.arima.
n <- 2000
m <- 200
y <- ts(rnorm(n) + (1:n)%%100/30, f=m)
library(forecast)
fit <- Arima(y, order=c(2,0,1), xreg=fourier(y, K=4))
plot(forecast(fit, h=2*m, xreg=fourierf(y, K=4, h=2*m)))
Hope this helps.

Error in arima of R: too few non-missing observations

I am using arima() and auto.arima() of R to get the prediction of sales. The data is at week level for three years.
my code looks like:
x<-c(1571,1501,895,1335,2306,930,2850,1380,975,1080,990,765,615,585,838,555,1449,615,705,465,165,630,330,825,555,720,615,360,765,1080,825,525,885,507,884,1230,342,615,1161,
1585,723,390,690,993,1025,1515,903,990,1510,1638,1461.67,1082,1075,2315,1014,2140,1572,794,1363,1184,1248,1344,1056,816,720,896,608,624,560,512,304,640,640,704,1072,768,
816,640,272,1168,736,1003,864,658.67,768,841,1727,944,848,432,704,850.67,1205,592,1104,976,629,814,1626,933.33,1100.33,1730,2742,1552,1038,826,1888,1440,1372,824,1824,1392,1424,768,464,
960,320,384,512,478,1488,384,338.67,176,624,464,528,592,288,544,418.67,336,752,400,1232,477.67,416,810.67,1256,1040,823,240,1422,704,718,1193,1541,1008,640,752,
1008,864,1507,4123,2176,899,1717,935)
length_data<-length(x)
length_train<-round(length_data*0.80)
forecast_period<-length_data-length_train
train_data<-x[1:length_train]
train_data<-ts(train_data,frequency=52,start=c(1,1))
validation_data<-x[(length_train+1):length_data]
validation_data<-ts(validation_data,frequency=52,start=c(ceiling((length_train)/52),((length_train)%%52+1)))
arima_output<-auto.arima(train_data) # fit the ARIMA Model
arima_validate <- Arima(x=validation_data,model=arima_output)
Error:
Error in stats::arima(x = x, order = order, seasonal = seasonal, include.mean = include.mean, :
too few non-missing observations
What I am doing wrong?
What does it mean by "too few non-missing observations"? I have searched it now net, but did not get any better explanation.
Thanks for any kind of help!
arima_output is a seasonal ARIMA model:
> arima_output
Series: train_data
ARIMA(1,0,1)(0,1,0)[52]
Arima() then attempts to refit this particular model to validation_data. But to fit a seasonal model to a time series, you need at least one full year of observations, since seasonal ARIMA depends on seasonal differencing.
As an illustration, note that Arima() will happily and without errors refit a time series that is double as long as validation_data:
validation_data <- x[(length_train+1):length_data]
validation_data<-ts(rep(validation_data,2),frequency=52,
start=c(ceiling((length_train)/52),((length_train)%%52+1)))
arima_validate <- Arima(x=validation_data,model=arima_output)
One way of dealing with this would be to force auto.arima() to use a nonseasonal model, by specifying D=0:
validation_data <- x[(length_train+1):length_data]
validation_data<-ts(validation_data,frequency=52,
start=c(ceiling((length_train)/52),((length_train)%%52+1)))
arima_output<-auto.arima(train_data, D=0) # fit the ARIMA Model
arima_validate <- Arima(x=validation_data,model=arima_output)
So this did turn out to be more of a CrossValidated question...
Your chosen model is ARIMA(1,0,1)(0,1,0)[52]. That is, it has a seasonal difference of lag 52. Your validation data has 32 observations. So you cannot take the seasonal differences on the validation data without knowing what the training data is.
One way around this is to fit the model to the full time series, and then extract what you want (presumably residuals from the validation portion).
You can also improve the readability of your code:
x <- ts(x, frequency=52, start=c(1,1))
length_data <- length(x)
length_train <- round(length_data*0.80)
train_data <- ts(head(x, length_train),
frequency=frequency(x), start=start(x))
validation_data <- ts(tail(x, length_data-length_train),
frequency=frequency(x), end=end(x))
library(forecast)
arima_train <- auto.arima(train_data)
arima_full <- Arima(x, model=arima_train)
res <- window(residuals(arima_full), start=start(validation_data))

Resources