Time series prediction with and without NAs (ARIMA and Forecast package) in R - r

This is my first question on stack overflow.
Situation: I have 2 time series. Both series have the same values but the second series has 5 NAs at the start. Hence, first series has 105 observations, where 2nd series has 110 observations. I have fitted an ARIMA(0,1,0) using the Arima function to both series separately. And then I used the forecast package to predict 10 steps to the future.
Issue: Even though the ARIMA coefficient for both series are the same, the projections (10 steps) appear to be different. I am uncertain why this is the case. Has anyone come across this before? Any guidance is highly appreciated.
Tried: I tried setting seed, creating index manually, and using auto.ARIMA for the model fitting. However, none of the steps has helped me to reconcile the difference.
I have added a picture to show you what I see. Please note I have hidden the mid part of the series so that you can see the start and the end of the series. The yellow highlighted cells are the projection outputs from the 'Forecast' package. I have manually added the index to be years after extracting the results from R.
Time series projected and base in excel
Rates <- read.csv("Rates_for_ARIMA.csv")
set.seed(123)
#ARIMA with NA
Simple_Arima <- Arima(
ts(Rates$Rates1),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima <- forecast(Simple_Arima, h = 10)
fcasted_Arima$mean
#ARIMA Without NA
Rates2 <- as.data.frame(Rates$Rates2)
##Remove the final spaces from the CSV
Rates2 <- Rates2[-c(106,107,108,109,110),]
Simple_Arima2 <- Arima(
ts(Rates2),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima2 <- forecast(Simple_Arima2, h = 10)
fcasted_Arima2$mean
The link to data is here, CSV format

Could you share your data and code such that others can see if there is any issue with it?
I tried to come up with an example and got the same results for both series, one that includes NAs and one that doesn't.
library(forecast)
library(xts)
set.seed(123)
ts1 <- arima.sim(model = list(0, 1, 0), n = 105)
ts2 <- ts(c(rep(NA, 5), ts1), start = 1)
fit1 <- forecast::Arima(ts1, order = c(0, 1, 0))
fit2 <- forecast::Arima(ts2, order = c(0, 1, 0))
pred1 <- forecast::forecast(fit1, 10)
pred2 <- forecast::forecast(fit2, 10)
forecast::autoplot(pred1)
forecast::autoplot(pred2)
> all.equal(as.numeric(pred1$mean), as.numeric(pred2$mean))
[1] TRUE

Related

Implementation of time series cross-validation

I am working with time series 551 of the monthly data of the M3 competition.
So, my data is :
library(forecast)
library(Mcomp)
# Time Series
# Subset the M3 data to contain the relevant series
ts.data<- subset(M3, 12)[[551]]
print(ts.data)
I want to implement time series cross-validation for the last 18 observations of the in-sample interval.
Some people would normally call this “forecast evaluation with a rolling origin” or something similar.
How can i achieve that ? Whats means the in-sample interval ? Which is the timeseries i must evaluate?
Im quite confused , any help in order to light up this would be welcome.
The tsCV function of the forecast package is a good place to start.
From its documentation,
tsCV(y, forecastfunction, h = 1, window = NULL, xreg = NULL, initial = 0, .
..)
Let ‘y’ contain the time series y[1:T]. Then ‘forecastfunction’ is
applied successively to the time series y[1:t], for t=1,...,T-h,
making predictions f[t+h]. The errors are given by e[t+h] =
y[t+h]-f[t+h].
That is first tsCV fit a model to the y[1] and then forecast y[1 + h], next fit a model to y[1:2] and forecast y[2 + h] and so on for T-h steps.
The tsCV function returns the forecast errors.
Applying this to the training data of the ts.data
# function to fit a model and forecast
fmodel <- function(x, h){
forecast(Arima(x, order=c(1,1,1), seasonal = c(0, 0, 2)), h=h)
}
# time-series CV
cv_errs <- tsCV(ts.data$x, fmodel, h = 1)
# RMSE of the time-series CV
sqrt(mean(cv_errs^2, na.rm=TRUE))
# [1] 778.7898
In your case, it maybe that you are supposed to
fit a model to ts.data$x and then forecast ts.data$xx[1]
fit mode the c(ts.data$x, ts.data$xx[1]) and forecast(ts.data$xx[2]),
so on.

How to forecast and fit the optimal model for multiple time series?

I want to do batch forecasting among multiple series, for example, if I want to forecast time series with IDs that end with 1(1,11,21,31...), how can I do that?
Since you did not provide detailed information, I was not sure which forecasting method you want to use hence I give here an example of a univariate time series model:
Load required packages:
library(forecast)
library(dplyr)
We use example data from Rob Hyndman:
dta <- read.csv("https://robjhyndman.com/data/ausretail.csv")
Now change the column names:
colnames(dta) <- c("date", paste0("tsname_", seq_len(ncol(dta[,-1]))))
Select timeseries which end with 1:
dta_ends_with1 <- dplyr::select(dta, dplyr::ends_with("1"))
Create a ts object:
dta_ends_with1 <- ts(dta_ends_with1, start = c(1982,5), frequency = 12)
Specify how many steps ahead you want to forecast, here I set it to 6 steps ahead,
h <- 6
Now we prepare a matrix to save the forecast:
fc <- matrix(NA, ncol = ncol(dta_ends_with1), nrow = h)
Forecasting loop.
for (i in seq_len(ncol(dta_ends_with1))) {
fc[,i] <- forecast::forecast(forecast::auto.arima(dta_ends_with1[,i]),
h = h)$mean
}
Set the column names:
colnames(fc) <- colnames(dta_ends_with1)
head(fc)

Forecasting multiple variable time series in R

I am trying to forecast three variables using R, but I am running into issues on how to deal with correlation.
The three variables I am trying to forecast are Revenue, Subscriptions and Price.
My initial approach was to do two independent time series forecast of subscriptions and price and multiply the outcomes to generate the revenue forecast.
I wanted to understand if this approach makes sense, as there is an inherent correlation between the price and the subscribers, and this is the part I do not know how to deal with.
# Load packages.
library(forecast)
# Read data
data <- read.csv("data.csv")
data.train <- data[0:57,]
data.test <- data[58:72,]
# Create time series for variables of interest
data.subs <- ts(data.train$subs, start=c(2014,1), frequency = 12)
data.price <- ts(data.train$price, start=c(2014,1), frequency = 12)
#Create model
subs.stlm <- stlm(data.subs)
price.stlm <- stlm(data.price)
#Forecast
subs.pred <- forecast(subs.stlm, h = 15, level = c(0.6, 0.75, 0.9))
price.pred <- forecast(price.stlm, h = 15, level = c(0.6, 0.75, 0.9))
Any help is greatly appreciated!
Looks like you can use the vector autoregression (VAR) model. Take a look at the description and the code provided here:
https://otexts.org/fpp2/VAR.html

CausalImpact R package - How to calculate the counter-factual time series from the output?

I am using the CasualImpact R package and I would like to get the counter-factual/control time series from the output after estimation. I run the following code which is basically the same as the example code on the website of the package.
set.seed(1)
x1 <- 100 + arima.sim(model = list(ar = 0.999), n = 100)
y <- 1.2 * x1 + rnorm(100)
y[71:100] <- y[71:100] + 10
data <- cbind(y, x1)
pre.period <- c(1, 70)
post.period <- c(71, 100)
impact <- CausalImpact(data, pre.period, post.period)
The the local linear trend is in
impact$model$bsts.model$state.contributions
while the coefficient draws are supposed to be in
impact$model$bsts.model$coefficients
so I run
trend=colMeans(impact$model$bsts.model$state.contributions[1:1000,1,1:100])
trend+mean(impact$model$bsts.model$coefficients[1:1000,2])*x1
to get the counter-factual time series, however this is far from the actual counter-factual time series when plotting the results with
plot(impact)
Can somebody tell me how I can get back the counter-factual time series?
Thanks in advance!
The point predictions for the entire time series (both pre and post intervention) can be found at
impact$series
in the point.pred column. The counter-factual is the part of the point predictions that occur in the post.period portion of that column. impact$series provides the data for all three graphs in plot(impact).

Interpolate missing values in a time series with a seasonal cycle

I have a time series for which I want to intelligently interpolate the missing values. The value at a particular time is influenced by a multi-day trend, as well as its position in the daily cycle.
Here is an example in which the tenth observation is missing from myzoo
start <- as.POSIXct("2010-01-01")
freq <- as.difftime(6, units = "hours")
dayvals <- (1:4)*10
timevals <- c(3, 1, 2, 4)
index <- seq(from = start, by = freq, length.out = 16)
obs <- (rep(dayvals, each = 4) + rep(timevals, times = 4))
myzoo <- zoo(obs, index)
myzoo[10] <- NA
If I had to implement this, I'd use some kind of weighted mean of close times on nearby days, or add a value for the day to a function line fitted to the larger trend, but I hope there already exist some package or functions that apply to this situation?
EDIT: Modified the code slightly to clarify my problem. There are na.* methods that interpolate from nearest neighbors, but in this case they do not recognize that the missing value is at the time that is the lowest value of the day. Maybe the solution is to reshape the data to wide format and then interpolate, but I wouldn't like to completely disregard the contiguous values from the same day. It is worth noting that diff(myzoo, lag = 4) returns a vector of 10's. The solution may lie with some combination of reshape, na.spline, and diff.inv, but I just can't figure it out.
Here are three approaches that don't work:
EDIT2. Image produced using the following code.
myzoo <- zoo(obs, index)
myzoo[10] <- NA # knock out the missing point
plot(myzoo, type="o", pch=16) # plot solid line
points(na.approx(myzoo)[10], col = "red")
points(na.locf(myzoo)[10], col = "blue")
points(na.spline(myzoo)[10], col = "green")
myzoo[10] <- 31 # replace the missing point
lines(myzoo, type = "o", lty=3, pch=16) # dashed line over the gap
legend(x = "topleft",
legend = c("na.spline", "na.locf", "na.approx"),
col=c("green","blue","red"), pch = 1)
Try this:
x <- ts(myzoo,f=4)
fit <- ts(rowSums(tsSmooth(StructTS(x))[,-2]))
tsp(fit) <- tsp(x)
plot(x)
lines(fit,col=2)
The idea is to use a basic structural model for the time series, which handles the missing value fine using a Kalman filter. Then a Kalman smooth is used to estimate each point in the time series, including any omitted.
I had to convert your zoo object to a ts object with frequency 4 in order to use StructTS. You may want to change the fitted values back to zoo again.
In this case, I think you want a seasonality correction in the ARIMA model. There's not enough date here to fit the seasonal model, but this should get you started.
library(zoo)
start <- as.POSIXct("2010-01-01")
freq <- as.difftime(6, units = "hours")
dayvals <- (1:4)*10
timevals <- c(3, 1, 2, 4)
index <- seq(from = start, by = freq, length.out = 16)
obs <- (rep(dayvals, each = 4) + rep(timevals, times = 4))
myzoo <- myzoo.orig <- zoo(obs, index)
myzoo[10] <- NA
myzoo.fixed <- na.locf(myzoo)
myarima.resid <- arima(myzoo.fixed, order = c(3, 0, 3), seasonal = list(order = c(0, 0, 0), period = 4))$residuals
myzoo.reallyfixed <- myzoo.fixed
myzoo.reallyfixed[10] <- myzoo.fixed[10] + myarima.resid[10]
plot(myzoo.reallyfixed)
points(myzoo.orig)
In my tests the ARMA(3, 3) is really close, but that's just luck. With a longer time series you should be able to calibrate the seasonal correction to give you good predictions. It would be helpful to have a good prior on what the underlying mechanisms for both the signal and the seasonal correction to get better out of sample performance.
forecast::na.interp is a good approach. From the documentation
Uses linear interpolation for non-seasonal series and a periodic stl decomposition with seasonal series to replace missing values.
library(forecast)
fit <- na.interp(myzoo)
fit[10] # 32.5, vs. 31.0 actual and 32.0 from Rob Hyndman's answer
This paper evaluates several interpolation methods against real time series, and finds that na.interp is both accurate and efficient:
From the R implementations tested in this paper, na.interp from the forecast package and na.StructTS from the zoo package showed the best overall results.
The na.interp function is also not that much slower than
na.approx [the fastest method], so the loess decomposition seems not to be very demanding in terms of computing time.
Also worth noting that Rob Hyndman wrote the forecast package, and included na.interp after providing his answer to this question. It's likely that na.interp is an improvement upon this approach, even though it performed worse in this instance (probably due to specifying the period in StructTS, where na.interp figures it out).
Package imputeTS has a method for Kalman Smoothing on the state space representation of an ARIMA model - which might be a good solution for this problem.
library(imputeTS)
na_kalman(myzoo, model = "auto.arima")
Also works directly with zoo time series objects. You could also use your own ARIMA models in this function. If you think you can do better then "auto.arima". This would be done this way:
library(imputeTS)
usermodel <- arima(myts, order = c(1, 0, 1))$model
na_kalman(myts, model = usermodel)
But in this case you have to convert the zoo onject back to ts, since arima() only accepts ts.

Resources