Do we need to do differencing of exogenous variables before passing to xreg argument of Arima() in R? - r

I am trying to build a forecasting model using ARIMAX in R and require some guidance on how covariates are handled in xreg argument.
I understand that, auto.arima function takes care of differencing of covariates while fitting the model (from training period data) and I also don't need to difference the covariates for generating forecasts for test period (future values).
However, while fitting the model using Arima() in R with custom (p, d, q) and (P, D, Q)[m] values with d or D greater than 0, do we need to manually do differencing of the covariates?
If I do differencing, I get the issue that the differenced covariates matrix is of smaller length than the number of data points of the dependent variable.
How should one handle this?
Should I send the covariate matrix as it is i.e. without differencing?
Should I do differencing but omit first few observations for which differenced covariate data is not available?
Should I keep the actual values for first few rows where difference covariate values are not available and remaining rows to have differenced values?
If I have to pass flag variables (1/0) to the xreg matrix, should I do differencing of those as well or cbind the actual values of flag variables with the differenced values of remaining variables?
Also, while generating the forecasts for future period, how do I pass the covariate values (as it is or after differencing)?
I am using the following code:
ndiff <- ifelse(((pdq_order == "auto") || (PDQ_order == "auto")), ndiffs(ts_train_PowerTransformed), pdq_order[2])
nsdiff <- ifelse(((pdq_order == "auto") || (PDQ_order == "auto")), nsdiffs(ts_train_PowerTransformed), PDQ_order$order[2])
# Creating the appropriate covariates matrix after doing differencing
ifelse(nsdiff >= 1
, ifelse(ndiff >= 1
, xreg_differenced <- diff(diff(ts_CovariatesData_TrainingPeriod, lag = PDQ_order$period, differences = nsdiff), lag = 1, differences = ndiff)
, xreg_differenced <- diff(ts_CovariatesData_TrainingPeriod , lag = PDQ_order$period, differences = nsdiff)
)
, ifelse(ndiff >= 1
, xreg_differenced <- diff( ts_CovariatesData, lag = 1, differences = ndiff)
, xreg_differenced <- ts_CovariatesData
)
# Fitting the model
model_arimax <- Arima(ts_train_PowerTransformed, order = pdq_order, seasonal = PDQ_order, xreg = xreg_differenced))
# Generating Forecast for the test period
fit.test <- model_arimax %>% forecast(h=length(ts_test),
xreg = as.data.frame(diff(diff(ts_CovariatesData_TestPeriod, lag = PDQ_order$period, differences = nsdiff), lag = 1, differences = ndiff))
Kindly suggest.

Arima will difference both the response variable and the xreg variables as specified in the order and seasonal arguments. You should never need to do the differencing yourself.

Related

Implementation of time series cross-validation

I am working with time series 551 of the monthly data of the M3 competition.
So, my data is :
library(forecast)
library(Mcomp)
# Time Series
# Subset the M3 data to contain the relevant series
ts.data<- subset(M3, 12)[[551]]
print(ts.data)
I want to implement time series cross-validation for the last 18 observations of the in-sample interval.
Some people would normally call this “forecast evaluation with a rolling origin” or something similar.
How can i achieve that ? Whats means the in-sample interval ? Which is the timeseries i must evaluate?
Im quite confused , any help in order to light up this would be welcome.
The tsCV function of the forecast package is a good place to start.
From its documentation,
tsCV(y, forecastfunction, h = 1, window = NULL, xreg = NULL, initial = 0, .
..)
Let ‘y’ contain the time series y[1:T]. Then ‘forecastfunction’ is
applied successively to the time series y[1:t], for t=1,...,T-h,
making predictions f[t+h]. The errors are given by e[t+h] =
y[t+h]-f[t+h].
That is first tsCV fit a model to the y[1] and then forecast y[1 + h], next fit a model to y[1:2] and forecast y[2 + h] and so on for T-h steps.
The tsCV function returns the forecast errors.
Applying this to the training data of the ts.data
# function to fit a model and forecast
fmodel <- function(x, h){
forecast(Arima(x, order=c(1,1,1), seasonal = c(0, 0, 2)), h=h)
}
# time-series CV
cv_errs <- tsCV(ts.data$x, fmodel, h = 1)
# RMSE of the time-series CV
sqrt(mean(cv_errs^2, na.rm=TRUE))
# [1] 778.7898
In your case, it maybe that you are supposed to
fit a model to ts.data$x and then forecast ts.data$xx[1]
fit mode the c(ts.data$x, ts.data$xx[1]) and forecast(ts.data$xx[2]),
so on.

Standardization and inclusion of intercept in sparse lasso GLM

I found some problems while practicing the sparse group lasso method using the cvSGL function forom the SGL package.
My questions are as follows:
Looking at the code for SGL:::center_scale, it doesn't seem to consider the sample size of the data.
SGL:::center_scale
#Output
function (X, standardize) {
means <- apply(X, 2, mean)
X <- t(t(X) - means)
X.transform <- list(X.means = means)
if (standardize == TRUE) {
var <- apply(X, 2, function(x) (sqrt(sum(x^2))))
X <- t(t(X)/var)
X.transform$X.scale <- var
}
else {
X.transform$X.scale <- 1
}
return(list(x = X, X.transform = X.transform))
}
Therefore, the sample standard deviation of the predicted variable is measured somewhat larger.
Is my understanding correct that this may cause the coefficients to be excessively large?
whether the model can be estimated by SGL package with intercept term (or constant term)
The SGL package does not seem to provide a function for estimating by including a intercept term.
In cvFit[["fit"]], I can see only the beta of the predict variables according to the lambda's except for the constant term. The value of cvFit[["fit"]][["intercept"]] is the mean of the y variable.
It can be estimated by adding 1 to first column of predict variable X, but in this case, it is expected to cause problems in centering and standardizing predict variables.
In addition, the SPG package seems to add a penalty to all predict variables. Even if the estimation is performed by adding 1 to the first column of the explanatory variable X as described above, the constant term may be estimated as 0.

acf() function at lag.max = 0

This might be a very simple question, but what exactly is calculated when acf lag.max = 0?
When lag.max = 1, I am assuming it is only calculating the autocovariance (when type = "covariance") given the previous observation, such that given an observation at time t, it is checking covariance with observation at t-1, for all observations. So what is the number generated when lag.max = 0? I notice it is very close to the actual variance of the data, but not precisely the same.
The acf function using type = "covariance", compute the covariance for your data at lag 0 up to lag = lag.max. If lag.max is 0, the output of 'acf(your_data,lag.max = 0, type = 'covariance')' will be actually the same to compute the covariance of your data with cov: 'cov(your_data,your_data)'. The numerical difference is because acf round down the significants number by default. Also to know in essence "acf" using type = "covariance" compute the function "cov" moving the start point of your data in the second argument like this:
n <- length(your_data)
cov(your_data[1:(n-nlag)],your_data[(1+nlag):n]) # for lag nlag

Vector Autoregressive Models for Multivariate Time Series : Trend and Seasonality

I have 3 time series and I want to predict future values for each of them.
I am using VARS! Package in R.
So this is the approach:
Decompose multiplicative time series and take out the trend, seasonality, and Random part.
time_series1_components = decompose(time_series1,type="mult")
Do this for all the time series.
Apply the VAR Model on the random parts and predict the future values:
random_part1 = time_series1_components$random
random_part2 = time_series2_components$random
random_part3 = time_series3_components$random
merged_df = ts.union(random_part1, random_part2,random_part3, dframe = TRUE)
merged_mat <- data.matrix(merged_df)
merged_mat = na.exclude(merged_mat)
checklag = VARselect(merged_mat)
EstimateModel=VAR(merged_mat, p = 2, type = "const", season = NULL, exogen = NULL)
summary(EstimateModel)
roots(EstimateModel)
predict(EstimateModel)`
Now, I should combine the predicted values of the random part with the trend and seasonality. And Plot a graph showing the past values and predicted values (highlighted separately).
How can I achieve this?
Any pointers will be helpful.

How to find an optimal adstock decay factor for an independent variable in panel analysis in R?

I'm working with a panel dataset (24 months of data for 210 DMAs). I'm trying to optimize the adstock decay factor for an independent variable by minimizing the standard error of a fixed effects model.
In this particular case, I want to get a decay factor that minimizes the SE of the adstock-transformed variable "SEM_Br_act_norm" in the model "Mkt_TRx_norm = b0 + b1*Mkt_TRx_norm_prev + b2*SEM+Br_act_norm_adstock".
So far, I've loaded the dataset in panel formal using plm and created a function to generate the adstock values. The function also runs a fixed effects model on the adstock values and returns the SE. I then use optimize() to find the best decay value within the bounds (0,1). While my code is returning an optimal value, I am worried something is wrong because it returns the same optimum (close to 1) on all other variables.
I've attached a sample of my data, as well as key parts of my code. I'd greatly appreciate if someone could take a look and see what is wrong.
Sample Data
# Set panel data structure
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
alldata$var <- alldata$SEM_Br_act_norm +0
# Create 1 month time lag for TRx
alldata <- ddply(
alldata, .(DMA), transform,
# This assumes that the data is sorted
Mkt_TRx_norm_prev = c(NA,Mkt_TRx_norm[-length(Mkt_TRx_norm)])
)
# Create adstock function and obtain SE of regression
adstockreg <-function(decay, period, data_vector, pool_vector=0){
data_vector <-alldata$var
pool_vector <- alldata$DMA
data2<-data_vector
l<-length(data_vector)
#if no pool apply zero to vector
if(length(pool_vector)==1)pool_vector<-rep(0,l)
#outer loop: extract data to decay from observation i
for( i in 1:l){
x<-data_vector[i]
#inner loop: apply decay onto following observations after i
for(j in 1:min(period,l)){
#constrain decay to same pool (if data is pooled)
if( pool_vector[i]==pool_vector[min(i+j,l)]){data2[(i+j)]<- data2[(i+j)]+(x*(decay)^j)}
}
}
#reduce length of edited data to equal length of initial data
data2<-data2[1:l]
#regression - excludes NA values
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
var_fe <- plm(alldata$Mkt_TRx_norm ~ alldata$Mkt_TRx_norm_prev + data2, data = alldata , model = "within", na.action = na.exclude)
se <- summary(var_fe)$coefficients["data2","Std. Error"]
return(se)
}
# Optimize decay for adstock variable
result <- optimize(adstockreg, interval=c(0,1), period = 6)
print(result)

Resources