Prediction of VAR model with exogenous variables in R - r

I'm building a very simple VAR model of inflation and a fuel price index. My data are from 1998 to Sep 2021. When running the simple VAR model I get a prediction for inflation with no problems whatsoever, however I can't get a prediction when I incorporate a "crisis dummy" as an exogenous variable. This variable has 0's for the most part and 1's when I want to impose a crisis assumption to the data.
library(urca)
library(vars)
library(mFilter)
library(tseries)
library(TSstudio)
library(forecast)
library(tidyverse)
#-----Loading the Dataset (283 rows x 4 columns[date, inflation, fuel price index & crisis dummy])
data <- read_csv2(file.choose())
head(data)
nrow(data)
#-----Turning the dataset's variabels into time series in order to apply the model
inf <- ts(data$inflation, start = c(1998,3,1), frequency = 12)
fuel_index <- ts(data$`fuel_ndex_lag2`, start = c(1998,3,1), frequency = 12)
crisis <- ts(data$`crisis`, start = c(1998,3,1), frequency = 12)
#Buidling the Model
v <- cbind(inf, fuel_index)
lagselect <- VARselect(v, lag.max = 12, type ="both" )
lagselect$selection
lagselect$criteria
#Building the VAR using the (283x1) timeseries object "crisis" as my exogenous variable
var <- VAR(v, p=5, type = "const", season = NULL, exog = crisis)
summary(var)
#-----Loading a CSV file that I've already prepared and turning it into a time series object in #order to use it
#as the new exogenous variable in the prediction block of the model. Since I will predict for only 1 period ahead,
#this new object is exactly the same as "crisis" but with an additional row (284x1). I fill that row with a 1
#(I also tried leaving the extra row blank)
new_crisis <- read_csv2(file.choose())
new_crisis <- ts(new_crisis$crisis, start = c(1998,3,1), frequency = 12)
#VAR forecasting
var_forecast <- predict(var, dumvar = new_crisis, n.ahead = 1, ci = 0.95)
When I run the last command I get the following error message:
"Error in predict.varest(var, dumvar = new_crisis, n.ahead = 1, ci = 0.95) : Row number of dumvar is unequal to n.ahead."
Since I am predicting for 1 additional period, shouldn't my dumvar value be just one extra row longer than my initial exogenous variable? What am I missing? Also, should this extra row be blank or should I make a decision of whether to fill it with a 1 or 0?
I'll appreciate any help on this.

Related

Error when adding fourth regressor to VAR - R

I am building a vector autoregressive model and got stuck on some problem.
My regressors are some sentiment and financial values. For testing robustness I wanted to add multiple other economic variables to the model.
The problem I encounter is: when adding a fourth regressor I only get an error message in R.
I can use three from any combination, but as soon as I add a fourth one, it wont work (see error message below)...
My Code:
library(dplyr)
library(readr)
library(tidyverse)
library(urca)
library(vars)
library(tseries)
library(forecast)
library(stargazer)
tr <- ts(TR$tr, start = c(2011, 1), frequency = 4) #4 because quarterly
Index1 <- ts(Index1$Value, start = c(2011, 1), frequency = 4)
Index2 <- ts(Index2$Value, start = c(2011, 1), frequency = 4)
Control1 <- ts(CPI$Value, start = c(2011, 1), frequency = 4)
Control2 <- ts(Spread$Value, start = c(2011, 1), frequency = 4)
# for finding optimal lags
tr.bv <- cbind(TR$tr, Index1$Value, Index2$Value, CPI$Value, Spread$Value)
colnames(tr.bv) <- cbind("Total Return", "Index1", "Index2", "CPI", "Spread")
lagselect <- VARselect(tr.bv, lag.max = 10, type = "const")
lagselect$selection
# Building the model
Model <- VAR(tr.bv, p = 10, type = "const", season = NULL, exog = NULL)
summary(Model_LSTM)
The error message I get:
Error in solve.default(Sigma) :
Lapack routine dgesv: system is exactly singular: U[1,1] = 0
In addition: Warning message:
In cor(resids) : Standarddeviation equals zero
I did build the same model in Python using the statsmodel VAR function -> here I only get 0's as p-values or nan's...
Hopefully someone can help me?
The problem likely lies with your data and the final parameter you have added to your model (possibly multicollinearity or overfitting). A reproducible example would be helpful here.
See: https://stats.stackexchange.com/questions/446707/var-model-error-in-solve-defaultsigma-system-is-computationally-singular-r

How to Create a loop (when levels do not overlap the reference)

I have written some code in R. This code takes some data and splits it into a training set and a test set. Then, I fit a "survival random forest" model on the training set. After, I use the model to predict observations within the test set.
Due to the type of problem I am dealing with ("survival analysis"), a confusion matrix has to be made for each "unique time" (inside the file "unique.death.time"). For each confusion matrix made for each unique time, I am interested in the corresponding "sensitivity" value (e.g. sensitivity_1001, sensitivity_2005, etc.). I am trying to get all these sensitivity values : I would like to make a plot with them (vs unique death times) and determine the average sensitivity value.
In order to do this, I need to repeatedly calculate the sensitivity for each time point in "unique.death.times". I tried doing this manually and it is taking a long time.
Could someone please show me how to do this with a "loop"?
I have posted my code below:
#load libraries
library(survival)
library(data.table)
library(pec)
library(ranger)
library(caret)
#load data
data(cost)
#split data into train and test
ind <- sample(1:nrow(cost),round(nrow(cost) * 0.7,0))
cost_train <- cost[ind,]
cost_test <- cost[-ind,]
#fit survival random forest model
ranger_fit <- ranger(Surv(time, status) ~ .,
data = cost_train,
mtry = 3,
verbose = TRUE,
write.forest=TRUE,
num.trees= 1000,
importance = 'permutation')
#optional: plot training results
plot(ranger_fit$unique.death.times, ranger_fit$survival[1,], type = 'l', col = 'red') # for first observation
lines(ranger_fit$unique.death.times, ranger_fit$survival[21,], type = 'l', col = 'blue') # for twenty first observation
#predict observations test set using the survival random forest model
ranger_preds <- predict(ranger_fit, cost_test, type = 'response')$survival
ranger_preds <- data.table(ranger_preds)
colnames(ranger_preds) <- as.character(ranger_fit$unique.death.times)
From here, another user (Justin Singh) from a previous post (R: how to repeatedly "loop" the results from a function?) suggested how to create a loop:
sensitivity <- list()
for (time in names(ranger_preds)) {
prediction <- ranger_preds[which(names(ranger_preds) == time)] > 0.5
real <- cost_test$time >= as.numeric(time)
confusion <- confusionMatrix(as.factor(prediction), as.factor(real), positive = 'TRUE')
sensitivity[as.character(i)] <- confusion$byclass[1]
}
But due to some of the observations used in this loop, I get the following error:
Error in confusionMatrix.default(as.factor(prediction), as.factor(real), :
The data must contain some levels that overlap the reference.
Does anyone know how to fix this?
Thanks
Certain values in prediction and/or real have only 1 unique value in them. Make sure the levels of the factors are the same.
sapply(names(ranger_preds), function(x) {
prediction <- factor(ranger_preds[[x]] > 0.5, levels = c(TRUE, FALSE))
real <- factor(cost_test$time >= as.numeric(x), levels = c(TRUE, FALSE))
confusion <- caret::confusionMatrix(prediction, real, positive = 'TRUE')
confusion$byClass[1]
}, USE.NAMES = FALSE) -> result
result

R: IRFs in a SVAR model, can't display specified model

I am doing a SVAR (structural vector auto regression) analysis in which I want to plot IRFs (impulse response functions). My time series have length 137 and I only use 3 variables, furthermore I select 1 lag when specifying the VAR model.
Specifying the VAR model works fine, but when I want to summarize it I get the following error message
VAR_reduced <- VAR(VAR_data_1, p = 1, type = "both")
summary(VAR_reduced)
Error in solve.default(Sigma) :
system is computationally singular: reciprocal condition number = 1.03353e-16
From what I read in another question this error usually come up when there are not enough observations leading to overfitting, but in my example this should not be a problem, as I have enough observations.
As R does not display an error message if I don't run the summary command it is still possible to calculate the IRFs using:
plot(irf(VAR_reduced, n.ahead = 40))
But, the plot seems rather counter-intuitive, as there is no reaction from any variable other than assets. Therefore, my guess is that the error message hints at something I got wrong but haven't realised yet.
Is this correct, that is do I need to solve that error, or do my IRFs have nothing to do with this?
For completeness here is all the code:
library(quantmod)
library(urca)
library(vars)
library(tseries)
getSymbols('CPILFESL',src='FRED')
getSymbols('INDPRO',src='FRED')
getSymbols('WALCL',src='FRED')
CPI <- ts(CPILFESL, frequency = 12, start = c(1957,1))
output <- ts(INDPRO, frequency = 12, start = c(1919,1))
assets <- as.xts(WALCL)
assets <- to.monthly(assets, indexAt='yearmon', drop.time = TRUE)
assets <- ts(assets[,4], frequency = 12, start = c(2002,12))
assets <- window(assets, start = c(2008,9), end = c(2020,1))
CPI <- window(CPI, start = c(2008,9), end = c(2020,1))
output <- window(output, start = c(2008,9), end = c(2020,1))
loutput <- log(output)
lCPI <- log(CPI)
data_0 <- cbind(loutput, lCPI, assets)
plot(data_0)
VAR_data_1 <- ts.intersect(diff(loutput), diff(lCPI), diff(assets, differences = 2))
VAR_reduced <- VAR(VAR_data_1, p = 1, type = "both")
summary(VAR_reduced)

Imputed predictions for missing time-series data nearly stationary (flat line)

I have player over time data that is missing player counts over several years. I'm trying to fill in/predict the missing player count data over different intervals.
Data available here: https://1drv.ms/u/s!AvEZ_QPY7OZuhJAlKJN89rH185SUhA
I'm following the instructions below that use KalmanRun to impute the missing values. I've tried 3 different approaches to transforming the data- using an xts object, and 2 approaches to converting it into time series data
https://stats.stackexchange.com/questions/104565/how-to-use-auto-arima-to-impute-missing-values
require(forecast)
library(xts)
library(anytime)
library(DescTools)
df_temp = read.csv("r_share.csv")
df_temp[['DateTime']] <- as.Date(strptime(df_temp[['DateTime']], format='%Y-%m-%d %H:%M:%S'))
3 approaches to convert data; xts seems to work best by returning non-zero data that is interpretable.
#Convert df_temp to TimeSeries object
df_temp = xts(df_temp$Players, df_temp$DateTime)
#df_temp = as.ts(log(df_temp$Players), start = start(df_temp$DateTime), end = end(df_temp$DateTime), frequency = 365)
#df_temp = ts(df_temp$Players, start = c(2013, 02, 02), end = c(2016, 01, 31), frequency = 365)
Fitting and plotting:
fit <- auto.arima(df_temp, seasonal = TRUE)
id.na <- which(is.na(df_temp))
kr <- KalmanRun(index(df_temp), fit$model, update = FALSE)
#?KalmanRun$tol
for (i in id.na)
df_temp[i] <- fit$model$Z %*% kr$states[i,]
plot(df_temp)
The expected output is data that mimics the variability seen in the actual data and is different for each interval, whereas the actual output is relatively stationary and unchanging (both intervals have nearly the same prediction).
It needs to be with model arima()?.
Maybe you could try with another model, developed by Facebook named Prophet.
Here you can find the guide and github page.
If I understood you want something like this:
# Import library
library(prophet)
# Read data
df = read.csv("C:/Users/Downloads/r_share.csv",sep = ";")
# Transform to date
df["DateTime"] = as.Date(df$DateTime,format = "%d/%m/%Y")
# Change names for the model
colnames(df) = c("ds","y")
# call model
m = prophet(df)
# make "future" just one day greater than past
future = make_future_dataframe(m,periods = 1)
# predict the points
forecast = predict(m,future)
# plot results
plot(m,forecast)

Run gbm in a loop and calculate predicted values for each model in r

I'm trying to make gbm models in a loop in R with different learning rates.
I want to calculate a few statistics for each model and combine them with the original data set.
But I' having an error due to the fact that each time a statistic is calculated, it is saved with the same name as the previous one and thus there is an error.
I get the following error at the end of the loop:
Error in `[<-.data.frame`(`*tmp*`, nl, value = list(dates = c(14824, 14825, :
duplicate subscripts for columns
train data is basically stock prices data with dates, open high close etc.
Following is the code:
learningRateList <- as.numeric(7:9)*0.01
for (i in learningRateList){
modelNames <- paste("gbmModel", i, sep = "")
gbmModels <-gbm.step(data=train, gbm.x = reqCol, gbm.y = CloseCol,tree.complexity =9,learning.rate = i,bag.fraction = 0.75,family ="laplace",step.size=100 )
assign(modelNames, gbmModels)
#training data
#predict values for the training data set
predTrainGbm<-paste("gbmTrainPrediction", i, sep = "")
gbmTrainPrediction <- predict.gbm(gbmModels,train,n.trees=gbmModels$gbm.call$best.trees)
assign(predTrainGbm,gbmTrainPrediction)
#calculate mape for the predictions
mapeTrain<-paste("mapeGbmTrain", i, sep = "")
mapeTrainGbm<-regr.eval(train$Close,gbmTrainPrediction,stats = "mape")
assign(mapeTrain,mapeTrainGbm)
train<-cbind(train,predTrainGbm,mapeTrain)
#creating plots of actual vs predicted values
imageGbmName<-paste(fileCalculated,"Gbm Prediction",i,".png")
png(imageGbmName)
par(mfrow=c(2,1))
plot(train$Close,type="l",col="red",main = "Training set")
lines(gbmTrainPrediction,col="green")
plot(test$Close,type="l",col="red",main = "Test Set")
lines(gbmTestPrediction,col="green")
dev.off()
}

Resources