Rolling regression forecast , DM test, CW test - r

I have a linear model with the exchange rate as a dependent variable and 7 others independent variables(e.g. inflation, interest rate etc.). I have quarterly data from 1993Q1-2011Q4.
I would like to create a rolling window regression (with the model above) with window size 60(from 1993Q1-2007Q4) and use the estimated regression to forecast the rest sample. Also, I would like to compare this model with the Random Walk model(exchange rate follows a R.W.). In the end, I would like to perform the dm.test and clarkwest test(does not run). Is my code right?
X = embed(data)
X = as.data.frame(X)
install.packages("foreach")
library(foreach)
w_size=60
n_windows = nrow(X) - 60 #until 2007Q4
forecasts = foreach(i=1:n_windows, .combine = rbind) %do%{
# = Select data for the window (in and out-of-sample) = #
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] for expanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(V1 ~ V2+V3+V4+V5+V6+V7+V8, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$V1, 1)
return(c(f1, f2))
}
e1 = tail(X[ ,"V1"], nrow(forecasts)) - forecasts[ ,1]
e2 = tail(X[ ,"V1"], nrow(forecasts)) - forecasts[ ,2]
library(tseries)
library(forecast)
dm.test(e1,e2, "l") #p-value is more than 5% for all the cases( two.sided, greater, less)
clarkwest(e1,e2)

It seems like the clarkwest() function is not supported anymore. I recently wrote my own function: CW Note that I used normal standard errors and not Newey-West corrected.
To investigate your loop you could try:
i=1
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] for expanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(V1 ~ V2+V3+V4+V5+V6+V7+V8, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$V1, 1)
Here you can see the composition the loop creates when i=1

Related

How to predict Brown's Exponential Double Smoothing in a rolling window approach of 60 observations from a dataset

I have a task to predict Brown's Exponential Double Smoothing in a rolling window frame of 60 observations from a dataset of 228 observatoins and 17 variables. I will only predict one variable of those 17.
Also need to plot the original data and the predicted data.
How can this be done?
First, I create this function to calculate the prediction:
estwo_brown = function(x ,m, lambda) {
# Number of observations
TT = length(x)
# Looking at the lecture - we need M1,t
so = c();
# We need M2,t
st = c();
# We need the level component, Lt
at = c();
# We need the trend component, Tt
bt = c();
# We need the predicted values, Y_t+h
fs = c();
# We set up initial values
so\[1\] = x\[1\]
st\[1\] = x\[1\]
# Now as before, we loop from the second to the last observation
# and use equations from the lecture (2:TT)
for (i in 2:TT) {
so\[i\] = lambda \* x\[i-1\] + (1-lambda) \* so\[i-1\]
st\[i\] = lambda \* so\[i\] + (1-lambda) \* st\[i-1\]
at\[i\] = 2\*so\[i\] - st\[i\]
bt\[i\] = lambda / (1 - lambda) \* (so\[i\] - st\[i\])
fs\[i+m\] = at\[i\] + bt\[i\] \* m
}
# Combining results
res = matrix(NA, nrow = TT, ncol = 6)
colnames(res) = c('Y', 'Mt1', 'Mt2', 'at', 'bt', 'Ft')
res\[,'Y'\] = x
res\[,'Mt1'\] = so
res\[,'Mt2'\] = st
res\[,'at'\] = at
res\[,'bt'\] = bt
res\[,'Ft'\] = fs\[1:TT\]
results = list()
results\[\['results'\]\] = as.data.frame(res)
results\[\['outpred'\]\] = fs\[length(fs)\]
return(results)
}
and then calculating the model using this:
b_double_exp = estwo_brown(x = dataset$column7, m = 1, lambda = 0.5)
But, this is will predict the whole dataset. How can I predict this in a rolling window of 60 observations? The rolling window should slide one observation ahead.

How to predict GAM with smooth terms and basic functions with independent data?

I attempt to fit a GAM model with interactions between days (tt variable) and lagged predictors (k=2) using k basis functions.
library(mgcv)
# Example data
data=data.frame(
tt=1:107, # days
pol=(sample.int(101,size=107,replace=TRUE)-1)/100,
at_rec=sample.int(101,size=107,replace=TRUE),
w_cas=sample.int(2000,size=107,replace=TRUE)
)
# model
gam1<-gam(pol ~ s(tt, k = 10) +
s(tt, by = Lag(at_rec, k = 2), k = 10)+
s(tt, by = Lag(w_cas, k = 2), k = 10),
data=data,method="GACV.Cp")
summary(gam1)
# while making newdata
> newdata=data.frame(tt=c(12,22),at_rec=c(44,34), w_cas=c(2011,2455))
# and prediction
> predict(gam1,newdata=newdata,se.fit=TRUE)
I got this error
"Error in PredictMat(object$smooth[[k]], data) : Can't find by variable"
How to predict such a model with new data?
I'm 99.9% sure that the predict method can't find the by terms because they are functions of variables and it's looking for variables with exactly the names you provided: "Lag(at_rec, k = 2)".
Try adding those lagged variables to your data frame as explicit variables and refit the model and it should work:
data <- transform(data,
lag_at_rec = Lag(at_rec, k=2),
lag_w_cas = Lag(w_cas, k=2))
gam1 <- gam(pol ~ s(tt, k = 10) +
s(tt, by = lag_at_rec, k = 10)+
s(tt, by = lag_w_cas, k = 10),
data = data, method = "GACV.Cp")

Simulating a mixed linear model and evaluating it with lmerTest in R

I am trying to understand how to use mixed linear models to analyse my data by simulating a model, but I can't reproduce the input parameters. What am I missing?
I want to start simulating a model with a random intercept for each subject. Here is the formula of what I want to simulate and reproduce:
If beta1 (<11) is small I find gamma00 as the intercept in fixed section, but I am completedly unaable to retrieve the slope (beta1). Also, the linear effect is not significant. Where is my conceptual mistake?
library(lmerTest)
# Generating data set
# General values and variables
numObj <- 20
numSub <- 100
e <- rnorm(numObj * numSub, mean = 0, sd = 0.1)
x <- scale(runif(numObj * numSub, min = -100, max = 100))
y <- c()
index <- 1
# Coefficients
gamma00 <- 18
gamma01 <- 0.5
beta1 <- -100
w <- runif(numSub, min = -3, max = 3)
uo <- rnorm(numSub, mean = 0, sd = 0.1)
meanBeta0 <- mean(gamma00 + gamma01*w + uo) # I should be able to retrieve that parameter.
for(j in 1:numSub){
for(i in 1:numObj){
y[index] <- gamma00 + gamma01*w[j]+ uo[j] + beta1*x[i] + e[index]
index <- index + 1
}
}
dataFrame2 <- data.frame(y = y, x = x, subNo = factor(rep(1:numSub, each = numObj)), objNum = factor(rep(1:numObj, numSub)))
model2 <- lmer(y ~ x +
(1 | subNo), data = dataFrame2)
summary(model2)
anova(model2)
No conceptual mistake here, just a mixed up index value: you should be using index rather than i to index x in your data generation loop.
Basically due to the mix-up you were using the first subject's x values for generating data for all the subjects, but using the individual x values in the model.

Predict Logistf

I'm using a R package called logistf to make a Logistc Regression and I saw that there's no predict function for new data in this package and predict package does not work with this, so I found a code that show how making this with new data:
fit<-logistf(Tax ~ L20+L24+L28+L29+L31+L32+L33+L36+S10+S15+S16+S17+S20, data=trainData)
betas <- coef(fit)
X <- model.matrix(fit, data=testData)
probs <- 1 / (1 + exp(-X %*% betas))
I want to make a cross validation version with this using fit$predict and the probabilities that probs generate for me. Has anyone ever done something like this before?
Other thing that I want to know is about fit$predict I'm making a binary logistic regression, and this function returns many values, are these values from class 0 or 1, how can I know this? Thanks
While the code that you wrote works perfectly, there is a concise way of getting the same results seemingly:
brglm_model <- brglm(formula = response ~ predictor , family = "binomial", data = train )
brglm_pred <- predict(object = brglm_model, newdata = test , type = "response")
About the CV, you have to write a few lines of code I guess:
#Setting the number of folds, and number of instances in each fold
n_folds <- 5
fold_size <- nrow(dataset) %/% 5
residual <- nrow(dataset) %% 5
#label the instances based on the number of folds
cv_labels <- c(rep(1,fold_size),rep(2,fold_size), rep(3,fold_size), rep(4,fold_size), rep(5,fold_size), rep(5,residual))
# the error term would differ based on each threshold value
t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = (n_folds+1) , nrow = length(t_seq))
index_mat[,1] <- t_seq
# the main loop for calculation of the CV error on each fold
for (i in 1:5){
train <- dataset %>% filter(cv_labels != i)
test <- dataset %>% filter(cv_labels == i )
brglm_cv_model <- brglm(formula = response_var ~ . , family = "binomial", data = train )
brglm_cv_pred <- predict(object = brglm_model, newdata = test , type = "response")
# error formula that you want, e.g. misclassification
counter <- 0
for (treshold in t_seq ) {
counter <- counter + 1
conf_mat <- table( factor(test$response_var) , factor(brglm_cv_pred>treshold, levels = c("FALSE","TRUE") ))
sen <- conf_mat[2,2]/sum(conf_mat[2,])
# other indices can be computed as follows
#spec <- conf_mat[1,1]/sum(conf_mat[1,])
#prec <- conf_mat[2,2]/sum(conf_mat[,2])
#F1 <- (2*prec * sen)/(prec+sen)
#accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
#here I am only interested in sensitivity
index_mat[counter,(i+1)] <- sen
}
}
# final data.frame would be the mean of sensitivity over each threshold value
final_mat <- matrix(nrow = length(t_seq), ncol = 2 )
final_mat[,1] <- t_seq
final_mat[,2] <- apply(X = index_mat[,-1] , MARGIN = 1 , FUN = mean)
final_mat <- data.frame(final_mat)
colnames(final_mat) <- c("treshold","sensitivity")
#why not having a look at the CV-sensitivity of the model over threshold values?
ggplot(data = final_mat) +
geom_line(aes(x = treshold, y = sensitivity ), color = "blue")

Cross validate seasonal linear model

I'm trying to perform a CV on my linear model, which has seasonal dummy variables, so i can't take a random sample.
y = rnorm(120,0,3) + 20*sin(2*pi*(1:120)/12)
x = months(ISOdate(2012,1:12,1))
reg.data = data.frame(y, x)
model = lm(y ~ x, data = reg.data)
My CV function is:
cross.valid = function(model, min.fit = as.integer(nrow(model$model)*0.7), h = 1)
{
dados = model$model
n.rows = nrow(dados)
results = data.frame(pred = numeric(), actual = numeric())
for (i in seq(1, n.rows - min.fit - h + 1, by = h))
{
dados.train = dados[1:(i + min.fit - 1), ]
model <- update(model, data = dados.train)
dados.pred = dados[(i + min.fit):(i + min.fit + h - 1), -1, drop = FALSE]
predic = predict(model, newdata = dados.pred, interval = 'prediction')
actual = dados[(i + min.fit):(i + min.fit + h - 1), 1]
results = rbind(results, data.frame(pred = predic[1:h, 'fit'], actual = actual))
}
results
}
Example:
cv1 = cross.valid(model, h = 1)
mae = with(cv1, mean(abs(actual - pred )))
print(mae)
The MAE values for different horizons (h) are too close. Is the code itself valid? Is there a better solution/package for doing this?
Thanks!
I don't think there is anything incorrect about your function. Investigate the forecast package; I suspect that it will provide many functions that you need.
I have rewritten your function concisely:
set.seed(1)
y = rnorm(120,0,3) + 20*sin(2*pi*(1:120)/12)
x = months(ISOdate(2012,1:12,1))
reg.data = data.frame(y, x)
pred.set<-function(i,h) {
train<-reg.data[1:(i + min.fit - 1),]
test<-reg.data[(i + min.fit):(i + min.fit + h - 1),]
pred<-predict(lm(y~x, data=train), newdata=test)
abs(test$y - pred)
}
pred.by.horiz<-function(h)
mean(sapply(seq(1, nrows - min.fit - h + 1, by = h),pred.set,h=h))
pred.by.horiz matches the output of your function (and post-processing) exactly.
As you mentioned, the horizon does not appear to affect the MAE:
mae.by.h<-sapply(seq(nrows-min.fit),pred.by.horiz)
plot(mae.by.h,type='l',col='red',lwd=2,xlab='Horizon',ylab='Mean absolute error')
Perhaps you expected the the mean error would increase as the prediction horizon increases. For many time-series models this would be true, but in your linear model of months adding more data doesn't help you predict the next point in the series (unless you add 12 months or more).
For example, consider what happens when h is 1. You begin with 84 months of data, 7 points of data for each month. Now, you add one point of data, which will be the next January, and attempt to predict the result of February. But your additional point of data will only help you predict the next January, that is how your linear function works. Look at the summary of the model:
lm(y ~ x, data = reg.data)
Coefficients:
(Intercept) xAugust xDecember xFebruary xJanuary
17.11380 -32.74962 -17.81076 -0.03235 -6.63998
xJuly xJune xMarch xMay xNovember
-26.69203 -17.41170 2.96735 -7.11166 -25.43532
xOctober xSeptember
-33.56517 -36.93474
Each prediction is made solely on the basis of two variables, the intercept, and the predicted month. So predicting one point ahead isn't any easier than predicting five points ahead. That is why the MAE isn't rising as the horizon increases the problem is in the way you modeled the data, not the MAE function.
One thing I didn't completely understand about your function is why you decided to increment the size of the train set by h on each iteration. It is revealing to look at what happens when you try to increment by 1:
# Code to increment by 1
pred.by.horiz2<-
function(h) mean(sapply(seq(1, nrows - min.fit - h + 1, by = 1),pred.set,h=h))
mae.by.h2<-sapply(seq(nrows-min.fit),pred.by.horiz2)
plot(mae.by.h2,type='l',col='red',lwd=2,xlab='Horizon',ylab='Mean absolute error')
The pattern here is complex, but you'll note that the MAE starts to decrease at 12, when the horizon is large enough that the next point can be used.

Resources