Nowcasting Y with known X, using VECM in R - r

I am trying to nowcast a time series data (Y) using another time series (X) as a predictor. X and Y are cointegrated. Y is a monthly data from Jan 2012 to Oct 2016 and X runs from Jan 2012 to Feb 2017.
So, I ran VECM as it shown in this video: https://www.youtube.com/watch?v=x9DcUA9puY0
Than, to obtain a predicted values, I transformed it in VAR by vec2var command, following information from this topic: https://stats.stackexchange.com/questions/223888/how-to-forecast-from-vecm-in-r
But I can not forecast Y with known X, how it can be made using predict function with a linear regression model. Also, I can not obtain modelled Y (Y hat) values.
This is my code:
# Cointegrated_series is a ZOO object, which contains two time series X and Y
library("zoo")
library("xts")
library("urca")
library("vars")
# Obtain lag length
Lagl <- VARselect(Cointegrated_series)$selection[[1]]
#Conduct Eigen test
cointest <- ca.jo(Cointegrated_series,K=Lagl,type = "eigen", ecdet = "const",
spec = "transitory")
#Fit VECM
vecm <- cajorls(cointest)
#Transform VECM to VAR
var <- vec2var(cointest)
Than I'm trying to use predict function in different ways: predict(var), predict(var, newdata = 50), predict(var, newdata = 1000) - result is the same.
Tried to use tsDyn package and newdata argument in predict method, as it mentioned here: https://stats.stackexchange.com/questions/261849/prediction-from-vecm-in-r-using-external-forecasts-of-regressors?rq=1
Not working. My newdata is a ZOO object, where X series has values from Nov 2016 to Feb 2017, and Y series are NAs. So, the method returns NAs in forecast:
# Cointegrated_series is a ZOO object, which contains
#two time series X and Y from Jan 2012 to Oct 2016. Both X and Y are values.
# newDat is a ZOO object, which contains two time series
#X and Y from Nov 2016 to Feb 2017. X are values, Y are NAs.
library(tsDyn)
vecm <-VECM(Cointegrated_series, lag=2)
predict(vecm,newdata = newDat, n.ahead=5)
This is a result:
Y X
59 NA NA
60 NA NA
61 NA NA
62 NA NA
63 NA NA
For example, this is what I get after calling predict whithout newdata argument:
predict(vecm, n.ahead=5)
Y X
59 65.05233 64.78006
60 70.54545 73.87368
61 75.65266 72.06513
62 74.76065 62.97242
63 70.03992 55.81045
So, my main questions are:
How to nowcast Y with known X, using VEC model in R?
How to obtain modelled Y (Y hat) values?
Besides that, I also couldn't find an answer on these questions:
How to call Akaike criteria (AIC) for VECM in R?
Does vars and urca packages provide F and t statistics for VECM?
UPD 10.04.2017
I slightly edited the question. Noticed, that my problem applies to a "ragged edge" problem, and it's incorrect to call it "forecasting" - it is "nowcasting".
UPD 11.04.2017
Thank you for answering!
Here is the full code:
library("lubridate")
library("zoo")
library("xts")
library("urca")
library("vars")
library("forecast")
Dat <- dget(file = "https://getfile.dokpub.com/yandex/get/https://yadi.sk/d/VJpQ75Rz3GsDKN")
NewDat <- dget(file = "https://getfile.dokpub.com/yandex/get/https://yadi.sk/d/T7qxxPUq3GsDLc")
Lagl <- VARselect(Dat)$selection[[1]]
#vars package
cointest_e <- ca.jo(Dat,K=Lagl,type = "eigen", ecdet = "const",
spec = "transitory")
vecm <- cajorls(cointest_e)
var <- vec2var(cointest_e)
Predict1 <- predict(var)
Predict2 <- predict(var, newdata = NewDat)
Predict1$fcst$Y
Predict2$fcst$Y
Predict1$fcst$Y == Predict2$fcst$Y
Predict1$fcst$X == Predict2$fcst$X
#As we see, Predict1 and Predict2 are similar, so the information in NewDat
#didn't came into account.
library("tsDyn")
vecm2 <-VECM(Dat, lag=3)
predict(vecm2)
predict(vecm2, newdata=NewDat)
If dget will return an error, please, download my data here:
https://yadi.sk/d/VJpQ75Rz3GsDKN - for Dat
https://yadi.sk/d/T7qxxPUq3GsDLc - for NewDat
About nowcasting
Saying Nowcasting I mean current-month or previous-month forecasts of unavailible data with currently availible data. Here are some referenses:
Gianonne, Reichlin, Small: Nowcasting: The real-time informational content of macroeconomic data (2008)
Now-Casting and the Real-time Data Flow (2013)
Marcellino, Schumacher: Factor MIDAS for Nowcasting and Forecasting with Ragged-Edge Data: A Model Comparison for German GDP (2010)

I feel your question is more about how to do nowcasting for cointegrated variables, then let's see later how to implement it in R.
In general, according to Granger's representation theorem, cointegrated variables can be represented in multiple forms:
Long term relationship: contemporaneous values of y and x
VECM representation: (diff of) y and x explained by (diff of) lags, and error-correction term at previous period.
So I am not sure how you would do nowcasting in the VECM representation, since it includes only past values? I can see two possibilities:
Do nowcasting based on the long-term relationship. So you just run standard OLS, and predict from there.
Do nowcasting based on a structural VECM, where you add contemporaneous values of the variables you know (X). In R, you would do this package urca, you need though to check whether the predict function will allow you to add know X values.
Regarding the long-term relationship approach, what is interesting is that you can obtain forecasts for X and Y based on the VECM (without known X) and from the LT with known X. This gives you a way to have an idea of the accuracy of your model (comparing known and predicted X), which you could use to create a forecast averaging scheme for your Y?

First of all, thank you so much #Matifou for your awesome package.
I am late in responding, but I was also struggling to figure out the same question and I did not find a solution. That is why I implemented the following function, I hope it will be useful for some people:
#' #title Special predict method for VECM models
#' #description Predict method for VECM models given some known endogenus
#' variables are known but one. It is just valid for one cointegration equation by the moment
#' #param object, an object of class ‘VECM’
#' #param new_data, a dataframe containing the forecast of all the endogenus variables
#' but one, if there are exogenus variables, its forecast must be provided.
#' #param predicted_var, a string with the desired endogenus variable to be predicted.
#' #return A list with the predicted variable, predicted values and a dataframe with the
#' detailed values used for the construction of the forecast.
#' #examples
#' data(zeroyld)
#' # Fit a VECM with Johansen MLE estimator:
#' vecm.jo<-VECM(zeroyld, lag=2, estim="ML")
#' predict.vecm(vecm.jo, new_data = data.frame("long.run" = c(7:10)), predicted_var = "short.run")
predict.vecm <- function(object, new_data, predicted_var){
if (inherits(object, "VECM")) {
# Just valid for VECM models
# Get endogenus and exogenus variables
summary_vecm <- summary(object)
model_vars <- colnames(object$model)
endovars <- sub("Equation ", "", rownames(summary_vecm$bigcoefficients))
if (!(predicted_var %in% endovars)) {
stop("You must provide a valid endogenus variable.")
}
exovars <- NULL
if (object$exogen) {
ind_endovars <-
unlist(sapply(endovars, function(x) grep(x, model_vars), simplify = FALSE))
exovars <- model_vars[-ind_endovars]
exovars <- exovars[exovars != "ECT"]
}
# First step: join new_data and (lags + 1) last values from the calibration data
new_data <- data.frame(new_data)
if (!all(colnames(new_data) %in% c(endovars, exovars))) {
stop("new_data must have valid endogenus or exogenus column names.")
}
# Endovars but the one desired to be predicted
endovars2 <- endovars[endovars != predicted_var]
# if (!all(colnames(new_data) %in% c(endovars2, exovars))) {
# stop("new_data must have valid endogenus (all but the one desired to predict) or exogenus column names.")
# }
new_data <- new_data[, c(endovars2, exovars), drop = FALSE]
new_data <- cbind(NA, new_data)
colnames(new_data) <- c(predicted_var, endovars2, exovars)
# Previous values to obtain lag values and first differences (lags + 1)
dt_tail <- data.frame(tail(object$model[, c(endovars, exovars), drop = FALSE], object$lag + 1))
new_data <- rbind(dt_tail, new_data)
# Second step: get long rung relationship forecast (ECT term)
ect_vars <- rownames(object$model.specific$beta)
if ("const" %in% ect_vars) {
new_data$const <- 1
}
ect_coeff <- object$model.specific$beta[, 1]
new_data$ECT_0 <-
apply(sweep(new_data[, ect_vars], MARGIN = 2, ect_coeff, `*`), MARGIN = 1, sum)
# Get ECT-1 (Lag 1)
new_data$ECT <- as.numeric(quantmod::Lag(new_data$ECT_0, 1))
# Third step: get differences of the endogenus and exogenus variables provided in new_data
diff_data <- apply(new_data[, c(endovars, exovars)], MARGIN = 2, diff)
colnames(diff_data) <- paste0("DIFF_", c(endovars, exovars))
diff_data <- rbind(NA, diff_data)
new_data <- cbind(new_data, diff_data)
# Fourth step: get x lags of the endogenus and exogenus variables
for (k in 1:object$lag) {
iter <- myLag(new_data[, paste0("DIFF_", endovars)], k)
colnames(iter) <- paste0("DIFF_", endovars, " -", k)
new_data <- cbind(new_data, iter)
}
# Fifth step: recursive calculatioon
vecm_vars <- colnames(summary_vecm$bigcoefficients)
if ("Intercept" %in% vecm_vars) {
new_data$Intercept <- 1
}
vecm_vars[!(vecm_vars %in% c("ECT", "Intercept"))] <-
paste0("DIFF_", vecm_vars[!(vecm_vars %in% c("ECT", "Intercept"))])
equation <- paste("Equation", predicted_var)
equation_coeff <- summary_vecm$coefficients[equation, ]
predicted_var2 <- paste0("DIFF_", predicted_var)
for (k in (object$lag + 2):nrow(new_data)) {
# Estimate y_diff
new_data[k, predicted_var2] <-
sum(sweep(new_data[k, vecm_vars], MARGIN = 2, equation_coeff, `*`))
# Estimate y_diff lags
for (j in 1:object$lag) {
new_data[, paste0(predicted_var2, " -", j)] <-
as.numeric(quantmod::Lag(new_data[, predicted_var2], j))
}
# Estimate y
new_data[k, predicted_var] <-
new_data[(k - 1), predicted_var] + new_data[k, predicted_var2]
# Estimate ECT
new_data[k, "ECT_0"] <- sum(sweep(new_data[k, ect_vars], MARGIN = 2, ect_coeff, `*`))
if (k < nrow(new_data)) {
new_data[k + 1, "ECT"] <- new_data[k, "ECT_0"]
}
}
predicted_values <- new_data[(object$lag + 2):nrow(new_data), predicted_var]
} else {
stop("You must provide a valid VECM model.")
}
return(
list(
predicted_variable = predicted_var,
predicted_values = predicted_values,
data = new_data
)
)
}
# Lag function applied to dataframes
myLag <- function(data, lag) data.frame(unclass(data[c(rep(NA, lag), 1:(nrow(data)-`lag)),]))`
#Andrey Goloborodko, in your example, you should apply:
NewDat <- NewDat[-1,] #Just new data is necessary to be provided
predict.vecm(vecm2, new_data=NewDat, predicted_var = "Y")
# $predicted_variable
# [1] "Y"
#
# $predicted_values
# [1] 65.05233 61.29563 59.45109
#
# $data
# Y X ECT_0 ECT DIFF_Y DIFF_X DIFF_Y -1 DIFF_X -1 DIFF_Y -2
# jul. 2016 92.40506 100 -29.0616718 NA NA NA NA NA NA
# ago. 2016 94.03255 78 -0.7115037 -29.0616718 1.627486 -22 NA NA NA
# sep. 2016 78.84268 53 14.4653067 -0.7115037 -15.189873 -25 1.627486 -22 NA
# oct. 2016 67.99277 52 4.8300645 14.4653067 -10.849910 -1 -15.189873 -25 1.627486
# nov. 2016 65.05233 51 3.1042967 4.8300645 -2.940435 -1 -10.849910 -1 -15.189873
# dic. 2016 61.29563 50 0.5622618 3.1042967 -3.756702 -1 -2.940435 -1 -10.849910
# ene. 2017 59.45109 55 -7.3556104 0.5622618 -1.844535 5 -3.756702 -1 -2.940435
# DIFF_X -2 DIFF_Y -3 DIFF_X -3 Intercept
# jul. 2016 NA NA NA 1
# ago. 2016 NA NA NA 1
# sep. 2016 NA NA NA 1
# oct. 2016 -22 NA NA 1
# nov. 2016 -25 1.627486 -22 1
# dic. 2016 -1 -15.189873 -25 1
# ene. 2017 -1 -10.849910 -1 1

Related

Manually created formula seems to not work...but why?

Let's consider data following :
library(pglm)
data('UnionWage', package = 'pglm')
df1 <- data.frame(UnionWage[1:2],'wage' = UnionWage$wage, 'exper' = UnionWage$exper)
head(df1)
id year wage exper
1 13 1980 1.197540 1
2 13 1981 1.853060 2
3 13 1982 1.344462 3
4 13 1983 1.433213 4
5 13 1984 1.568125 5
6 13 1985 1.699891 6
I want to write a function which will fit logistic random panel regression to data :
Function
fit_panel_binary <- function(y, x) {
x <- cbind(x, y)
#Taking varnames since 3 (first and second are respectively id and year)
varnames <- names(x)[3:(length(x))]
#Excluding dependent variable from varnames
varnames <- varnames[!(varnames == names(y))]
#Creating formula of independent variables (sum)
form <- paste0(varnames, collapse = "+")
x_copy <- data.frame(x)
#Performing panel random regression
model <- pglm(as.formula(paste0(names(y), "~", form)), data = as.matrix(x_copy),
model = 'random', family = binomial(link = 'logit'))
}
And error occrus :
fit_panel_binary(UnionWage['union'],df1)
Error in if (!id.name %in% names(x)) stop(paste("variable ", id.name, :
argument is of length zero
I though that I put formula incorrectly so I changed my function to outputs formula paste0(names(y), "~", form)
and this is what I got :
"union~wage+exper"
It seems that formula is correct, but regression says something different. DO you know where is the problem ?
Unfortunately, because of the way pglm captures its arguments using non-standard evaluation, it doesn't like being passed formulas that have to be evaluated (i.e. formulas that aren't typed in situ). You can get round this by calling it with do.call, but even then you have to be careful that the data is available in the frame in which pglm is called. So you have to do something like:
fit_panel_binary <- function(y, x) {
x <- cbind(x, y)
varnames <- names(x)[3:(length(x))]
varnames <- varnames[!(varnames == names(y))]
form <- paste0(varnames, collapse = "+")
x_copy <- data.frame(x)
form <- as.formula(paste(names(y), "~", form))
params <- list(formula = form, data = x_copy, model = "random",
family = binomial(link = "logit"))
pglm_env <- list2env(params, envir = new.env())
do.call("pglm", params, envir = pglm_env)
}
Allowing:
fit_panel_binary(UnionWage['union'], df1)
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 6 iterations
#> Return code 1: gradient close to zero
#> Log-Likelihood: -1655.081 (4 free parameter(s))
#> Estimate(s): -3.428254 0.8305558 -0.06590541 4.2625
Created on 2020-12-08 by the reprex package (v0.3.0)

Why would my residuals for a linear model in R have one less value than the data set I used to create it?

I created a linear model in R using county data from the ACS. There are 3140 entries in my data set, and they all have their corresponding fips codes. I'm trying to make a map of the residuals from my linear model, but I only have 3139 residuals. Does anyone know if there's something R does when creating a linear model that is responsible for this, and how I can fix it so that I can create this map? Thanks!
In response to the suggestion of checking for NAs, I ran this:
which(completedata$fipscode == NA)
integer(0)
R code if it helps:
sectorcodes <- read.csv("sectorcodes1.csv") #ruralubrancode, median hh income
sectorcodesdf <- data.frame(sectorcodes)
religion <- read.csv("Religion2.csv")
religiondf <- data.frame(religion)
merge1 <- merge(sectorcodesdf,religiondf, by = c('fipscode'))
merge1df <- data.frame(merge1)
family <- read.csv("censusdataavgfamsize.csv") #avgfamilysize
familydf <- data.frame(family)
merge2 <- merge(merge1df, familydf, by = c('fipscode'))
merge2df <- data.frame(merge2)
gradrate <- read.csv("censusdatahsgrad.csv")
gradratedf <- data.frame(gradrate)
evenmoredata2 <- merge(gradrate,merge2df, by=c("fipscode"))
#write.csv(evenmoredata2, file = "completedataset.csv")
completedata <- read.csv("completedataset.csv")
completedatadf <- data.frame(completedata)
lm8 <- lm(completedatadf$hsgrad ~ completedatadf$averagefamilysize*completedatadf$Rural_urban_continuum_code_2013*completedatadf$TOTADH*completedatadf$Median_Household_Income_2016)
summary(lm8)
library(blscrapeR)
library(RgoogleMaps)
library(choroplethr)
library(acs)
attach(acs)
require(choroplethr)
dataframe1 <- data.frame(completedatadf$fipscode,completedatadf$averagefamilysize)
names(dataframe1) <- c("region","value")
dataframe2 <- data.frame(completedata$fipscode,completedata$hsgrad)
names(dataframe2) <- c("region","value")
residdf <- data.frame(lm8$residuals)
dataframe3 <- data.frame(completedata$fipscode,lm8$residuals)
names(dataframe3) <- c("region","value")
county_choropleth(dataframe1)
county_choropleth(dataframe2)
county_choropleth(dataframe3)
When I try to run dataframe3, the error message is:
dataframe3 <- data.frame(completedata$fipscode,lm8$residuals)
Error in data.frame(completedata$fipscode, lm8$residuals) :
arguments imply differing number of rows: 3140, 3139
This can be caused by an NA in the response. For example, using the builtin BOD data frame note that there are 5 residuals in this example but 6 rows in b:
b <- BOD
b[3, 2] <- NA
nrow(b)
## [1] 6
fm <- lm(demand ~ Time, b)
resid(fm)
## 1 2 4 5 6
## -0.3578947 -0.2657895 1.6184211 -0.6894737 -0.3052632
We can handle that by specifying na.action = na.exclude when running lm. Note that now there are 6 residuals with the extra one being NA.
fm <- lm(demand ~ Time, b, na.action = na.exclude)
resid(fm)
## 1 2 3 4 5 6
## -0.3578947 -0.2657895 NA 1.6184211 -0.6894737 -0.3052632
Try
data.frame(na.omit(completedata$fipscode), lm8$residuals)
Probably, you're data has NA values.

Rolling regression and prediction with lm() and predict()

I need to apply lm() to an enlarging subset of my dataframe dat, while making prediction for the next observation. For example, I am doing:
fit model predict
---------- -------
dat[1:3, ] dat[4, ]
dat[1:4, ] dat[5, ]
. .
. .
dat[-1, ] dat[nrow(dat), ]
I know what I should do for a particular subset (related to this question: predict() and newdata - How does this work?). For example to predict the last row, I do
dat1 = dat[1:(nrow(dat)-1), ]
dat2 = dat[nrow(dat), ]
fit = lm(log(clicks) ~ log(v1) + log(v12), data=dat1)
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
How can I do this automatically for all subsets, and potentially extract what I want into a table?
From fit, I'd need the summary(fit)$adj.r.squared;
From predict.fit I'd need predict.fit$fit value.
Thanks.
(Efficient) solution
This is what you can do:
p <- 3 ## number of parameters in lm()
n <- nrow(dat) - 1
## a function to return what you desire for subset dat[1:x, ]
bundle <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
c(summary(fit)$adj.r.squared, pred$fit, pred$se.fit)
}
## rolling regression / prediction
result <- t(sapply(p:n, bundle))
colnames(result) <- c("adj.r2", "prediction", "se")
Note I have done several things inside the bundle function:
I have used subset argument for selecting a subset to fit
I have used model = FALSE to not save model frame hence we save workspace
Overall, there is no obvious loop, but sapply is used.
Fitting starts from p, the minimum number of data required to fit a model with p coefficients;
Fitting terminates at nrow(dat) - 1, as we at least need the final column for prediction.
Test
Example data (with 30 "observations")
dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
v12 = runif(30, 1, 100))
Applying code above gives results (27 rows in total, truncated output for 5 rows)
adj.r2 prediction se
[1,] NaN 3.881068 NaN
[2,] 0.106592619 3.676821 0.7517040
[3,] 0.545993989 3.892931 0.2758347
[4,] 0.622612495 3.766101 0.1508270
[5,] 0.180462206 3.996344 0.2059014
The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for adj.r2 is NaN, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to se as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.
I just made up some random data to use for this example. I'm calling the object data because that was what it was called in the question at the time that I wrote this solution (call it anything you like).
(Efficient) Solution
data <- data.frame(v1=rnorm(100),v2=rnorm(100),clicks=rnorm(100))
data1 = data[1:(nrow(data)-1), ]
data2 = data[nrow(data), ]
for(i in 3:nrow(data)){
nam <- paste("predict", i, sep = "")
nam1 <- paste("fit", i, sep = "")
nam2 <- paste("summary_fit", i, sep = "")
fit = lm(clicks ~ v1 + v2, data=data[1:i,])
tmp <- predict(fit, newdata=data2, se.fit=TRUE)
tmp1 <- fit
tmp2 <- summary(fit)
assign(nam, tmp)
assign(nam1, tmp1)
assign(nam2, tmp2)
}
All of the results you want will be stored in the data objects this creates.
For example:
> summary_fit10$r.squared
[1] 0.3087432
You mentioned in the comments that you'd like a table of results. You can programmatically create tables of results from the 3 types of output files like this:
rm(data,data1,data2,i,nam,nam1,nam2,fit,tmp,tmp1,tmp2)
frames <- ls()
frames.fit <- frames[1:98] #change index or use pattern matching as needed
frames.predict <- frames[99:196]
frames.sum <- frames[197:294]
fit.table <- data.frame(intercept=NA,v1=NA,v2=NA,sourcedf=NA)
for(i in 1:length(frames.fit)){
tmp <- get(frames.fit[i])
fit.table <- rbind(fit.table,c(tmp$coefficients[[1]],tmp$coefficients[[2]],tmp$coefficients[[3]],frames.fit[i]))
}
fit.table
> fit.table
intercept v1 v2 sourcedf
2 -0.0647017971121678 1.34929652763687 -0.300502017324518 fit10
3 -0.0401617893034109 -0.034750571912636 -0.0843076273486442 fit100
4 0.0132968863522573 1.31283604433593 -0.388846211083564 fit11
5 0.0315113918953643 1.31099122173898 -0.371130010135382 fit12
6 0.149582794027583 0.958692838785998 -0.299479715938493 fit13
7 0.00759688947362175 0.703525856001948 -0.297223988673322 fit14
8 0.219756240025917 0.631961979610744 -0.347851129205841 fit15
9 0.13389223748979 0.560583832333355 -0.276076134872669 fit16
10 0.147258022154645 0.581865844000838 -0.278212722024832 fit17
11 0.0592160359650468 0.469842498721747 -0.163187274356457 fit18
12 0.120640756525163 0.430051839741539 -0.201725012088506 fit19
13 0.101443924785995 0.34966728554219 -0.231560038360121 fit20
14 0.0416637001406594 0.472156988919337 -0.247684504074867 fit21
15 -0.0158319749710781 0.451944113682333 -0.171367482879835 fit22
16 -0.0337969739950376 0.423851304105399 -0.157905431162024 fit23
17 -0.109460218252207 0.32206642419212 -0.055331391802687 fit24
18 -0.100560410735971 0.335862465403716 -0.0609509815266072 fit25
19 -0.138175283219818 0.390418411384468 -0.0873106257144312 fit26
20 -0.106984355317733 0.391270279253722 -0.0560299858019556 fit27
21 -0.0740684978271464 0.385267011513678 -0.0548056844433894 fit28

Rolling regression return multiple objects

I am trying to build a rolling regression function based on the example here, but in addition to returning the predicted values, I would like to return the some rolling model diagnostics (i.e. coefficients, t-values, and mabye R^2). I would like the results to be returned in discrete objects based on the type of results. The example provided in the link above sucessfully creates thr rolling predictions, but I need some assistance packaging and writing out the rolling model diagnostics:
In the end, I would like the function to return three (3) objects:
Predictions
Coefficients
T values
R^2
Below is the code:
require(zoo)
require(dynlm)
## Create Some Dummy Data
set.seed(12345)
x <- rnorm(mean=3,sd=2,100)
y <- rep(NA,100)
y[1] <- x[1]
for(i in 2:100) y[i]=1+x[i-1]+0.5*y[i-1]+rnorm(1,0,0.5)
int <- 1:100
dummydata <- data.frame(int=int,x=x,y=y)
zoodata <- as.zoo(dummydata)
rolling.regression <- function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# 1) Make Predictions
predicted <- predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
attributes(predicted) <- NULL
c(predicted=predicted,square.res <-(predicted-zoodata[nextOb,'y'])^2)
# 2) Extract coefficients
#coefficients <- coef(mod)
# 3) Extract rolling coefficient t values
#tvalues <- ????(mod)
# 4) Extract rolling R^2
#rsq <-
}
}
rolling.window <- 20
results.z <- rollapply(zoodata, width=rolling.window, FUN=rolling.regression, by.column=F, align='right')
So after figuring out how to extract t values from model (i.e. mod) , what do I need to do to make the function return three (3) seperate objects (i.e. Predictions, Coefficients, and T-values)?
I am fairly new to R, really new to functions, and extreemly new to zoo, and I'm stuck.
Any assistance would be greatly appreciated.
I hope I got you correctly, but here is a small edit of your function:
rolling.regression <- function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# 1) Make Predictions
predicted=predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
attributes(predicted)<-NULL
#Solution 1; Quicker to write
# c(predicted=predicted,
# square.res=(predicted-zoodata[nextOb,'y'])^2,
# summary(mod)$coef[, 1],
# summary(mod)$coef[, 3],
# AdjR = summary(mod)$adj.r.squared)
#Solution 2; Get column names right
c(predicted=predicted,
square.res=(predicted-zoodata[nextOb,'y'])^2,
coef_intercept = summary(mod)$coef[1, 1],
coef_Ly = summary(mod)$coef[2, 1],
coef_Lx = summary(mod)$coef[3, 1],
tValue_intercept = summary(mod)$coef[1, 3],
tValue_Ly = summary(mod)$coef[2, 3],
tValue_Lx = summary(mod)$coef[3, 3],
AdjR = summary(mod)$adj.r.squared)
}
}
rolling.window <- 20
results.z <- rollapply(zoodata, width=rolling.window, FUN=rolling.regression, by.column=F, align='right')
head(results.z)
predicted square.res coef_intercept coef_Ly coef_Lx tValue_intercept tValue_Ly tValue_Lx AdjR
20 10.849344 0.721452 0.26596465 0.5798046 1.049594 0.38309211 7.977627 13.59831 0.9140886
21 12.978791 2.713053 0.26262820 0.5796883 1.039882 0.37741499 7.993014 13.80632 0.9190757
22 9.814676 11.719999 0.08050796 0.5964808 1.073941 0.12523824 8.888657 15.01353 0.9340732
23 5.616781 15.013297 0.05084124 0.5984748 1.077133 0.08964998 9.881614 16.48967 0.9509550
24 3.763645 6.976454 0.26466039 0.5788949 1.068493 0.51810115 11.558724 17.22875 0.9542983
25 9.433157 31.772658 0.38577698 0.5812665 1.034862 0.70969330 10.728395 16.88175 0.9511061
To see how it works, make a small example with a regression:
x <- rnorm(1000); y <- 2*x + rnorm(1000)
reg <- lm(y ~ x)
summary(reg)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02694322 0.03035502 0.8876033 0.374968
x 1.97572544 0.03177346 62.1816310 0.000000
As you can see, calling summary first and then getting the coefficients of it (coef(summary(reg)) works as well) gives you a table with estimates, standard errors, and t-values. So estimates are saved in column 1 of that table, t-values in column 3. And that's how I obtain them in the updated rolling.regression function.
EDIT
I updated my solution; now it also contains the adjusted R2. If you just want the normal R2, get rid of the .adj.
EDIT 2
Quick and dirty hack how to name the columns:
rolling.regression <- function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# 1) Make Predictions
predicted=predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
attributes(predicted)<-NULL
#Get variable names
strVar <- c("Intercept", paste0("L", 1:(nrow(summary(mod)$coef)-1)))
vec <- c(predicted=predicted,
square.res=(predicted-zoodata[nextOb,'y'])^2,
AdjR = summary(mod)$adj.r.squared,
summary(mod)$coef[, 1],
summary(mod)$coef[, 3])
names(vec)[4:length(vec)] <- c(paste0("Coef_", strVar), paste0("tValue_", strVar))
vec
}
}

Block bootstrap from subject list

I'm trying to efficiently implement a block bootstrap technique to get the distribution of regression coefficients. The main outline is as follows.
I have a panel data set, and say firm and year are the indices. For each iteration of the bootstrap, I wish to sample n subjects with replacement. From this sample, I need to construct a new data frame that is an rbind() stack of all the observations for each sampled subject, run the regression, and pull out the coefficients. Repeat for a bunch of iterations, say 100.
Each firm can potentially be selected multiple times, so I need to include it data multiple times in each iteration's data set.
Using a loop and subset approach, like below, seems computationally burdensome.
Note that for my real data frame, n, and the number iterations is much larger than the example below.
My thoughts initially are to break the existing data frame into a list by subject using the split() command. From there, use
sample(unique(df1$subject),n,replace=TRUE)
to get the new list, then perhaps implement quickdf from the plyr package to construct a new data frame.
Example slow code:
require(plm)
data("Grunfeld", package="plm")
firms = unique(Grunfeld$firm)
n = 10
iterations = 100
mybootresults=list()
for(j in 1:iterations){
v = sample(length(firms),n,replace=TRUE)
newdata = NULL
for(i in 1:n){
newdata = rbind(newdata,subset(Grunfeld, firm == v[i]))
}
reg1 = lm(value ~ inv + capital, data = newdata)
mybootresults[[j]] = coefficients(reg1)
}
mybootresults = as.data.frame(t(matrix(unlist(mybootresults),ncol=iterations)))
names(mybootresults) = names(reg1$coefficients)
mybootresults
(Intercept) inv capital
1 373.8591 6.981309 -0.9801547
2 370.6743 6.633642 -1.4526338
3 528.8436 6.960226 -1.1597901
4 331.6979 6.239426 -1.0349230
5 507.7339 8.924227 -2.8661479
...
...
How about something like this:
myfit <- function(x, i) {
mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
coefficients(lm(value ~ inv + capital, data = mydata))
}
firms <- unique(Grunfeld$firm)
b0 <- boot(firms, myfit, 999)
You can also use the tsboot function in the boot package with fixed block resampling scheme.
require(plm)
require(boot)
data(Grunfeld)
### each firm is of length 20
table(Grunfeld$firm)
## 1 2 3 4 5 6 7 8 9 10
## 20 20 20 20 20 20 20 20 20 20
blockboot <- function(data)
{
coefficients(lm(value ~ inv + capital, data = data))
}
### fixed length (every 20 obs, so for each different firm) block bootstrap
set.seed(321)
boot.1 <- tsboot(Grunfeld, blockboot, R = 99, l = 20, sim = "fixed")
boot.1
## Bootstrap Statistics :
## original bias std. error
## t1* 410.81557 -25.785972 174.3766
## t2* 5.75981 0.451810 2.0261
## t3* -0.61527 0.065322 0.6330
dim(boot.1$t)
## [1] 99 3
head(boot.1$t)
## [,1] [,2] [,3]
## [1,] 522.11 7.2342 -1.453204
## [2,] 626.88 4.6283 0.031324
## [3,] 479.74 3.2531 0.637298
## [4,] 557.79 4.5284 0.161462
## [5,] 568.72 5.4613 -0.875126
## [6,] 379.04 7.0707 -1.092860
Here is a method that should typically be faster than the accepted answer, returns the same results and does not rely on additional packages (except boot). The key here is to use which and integer indexing to construct each data.frame replicate rather than split/subset and do.call/rbind.
# get function for boot
myIndex <- function(x, i) {
# select the observations to subset. Likely repeated observations
blockObs <- unlist(lapply(i, function(n) which(x[n] == Grunfeld$firm)))
# run regression for given replicate, return estimated coefficients
coefficients(lm(value~ inv + capital, data=Grunfeld[blockObs,]))
}
now, bootstrap
# get result
library(boot)
set.seed(1234)
b1 <- boot(firms, myIndex, 200)
Run the accepted answer
set.seed(1234)
b0 <- boot(firms, myfit, 200)
Let's eyeball a comparison
using indexing
b1
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = firms, statistic = myIndex, R = 200)
Bootstrap Statistics :
original bias std. error
t1* 410.8155650 -6.64885086 197.3147581
t2* 5.7598070 0.37922066 2.4966872
t3* -0.6152727 -0.04468225 0.8351341
Original version
b0
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = firms, statistic = myfit, R = 200)
Bootstrap Statistics :
original bias std. error
t1* 410.8155650 -6.64885086 197.3147581
t2* 5.7598070 0.37922066 2.4966872
t3* -0.6152727 -0.04468225 0.8351341
These look pretty close. Now, a bit more checking
identical(b0$t, b1$t)
[1] TRUE
and
identical(summary(b0), summary(b1))
[1] TRUE
Finally, we'll do a quick benchmark
library(microbenchmark)
microbenchmark(index={b1 <- boot(firms, myIndex, 200)},
rbind={b0 <- boot(firms, myfit, 200)})
On my computer, this returns
Unit: milliseconds
expr min lq mean median uq max neval
index 292.5770 296.3426 303.5444 298.4836 301.1119 395.1866 100
rbind 712.1616 720.0428 729.6644 724.0777 731.0697 833.5759 100
So, direct indexing is more than 2 times faster at every level of the distribution.
note on missing fixed effects
As with most of the answers, the issue of missing "fixed effects" may emerge. Commonly, fixed effects are used as controls and the researcher is interested in one or a couple of variables that will be included with every selected observation. In this dominant case, there is no (or very little) harm in restricting the returned result of the myIndex or myfit function to only include the variables of interest in the returned vector.
The solution needs to be modified to manage fixed effects.
library(boot) # for boot
library(plm) # for Grunfeld
library(dplyr) # for left_join
## Get the Grunfeld firm data (10 firms, each for 20 years, 1935-1954)
data("Grunfeld", package="plm")
## Create dataframe with unique firm identifier (one line per firm)
firms <- data.frame(firm=unique(Grunfeld$firm),junk=1)
## for boot(), X is the firms dataframe; i index the sampled firms
myfit <- function(X, i) {
## join the sampled firms to their firm-year data
mydata <- left_join(X[i,], Grunfeld, by="firm")
## Distinguish between multiple resamples of the same firm
## Otherwise they have the same id in the fixed effects regression
## And trouble ensues
mydata <- mutate(group_by(mydata,firm,year),
firm_uniq4boot = paste(firm,"+",row_number())
)
## Run regression with and without firm fixed effects
c(coefficients(lm(value ~ inv + capital, data = mydata)),
coefficients(lm(value ~ inv + capital + factor(firm_uniq4boot), data = mydata)))
}
set.seed(1)
system.time(b <- boot(firms, myfit, 1000))
summary(b)
summary(lm(value ~ inv + capital, data=Grunfeld))
summary(lm(value ~ inv + capital + factor(firm), data=Grunfeld))
I found a method using dplyr::left_join that is a bit more concise, only takes about 60% as long, and gives the same results as in the answer by Sean. Here's a complete self-contained example.
library(boot) # for boot
library(plm) # for Grunfeld
library(dplyr) # for left_join
# First get the data
data("Grunfeld", package="plm")
firms <- unique(Grunfeld$firm)
myfit1 <- function(x, i) {
# x is the vector of firms
# i are the indexes into x
mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
coefficients(lm(value ~ inv + capital, data = mydata))
}
myfit2 <- function(x, i) {
# x is the vector of firms
# i are the indexes into x
mydata <- left_join(data.frame(firm=x[i]), Grunfeld, by="firm")
coefficients(lm(value ~ inv + capital, data = mydata))
}
# rbind method
set.seed(1)
system.time(b1 <- boot(firms, myfit1, 5000))
## user system elapsed
## 13.51 0.01 13.62
# left_join method
set.seed(1)
system.time(b2 <- boot(firms, myfit2, 5000))
## user system elapsed
## 8.16 0.02 8.26
b1
## original bias std. error
## t1* 410.8155650 9.2896499 198.6877889
## t2* 5.7598070 0.5748503 2.5725441
## t3* -0.6152727 -0.1200954 0.7829191
b2
## original bias std. error
## t1* 410.8155650 9.2896499 198.6877889
## t2* 5.7598070 0.5748503 2.5725441
## t3* -0.6152727 -0.1200954 0.7829191

Resources