bootstrapping with lme4 model and missing values - r

I am working through an example from Aguinis, Gottfredson, & Culpepper (2013). They have provided some R code to perform a bootstrapping procedure in R to estimate confidence intervals for slope variances. This is their original R code:
library(RLRsim)
#STEP 3: Random Intercept and Random Slope model
lmm.fit3=lmer(Y ~ (Xc|l2id) + Xc + I(Wj-mean(Wj)), data=exdata, REML=F)
# Nonparametric Bootstrap Function
REMLVC=VarCorr(lmer(Y ~Xc+(Xc|l2id)+I(Wj-mean(Wj) ),data=exdata,REML=T))$l2id[1:2,1:2]
U.R=chol(REMLVC)
REbootstrap=function(Us,es,X,gs){
nj=nrow(Us)
idk=sample(1:nj,size=nj,replace=T)
Usk=as.matrix(Us[idk,])
esk=sample(es,size=length(es),replace=T)
S=t(Usk)%*%Usk/nj
U.S = chol(S)
A=solve(U.S)%*%U.R
Usk = Usk%*%A
datk=expand.grid(l1id = 1:6,l2id = 1:nj)
colnames(X)=c('one','Xc','Wjc')
datk=cbind(datk,X)
datk$yk = X%*%gs + Usk[datk$l2id,1]+Usk[datk$l2id,2]*X[,2]+esk
lmm.fitk=lmer(yk ~Xc+(Xc|l2id)+Wjc,data=datk,REML=F)
tau11k = VarCorr(lmm.fitk)$l2id[2,2]
tau11k
}
# Implementing Bootstrap
bootks=replicate(1500,REbootstrap(Us=ranef(lmm.fit3)$l2id,es=resid(lmm.fit3),X=model.matrix(lmm.fit3),gs=fixef(lmm.fit3)))
quantile(bootks,probs=c(.025,.975))
I was trying to adapt the code to suit my own data and model. That was unfruitful so far because (a) I do not fully understand all the lines of code and (b) I have missing datapoints in one of my predictors. Here is what I have so far:
#reproducible code
set.seed(855)
exdf <- data.frame(
ID= c(rep(1:105, 28)),
content= sort(c(rep(1:28, 105))),
PrePost= sample(0:1, 105*28, replace=TRUE),
eyeFRF= sort(rep(rnorm(28), 105)),
APMs= sample(0:1, 105*28, replace=TRUE),
Gf= rep(rnorm(105), 28)
)
exdf[which(exdf$ID==62), "eyeFRF"] <- NA
RandomMissing <- sample(rownames(exdf[-which(exdf$ID==62), ]), 17)
exdf[RandomMissing, "eyeFRF"] <- NA
View(exdf)
#model
M03b <- glmer(APMs ~ PrePost + Gf + eyeFRF + (1|content) + (eyeFRF|ID), data=exdf, family=binomial("logit"))
#own adaptation
REMLVC=VarCorr(M03b)$ID[1:2,1:2]
U.R=chol(REMLVC)
REbootstrap=function(Us, es, X, gs){
#Us = random effects
#es = residuals
#X = design matrix
#gs = fixed effects
nj = nrow(Us) #104 in this case, one is excluded (#62) b/c no eye-data
idk = sample(1:nj, size=nj, replace=TRUE) #104 IDs
Usk = as.matrix(Us[idk,]) #104 intercepts and slopes
esk = sample(es, size=length(es), replace=TRUE) #2895 datapoints called 'x' (errors?)
S = t(Usk)%*%Usk/nj #?
U.S = chol(S) #?
A = solve(U.S)%*%U.R #?
Usk = Usk%*%A #?
datk = expand.grid(content=1:28, ID=1:nj)
colnames(X) = c('one', 'PrePost', 'Gf', 'eyeFRF')
datk = cbind(datk, X)
datk$APMsk = X%*%gs + Usk[datk$ID,1] + Usk[datk$ID,2]*X[ ,2] + esk
lmm.fitk = glmer(APMsk ~ PrePost + Gf + eyeFRF + (1|content) + (zb|ID), data=datk, family=binomial("logit"))
tau11k = VarCorr(lmm.fitk)$l2id[2,2]
tau11k
}
# Implementing Bootstrap
bootks <- replicate(1500, REbootstrap(Us=ranef(M03b)$ID, es=resid(M03b), X=model.matrix(M03b), gs=fixef(M03b)))
quantile(bootks, probs=c(.025,.975))

(upgrading comment to an answer)
If you're trying to get confidence intervals via parametric bootstrapping, would confint(M03b,method="boot") work for you? (I think these methods may be new or better developed since that paper was written ...)

Related

Rolling regression forecast , DM test, CW test

I have a linear model with the exchange rate as a dependent variable and 7 others independent variables(e.g. inflation, interest rate etc.). I have quarterly data from 1993Q1-2011Q4.
I would like to create a rolling window regression (with the model above) with window size 60(from 1993Q1-2007Q4) and use the estimated regression to forecast the rest sample. Also, I would like to compare this model with the Random Walk model(exchange rate follows a R.W.). In the end, I would like to perform the dm.test and clarkwest test(does not run). Is my code right?
X = embed(data)
X = as.data.frame(X)
install.packages("foreach")
library(foreach)
w_size=60
n_windows = nrow(X) - 60 #until 2007Q4
forecasts = foreach(i=1:n_windows, .combine = rbind) %do%{
# = Select data for the window (in and out-of-sample) = #
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] for expanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(V1 ~ V2+V3+V4+V5+V6+V7+V8, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$V1, 1)
return(c(f1, f2))
}
e1 = tail(X[ ,"V1"], nrow(forecasts)) - forecasts[ ,1]
e2 = tail(X[ ,"V1"], nrow(forecasts)) - forecasts[ ,2]
library(tseries)
library(forecast)
dm.test(e1,e2, "l") #p-value is more than 5% for all the cases( two.sided, greater, less)
clarkwest(e1,e2)
It seems like the clarkwest() function is not supported anymore. I recently wrote my own function: CW Note that I used normal standard errors and not Newey-West corrected.
To investigate your loop you could try:
i=1
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] for expanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(V1 ~ V2+V3+V4+V5+V6+V7+V8, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$V1, 1)
Here you can see the composition the loop creates when i=1

Modeling beta-binomial distributed data using glmmTBM

Im trying to fit a mixed effect model to asses for effects upon the rate of germinated polen grains. I started with a binomial distribution with a model structure like this:
glmer(cbind(NGG,NGNG) ~ RH3*Altitude + AbH + Date3 + (1 | Receptor/Code/Plant) +
(1 | Mountain/Community), data=database, family="binomial",
control = glmerControl(optimizer="bobyqa"))
Where NGG is the number of successes (germinated grains per stigma, can vary from 0 to e.g. 55), NGNG the number of failures (non-germinated grains 0 to e.g. 80). The issue is, after seeing the results, data seems to be over-dispersed, as indicated by the function (found in http://rstudio-pubs-static.s3.amazonaws.com/263877_d811720e434d47fb8430b8f0bb7f7da4.html):
overdisp_fun <- function(model) {
vpars <- function(m) {
nrow(m)*(nrow(m)+1)/2
}
model.df <- sum(sapply(VarCorr(model), vpars)) + length(fixef(model))
rdf <- nrow(model.frame(model))-model.df
rp <- residuals(model, type = "pearson") # computes pearson residuals
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df = rdf, lower.tail = FALSE)
c(chisq = Pearson.chisq, ratio = prat, rdf = rdf, p = pval)
}
The output was:
chisq = 1.334567e+04, ratio = 1.656201e+00, rdf = 8.058000e+03, p = 3.845911e-268
So I decided to try a beta-binomial in glmmTMB as follows (its important to keep this hierarchical structure):
glmmTMB(cbind(NGG,NGNG) ~ RH3*Altitude + AbH + Date3 + (1 | Receptor/Code/Plant) +
(1 | Mountain/Community), data=database,
family=betabinomial(link = "logit"), na.action = na.omit, weights=NGT)
When I run it.. says:
Error in nlminb(start = par, objective = fn, gradient = gr, control = control$optCtrl) : (converted from warning) NA/NaN function evaluation
Is there something wrong in the model writing? I already checked for posible issues in (http://rstudio-pubs-static.s3.amazonaws.com/263877_d811720e434d47fb8430b8f0bb7f7da4.html) but did not find any solution yet.
thanks

Simulating a mixed linear model and evaluating it with lmerTest in R

I am trying to understand how to use mixed linear models to analyse my data by simulating a model, but I can't reproduce the input parameters. What am I missing?
I want to start simulating a model with a random intercept for each subject. Here is the formula of what I want to simulate and reproduce:
If beta1 (<11) is small I find gamma00 as the intercept in fixed section, but I am completedly unaable to retrieve the slope (beta1). Also, the linear effect is not significant. Where is my conceptual mistake?
library(lmerTest)
# Generating data set
# General values and variables
numObj <- 20
numSub <- 100
e <- rnorm(numObj * numSub, mean = 0, sd = 0.1)
x <- scale(runif(numObj * numSub, min = -100, max = 100))
y <- c()
index <- 1
# Coefficients
gamma00 <- 18
gamma01 <- 0.5
beta1 <- -100
w <- runif(numSub, min = -3, max = 3)
uo <- rnorm(numSub, mean = 0, sd = 0.1)
meanBeta0 <- mean(gamma00 + gamma01*w + uo) # I should be able to retrieve that parameter.
for(j in 1:numSub){
for(i in 1:numObj){
y[index] <- gamma00 + gamma01*w[j]+ uo[j] + beta1*x[i] + e[index]
index <- index + 1
}
}
dataFrame2 <- data.frame(y = y, x = x, subNo = factor(rep(1:numSub, each = numObj)), objNum = factor(rep(1:numObj, numSub)))
model2 <- lmer(y ~ x +
(1 | subNo), data = dataFrame2)
summary(model2)
anova(model2)
No conceptual mistake here, just a mixed up index value: you should be using index rather than i to index x in your data generation loop.
Basically due to the mix-up you were using the first subject's x values for generating data for all the subjects, but using the individual x values in the model.

How to estimate the Kalman Filter with 'KFAS' R package, with an AR(1) transition equation?

I am using 'KFAS' package from R to estimate a state-space model with the Kalman filter. My measurement and transition equations are:
y_t = Z_t * x_t + \eps_t (measurement)
x_t = T_t * x_{t-1} + R_t * \eta_t (transition),
with \eps_t ~ N(0,H_t) and \eta_t ~ N(0,Q_t).
So, I want to estimate the variances H_t and Q_t, but also T_t, the AR(1) coefficient. My code is as follows:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
fit <- fitSSM(ss_model, inits = c(0,0.6,0), method = 'L-BFGS-B')
But it returns: "Error in is.SSModel(do.call(updatefn, args = c(list(inits, model), update_args)),: System matrices (excluding Z) contain NA or infinite values, covariance matrices contain values larger than 1e+07"
The NA definitions for the variances works well, as documented in the package's paper. However, it seems this cannot be done for the AR coefficients. Does anyone know how can I do this?
Note that I am aware of the SSMarima function, which eases the definition of the transition equation as ARIMA models. Although I am able to estimate the AR(1) coef. and Q_t this way, I still cannot estimate the \eps_t variance (H_t). Moreover, I am migrating my Kalman filter codes from EViews to R, so I need to learn SSMcustom for other models that are more complicated.
Thanks!
It seems that you are missing something in your example, as your error message comes from the function fitSSM. If you want to use fitSSM for estimating general state space models, you need to provide your own model updating function. The default behaviour can only handle NA's in covariance matrices H and Q. The main goal of fitSSM is just to get started with simple stuff. For complex models and/or large data, I would recommend using your self-written objective function (with help of logLik method) and your favourite numerical optimization routines manually for maximum performance. Something like this:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
objf <- function(pars, model, estimate = TRUE) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
if (estimate) {
-logLik(model)
} else {
model
}
}
opt <- optim(c(1, 0.5, 1), objf, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100), model = ss_model)
ss_model_opt <- objf(opt$par, ss_model, estimate = FALSE)
Same with fitSSM:
updatefn <- function(pars, model) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
model
}
fit <- fitSSM(ss_model, c(1, 0.5, 1), updatefn, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100))
identical(ss_model_opt, fit$model)

Cross validate seasonal linear model

I'm trying to perform a CV on my linear model, which has seasonal dummy variables, so i can't take a random sample.
y = rnorm(120,0,3) + 20*sin(2*pi*(1:120)/12)
x = months(ISOdate(2012,1:12,1))
reg.data = data.frame(y, x)
model = lm(y ~ x, data = reg.data)
My CV function is:
cross.valid = function(model, min.fit = as.integer(nrow(model$model)*0.7), h = 1)
{
dados = model$model
n.rows = nrow(dados)
results = data.frame(pred = numeric(), actual = numeric())
for (i in seq(1, n.rows - min.fit - h + 1, by = h))
{
dados.train = dados[1:(i + min.fit - 1), ]
model <- update(model, data = dados.train)
dados.pred = dados[(i + min.fit):(i + min.fit + h - 1), -1, drop = FALSE]
predic = predict(model, newdata = dados.pred, interval = 'prediction')
actual = dados[(i + min.fit):(i + min.fit + h - 1), 1]
results = rbind(results, data.frame(pred = predic[1:h, 'fit'], actual = actual))
}
results
}
Example:
cv1 = cross.valid(model, h = 1)
mae = with(cv1, mean(abs(actual - pred )))
print(mae)
The MAE values for different horizons (h) are too close. Is the code itself valid? Is there a better solution/package for doing this?
Thanks!
I don't think there is anything incorrect about your function. Investigate the forecast package; I suspect that it will provide many functions that you need.
I have rewritten your function concisely:
set.seed(1)
y = rnorm(120,0,3) + 20*sin(2*pi*(1:120)/12)
x = months(ISOdate(2012,1:12,1))
reg.data = data.frame(y, x)
pred.set<-function(i,h) {
train<-reg.data[1:(i + min.fit - 1),]
test<-reg.data[(i + min.fit):(i + min.fit + h - 1),]
pred<-predict(lm(y~x, data=train), newdata=test)
abs(test$y - pred)
}
pred.by.horiz<-function(h)
mean(sapply(seq(1, nrows - min.fit - h + 1, by = h),pred.set,h=h))
pred.by.horiz matches the output of your function (and post-processing) exactly.
As you mentioned, the horizon does not appear to affect the MAE:
mae.by.h<-sapply(seq(nrows-min.fit),pred.by.horiz)
plot(mae.by.h,type='l',col='red',lwd=2,xlab='Horizon',ylab='Mean absolute error')
Perhaps you expected the the mean error would increase as the prediction horizon increases. For many time-series models this would be true, but in your linear model of months adding more data doesn't help you predict the next point in the series (unless you add 12 months or more).
For example, consider what happens when h is 1. You begin with 84 months of data, 7 points of data for each month. Now, you add one point of data, which will be the next January, and attempt to predict the result of February. But your additional point of data will only help you predict the next January, that is how your linear function works. Look at the summary of the model:
lm(y ~ x, data = reg.data)
Coefficients:
(Intercept) xAugust xDecember xFebruary xJanuary
17.11380 -32.74962 -17.81076 -0.03235 -6.63998
xJuly xJune xMarch xMay xNovember
-26.69203 -17.41170 2.96735 -7.11166 -25.43532
xOctober xSeptember
-33.56517 -36.93474
Each prediction is made solely on the basis of two variables, the intercept, and the predicted month. So predicting one point ahead isn't any easier than predicting five points ahead. That is why the MAE isn't rising as the horizon increases the problem is in the way you modeled the data, not the MAE function.
One thing I didn't completely understand about your function is why you decided to increment the size of the train set by h on each iteration. It is revealing to look at what happens when you try to increment by 1:
# Code to increment by 1
pred.by.horiz2<-
function(h) mean(sapply(seq(1, nrows - min.fit - h + 1, by = 1),pred.set,h=h))
mae.by.h2<-sapply(seq(nrows-min.fit),pred.by.horiz2)
plot(mae.by.h2,type='l',col='red',lwd=2,xlab='Horizon',ylab='Mean absolute error')
The pattern here is complex, but you'll note that the MAE starts to decrease at 12, when the horizon is large enough that the next point can be used.

Resources