I would like to test for the significance of my model. I have read that because I am using a cluster-robust model the F-test doesn't hold and instead I should use a Wald test.
My script currently looks like this and all of these different options give me corrected cluster-robust se's:
Option 1:
m <- lm(y_var ~ var1 + poly(var2, 2) + quartier, data = df)
m_robust_clustered <- coeftest(m, vcov = vcovCL,
type = "HC1",
df = 9, # There are 10 quartiers, so 10-1 = 9
cluster = ~ quartier) # retrieve cluster robust se's
Option 2: (using miceadds)
m <- lm.cluster(y_var ~ var1 + poly(var2, 2) + quartier,
cluster = 'quartier',
data = df)
Option 3: (using estimatr)
m <- lm_robust(y_var ~ var1 + poly(var2, 2) + quartier, cluster = quartier, data = df)
My issue is that from here I cannot figure out how to perform a Wald test. I have looked at both waldtest() and Wald_test() functions but none of these work:
waldtest(m)
Wald_test(m)
==> What am I missing here ? Which syntax should I be using for the wald test in each of the regression coding above ?
Thanks for the help!
The details of functions with names like "Wald test" can differ among packages. Some are designed for testing nested models, and wouldn't work for single models as shown in the question (which doesn't seem to specify which packages provided these waldtest() or Wald_test() functions).
A safe (if not the easiest to use) choice would be the wald.test() function in the R aod package. Usage is:
wald.test(Sigma, b, Terms = NULL, L = NULL, H0 = NULL, df = NULL, verbose = FALSE)
where Sigma is the variance-covariance matrix from the model, b is the vector of model coefficients, and you choose between Terms and L to specify what to test. It doesn't accept a model; you specify the coefficient vector and the variance-covariance matrix directly. Use Terms for a joint test on a set of coefficients; use L to test linear combinations of coefficients. The default null hypothesis H0 is all equal to 0. For the overall model Wald test, you would specify Terms as an integer vector specifying all of the coefficients, for example 1:length(coef(model)) if coef(model) returns a vector.
Related
I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...
I have a multivariate model with this (approximate) form:
library(MCMCglmm)
mod.1 <- MCMCglmm(
cbind(OFT1, MIS1, PC1, PC2) ~
trait-1 +
trait:sex +
trait:date,
random = ~us(trait):squirrel_id + us(trait):year,
rcov = ~us(trait):units,
family = c("gaussian", "gaussian", "gaussian", "gaussian"),
data= final_MCMC,
prior = prior.invgamma,
verbose = FALSE,
pr=TRUE, #this saves the BLUPs
nitt=103000, #number of iterations
thin=100, #interval at which the Markov chain is stored
burnin=3000)
For publication purposes, I've been asked to report the Gelman-Rubin statistic to indicate that the model has converged.
I have been trying to run:
gelman.diag(mod.1)
But, I get this error:
Error in mcmc.list(x) : Arguments must be mcmc objects
Any suggestions on the proper approach? I assume that the error means I can't pass my mod.1 output through gelman.diag(), but I am not sure what it is I am supposed to put there instead? My knowledge is quite limited here, so I'd appreciate any and all help!
Note that I haven't added the data here, but I suspect the answer is more code syntax and not data related.
The gelman.diag requires a mcmc.list. If we are running models with different set of parameters, extract the 'Sol' and place it in a list (Below, it is the same model)
library(MCMCglmm)
model1 <- MCMCglmm(PO~1, random=~FSfamily, data=PlodiaPO, verbose=FALSE,
nitt=1300, burnin=300, thin=1)
model2 <- MCMCglmm(PO~1, random=~FSfamily, data=PlodiaPO, verbose=FALSE,
nitt=1300, burnin=300, thin=1 )
mclist <- mcmc.list(model1$Sol, model2$Sol)
gelman.diag(mclist)
# gelman.diag(mclist)
#Potential scale reduction factors:
# Point est. Upper C.I.
#(Intercept) 1 1
According to the documentation, it seems to be applicable for more than one mcmc chain
Gelman and Rubin (1992) propose a general approach to monitoring convergence of MCMC output in which m > 1 parallel chains are run with starting values that are overdispersed relative to the posterior distribution.
The input x here is
x - An mcmc.list object with more than one chain, and with starting values that are overdispersed with respect to the posterior distribution.
I am using the mice package and lmer from lme4 for my analyses. However, pool.r.squared() won't work on this output. I am looking for suggestions on how to include the computation of the adjusted R squared in the following workflow.
require(lme4, mice)
imp <- mice(nhanes)
imp2 <- mice::complete(imp, "all") # This step is necessary in my analyses to include other variables/covariates following the multiple imputation
fit <- lapply(imp2, lme4::lmer,
formula = bmi ~ (1|age) + hyp + chl,
REML = T)
est <- pool(fit)
summary(est)
You have two separate problems here.
First, there are several opinions about what an R-squared for multilevel/mixed-model regressions actually is. This is the reason why pool.r.squared does not work for you, as it does not accept results from anything other than lm(). I do not have an answer for you how to calculate something R-squared-ish for your model and since it is a statistics question – not a programming one – I am not going into detail. However, a quick search indicates that for some kinds of multilevel R-squares, there are functions available for R, e.g. mitml::multilevelR2.
Second, in order to pool a statistic across imputation samples, it should be normally distributed. Therefore, you have to transform R-squared into Fisher's Z and back-transform it after pooling. See https://stefvanbuuren.name/fimd/sec-pooling.html
In the following I assume that you have a way (or several options) to calculate your (adjusted) R-squared. Assuming that you use mitl::multilevelR2 and choose the method by LaHuis et al. (2014), you can compute and pool it across your imputations with the following steps:
# what you did before:
imp <- mice::mice(nhanes)
imp2 <- mice::complete(imp, "all")
fit_l <- lapply(imp2, lme4::lmer,
formula = bmi ~ (1|age) + hyp + chl,
REML = T)
# get your R-squareds in a vector (replace `mitl::multilevelR2` with your preferred function for this)
Rsq <- lapply(fit_l, mitml::multilevelR2, print="MVP")
Rsq <- as.double(Rsq)
# convert the R-squareds into Fisher's Z-scores
Zrsq <- 1/2*log( (1+sqrt(Rsq)) / (1-sqrt(Rsq)) )
# get the variance of Fisher's Z (same for all imputation samples)
Var_z <- 1 / (nrow(imp2$`1`)-3)
Var_z <- rep(Var_z, imp$m)
# pool the Zs
Z_pool <- pool.scalar(Zrsq, Var_z, n=imp$n)$qbar
# back-transform pooled Z to Rsquared
Rsq_pool <- ( (exp(2*Z_pool) - 1) / (exp(2*Z_pool) + 1) )^2
Rsq_pool #done
The working data looks like:
set.seed(1234)
df <- data.frame(y = rnorm(1:30),
fac1 = as.factor(sample(c("A","B","C","D","E"),30, replace = T)),
fac2 = as.factor(sample(c("NY","NC","CA"),30,replace = T)),
x = rnorm(1:30))
The lme model is fitted as:
library(lme4)
mixed <- lmer(y ~ x + (1|fac1) + (1|fac2), data = df)
I used bootMer to run the parametric bootstrapping and I can successfully obtain the coefficients (intercept) and SEs for fixed&random effects:
mixed_boot_sum <- function(data){s <- sigma(data)
c(beta = getME(data, "fixef"), theta = getME(data, "theta"), sigma = s)}
mixed_boot <- bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", use.u = FALSE)
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I have no problem extracting the coefficients(slope) from mixed model by using augment function from broom package, see below:
library(broom)
mixed.coef <- augment(mixed, df)
However, it seems like broom can't deal with boot class object. I can't use above functions directly on mixed_boot.
I also tried to modify the mixed_boot_sum by adding mmList( I thought this would be what I am looking for), but R complains as:
Error in bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", :
bootMer currently only handles functions that return numeric vectors
Furthermore, is it possible to obtain CI of both fixed&random effects by specifying FUN as well?
Now, I am very confused about the correct specifications for the FUN in order to achieve my needs. Any help regarding to my question would be greatly appreciated!
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I'm not sure what you mean by "coefficients(slope) of each individual level". broom::augment(mixed, df) gives the predictions (residuals, etc.) for every observation. If you want the predicted coefficients at each level I would try
mixed_boot_coefs <- function(fit){
unlist(coef(fit))
}
which for the original model gives
mixed_boot_coefs(mixed)
## fac1.(Intercept)1 fac1.(Intercept)2 fac1.(Intercept)3 fac1.(Intercept)4
## -0.4973925 -0.1210432 -0.3260958 0.2645979
## fac1.(Intercept)5 fac1.x1 fac1.x2 fac1.x3
## -0.6288728 0.2187408 0.2187408 0.2187408
## fac1.x4 fac1.x5 fac2.(Intercept)1 fac2.(Intercept)2
## 0.2187408 0.2187408 -0.2617613 -0.2617613
## ...
If you want the resulting object to be more clearly named you can use:
flatten <- function(cc) setNames(unlist(cc),
outer(rownames(cc),colnames(cc),
function(x,y) paste0(y,x)))
mixed_boot_coefs <- function(fit){
unlist(lapply(coef(fit),flatten))
}
When run through bootMer/confint/boot::boot.ci these functions will give confidence intervals for each of these values (note that all of the slopes facW.xZ are identical across groups because the model assumes random variation in the intercept only). In other words, whatever information you know how to extract from a fitted model (conditional modes/BLUPs [ranef], predicted intercepts and slopes for each level of the grouping variable [coef], parameter estimates [fixef, getME], random-effects variances [VarCorr], predictions under specific conditions [predict] ...) can be used in bootMer's FUN argument, as long as you can flatten its structure into a simple numeric vector.
I'm using caret package and 'neuralnet' model so as to find the best tuning parameters for a neural network based on a data set which contains several predictors transformed by PCA. This data set also contains two output numeric variables, so I want to model these two variables against the predictors. Thus, I'm performing regression.
When using the package 'neuralnet', I get the desired output: a network whose output layer consists of two neurons, corresponding to the two output variables that I want to model, as you can see from the following code.
library(neuralnet)
neuralnet.network <- neuralnet(x + y ~ PC1 + PC2, train.pca.groundTruth, hidden=2, rep=5, algorithm = "rprop+", linear.output=T)
> head(compute(neuralnet.network, test.pca[,c(1,2)])$net.result)
[,1] [,2]
187 0.5890781796 0.3481661367
72 0.7182396668 0.4330461404
107 0.5854193907 0.3446555435
228 0.6114171607 0.3648684296
262 0.6727465772 0.4035759540
135 0.5559830113 0.3288717153
However, when using the same model with train function from caret package, the output consists of just one single variable, named '.outcome', which is in fact the sum of the two variables. This is the code:
paramGrid <- expand.grid(.layer1 = c(2), .layer2 = 0, .layer3 = 0)
ctrl <- trainControl(method = "repeatedcv", repeats = 5)
set.seed(23)
caret.neuralnet <- train(x + y ~ PC1 + PC2, data = train.pca.groundTruth, method = "neuralnet", metric = "RMSE", tuneGrid = paramGrid, trControl = ctrl, algorithm = "rprop+", linear.output = T)
> head(predict(caret.neuralnet, test.pca[,c(1,2)]))
[1] 0.9221328635 1.1953289038 1.0333353272 0.9561434406 1.0409961115 0.8834807926
Is there any possibility to prevent caret train function from interpreting the symbol '+' in a formula as summation but as the specification of several output variables, just as neuralnet does? I've tried the x-y form, though it doesn't work.
I would like to know whether there is any form to do that without training separate models for each output variable.
Thank you so much!
train doesn't support multiple outcomes so the intended symbolic formula x + y resolves to a literal one adding x and y.
Max