Wald Testing Bootstrapped Estimates in R - r

I've performed multiple regression (specifically quantile regression with multiple predictors using quantreg in R). I have estimated the standard error and confidence intervals based on bootstrapping the estimates. Now i want to test whether the estimates at different quantiles differ significantly from one another (Wald test would be preferable). How can i do this?
FML <- as.formula(outcome ~ VAR + c1 + c2 + c3)
quantiles <- c(0.25, 0.5, 0.75)
q.Result <- rqs(FML, tau=quantiles, data, method="fn", na.action=na.omit)
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb",
covariance=TRUE)
From q.Summary i've extracted the bootstrapped (ie 10000) estimates (ie vector of 10000 bootstrapped B values).
Note: In reality I'm not especially interested comparing the estimates from all my covariates (in FML), I'm primarily interested comparing the estimates for VAR. What is the best way to proceed?

Consulted with a colleague, and we resolved that estimates from different taus could be compared using Wald test as follows.
From object rqs produced by
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb", covariance=TRUE)
you extract the bootstrapped Beta values for variable of interest in this case VAR, the first covariate in FML for each tau
boot.Bs <- sapply(q.Summary, function (x) x[["B"]][,2])
B0 <- coef(summary(lm(FML, data)))[2,1] # Extract liner estimate data linear estimate
Then compute wald statistic and get pvalue with number of quantiles for degrees of freedom
Wald <- sum(apply(boot.Bs, 2, function (x) ((mean(x)-B0)^2)/var(x)))
Pvalue <- pchisq(Wald, ncol(boot.Bs), lower=FALSE)
You also want to verify that bootstrapped Betas are normally distributed, and if you're running many taus it can be cumbersome to check all those QQ plots so just sum them by row
qqnorm(apply(boot.Bs, 1, sum))
qqline(apply(boot.Bs, 1, sum), col = 2)
This seems to be working, and if anyone can think of anything wrong with my solution, please share

Related

Adjusted R squared using 'mice'

I am using the mice package and lmer from lme4 for my analyses. However, pool.r.squared() won't work on this output. I am looking for suggestions on how to include the computation of the adjusted R squared in the following workflow.
require(lme4, mice)
imp <- mice(nhanes)
imp2 <- mice::complete(imp, "all") # This step is necessary in my analyses to include other variables/covariates following the multiple imputation
fit <- lapply(imp2, lme4::lmer,
formula = bmi ~ (1|age) + hyp + chl,
REML = T)
est <- pool(fit)
summary(est)
You have two separate problems here.
First, there are several opinions about what an R-squared for multilevel/mixed-model regressions actually is. This is the reason why pool.r.squared does not work for you, as it does not accept results from anything other than lm(). I do not have an answer for you how to calculate something R-squared-ish for your model and since it is a statistics question – not a programming one – I am not going into detail. However, a quick search indicates that for some kinds of multilevel R-squares, there are functions available for R, e.g. mitml::multilevelR2.
Second, in order to pool a statistic across imputation samples, it should be normally distributed. Therefore, you have to transform R-squared into Fisher's Z and back-transform it after pooling. See https://stefvanbuuren.name/fimd/sec-pooling.html
In the following I assume that you have a way (or several options) to calculate your (adjusted) R-squared. Assuming that you use mitl::multilevelR2 and choose the method by LaHuis et al. (2014), you can compute and pool it across your imputations with the following steps:
# what you did before:
imp <- mice::mice(nhanes)
imp2 <- mice::complete(imp, "all")
fit_l <- lapply(imp2, lme4::lmer,
formula = bmi ~ (1|age) + hyp + chl,
REML = T)
# get your R-squareds in a vector (replace `mitl::multilevelR2` with your preferred function for this)
Rsq <- lapply(fit_l, mitml::multilevelR2, print="MVP")
Rsq <- as.double(Rsq)
# convert the R-squareds into Fisher's Z-scores
Zrsq <- 1/2*log( (1+sqrt(Rsq)) / (1-sqrt(Rsq)) )
# get the variance of Fisher's Z (same for all imputation samples)
Var_z <- 1 / (nrow(imp2$`1`)-3)
Var_z <- rep(Var_z, imp$m)
# pool the Zs
Z_pool <- pool.scalar(Zrsq, Var_z, n=imp$n)$qbar
# back-transform pooled Z to Rsquared
Rsq_pool <- ( (exp(2*Z_pool) - 1) / (exp(2*Z_pool) + 1) )^2
Rsq_pool #done

How to validate performance of generalized linear regression model

I'm trying to validate the performance of a generalized linear model, that has a continuous output. Through research I found that the most effective means of validating the performance of a continuous model is to utilise rsquared, adjusted rsquared and RMSE methods(correct me if I'm wrong) rather than utilise the confusion matrix method (accuracy, precision, f1 etc.) used for binomial models.
How do I find the squared value for my model, based on the actual vs. predicted value. Below is the code for my glm model, data has been split into train and test.
Quite new to this so open to suggestions.
#GENERALISED LINEAR MODEL
LR_swim <- glm(racetime_mins ~ event_month +gender + place +
clocktime_mins +handicap_mins +
Wind_Speed_knots+
Air_Temp_Celsius +Water_Temp_Celsius +Wave_Height_m,
data = SwimmingTrain,
family=gaussian(link = "identity"))
summary(LR_swim)
#Predict Race_Time
pred_LR <- predict(LR_swim, SwimmingTest, type ="response")
pred_LR
Such performance measures can be implemented with a simple line of R code. So, for some dummy data:
preds <- c(1.0, 2.0, 9.5)
actuals <- c(0.9, 2.1, 10.0)
the mean squared error (MSE) is simply
mean((preds-actuals)^2)
# [1] 0.09
while the mean absolute error (MAE), is
mean(abs(preds-actuals))
# [1] 0.2333333
and the root mean squared error (RMSE) is simply the square root of the MSE, i.e.:
sqrt(mean((preds-actuals)^2))
# [1] 0.3
The last two measures have an additional advantage of being in the same scale as your original data (not the case for MSE).

AIC in R: differences in manual vs. internal value when using weighted data

I am attempting to use R for model selection based on the AIC statistic. When comparing linear models with or without weighting, my code in R informs me that weighting is preferable compared to no-weighting, and these results are confirmed in other software (GraphPad Prism). I have sample code using real data from a standard curve:
#Linear Curve Fitting
a <- c(0.137, 0.412, 1.23, 3.7, 11.1 ,33.3)
b <- c(0.00198, 0.00359, 0.00816, 0.0220, 0.0582, 0.184)
m1 <- lm(b ~ poly(a,1))
m2 <- lm(b ~ poly(a,1), weight=1/a)
n1 <- 6 #Number of observations
k1 <- 2 #Number of parameters
When I calculate AIC using either the internal function in R or via manual calculation in which:
AIC = n + n log 2π + n log(RSS/n) + 2(k + 1) with n observations and k parameters
I get equivalent AIC values for the non-weighted model. When I analyze the effect of weighting, the manual AIC value is lower, however the end result is that both the internal and manual AIC suggest that weighting is preferred.
> AIC(m1); n1+(n1*log(2*pi))+n1*(log(deviance(m1)/n1))+(2*(k1+1))
[1] -54.83171
[1] -54.83171
> AIC(m2); n1+(n1*log(2*pi))+n1*(log(deviance(m2)/n1))+(2*(k1+1))
[1] -64.57691
[1] -69.13025
When I try the same analysis using a nonlinear model, the difference in AIC between the internal function and manual calculation is more profound. Below is a code of examplar Michaelis-Menten kinetic data:
c <- c(0.5, 1, 5, 10, 30, 100, 300)
d <- c(3, 5, 20, 50, 75, 200, 250)
m3 <- nls(d ~ (V * c)/(K + c), start=list(V=10, K=1))
m4 <- nls(d ~ (V * c)/(K + c), start=list(V=10, K=1), weight=1/d^2)
n2 <- 7
k2 <- 2
The AIC are calculated as indicated for the first two models:
> AIC(m3); n2+(n2*log(2*pi))+n2*(log(deviance(m3)/n2))+(2*(k2+1))
[1] 58.48839
[1] 58.48839
> AIC(m4); n2+(n2*log(2*pi))+n2*(log(deviance(m4)/n2))+(2*(k2+1))
[1] 320.7105
[1] 0.1538546
Similar to the linear example, the internal AIC and manual AIC values are the same when data are not weighted (m3). The problem occurs with weighting (m4) as the manual AIC estimate is much lower. This situation is similar to what was asked in a related problem AIC with weighted nonlinear regression (nls).
I earlier mentioned GraphPad Prism, which for both the models and datasets given above showed lower AICs when weighting was used. My question then is why is there such a difference in the internal vs. manual AIC estimates in R when weighting the data (for which the outcome is different for nonlinear model compared to a linear one)? Ultimately, should I regard the internal AIC value or the manual value as being more correct, or am I using a wrong equation?
The discrepancy you are seeing is from using the unweighted log-likelihood formula in the manual calculations for a weighted model. For example, you can replicate the AIC results for m2 and m4 with the following adjustments:
In the case of m2, you simply need to subract sum(log(m2$weights)) from your calculation:
AIC(m2); n1+(n1*log(2*pi))+n1*(log(deviance(m2)/n1))+(2*(k1+1)) - sum(log(m2$weights))
[1] -64.57691
[1] -64.57691
In the case of m4, you would have to swap the deviance call with a weighted residuals calculation, and subtract n2 * sum(log(m4$weights)) from your results:
AIC(m4); n2+(n2*log(2*pi))+n2*(log(sum(m4$weights * m4$m$resid()^2)/n2))+(2*(k2+1)) - n2 * sum(log(m4$weights))
[1] 320.7105
[1] 320.7105
I believe the derivation for the formula used by logLikin m2 is pretty straight forward and correct, but I am not as sure about m4. From reading some other threads about logLik.nls() (example 1, example 2), it seems like there is some confusion about the correct approach for the nls estimate. To summarize, I believe AIC is correct for m2; I was not able to verify the math for the weighted nls model and would lean towards using the m2 formula again in that case (but replace deviance calculation with weighted residuals), or (maybe better) not use AIC for the nls model

Performing Anova on Bootstrapped Estimates from Quantile Regression

So I'm using the quantreg package in R to conduct quantile regression analyses to test how the effects of my predictors vary across the distribution of my outcome.
FML <- as.formula(outcome ~ VAR + c1 + c2 + c3)
quantiles <- c(0.25, 0.5, 0.75)
q.Result <- list()
for (i in quantiles){
i.no <- which(quantiles==i)
q.Result[[i.no]] <- rq(FML, tau=i, data, method="fn", na.action=na.omit)
}
Then i call anova.rq which runs a Wald test on all the models and outputs a pvalue for each covariate telling me whether the effects of each covariate vary significantly across the distribution of my outcome.
anova.Result <- anova(q.Result[[1]], q.Result[[2]], q.Result[[3]], joint=FALSE)
Thats works just fine. However, for my particular data (and in general?), bootstrapping my estimates and their error is preferable. Which i conduct with a slight modification of the code above.
q.Result <- rqs(FML, tau=quantiles, data, method="fn", na.action=na.omit)
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb",
covariance=TRUE)
Here's where i get stuck. The quantreg currently cannot peform the anova (Wald) test on boostrapped estimates. The information files on the quantreg packages specifically states that "extensions of the methods to be used in anova.rq should be made" regarding the boostrapping method.
Looking at the details of the anova.rq method. I can see that it requires 2 components not present in the quantile model when bootstrapping.
1) Hinv (Inverse Hessian Matrix). The package information files specifically states "note that for se = "boot" there is no way to split the estimated covariance matrix into its sandwich constituent parts."
2) J which, according to the information files, is "Unscaled Outer product of gradient matrix returned if cov=TRUE and se != "iid". The Huber sandwich is cov = tau (1-tau) Hinv %*% J %*% Hinv. as for the Hinv component, there is no J component when se == "boot". (Note that to make the Huber sandwich you need to add the tau (1-tau) mayonnaise yourself.)"
Can i calculate or estimate Hinv and J from the bootstrapped estimates? If not what is the best way to proceed?
Any help on this much appreciated. This my first timing posting a question here, though I've greatly benefited from the answers to other peoples questions in the past.
For question 2: You can use R = for resampling. For example:
anova(object, ..., test = "Wald", joint = TRUE, score =
"tau", se = "nid", R = 10000, trim = NULL)
Where R is the number of resampling replications for the anowar form of the test, used to estimate the reference distribution for the test statistic.
Just a heads up, you'll probably get a better response to your questions if you only include 1 question per post.
Consulted with a colleague, and he confirmed that it was unlikely that Hinv and J could be 'reverse' computed from bootstrapped estimates. However we resolved that estimates from different taus could be compared using Wald test as follows.
From object rqs produced by
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb", covariance=TRUE)
you extract the bootstrapped Beta values for variable of interest in this case VAR, the first covariate in FML for each tau
boot.Bs <- sapply(q.Summary, function (x) x[["B"]][,2])
B0 <- coef(summary(lm(FML, data)))[2,1] # Extract liner estimate data linear estimate
Then compute wald statistic and get pvalue with number of quantiles for degrees of freedom
Wald <- sum(apply(boot.Bs, 2, function (x) ((mean(x)-B0)^2)/var(x)))
Pvalue <- pchisq(Wald, ncol(boot.Bs), lower=FALSE)
You also want to verify that bootstrapped Betas are normally distributed, and if you're running many taus it can be cumbersome to check all those QQ plots so just sum them by row
qqnorm(apply(boot.Bs, 1, sum))
qqline(apply(boot.Bs, 1, sum), col = 2)
This seems to be working, and if anyone can think of anything wrong with my solution, please share

Finding critical values for the Pearson correlation coefficient

I'd like to use R to find the critical values for the Pearson correlation coefficient.
This has proved difficult to find in search engines since the standard variable for the Pearson correlation coefficient is itself r. In turn, I'm finding a lot of r critical value tables (rather than how to find this by using the statistical package R).
I'm looking for a function that will provide output like the following:
I'm comfortable finding the correlation with:
cor(x,y)
However, I'd also like to find the critical values.
Is there a function I can use to enter n (or degrees of freedom) as well as alpha in order to find the critical value?
The significance of a correlation coefficient, r, is determined by converting r to a t-statistic and then finding the significance of that t-value at the degrees of freedom that correspond to the sample size, n. So, you can use R to find the critical t-value and then convert that value back to a correlation coefficient to find the critical correlation coefficient.
critical.r <- function( n, alpha = .05 ) {
df <- n - 2
critical.t <- qt(alpha/2, df, lower.tail = F)
critical.r <- sqrt( (critical.t^2) / ( (critical.t^2) + df ) )
return(critical.r)
}
# Example usage: Critical correlation coefficient at sample size of n = 100
critical.r( 100 )
The general structure of hypothesis testing is kind of a mish-mash of two systems: Fisherian and Neyman-Pearson. Statisticians understand the differences but rarely does this get clearly presented in undergraduate stats classes. R was designed by and intended for statisticians as a toolbox, so they constructed a function named cor.test that will deliver a p-value (part of the Fisherian tradition) as well as a confidence interval for "r" (derived on the basis of the Neyman-Pearson formalism.) Fisher and Neyman had bitter disputes in their lifetime. The "critical value" terminology is part of the N-P testing strategy. It is equivalent to building a confidence interval and finding the particular statistic that reaches exactly a threshold value of 0.05 significance.
The code for constructing the inferential statistics in cor.test is available with:
methods(cor.test)
getAnywhere(cor.test.default)
# scroll down
method <- "Pearson's product-moment correlation"
#-----partial code----
r <- cor(x, y)
df <- n - 2L
ESTIMATE <- c(cor = r)
PARAMETER <- c(df = df)
STATISTIC <- c(t = sqrt(df) * r/sqrt(1 - r^2))
p <- pt(STATISTIC, df)
# ---- omitted some set up and error checking ----
# this is the confidence interval section------
z <- atanh(r)
sigma <- 1/sqrt(n - 3)
cint <- switch(alternative, less = c(-Inf, z + sigma *
qnorm(conf.level)), greater = c(z - sigma * qnorm(conf.level),
Inf), two.sided = z + c(-1, 1) * sigma * qnorm((1 +
conf.level)/2))
cint <- tanh(cint)
So now you know how R does it. Notice that there is no "critical value" mentioned. I suspect that your hope was to find some table where a tabulation of "r" and "df" was laid out displaying the minimum "r" that would reach a significance of 0.05 for a given 'df'. Such a table could be built but that's not how this particular toolbox is constructed. You should now have the tools to build it yourself.
I would do the same. But if you are using a Spearman correlation you need to convert t into r using a different formula.
just change the last line before the return in the function with this one:
critical.r <- sqrt(((critical.t^2) / (df)) + 1)

Resources