I've performed multiple regression (specifically quantile regression with multiple predictors using quantreg in R). I have estimated the standard error and confidence intervals based on bootstrapping the estimates. Now i want to test whether the estimates at different quantiles differ significantly from one another (Wald test would be preferable). How can i do this?
FML <- as.formula(outcome ~ VAR + c1 + c2 + c3)
quantiles <- c(0.25, 0.5, 0.75)
q.Result <- rqs(FML, tau=quantiles, data, method="fn", na.action=na.omit)
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb",
From q.Summary i've extracted the bootstrapped (ie 10000) estimates (ie vector of 10000 bootstrapped B values).
Note: In reality I'm not especially interested comparing the estimates from all my covariates (in FML), I'm primarily interested comparing the estimates for VAR. What is the best way to proceed?

Consulted with a colleague, and we resolved that estimates from different taus could be compared using Wald test as follows.
From object rqs produced by
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb", covariance=TRUE)
you extract the bootstrapped Beta values for variable of interest in this case VAR, the first covariate in FML for each tau
boot.Bs <- sapply(q.Summary, function (x) x[["B"]][,2])
B0 <- coef(summary(lm(FML, data)))[2,1] # Extract liner estimate data linear estimate
Then compute wald statistic and get pvalue with number of quantiles for degrees of freedom
Wald <- sum(apply(boot.Bs, 2, function (x) ((mean(x)-B0)^2)/var(x)))
Pvalue <- pchisq(Wald, ncol(boot.Bs), lower=FALSE)
You also want to verify that bootstrapped Betas are normally distributed, and if you're running many taus it can be cumbersome to check all those QQ plots so just sum them by row
qqnorm(apply(boot.Bs, 1, sum))
qqline(apply(boot.Bs, 1, sum), col = 2)
This seems to be working, and if anyone can think of anything wrong with my solution, please share


Adjusted R squared using 'mice'

I am using the mice package and lmer from lme4 for my analyses. However, pool.r.squared() won't work on this output. I am looking for suggestions on how to include the computation of the adjusted R squared in the following workflow.
require(lme4, mice)
imp <- mice(nhanes)
imp2 <- mice::complete(imp, "all") # This step is necessary in my analyses to include other variables/covariates following the multiple imputation
fit <- lapply(imp2, lme4::lmer,
formula = bmi ~ (1|age) + hyp + chl,
est <- pool(fit)
You have two separate problems here.
First, there are several opinions about what an R-squared for multilevel/mixed-model regressions actually is. This is the reason why pool.r.squared does not work for you, as it does not accept results from anything other than lm(). I do not have an answer for you how to calculate something R-squared-ish for your model and since it is a statistics question – not a programming one – I am not going into detail. However, a quick search indicates that for some kinds of multilevel R-squares, there are functions available for R, e.g. mitml::multilevelR2.
Second, in order to pool a statistic across imputation samples, it should be normally distributed. Therefore, you have to transform R-squared into Fisher's Z and back-transform it after pooling. See https://stefvanbuuren.name/fimd/sec-pooling.html
In the following I assume that you have a way (or several options) to calculate your (adjusted) R-squared. Assuming that you use mitl::multilevelR2 and choose the method by LaHuis et al. (2014), you can compute and pool it across your imputations with the following steps:
# what you did before:
imp <- mice::mice(nhanes)
imp2 <- mice::complete(imp, "all")
fit_l <- lapply(imp2, lme4::lmer,
formula = bmi ~ (1|age) + hyp + chl,
# get your R-squareds in a vector (replace `mitl::multilevelR2` with your preferred function for this)
Rsq <- lapply(fit_l, mitml::multilevelR2, print="MVP")
Rsq <- as.double(Rsq)
# convert the R-squareds into Fisher's Z-scores
Zrsq <- 1/2*log( (1+sqrt(Rsq)) / (1-sqrt(Rsq)) )
# get the variance of Fisher's Z (same for all imputation samples)
Var_z <- 1 / (nrow(imp2$`1`)-3)
Var_z <- rep(Var_z, imp$m)
# pool the Zs
Z_pool <- pool.scalar(Zrsq, Var_z, n=imp$n)$qbar
# back-transform pooled Z to Rsquared
Rsq_pool <- ( (exp(2*Z_pool) - 1) / (exp(2*Z_pool) + 1) )^2
Rsq_pool #done

How to validate performance of generalized linear regression model

I'm trying to validate the performance of a generalized linear model, that has a continuous output. Through research I found that the most effective means of validating the performance of a continuous model is to utilise rsquared, adjusted rsquared and RMSE methods(correct me if I'm wrong) rather than utilise the confusion matrix method (accuracy, precision, f1 etc.) used for binomial models.
How do I find the squared value for my model, based on the actual vs. predicted value. Below is the code for my glm model, data has been split into train and test.
Quite new to this so open to suggestions.
LR_swim <- glm(racetime_mins ~ event_month +gender + place +
clocktime_mins +handicap_mins +
Air_Temp_Celsius +Water_Temp_Celsius +Wave_Height_m,
data = SwimmingTrain,
family=gaussian(link = "identity"))
#Predict Race_Time
pred_LR <- predict(LR_swim, SwimmingTest, type ="response")
Such performance measures can be implemented with a simple line of R code. So, for some dummy data:
preds <- c(1.0, 2.0, 9.5)
actuals <- c(0.9, 2.1, 10.0)
the mean squared error (MSE) is simply
# [1] 0.09
while the mean absolute error (MAE), is
# [1] 0.2333333
and the root mean squared error (RMSE) is simply the square root of the MSE, i.e.:
# [1] 0.3
The last two measures have an additional advantage of being in the same scale as your original data (not the case for MSE).

AIC in R: differences in manual vs. internal value when using weighted data

I am attempting to use R for model selection based on the AIC statistic. When comparing linear models with or without weighting, my code in R informs me that weighting is preferable compared to no-weighting, and these results are confirmed in other software (GraphPad Prism). I have sample code using real data from a standard curve:
#Linear Curve Fitting
a <- c(0.137, 0.412, 1.23, 3.7, 11.1 ,33.3)
b <- c(0.00198, 0.00359, 0.00816, 0.0220, 0.0582, 0.184)
m1 <- lm(b ~ poly(a,1))
m2 <- lm(b ~ poly(a,1), weight=1/a)
n1 <- 6 #Number of observations
k1 <- 2 #Number of parameters
When I calculate AIC using either the internal function in R or via manual calculation in which:
AIC = n + n log 2π + n log(RSS/n) + 2(k + 1) with n observations and k parameters
I get equivalent AIC values for the non-weighted model. When I analyze the effect of weighting, the manual AIC value is lower, however the end result is that both the internal and manual AIC suggest that weighting is preferred.
> AIC(m1); n1+(n1*log(2*pi))+n1*(log(deviance(m1)/n1))+(2*(k1+1))
[1] -54.83171
[1] -54.83171
> AIC(m2); n1+(n1*log(2*pi))+n1*(log(deviance(m2)/n1))+(2*(k1+1))
[1] -64.57691
[1] -69.13025
When I try the same analysis using a nonlinear model, the difference in AIC between the internal function and manual calculation is more profound. Below is a code of examplar Michaelis-Menten kinetic data:
c <- c(0.5, 1, 5, 10, 30, 100, 300)
d <- c(3, 5, 20, 50, 75, 200, 250)
m3 <- nls(d ~ (V * c)/(K + c), start=list(V=10, K=1))
m4 <- nls(d ~ (V * c)/(K + c), start=list(V=10, K=1), weight=1/d^2)
n2 <- 7
k2 <- 2
The AIC are calculated as indicated for the first two models:
> AIC(m3); n2+(n2*log(2*pi))+n2*(log(deviance(m3)/n2))+(2*(k2+1))
[1] 58.48839
[1] 58.48839
> AIC(m4); n2+(n2*log(2*pi))+n2*(log(deviance(m4)/n2))+(2*(k2+1))
[1] 320.7105
[1] 0.1538546
Similar to the linear example, the internal AIC and manual AIC values are the same when data are not weighted (m3). The problem occurs with weighting (m4) as the manual AIC estimate is much lower. This situation is similar to what was asked in a related problem AIC with weighted nonlinear regression (nls).
I earlier mentioned GraphPad Prism, which for both the models and datasets given above showed lower AICs when weighting was used. My question then is why is there such a difference in the internal vs. manual AIC estimates in R when weighting the data (for which the outcome is different for nonlinear model compared to a linear one)? Ultimately, should I regard the internal AIC value or the manual value as being more correct, or am I using a wrong equation?
The discrepancy you are seeing is from using the unweighted log-likelihood formula in the manual calculations for a weighted model. For example, you can replicate the AIC results for m2 and m4 with the following adjustments:
In the case of m2, you simply need to subract sum(log(m2$weights)) from your calculation:
AIC(m2); n1+(n1*log(2*pi))+n1*(log(deviance(m2)/n1))+(2*(k1+1)) - sum(log(m2$weights))
[1] -64.57691
[1] -64.57691
In the case of m4, you would have to swap the deviance call with a weighted residuals calculation, and subtract n2 * sum(log(m4$weights)) from your results:
AIC(m4); n2+(n2*log(2*pi))+n2*(log(sum(m4$weights * m4$m$resid()^2)/n2))+(2*(k2+1)) - n2 * sum(log(m4$weights))
[1] 320.7105
[1] 320.7105
I believe the derivation for the formula used by logLikin m2 is pretty straight forward and correct, but I am not as sure about m4. From reading some other threads about logLik.nls() (example 1, example 2), it seems like there is some confusion about the correct approach for the nls estimate. To summarize, I believe AIC is correct for m2; I was not able to verify the math for the weighted nls model and would lean towards using the m2 formula again in that case (but replace deviance calculation with weighted residuals), or (maybe better) not use AIC for the nls model

Finding critical values for the Pearson correlation coefficient

I'd like to use R to find the critical values for the Pearson correlation coefficient.
This has proved difficult to find in search engines since the standard variable for the Pearson correlation coefficient is itself r. In turn, I'm finding a lot of r critical value tables (rather than how to find this by using the statistical package R).
I'm looking for a function that will provide output like the following:
I'm comfortable finding the correlation with:
However, I'd also like to find the critical values.
Is there a function I can use to enter n (or degrees of freedom) as well as alpha in order to find the critical value?
The significance of a correlation coefficient, r, is determined by converting r to a t-statistic and then finding the significance of that t-value at the degrees of freedom that correspond to the sample size, n. So, you can use R to find the critical t-value and then convert that value back to a correlation coefficient to find the critical correlation coefficient.
critical.r <- function( n, alpha = .05 ) {
df <- n - 2
critical.t <- qt(alpha/2, df, lower.tail = F)
critical.r <- sqrt( (critical.t^2) / ( (critical.t^2) + df ) )
# Example usage: Critical correlation coefficient at sample size of n = 100
critical.r( 100 )
The general structure of hypothesis testing is kind of a mish-mash of two systems: Fisherian and Neyman-Pearson. Statisticians understand the differences but rarely does this get clearly presented in undergraduate stats classes. R was designed by and intended for statisticians as a toolbox, so they constructed a function named cor.test that will deliver a p-value (part of the Fisherian tradition) as well as a confidence interval for "r" (derived on the basis of the Neyman-Pearson formalism.) Fisher and Neyman had bitter disputes in their lifetime. The "critical value" terminology is part of the N-P testing strategy. It is equivalent to building a confidence interval and finding the particular statistic that reaches exactly a threshold value of 0.05 significance.
The code for constructing the inferential statistics in cor.test is available with:
# scroll down
method <- "Pearson's product-moment correlation"
#-----partial code----
r <- cor(x, y)
df <- n - 2L
ESTIMATE <- c(cor = r)
PARAMETER <- c(df = df)
STATISTIC <- c(t = sqrt(df) * r/sqrt(1 - r^2))
p <- pt(STATISTIC, df)
# ---- omitted some set up and error checking ----
# this is the confidence interval section------
z <- atanh(r)
sigma <- 1/sqrt(n - 3)
cint <- switch(alternative, less = c(-Inf, z + sigma *
qnorm(conf.level)), greater = c(z - sigma * qnorm(conf.level),
Inf), two.sided = z + c(-1, 1) * sigma * qnorm((1 +
cint <- tanh(cint)
So now you know how R does it. Notice that there is no "critical value" mentioned. I suspect that your hope was to find some table where a tabulation of "r" and "df" was laid out displaying the minimum "r" that would reach a significance of 0.05 for a given 'df'. Such a table could be built but that's not how this particular toolbox is constructed. You should now have the tools to build it yourself.
I would do the same. But if you are using a Spearman correlation you need to convert t into r using a different formula.
just change the last line before the return in the function with this one:
critical.r <- sqrt(((critical.t^2) / (df)) + 1)
