I have used the following R packages: mice, mitools, and pROC.
Basic design: 3 predictor measures with missing data rates between 5% and 70% on n~1,000. 1 binary target outcome variable.
Analytic Goal: Determine the AUROC of each of the 3 predictors.
I used the mice package to impute data and now have m datasets of imputed data.
Using the following command, I am able to get AUROC curves for each of m datasets:
fit1<-with(imp2, (roc(target, symptom1, ci=TRUE)))
fit2<-with(imp2, (roc(target, symptom2, ci=TRUE)))
fit3<-with(imp2, (roc(target, symptom3, ci=TRUE)))
I can see the estimates for each of m datasets without any problems.
fit1
fit2
fit3
To combine the parameters, I attempted to use mitools
>summary(pool(fit1))
>summary(pool(fit2))
>summary(pool(fit3))
I get the following error message:
"Error in pool(fit): Object has no vcov() method".
When combining coefficient estimates from m datasets, my understanding is that this is a simple average of the coefficients. However, the error term is more complex.
My question: How do I pool the "m" ROC parameter estimates (AUROC and 95% C.I. or S.E.) to get an accurate estimate of the error term for significance testing/95% Confidence Intervals?
Thank you for any help in advance.
I think the following works to combine the estimates.
pROC produces a point estimate for the AUROC as well as a 95% Confidence Interval.
To combine the AUROC from m imputation dataets, it is simply averaging the AUROC.
To create an appropriate standard error estimate and then a 95% C.I., I converted the 95% C.I.s into S.E. Using the standard formulas (Multiple Imputation FAQ, I computed the within, between, and total variance for the estimate. Once I had the standard error, I converted that back to a 95% C.I.
If anyone has any better suggestions, I would very much appreciate it.
I would use bootstrapping with the boot package to assess the different sources of variance. For instance for the variance due to imputation, you could use something like this:
bootstrap.imputation <- function(d, i, symptom){
sampled.data <- d[i,]
imputed.data <- ... # here the code you use to generate one imputed dataset, but apply it to sampled.data
auc(roc(imputed.data$target, imputed.data[[symptom]]))
}
boot.n <- 2000
boot(dataset, bootstrap.imputation, boot.n, "symptom1") # symptom1 is passed with ... to bootstrap.imputation
boot(dataset, bootstrap.imputation, boot.n, "symptom2")
boot(dataset, bootstrap.imputation, boot.n, "symptom3")
Then you can then do the same to assess the variance of the AUC. Impute your data, and apply the bootstrap again (or you can do with the built-in functions of pROC).
Related
I've created 12 imputed samples using the MICE package and wish to run paired sample t-tests and cohen's d calculation using an imputed dataset but I'm not sure how to do this. My end goal is to compare parameter estimates, t-test results and effect size estimates from both complete case analysis and adjusted (via MICE) to compare these, but while I have no issue with parameter estimates, I can't figure out t-tests and cohen's d.
I'm a bit confused as to how to approach this and searching online and in the mice package documentation and has not led to much progress. I did find mi.t.test from the MKmisc package but this appears to be for datasets imputed using Amelia, not MICE, and I can't quite figure it out. Would anyone have any advice or resources here please?
So far I have:
Identified auxiliary variables
Created Predictor Matrix
Imputed missing data m times
Fit & pooled estimates for linear models using with() for parameter estimates using summary()
Is there perhaps a way I can create an object of an imputed dataset that is usable with other analyses or am I looking at this in the wrong way?
I used multiple imputations for the first time for my research, but maybe I can help you by passing on the tips I received.
Perform the t-test on every imputed dataset
Use the Mice pool.scalar function. You can find documentation online. For Q fill in the Mean Difference, and for U the Standard Error Difference.
Then your pooled t-value is: qbar / sqrt(t)
You can find the values of qbar and t in the output of pool.scalar
And your pooled p-value is: 2 * (1 - pt(abs(statistic), pmax(DF, 0.001)))
Hope this helps!
When using the GAMLSS package in R, there are many different ways to fit a distribution to a set of data. My data is a single vector of values, and I am fitting a distribution over these values.
My question is this: what is the main difference between using fitDist() and gamlss() since they give similar but different answers for parameter values, and different worm plots?
Also, using the function confint() works for gamlss() fitted objects but not for objects fitted with fitDist(). Is there any way to produce confidence intervals for parameters fitted with the fitDist() function? Is there an accuracy difference between the two procedures? Thanks!
m1 <- fitDist()
fits many distributions and chooses the best according to a
generalized Akaike information criterion, GAIC(k), wit penalty k for each
fitted parameter in the distribution, where k is specified by the user,
e.g. k=2 for AIC,
k = log(n) for BIC,
k=4 for a Chi-squared test (rounded from 3.84, the 5% critical value of a Chi-squared distribution with 1 degree of fereedom), which is my preference.
m1$fits
gives the full results from the best to worst distribution according to GAIC(k).
I'm a bit of a newbie with stats and R, so need a bit of direction to find a suitable post-hoc test for my glmer model.
The model has a binary dependent variable (absent/present) and the predictor variables are interactive terms between a continuous variable(eg temp) and a categorical variable (species, n=3). Only interactive terms, rather than the continuous factor in isolation, produce significant results when an anova is run on the model. Species by itself has a large effect because one species is much rarer than the others. I'm trying to tease apart how the presence of these species varies across pH and between species.
I've tried lsmeans test with Tukey, and Firth's Bias-Reduced Logistic Regression, emmeans. I ran the effects function on the interactive terms, so had a rough expectation of what a post hoc could show, but the results logistf (firth's) have produced I was not expecting. Emmeans and tukey both gave the same results and ignored the continuous variable I assume because it's not a factor.
When I run firth's regression it produces chi-squared and p values that are either infinity for chi values or the p values astronomically small, even though what I saw through effects suggested no significant difference. I can't tell with the interactive term if there truly is an effect of the environmental variable or if the significant effect is because of the difference in species. Based on what I have seen of the logistf function, I didn't think it would produce a chi-square score. Is this an issue in coding or is it because of my data?
If I wasn't clear enough about something please let me know and if anyone has any suggestions or advice, they would be massively appreciated. Thanks!
The model and test code I used are below:
###glmer model
Large<-glmer(Abs.Pres~ Species:Q.Depth+Species:Conductivity+Species:Temp+Species:pH+Species:DO.P+(1|QID),
nAGQ=0,
family=binomial,
data=Stacked_Pref)
anova(Large)
Output:Analysis of Variance Table
npar Sum Sq Mean Sq F value
Species:Q.Depth 3 234.904 78.301 78.3014
Species:Conductivity 3 32.991 10.997 10.9970
Species:Temp 3 39.001 13.000 13.0004
Species:pH 3 25.369 8.456 8.4562
Species:DO.P 3 34.930 11.643 11.6434
###Firths
Lp<-logistf(Abs.Pres~Species:pH, data=Stacked_Pref, contrasts.arg=list(pH="contr.treatment", Species="contr.sum"))
> Lp
logistf(formula = Abs.Pres ~ Species:pH, data = Stacked_Pref,
contrasts.arg = list(pH = "contr.treatment", Species = "contr.sum"))
Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood
coef se(coef) lower 0.95 upper 0.95 Chisq p
(Intercept) 1.9711411 0.57309880 0.8552342 3.1015114 12.09107 5.066380e-04
SpeciesGoby:pH -0.3393185 0.07146049 -0.4804047 -0.2003108 23.31954 1.371993e-06
SpeciesMosquito:pH -0.3001385 0.07127771 -0.4408186 -0.1614419 18.24981 1.937453e-05
SpeciesRFBE:pH -0.4771393 0.07232469 -0.6200179 -0.3365343 45.73750 1.352096e-11
Likelihood ratio test=267.0212 on 3 df, p=0, n=3945
Sorry for a quite stupid question. I am doing multiple comparisons of morphologic traits through correlations of bootstraped data. I'm curious if such multiple comparisons are impacting my level of inference, as well as the effect of the potential multicollinearity in my data. Perhaps, a reasonable option would be to use my bootstraps to generate maximum likelihood and then generate AICc-s to do comparisons with all of my parameters, to see what comes out as most important... the problem is that although I have (more or less clear) the way, I don't know how to implement this in R. Can anybody be so kind as to throw some light on this for me?
So far, here an example (using R language, but not my data):
library(boot)
data(iris)
head(iris)
# The function
pearson <- function(data, indices){
dt<-data[indices,]
c(
cor(dt[,1], dt[,2], method='p'),
median(dt[,1]),
median(dt[,2])
)
}
# One example: iris$Sepal.Length ~ iris$Sepal.Width
# I calculate the r-squared with 1000 replications
set.seed(12345)
dat <- iris[,c(1,2)]
dat <- na.omit(dat)
results <- boot(dat, statistic=pearson, R=1000)
# 95% CIs
boot.ci(results, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = results, type = "bca")
Intervals :
Level BCa
95% (-0.2490, 0.0423 )
Calculations and Intervals on Original Scale
plot(results)
I have several more pairs of comparisons.
More of a Cross Validated question.
Multicollinearity shouldn't be a problem if you're just assessing the relationship between two variables (in your case correlation). Multicollinearity only becomes an issue when you fit a model, e.g. multiple regression, with several highly correlated predictors.
Multiple comparisons is always a problem though because it increases your type-I error. The way to address that is to do a multiple comparison correction, e.g. Bonferroni-Holm or the less conservative FDR. That can have its downsides though, especially if you have a lot of predictors and few observations - it may lower your power so much that you won't be able to find any effect, no matter how big it is.
In high-dimensional setting like this, your best bet may be with some sort of regularized regression method. With regularization, you put all predictors into your model at once, similarly to doing multiple regression, however, the trick is that you constrain the model so that all of the regression slopes are pulled towards zero, so that only the ones with the big effects "survive". The machine learning versions of regularized regression are called ridge, LASSO, and elastic net, and they can be fitted using the glmnet package. There is also Bayesian equivalents in so-called shrinkage priors, such as horseshoe (see e.g. https://avehtari.github.io/modelselection/regularizedhorseshoe_slides.pdf). You can fit Bayesian regularized regression using the brms package.
I am trying to determine confidence intervals for predicted probabilities from a binomial logistic regression in R. The model is estimated using lrm (from the package rms) to allow for clustering standard errors on survey respondents (each respondent appears up to 3 times in the data):
library(rms)
model1<-lrm(outcome~var1+var2+var3,data=mydata,x=T,y=T,se.fit=T)
model.rob<-robcov(model1,cluster=respondent.id)
I am able to estimate a predicted probability for the outcome using predict.lrm:
predicted.prob<-predict(model.rob,newdata=data.frame(var1=1,var2=.33,var3=.5),
type="fitted")
What I want to determine is a 95% confidence interval for this predicted probability. I have tried specifying se.fit=T, but this not permissible in predict.lrm when type=fitted.
I have spent the last few hours scouring the Internet for how to do this with lrm to no avail (obviously). Can anyone point me toward a method for determining this confidence interval? Alternatively, if it is impossible or difficult with lrm models, is there another way to estimate a logit with clustered standard errors for which confidence intervals would be more easily obtainable?
The help file for predict.lrm has a clear example. Here is a slight modification of it:
L <- predict(fit, newdata=data.frame(...), se.fit=TRUE)
plogis(with(L, linear.predictors + 1.96*cbind(- se.fit, se.fit)))
For some problems you may want to use the gendata or Predict functions, e.g.
L <- predict(fit, gendata(fit, var1=1), se.fit=TRUE) # leave other vars at median/mode
Predict(fit, var1=1:2, var2=3) # leave other vars at median/mode; gives CLs