Plot Multiple Imputation Results - r

I have successfully completed a multiple imputation on the missing data of my questionnaire research using the MICE package in R and performed a linear regression on the pooled imputed variables. I can't seem to work out how to extract single pooled variables and plot in a graph. Any ideas?
e.g.
>imp <- mice(questionnaire)
>fit <- with(imp, lm(APE~TMAS+APB+APA+FOAP))
>summary(pool(fit))
I want to plot pooled APE by TMAS.
Reproducible Example using nhanes:
> library(mice)
> nhanes
> imp <-mice(nhanes)
> fit <-with(imp, lm(bmi~chl+hyp))
> fit
> summary(pool(fit))
I would like to plot pooled chl against pooled bmi (for example).
Best I have been able to achieve is
> mat <-complete(imp, "long")
> plot(mat$chl~mat$bmi)
Which I believe gives the combined plot of all 5 imputations and is not quite what I am looking for (I think).

the underlying with.mids() function lets the regression be carried out on each imputed dataframe. So it is not one regression, but 5 regressions that happened. pool() just averages the estimated coefficients and adjusts the variances for the statistical inference according to the amount of imputation.
So there aren't single pooled variables to plot. What you could do is average the 5 imputed sets and recreate some kind of "regression line" based on the pooled coefficients, eg :
# Averaged imputed data
combchl <- tapply(mat$chl,mat$.id,mean)
combbmi <- tapply(mat$bmi,mat$.id,mean)
combhyp <- tapply(mat$hyp,mat$.id,mean)
# coefficients
coefs <- pool(fit)$qbar
# regression results
x <- data.frame(
int = rep(1,25),
chl = seq(min(combchl),max(combchl),length.out=25),
hyp = seq(min(combhyp),max(combhyp),length.out=25)
)
y <- as.matrix(x) %*%coefs
# a plot
plot(combbmi~combchl)
lines(x$chl,y,col="red")

Related

IPW-adjusted Kaplan-Meier analysis and IPW-adjusted RMST analysis after multiple imputation

I would like to do the following analyses with the dataset with missing variables. Because mice and MatchThem packages do not support pooling the results of Kaplan-Meier analysis, I try to do it manually as follows:
Do multiple imputations using mice.
Calculate inverse probability weights in each imputed dataset using WeightIt.
Estimate IPW-adjusted Kaplan-Meier curves in each imputed dataset using survfit.
Pool the results of #3 and depict the pooled IPW-adjusted KM curves.
Calculate the difference in IPW-adjusted restricted mean survival time (the area under KM curve until the specific timepoint) according to akm-rmst (https://github.com/s-conner/akm-rmst) within each imputed dataset.
Pool the results of #5.
Get descriptive statistics of baseline characteristics in imputed dataset using tbl_summary from gtsummary package.
Here are my codes
pacman::p_load(survival, survey, survminer, WeightIt, tidyverse, mice)
df # sample dataset
m <- 10 # number of imputation
dimp <- mice::mice(df, m = m, seed = 123)
for (i in 1:m) {
dcomp <- mice::complete(dimp, i) # extract imputed data
# estimate weight
wgt <- weightit(
treatment ~ age + sex + smoking,
data = dcomp, method = "ps", estimand = "ATE", stabilize = TRUE
)
# add weight and pscore to dataset
dimp <- tibble(dcomp, wgt = wgt[["weights"]], pscores = wgt[["ps"]])
assign(paste0("df", i), output) # save "i"th imputed dataset
# calculate Kaplan-Meier estimate
surv <- survival::survfit(Surv(time, event) ~ treatment, data = dimp, weight = wgt)
assign(paste0("surv", i), output) # save "i"th IPW-adjusted KM curves
}
These codes do the analyses from #1 to #3. Although I read the reference (https://stefvanbuuren.name/fimd/sec-pooling.html), I could not find how to do these analyses(#4 to #7). Can anyone give me some advice regarding #4 to #7?
I believe this is not a duplicate to any posted question so I'd appreciate any advice. Any assistance you can provide would be greatly appreciated.
Regarding your point #7. When you look for imputation with presumably the need of a high number of datasets (m=20, 40 or > 50) you cannot pick only one dataset in random. Risk of type-one error and you lose the effect of your imputation. Had the same concerns like you. this thread could help you (only for a summary of imputed descriptive data) : Björn answer in StackExchange

Adjusted R squared using 'mice'

I am using the mice package and lmer from lme4 for my analyses. However, pool.r.squared() won't work on this output. I am looking for suggestions on how to include the computation of the adjusted R squared in the following workflow.
require(lme4, mice)
imp <- mice(nhanes)
imp2 <- mice::complete(imp, "all") # This step is necessary in my analyses to include other variables/covariates following the multiple imputation
fit <- lapply(imp2, lme4::lmer,
formula = bmi ~ (1|age) + hyp + chl,
REML = T)
est <- pool(fit)
summary(est)
You have two separate problems here.
First, there are several opinions about what an R-squared for multilevel/mixed-model regressions actually is. This is the reason why pool.r.squared does not work for you, as it does not accept results from anything other than lm(). I do not have an answer for you how to calculate something R-squared-ish for your model and since it is a statistics question – not a programming one – I am not going into detail. However, a quick search indicates that for some kinds of multilevel R-squares, there are functions available for R, e.g. mitml::multilevelR2.
Second, in order to pool a statistic across imputation samples, it should be normally distributed. Therefore, you have to transform R-squared into Fisher's Z and back-transform it after pooling. See https://stefvanbuuren.name/fimd/sec-pooling.html
In the following I assume that you have a way (or several options) to calculate your (adjusted) R-squared. Assuming that you use mitl::multilevelR2 and choose the method by LaHuis et al. (2014), you can compute and pool it across your imputations with the following steps:
# what you did before:
imp <- mice::mice(nhanes)
imp2 <- mice::complete(imp, "all")
fit_l <- lapply(imp2, lme4::lmer,
formula = bmi ~ (1|age) + hyp + chl,
REML = T)
# get your R-squareds in a vector (replace `mitl::multilevelR2` with your preferred function for this)
Rsq <- lapply(fit_l, mitml::multilevelR2, print="MVP")
Rsq <- as.double(Rsq)
# convert the R-squareds into Fisher's Z-scores
Zrsq <- 1/2*log( (1+sqrt(Rsq)) / (1-sqrt(Rsq)) )
# get the variance of Fisher's Z (same for all imputation samples)
Var_z <- 1 / (nrow(imp2$`1`)-3)
Var_z <- rep(Var_z, imp$m)
# pool the Zs
Z_pool <- pool.scalar(Zrsq, Var_z, n=imp$n)$qbar
# back-transform pooled Z to Rsquared
Rsq_pool <- ( (exp(2*Z_pool) - 1) / (exp(2*Z_pool) + 1) )^2
Rsq_pool #done

What is the correct way to use weights in a logistic regression in R?

My data includes survey data of car buyers. My data has a weight column that i used in SPSS to get sample sizes. Weight column is affected by demographic factors & vehicle sales. Now i am trying to put together a logistic regression model for a car segment which includes a few vehicles. I want to use the weight column in the logistic regression model & i tried to do so using "weights" in glm function. But the results are horrific. Deviances are too high, McFadden Rsquare too low. My dependent variable is binary, independent variables are on 1 to 5 scale. Weight column is numerical, ranging from 32 to 197. Could that be a reason that results are poor? Do i need to have values in weight column below 1?
Format of input file to R is -
WGT output I1 I2 I3 I4 I5
67 1 1 3 1 5 4
I1, I2, I3 being independent variables
logr<-glm(output~1,data=data1,weights=WGT,family="binomial")
logrstep<-step(logr,direction = "both",scope = formula(data1))\
logr1<-glm(output~ (formula from final iteration),weights = WGT,data=data1,family="binomial")
hl <- hoslem.test(data1$output,fitted(logr1),g=10)
I want a logistic regression model with better accuracy & gain a better understanding of using weights with logistic regression
I would check out the survey package. This will allow you to specify weights for the survey design using the svydesign function. Additionally, you can use the svyglm function to perform your weighted logistic regression. See http://r-survey.r-forge.r-project.org/survey/
Something like the following assuming your data is in a dataframe called df
my_svy <- svydesign(df, ids = ~1, weights = ~WGT)
Then you can do the following:
my_fit <- svyglm(output ~1, my_svy, family = "binomial")
For a full reprex check out the below example
library(survey)
# Generate Some Random Weights
mtcars$wts <- rnorm(nrow(mtcars), 50, 5)
# Make vs a factor just for illustrative purposes
mtcars$vs <- as.factor(mtcars$vs)
# Build the Complete survey Object
svy_df <- svydesign(data = mtcars, ids = ~1, weights = ~wts)
# Fit the logistic regression
fit <- svyglm(vs ~ gear + disp, svy_df, family = "binomial")
# Store the summary object
(fit_sumz <- summary(fit))
# Look at the AIC if desired
AIC(fit)
# Pull out the deviance if desired
fit_sumz$deviance
As far as the stepwise regression, this typically isn't a great methodology for a statistical point of view. It results in a higher R2 and some other issues regarding inference (see https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/).

Pooling sandwich variance estimator over multiply imputed datasets

I am running a poisson regression on multiply imputed data to predict a common binary outcome. After running mice, I have obtained a stacked data frame comprising the raw data and five imputed datasets. Here is a toy example:
df <- mice::nhanes
imp <- mice(df) #impute data
com <- complete(imp, "long", TRUE) #creates data frame
I now want to:
Run the regression on each imputed dataset
Calculate robust standard errors using a sandwich variance estimator
Combine / pool the results of both analyses
I can run the regression on the mids object using the with and pool commands:
fit.pois.mids <- with(imp, glm(hyp ~ age + bmi + chl, family = poisson))
summary(pool(fit.pois.mids))
I can also run the regression on each of the imputed datasets before combining them:
imp.df <- split(com, com$.imp); names(imp.df) <- c("raw", "imp1", "imp2", "imp3", "imp4", "imp5") #creates list of data frames representing each imputed dataset
fit.pois <- lapply(imp.df, function(x) {
fit <- glm(hyp ~ age + bmi + chl, data = x, family = poisson)
fit
})
summary(MIcombine(fit.pois))
Similarly, I can calculate the standard errors for each imputed dataset:
sand <- lapply(fit.pois, function(x) {
se <- coeftest(x, vcov = sandwich)
se
})
Unfortunately, MIcombine does not seem to return p-values. This post suggests using Zelig, but for that matter, I may as well just use mice. Further it does not appear to be possible to combine the estimates of the standard errors:
summary(MIcombine(sand.df))
Error in UseMethod("vcov") :
no applicable method for 'vcov' applied to an object of class "coeftest"
For the sake of simplicity, it seems that mice is a better option for pooling the results of the regression; however, I am wondering how I would go about updating (i.e., pooling and combining) the standard errors. What are some ways this could be addressed?

GLM BACI analysis in R

I am trying to conduct a BACI analysis in R using logistic regression. Due to the use of reference levels in the output of GLMs, I am having difficulty interpreting my results. Has anyone had any luck retrieving a summary of all pairwise interactions?
(Depth is a continuous predictor variable, but I can convert it to a categorical if necessary.)
Towards <- c(4,7,9,0,15,10,11,23,1,4)
Total <- c(6,14,10,7,15,12,20,41,5,8)
Depth <- c(-.3,-.25,-.21,-.17,-.05,0,0,.25,.5,.56)
DPM <- c("Pre","Post","Pre","Pre","Post","Pre","Post","Post","Post","Pre")
Proximity <- c("Far","Near","East","East","East","Near","Far", "Far","Near","Far")
Area <- c("DPM","Control","Control","DPM","Control","Control",
"DPM","DPM","Control‌​","DPM")
Data <- data.frame(Towards,Total, Depth, DPM, Proximity, Area)}
mod <- glm(cbind(Towards, Total-Towards) ~ DPM*Site*Depth,
data=LogReg, family=binomial('logit'))

Resources