Which MICE imputed data set to use in succeeding analysis? - r

I am not sure if I need to provide a reproducible output for this as this is more of a general question. Anyway, after running the mice package, it returns m multiple imputed dataset. We can extract the data by using the complete() function.
I am confuse however which dataset shall I used for my succeeding analysis (descriptive estimation, model building, etc).
Questions:
1. Do I need to extract specific dataset e.g. complete(imp,1)? or shall I use the whole imputed dataset e.g. complete(imp, "long", inc = TRUE)?
If it is the latter complete(imp, "long", inc = TRUE), how do I compute some descriptives like mean, proportion,etc? For example, I will analyze the long data using SPSS. Shall I split the data according to the m number of imputed dataset and manually find the average? is that how it should be done?
Thanks for your help.

You should run your statistical analysis on each of the m imputed data sets individually, then pool the results together. This allows you to take into account the additional uncertainty introduced by the imputation procedure. MICE has this functionality built in. For example, if you wanted to do a simple linear model you would do this:
fit <- with(imp, lm(y ~ x1 + x2))
est <- pool(fit)
summary(est)
Check out ?pool and ?mira

Multiple imputation is comprised of the following three steps:
1. Imputation
2. Analysis
3. Pooling
In the first step, m number of imputed datasets are generated, in the second step data analysis, such as regression is applied to each dataset separately. Finally, in the thirds step, the analysis results are pooled into a final result. There are various pooling techniques implemented for different parameters.
Here is a nice link describing the pooling in detail - mice Vignettes

Related

Creating a data frame with pooled imputated values

So I want to do a regression analysis. and in order to account for missing data in my data set, I imputed NAs with mice. So far everything's fine, I ran mice() with m=5 and now I have the 5 imputation models. The next step according to the documentation would be to do the actual regression and combine the different imputation by using pool(), so:
model <- with (data_imp, exp = ...)
summary(pool(model))
This will create the regression output for my analysis. Seems good so far.
However, I also want to provide some descriptive statistics (namely, a boxplot) on my dependent and independent variables. So therefore I need a dataframa that contains both a) the values that were already given and b) the combined, imputed values that were inserted in place of the NAs and used in the regression model. How can I create this data.frame?
I saw in this tutorial know that you can combine the imputed data with the already given values into a new dataframe by using data_full <- complete(data_imp) but apparently thisonly works, if you want to specifically choose only one of the 5 imputations (for example data_full <- complete(data_imp, 1) to choose the first imputation). If you dont specifiy any nummer, I think it will just use the first imputation. I however want to use the combined, estimateed values from every 5 imputations and combine them into a dataframe. How can I do this?
I'd be really grateful for every piece of advice :) Thanks in advance!
I'm not entirely clear on how you want to display the data, but if you want to look at a box plot for each imputation, you can iterate through each of the 5 imputations run with complete() using lapply() or purrr::map() as per below:
library(mice)
library(tidyverse)
imp <- mice(nhanes)
map_dfr(1:5, ~ complete(imp, .x), .id = "imp") |>
ggplot(aes(x = imp, y = bmi)) +
geom_boxplot()

How to proceed with descriptive statistics (median, IQR, frequencies, proportions etc) after multiple imputation using MICE

I have been performing multiple imputation using the mice-package (van Buuren) in R, with m = 50 (50 imputation datasets) and 20 iterations for roughly 9 variables with missing data (MAR = missing at random) ranging from 5-13 %. After this, I want to proceed with estimating descriptive statistics for my dataset (i.e. not use complete case analysis only for descriptive statistics, but also compare the results with the descriptive statistics from my imputation). So my question is now, how to proceed.
I know that the correct procedure for dealing with MICE-data is:
Impute the missing data by the mice function, resulting in a multiple imputed data set (class mids);
Fit the model of interest (scientific model) on each imputed data set by the with() function, resulting an object of class mira;
Pool the estimates from each model into a single set of estimates and standard errors, resulting is an object of class mipo;
Optionally, compare pooled estimates from different scientific models by the D1() or D3() functions.
My problem is that I do not understand how to apply this theory on my data. I have done:
#Load package:
library(mice)
library(dplyr)
#Perform imputation:
Imp_Data <- mice(MY_DATA, m=50, method = "pmm", maxit = 20, seed = 123)
#Make the imputed data in long format:
Imp_Data_Long <- complete(Imp_Data, action = "long", include = FALSE)
I then assumed that this procedure is correct for getting the median of the BMI variable, where the .imp variable is the number of the imputation dataset (i.e. from 1-50):
BMI_Medians_50 <- Imp_Data_Long %>% group_by(.imp, Smoker) %>% summarise(Med_BMI = median(BMI))
BMI_Median_Pooled <- mean(BMI_Medians_50$Med_BMI)
I might have understood things completely wrong, but I have very much tried to understand the right procedure and found very different procedures here on StackOverflow and StatQuest.
Well, in general you perform the multiple imputation (in comparison to single imputation), because you want to account for the uncertainty that comes with performing imputation.
The missing data is ultimately lost - we can only estimate, what the real data might could look like. With multiple imputation we generate multiple estimates, what the real data could look like. Our goal is to have something like a probability distribution for the value of the imputed values.
For your descriptive statistics you do not need a pooling with rubins rules (these are important for standard errors and other metrics for linear models). You would calculate your statistics on each of your m = 50 imputed datasets separately and pool / sum them up with your desired metrics.
What you want to archive is to provide your reader with an information about the uncertainty that comes with the imputation. (and an estimate in which bounds the imputed values most likely are)
Looking for example at the mean as a descriptive statistic. Here you could e.g. provide the lowest mean and the highest mean over these imputed datasets. You can provide the mean of these means and the standard deviation for the mean over the imputed datasets.
Your complete case analysis would just provide e.g. 3.3 as a mean for a variable. But, might be the same mean varies quite a lot in your m=50 multiple imputed datasets e.g. from 1.1 up to 50.3. This can give you the valueable information, that you should take the 3.3 from your complete case analysis with a lot of care and that there is a lot of uncertainty in general with this kind of statistic for this dataset.

Unequal Data for Model Compare in R

I'm fairly new to R and am trying to compare two models with the modelCompare function. However, the data set that I am working with is a bit large and has unevenly distributed missing values. When I try the following code for example:
Model_A <- lm(DV~var1*var2 + cont.var, data=df)
Model_C <- lm(DV~ cont.var, data=df)
modelCompare(Model_C,Model_A)
I get an error that the models have different N values and cannot be compared because data is differentially omitted between the two models. Is there an easy way to remove this variation, as I will be running a number of regression analyses with this data set?
What are you looking to compare? If you want to compare intercepts between the models just:
Model_A
Model_C
If you want to compare accuracy of the model, use a training and testing dataset!

How to describe data after multiple imputation using Amelia (which dataset should I use)?

I did multiple imputation using Amelia using the following code
binary<- c("Gender", "Diabetes")
exclude.from.IMPUTATION<-c( "Serial.ID")
NPvars<- c("age", "HDEF","BMI")#a skewed (non-parametric variable
a.out <- Amelia::amelia(x = for.imp.data,m=10,
idvars=exclude.from.IMPUTATION,
noms = binary, logs =NPvars)
summary(a.out)
## save imputed datasets ##
Amelia::write.amelia(obj=a.out, file.stem = "impdata", format = "csv")
I had 10 different output data csv files (shown in the picture below)
and I know that I can use any one of them to do descriptive analysis as shown in prior questions but
Why we should do MULTIPLE imputation if we will use any SINGLE file
of them?
Some authors reported using Rubin's Rule to summarize across
imputations as shown here, please advice on how to do that.
You do not use just one of these dataset. As you correctly stated, then the whole process of multiple imputation would be useless.
As jay.sf said, the different datasets express the uncertainty of the imputation. The missing data is ultimately lost - we can only estimate, what the real data could look like. With multiple imputation we generate multiple estimates, what the real data could look like. Overall, this can be used to say something like: the missing data most likely lies between ... and ... .
When you are generating descriptive statistics, you generate these for each of the imputed datasets separately. Looking at for example at the mean, you could then e.g. provide as additional information, the lowest mean and the highest mean over these imputed datasets. You can provide the mean of these means and the standard deviation for the mean over the imputed datasets. This way your readers will know how much uncertainty comes with the imputation.
You can also use your imputed datasets to describe the uncertainty for the output of linear models. You do this by using RubinĀ“s Rules (RR) to pool parameter estimates, such as mean differences, regression coefficients, standard errors and to derive confidence intervals and p-values. (see also https://bookdown.org/mwheymans/bookmi/rubins-rules.html)

predict() method for "mice" package

I want to create imputation strategy using mice function from mice package. The problem is I can't seems to find any predict methods (or it's cousins) for new data in this package.
I want to do something like this:
require(mice)
data(boys)
train_boys <- boys[1:400,]
test_boys <- boys[401:nrow(boys),]
mice_object <- mice(train_boys)
train_complete_boys <- complete(train_boys)
# Here comes a hypothetical method
test_complete_boys <- predict(mice_object, test_boys)
I would like to find some approach that would emulate the code above.
Now, it's totally possible to do separate mice operations on train and test datasets separately, but it seems like from logical point of view that would be incorrect - all the information you have is in the train dataset. Observations from test dataset shouldn't provide information for each other. That's especially true when dealing with data when observations can be ordered by time of appearance.
One possible approach is to add rows from test dataset to train dataset iteratively, running imputation every time. However this seems very inelegant.
So here is the question:
Is there a method for the mice package that would be similar to the general predict method? If not, what are the possible workarounds?
Thank you!
I think it could be logically incorrect to "predict" missing values with another imputed dataset, since MICE algorithm is building models iteratively to estimate the missing values by the observed values in your given dataset.
In other words, when you do mice_object <- mice(train_boys), the algorithm estimates and imputes the NAs by the relationships between variables in train_boys. However, such estimation cannot be applied to test_boy because the relationships between variables in test_boy may differ from those in train_boy. Also, the amount of observed information is different between these two datasets.
If you believe the relationships between variables are homogeneous across train_boys and test_boys, how about doing NA imputation before splitting the dataset? i.e.:
mice_object <- mice(boys)
complete_boys <- compete(mice_object)
train_boys <- complete_boys[1:400,]
test_boys <- complete_boys[401:nrow(complete_boys),]
You can read Multiple imputation by chained equations: What is it and how does it work? if you need more information of MICE.

Resources