predict() method for "mice" package - r

I want to create imputation strategy using mice function from mice package. The problem is I can't seems to find any predict methods (or it's cousins) for new data in this package.
I want to do something like this:
require(mice)
data(boys)
train_boys <- boys[1:400,]
test_boys <- boys[401:nrow(boys),]
mice_object <- mice(train_boys)
train_complete_boys <- complete(train_boys)
# Here comes a hypothetical method
test_complete_boys <- predict(mice_object, test_boys)
I would like to find some approach that would emulate the code above.
Now, it's totally possible to do separate mice operations on train and test datasets separately, but it seems like from logical point of view that would be incorrect - all the information you have is in the train dataset. Observations from test dataset shouldn't provide information for each other. That's especially true when dealing with data when observations can be ordered by time of appearance.
One possible approach is to add rows from test dataset to train dataset iteratively, running imputation every time. However this seems very inelegant.
So here is the question:
Is there a method for the mice package that would be similar to the general predict method? If not, what are the possible workarounds?
Thank you!

I think it could be logically incorrect to "predict" missing values with another imputed dataset, since MICE algorithm is building models iteratively to estimate the missing values by the observed values in your given dataset.
In other words, when you do mice_object <- mice(train_boys), the algorithm estimates and imputes the NAs by the relationships between variables in train_boys. However, such estimation cannot be applied to test_boy because the relationships between variables in test_boy may differ from those in train_boy. Also, the amount of observed information is different between these two datasets.
If you believe the relationships between variables are homogeneous across train_boys and test_boys, how about doing NA imputation before splitting the dataset? i.e.:
mice_object <- mice(boys)
complete_boys <- compete(mice_object)
train_boys <- complete_boys[1:400,]
test_boys <- complete_boys[401:nrow(complete_boys),]
You can read Multiple imputation by chained equations: What is it and how does it work? if you need more information of MICE.

Related

How to proceed with descriptive statistics (median, IQR, frequencies, proportions etc) after multiple imputation using MICE

I have been performing multiple imputation using the mice-package (van Buuren) in R, with m = 50 (50 imputation datasets) and 20 iterations for roughly 9 variables with missing data (MAR = missing at random) ranging from 5-13 %. After this, I want to proceed with estimating descriptive statistics for my dataset (i.e. not use complete case analysis only for descriptive statistics, but also compare the results with the descriptive statistics from my imputation). So my question is now, how to proceed.
I know that the correct procedure for dealing with MICE-data is:
Impute the missing data by the mice function, resulting in a multiple imputed data set (class mids);
Fit the model of interest (scientific model) on each imputed data set by the with() function, resulting an object of class mira;
Pool the estimates from each model into a single set of estimates and standard errors, resulting is an object of class mipo;
Optionally, compare pooled estimates from different scientific models by the D1() or D3() functions.
My problem is that I do not understand how to apply this theory on my data. I have done:
#Load package:
library(mice)
library(dplyr)
#Perform imputation:
Imp_Data <- mice(MY_DATA, m=50, method = "pmm", maxit = 20, seed = 123)
#Make the imputed data in long format:
Imp_Data_Long <- complete(Imp_Data, action = "long", include = FALSE)
I then assumed that this procedure is correct for getting the median of the BMI variable, where the .imp variable is the number of the imputation dataset (i.e. from 1-50):
BMI_Medians_50 <- Imp_Data_Long %>% group_by(.imp, Smoker) %>% summarise(Med_BMI = median(BMI))
BMI_Median_Pooled <- mean(BMI_Medians_50$Med_BMI)
I might have understood things completely wrong, but I have very much tried to understand the right procedure and found very different procedures here on StackOverflow and StatQuest.
Well, in general you perform the multiple imputation (in comparison to single imputation), because you want to account for the uncertainty that comes with performing imputation.
The missing data is ultimately lost - we can only estimate, what the real data might could look like. With multiple imputation we generate multiple estimates, what the real data could look like. Our goal is to have something like a probability distribution for the value of the imputed values.
For your descriptive statistics you do not need a pooling with rubins rules (these are important for standard errors and other metrics for linear models). You would calculate your statistics on each of your m = 50 imputed datasets separately and pool / sum them up with your desired metrics.
What you want to archive is to provide your reader with an information about the uncertainty that comes with the imputation. (and an estimate in which bounds the imputed values most likely are)
Looking for example at the mean as a descriptive statistic. Here you could e.g. provide the lowest mean and the highest mean over these imputed datasets. You can provide the mean of these means and the standard deviation for the mean over the imputed datasets.
Your complete case analysis would just provide e.g. 3.3 as a mean for a variable. But, might be the same mean varies quite a lot in your m=50 multiple imputed datasets e.g. from 1.1 up to 50.3. This can give you the valueable information, that you should take the 3.3 from your complete case analysis with a lot of care and that there is a lot of uncertainty in general with this kind of statistic for this dataset.

How to describe data after multiple imputation using Amelia (which dataset should I use)?

I did multiple imputation using Amelia using the following code
binary<- c("Gender", "Diabetes")
exclude.from.IMPUTATION<-c( "Serial.ID")
NPvars<- c("age", "HDEF","BMI")#a skewed (non-parametric variable
a.out <- Amelia::amelia(x = for.imp.data,m=10,
idvars=exclude.from.IMPUTATION,
noms = binary, logs =NPvars)
summary(a.out)
## save imputed datasets ##
Amelia::write.amelia(obj=a.out, file.stem = "impdata", format = "csv")
I had 10 different output data csv files (shown in the picture below)
and I know that I can use any one of them to do descriptive analysis as shown in prior questions but
Why we should do MULTIPLE imputation if we will use any SINGLE file
of them?
Some authors reported using Rubin's Rule to summarize across
imputations as shown here, please advice on how to do that.
You do not use just one of these dataset. As you correctly stated, then the whole process of multiple imputation would be useless.
As jay.sf said, the different datasets express the uncertainty of the imputation. The missing data is ultimately lost - we can only estimate, what the real data could look like. With multiple imputation we generate multiple estimates, what the real data could look like. Overall, this can be used to say something like: the missing data most likely lies between ... and ... .
When you are generating descriptive statistics, you generate these for each of the imputed datasets separately. Looking at for example at the mean, you could then e.g. provide as additional information, the lowest mean and the highest mean over these imputed datasets. You can provide the mean of these means and the standard deviation for the mean over the imputed datasets. This way your readers will know how much uncertainty comes with the imputation.
You can also use your imputed datasets to describe the uncertainty for the output of linear models. You do this by using RubinĀ“s Rules (RR) to pool parameter estimates, such as mean differences, regression coefficients, standard errors and to derive confidence intervals and p-values. (see also https://bookdown.org/mwheymans/bookmi/rubins-rules.html)

Is it possible to analyse the effect of both factors and numeric variables at the same time in a limma model?

I try to perfom an analysis of gene expression data with the limma r package. My model includes factors and numerical covariates and I'm not able to get the results for both types of variables at once.
Here is an example:
design <- model.matrix(~0+Factor+NumericCov,data=sampleData)
fit <- lmFit(geneExprData,design)
cont.matrix <- makeContrasts(Factor1=FactorLevel2-FactorLevel1,
Factor2=FactorLevel3-FactorLevel2,
Factor2=FactorLevel1-FactorLevel3,
NumericCov = NumericCov,
levels=design)
fit <- contrasts.fit(fit, cont.matrix)
fit <- eBayes(fit)
topTable(fit,coef="Factor1")
topTable(fit,coef="Factor2")
topTable(fit,coef="Factor3")
topTable(fit,coef="NumericCov")
Is this correct? Or should I just not use a contrast matrix for the analysis of the effect of numeric covariates?
If I do not use the makeContrast function it is more difficult to look at the difference between all the levels of the factor (which I need to do).
So if this is not correct, is there nevertheless a way to define the constrasts in order do both parts of the analysis at once?
Thanks in advance!

Which MICE imputed data set to use in succeeding analysis?

I am not sure if I need to provide a reproducible output for this as this is more of a general question. Anyway, after running the mice package, it returns m multiple imputed dataset. We can extract the data by using the complete() function.
I am confuse however which dataset shall I used for my succeeding analysis (descriptive estimation, model building, etc).
Questions:
1. Do I need to extract specific dataset e.g. complete(imp,1)? or shall I use the whole imputed dataset e.g. complete(imp, "long", inc = TRUE)?
If it is the latter complete(imp, "long", inc = TRUE), how do I compute some descriptives like mean, proportion,etc? For example, I will analyze the long data using SPSS. Shall I split the data according to the m number of imputed dataset and manually find the average? is that how it should be done?
Thanks for your help.
You should run your statistical analysis on each of the m imputed data sets individually, then pool the results together. This allows you to take into account the additional uncertainty introduced by the imputation procedure. MICE has this functionality built in. For example, if you wanted to do a simple linear model you would do this:
fit <- with(imp, lm(y ~ x1 + x2))
est <- pool(fit)
summary(est)
Check out ?pool and ?mira
Multiple imputation is comprised of the following three steps:
1. Imputation
2. Analysis
3. Pooling
In the first step, m number of imputed datasets are generated, in the second step data analysis, such as regression is applied to each dataset separately. Finally, in the thirds step, the analysis results are pooled into a final result. There are various pooling techniques implemented for different parameters.
Here is a nice link describing the pooling in detail - mice Vignettes

Error in missing value imputation using MICE package

I have a huge data (4M x 17) that has missing values. Two columns are categorical, rest all are numerical. I want to use MICE package for missing value imputation. This is what I tried:
> testMice <- mice(myData[1:100000,]) # runs fine
> testTot <- predict(testMice, myData)
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "mids"
Running the imputation on whole dataset was computationally expensive, so I ran it on only the first 100K observations. Then I am trying to use the output to impute the whole data.
Is there anything wrong with my approach? If yes, what should I do to make it correct? If no, then why am I getting this error?
Neither mice nor hmisc provide the parameter estimates from the imputation process. Both Amelia and imputeMulti do. In both cases, you can extract the parameter estimates and use them for imputing your other observations.
Amelia assumes your data are distributed as a multivariate normal (eg. X \sim N(\mu, \Sigma).
imputeMulti assumes that your data is distributed as a multivariate multinomial distribution. That is the complete cell counts are distributed (X \sim M(n,\theta)) where n is the number of observations.
Fitting can be done as follows, via example data. Examining parameter estimates is shown further below.
library(Amelia)
library(imputeMulti)
data(tract2221, package= "imputeMulti")
test_dat2 <- tract2221[, c("gender", "marital_status","edu_attain", "emp_status")]
# fitting
IM_EM <- multinomial_impute(test_dat2, "EM",conj_prior = "non.informative", verbose= TRUE)
amelia_EM <- amelia(test_dat2, m= 1, noms= c("gender", "marital_status","edu_attain", "emp_status"))
The parameter estimates from the amelia function are found in amelia_EM$mu and amelia_EM$theta.
The parameter estimates in imputeMulti are found in IM_EM#mle_x_y and can be accessed via the get_parameters method.
imputeMulti has noticeably higher imputation accuracy for categorical data relative to either of the other 3 packages, though it only accepts multinomial (eg. factor) data.
All of this information is in the currently unpublished vignette for imputeMulti. The paper has been submitted to JSS and I am awaiting a response before adding the vignette to the package.
You don't use predict() with mice. It's not a model you're fitting per se. Your imputed results are already there for the 100,000 rows.
If you want data for all rows then you have to put all rows in mice. I wouldn't recommend it though, unless you set it up on a large cluster with dozens of CPU cores.

Resources