Error in missing value imputation using MICE package - r

I have a huge data (4M x 17) that has missing values. Two columns are categorical, rest all are numerical. I want to use MICE package for missing value imputation. This is what I tried:
> testMice <- mice(myData[1:100000,]) # runs fine
> testTot <- predict(testMice, myData)
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "mids"
Running the imputation on whole dataset was computationally expensive, so I ran it on only the first 100K observations. Then I am trying to use the output to impute the whole data.
Is there anything wrong with my approach? If yes, what should I do to make it correct? If no, then why am I getting this error?

Neither mice nor hmisc provide the parameter estimates from the imputation process. Both Amelia and imputeMulti do. In both cases, you can extract the parameter estimates and use them for imputing your other observations.
Amelia assumes your data are distributed as a multivariate normal (eg. X \sim N(\mu, \Sigma).
imputeMulti assumes that your data is distributed as a multivariate multinomial distribution. That is the complete cell counts are distributed (X \sim M(n,\theta)) where n is the number of observations.
Fitting can be done as follows, via example data. Examining parameter estimates is shown further below.
library(Amelia)
library(imputeMulti)
data(tract2221, package= "imputeMulti")
test_dat2 <- tract2221[, c("gender", "marital_status","edu_attain", "emp_status")]
# fitting
IM_EM <- multinomial_impute(test_dat2, "EM",conj_prior = "non.informative", verbose= TRUE)
amelia_EM <- amelia(test_dat2, m= 1, noms= c("gender", "marital_status","edu_attain", "emp_status"))
The parameter estimates from the amelia function are found in amelia_EM$mu and amelia_EM$theta.
The parameter estimates in imputeMulti are found in IM_EM#mle_x_y and can be accessed via the get_parameters method.
imputeMulti has noticeably higher imputation accuracy for categorical data relative to either of the other 3 packages, though it only accepts multinomial (eg. factor) data.
All of this information is in the currently unpublished vignette for imputeMulti. The paper has been submitted to JSS and I am awaiting a response before adding the vignette to the package.

You don't use predict() with mice. It's not a model you're fitting per se. Your imputed results are already there for the 100,000 rows.
If you want data for all rows then you have to put all rows in mice. I wouldn't recommend it though, unless you set it up on a large cluster with dozens of CPU cores.

Related

Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

How to handle missing values (NA's) in a column in lmer

I would like to use na.pass for na.action when working with lmer. There are NA values in some observations of the data set in some columns. I just want to control for this variables that contains the NA's. It is very important that the size of the data set will be the same after the control of the fixed effects.
I think I have to work with na.action in lmer(). I am using the following model:
baseline_model_0 <- lmer(formula=log_life_time_income_child ~ nationality_dummy +
sex_dummy + region_dummy + political_position_dummy +(1|Family), data = baseline_df
Error in qr.default(X, tol = tol, LAPACK = FALSE) :
NA/NaN/Inf in foreign function call (arg 1)
My data: as you see below, there are quite a lot of NA's in all the control variables. So "throwing" away all of these observations is no option!
One example:
nat_dummy
1 : 335
2 : 19
NA's: 252
My questions:
1.) How can I include all of my control variables (expressed in multiple columns) to the model without kicking out observations (expressed in rows)?
2.) How does lmer handle the missing variables in all the columns?
To answer your second question, lmer typically uses maximum likelihood, where it will estimate missing values of the dependent variable and kick out missing values of your predictors. To avoid this, as others have suggested, you can use multiple imputation instead. I demonstrate below an example with the airquality dataset native to R since you don't have your data included in your question. First, load the necessary libraries: lmerTest for fitting the regression, mice for imputation and broom.mixed for summarizing the results.
#### Load Libraries ####
library(lmerTest)
library(mice)
library(broom.mixed)
We can inspect the missing patterns with the next code:
#### Missing Patterns ####
md.pattern(airquality)
Which gives us this nice plot of all the missing data patterns. For example, you may notice that we have two observations that are missing both Ozone and Solar.R.
To fill in the gap, we can impute the data with 5 imputations (the default, so you don't have to include the m=5 part, but I specify explicitly for your understanding.
#### Impute ####
imp <- mice(airquality,
m=5)
After, you run your imputations with the model like below. The with argument takes your imputed data and runs each imputation with the regression model. This model is a bit erroneous and comes back singular, but I just use it because its the quickest dataset I could remember with missing values included.
#### Fit With Imputations ####
fit <- with(imp,
lmer(Solar.R ~ Ozone + (1|Month)))
From there you can pool and summarize your results like so:
#### Pool and Summarise ####
pool <- pool(fit)
summary(pool)
Obviously with the model being singular this would be meaningless, but with a proper fit model, your summary should look something like this:
term estimate std.error statistic df p.value
1 (Intercept) 151.9805678 12.1533295 12.505262 138.8303 0.000000000
2 Ozone 0.8051218 0.2190679 3.675216 135.4051 0.000341446
As Ben already mentioned, you need to also determine why your data is missing. If there are non-random reasons for their missingness, this would require some consideration, as this can bias your imputations/model. I really recommend the mice vignettes here as a gentle introduction to the topic:
https://www.gerkovink.com/miceVignettes/
Edit
You asked in the comments about adding in random effects estimates. I'm not sure why this isn't already something ported into the respective packages already, but the mitml package can help fill that gap. Here is the code:
#### Load Library and Get All Estimates ####
library(mitml)
testEstimates(as.mitml.result(fit),
extra.pars = T)
Which gives you both fixed and random effects for imputed lmer objects:
Call:
testEstimates(model = as.mitml.result(fit), extra.pars = T)
Final parameter estimates and inferences obtained from 5 imputed data sets.
Estimate Std.Error t.value df P(>|t|) RIV FMI
(Intercept) 146.575 14.528 10.089 68.161 0.000 0.320 0.264
Ozone 0.921 0.254 3.630 90.569 0.000 0.266 0.227
Estimate
Intercept~~Intercept|Month 112.587
Residual~~Residual 7274.260
ICC|Month 0.015
Unadjusted hypothesis test as appropriate in larger samples.
And if you just want to pull the random effects, you can use testEstimates(as.mitml.result(fit), extra.pars = T)$extra.pars instead, which gives you just the random effects:
Estimate
Intercept~~Intercept|Month 1.125872e+02
Residual~~Residual 7.274260e+03
ICC|Month 1.522285e-02
Unfortunately there is no easy answer to your question; using na.pass doesn't do anything smart, it just lets the NA values go forward into the mixed-model machinery, where (as you have seen) they screw things up.
For most analysis types, in order to deal with missing values you need to use some form of imputation (using a model of some kind to fill in plausible values). If you only care about prediction without confidence intervals, you can use some simple single imputation method such as replacing NA values with means. If you want to do inference (compute p-values/confidence intervals), you need multiple imputation, i.e. generating multiple data sets with imputed values drawn differently in each one, fitting the model to each data set, then pooling estimates and confidence intervals appropriately across the fits.
mice is the standard/state-of-the-art R package for multiple imputation: there is an example of its use with lmer here.
There a bunch of questions you should ask/understand the answers to before you embark on any kind of analysis with missing data:
what kind of missingness do I have ("completely at random" [MCAR], "at random" [MAR], "not at random" [MNAR])? Can my missing-data strategy lead to bias if the data are missing not-at-random?
have I explored the pattern of missingness? Are there subsets of rows/columns that I can drop without much loss of information (e.g. if some column(s) or row(s) have mostly missing information, imputation won't help very much)
mice has a variety of imputation methods to choose from. It won't hurt to try out the default methods when you're getting started (as in #ShawnHemelstrand's answer), but before you go too far you should at least make sure you understand what methods mice is using on your data, and that the defaults make sense for your case.
I would strongly recommend the relevant chapter of Frank Harrell's Regression Modeling Strategies, if you can get ahold of a copy.

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

Which MICE imputed data set to use in succeeding analysis?

I am not sure if I need to provide a reproducible output for this as this is more of a general question. Anyway, after running the mice package, it returns m multiple imputed dataset. We can extract the data by using the complete() function.
I am confuse however which dataset shall I used for my succeeding analysis (descriptive estimation, model building, etc).
Questions:
1. Do I need to extract specific dataset e.g. complete(imp,1)? or shall I use the whole imputed dataset e.g. complete(imp, "long", inc = TRUE)?
If it is the latter complete(imp, "long", inc = TRUE), how do I compute some descriptives like mean, proportion,etc? For example, I will analyze the long data using SPSS. Shall I split the data according to the m number of imputed dataset and manually find the average? is that how it should be done?
Thanks for your help.
You should run your statistical analysis on each of the m imputed data sets individually, then pool the results together. This allows you to take into account the additional uncertainty introduced by the imputation procedure. MICE has this functionality built in. For example, if you wanted to do a simple linear model you would do this:
fit <- with(imp, lm(y ~ x1 + x2))
est <- pool(fit)
summary(est)
Check out ?pool and ?mira
Multiple imputation is comprised of the following three steps:
1. Imputation
2. Analysis
3. Pooling
In the first step, m number of imputed datasets are generated, in the second step data analysis, such as regression is applied to each dataset separately. Finally, in the thirds step, the analysis results are pooled into a final result. There are various pooling techniques implemented for different parameters.
Here is a nice link describing the pooling in detail - mice Vignettes

predict() method for "mice" package

I want to create imputation strategy using mice function from mice package. The problem is I can't seems to find any predict methods (or it's cousins) for new data in this package.
I want to do something like this:
require(mice)
data(boys)
train_boys <- boys[1:400,]
test_boys <- boys[401:nrow(boys),]
mice_object <- mice(train_boys)
train_complete_boys <- complete(train_boys)
# Here comes a hypothetical method
test_complete_boys <- predict(mice_object, test_boys)
I would like to find some approach that would emulate the code above.
Now, it's totally possible to do separate mice operations on train and test datasets separately, but it seems like from logical point of view that would be incorrect - all the information you have is in the train dataset. Observations from test dataset shouldn't provide information for each other. That's especially true when dealing with data when observations can be ordered by time of appearance.
One possible approach is to add rows from test dataset to train dataset iteratively, running imputation every time. However this seems very inelegant.
So here is the question:
Is there a method for the mice package that would be similar to the general predict method? If not, what are the possible workarounds?
Thank you!
I think it could be logically incorrect to "predict" missing values with another imputed dataset, since MICE algorithm is building models iteratively to estimate the missing values by the observed values in your given dataset.
In other words, when you do mice_object <- mice(train_boys), the algorithm estimates and imputes the NAs by the relationships between variables in train_boys. However, such estimation cannot be applied to test_boy because the relationships between variables in test_boy may differ from those in train_boy. Also, the amount of observed information is different between these two datasets.
If you believe the relationships between variables are homogeneous across train_boys and test_boys, how about doing NA imputation before splitting the dataset? i.e.:
mice_object <- mice(boys)
complete_boys <- compete(mice_object)
train_boys <- complete_boys[1:400,]
test_boys <- complete_boys[401:nrow(complete_boys),]
You can read Multiple imputation by chained equations: What is it and how does it work? if you need more information of MICE.

Resources