Multiple Imputation on New/Predictor Data - r

Can someone please help me understand how to handle missing values in new/unseen data? I have investigated a handful of multiple imputation packages in R and all seem only to impute the training and test set (at the same time). How do you then handle new unlabeled data to estimate in the same way that you did train/test? Basically, I want to use multiple imputation for missing values in the training/test set and also the same model/method for predictor data. Based on my research of multiple imputation (not an expert), it does not seem feasible to do this with MI? However, for example, using caret, you can easily use the same model that was used in training/test for new data. Any help would be greatly appreciated. Thanks.
** Edit
Basically, My data set contains many missing values. Deletion is not an option as it will discard most of my train/test set. Up to this point, I have encoded categorical variables, removed near zero variance and highl correlated variables. After this preprocessing, I was able to easily apply the mice package for imputation
m=mice(sg.enc)
At this point, I could use the pool command to apply the model against the imputed data sets. That works fine. However, I know that future data will have missing values and would like to somehow apply this MI incrementally?

It does not have multiple imputation, but the yaImpute package has a predict() function to impute values for new data. I ran a test using training data (that included NAs) to create a "yai" object, then used that object via predict() to impute values in a new testing data set. Unlike Caret preProcess(), yaImpute supports factor variables (at least for imputing values for them) in its knn algorithm. I did not yet test if factors can be part of the "predictors" for the missing target variables. yaImpute does support other imputation methods besides knn.

Related

In the R mice package, how do I find the number of nodes in the tree used in my imputation?

In a paper under review, I have a very large dataset with a relatively small number of imputations. The reviewer asked me to report how many nodes were in the tree I generated using the CART method within MICE. I don't know why this is important, but after hunting around for a while, my own interest is piqued.
Below is a simple example using this method to impute a single value. How many nodes are in the tree that the missing value is being chosen from? And how many members are in each node?
data(whiteside, package ="MASS")
data <- whiteside
data[1,2] <- NA
library(mice)
impute <- mice(data,m=100,method="cart")
impute2 <- complete(impute,"long")
I guess, whiteside is only used as an example here. So your actual data looks different.
I can't easily get the number of nodes for the tree generated in mice. The first problem is, that it isn't just one tree ... as the package names says mice - Multivariate Imputation by Chained Equations. Which means you are sequentially creating multiple CART trees. Also each incomplete variable is imputed by a separate model.
From the mice documentation:
The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model.
If you really want to get numbers of nodes for each used model, you probably would have to adjust the mice package itself and add some logging there.
Here is how you might approach this:
Calling impute <- mice(data,m=100,method="cart") you get a S3 object of class mids that contains information about the imputation. (but not the number of nodes for each tree)
But you can call impute$formulas, impute$method, impute$nmis to get some more information, which formulas were used and which variables actually had missing values.
From the mice.impute.cart documentation you can see, that mice uses rpart internally for creating the classification and regression trees.
Since the mids object does not contain information about the fitted trees I'd suggest, you use rpart manually with the formula from impute$formulas.
Like this:
library("rpart")
rpart(Temp ~ 0 + Insul + Gas, data = data)
Which will print / give you the nodes/tree. This wouldn't really be the tree used in mice. As I said, mice means multiple chained equations / multiple models after each other - meaning multiple possibly different trees after each other. (take a look at the algortihm description here https://stefvanbuuren.name/fimd/sec-cart.html for the univariate missingness case with CART). But this could at least be an indicator, if applying rpart on your specific data provides a useful model and thus leads to good imputation results.

How to impute missing "build_year" column in Sberbank Russian Housing Market dataset on Kaggle?

I am working on an academic project that involves predicting the house prices based on the Sberbank Russian Housing Market dataset. However, I am stuck in the data cleaning process of a particular column that indicates the date when the property was built. I can't just impute the missing values by replacing it with a mean or median. I was looking for all the possible ways available to impute such a data that are meaningful and not just random numbers. Also, the scope of the project allows me the usage of only linear regression models in R so I would not want models like XGBoost to automatically take care of imputation.
Your question is very broad. There are actually multiple R packages that can help you here:
missForest
imputeR
mice
VIM
simputation
There are even more, there is a whole official TaskView dedicated to listing packages for imputation in R. Look mostly for Single Imputation packages, because these will be a good fit for your task.
Can't tell you, which method performs best for your specific task. This depends on your data and the linear regression model you are using afterwards.
So you have to test, with which combination of imputation algorithm + regression model you get the best overall performance.
So overall you are testing with which feature engineering / preprocessing + imputation algorithm + regression model you archive the best result.
Be careful of leakage in your testing (accidentally sharing information between the test and training datasets). Usually you can combine train+test data and perform the imputation on the complete dataset. But it is important, that the target variable is removed from the test dataset. (because you wouldn't have this for the real data)
Most of the mentioned packages are quite easy to use, here an example for missForest:
library("missForest")
# create example dataset with missing values
missing_data_iris <- prodNA(iris, noNA = 0.1)
# Impute the dataset
missForest(missing_data_iris)
The other packages are equally easy to use. Usually for all these single imputation packages it is just one function, where you give in your incomplete dataset and you get the data back without NAs.

What exactly does complete in mice do?

I am researching how to use multiple imputation results. The following is my understanding, and please let me know if there're mistakes.
Suppose you have a data set with missing values, and you want to conduct a regression analysis. You may perform multiple imputation for m = 5 times, and for each imputed data set (5 imputed data sets now) you run a regression analysis, then "pool" the coefficient estimates from these m = 5 models via Rubin's rules (or use R package "pool").
My question is that, in mice you have a function complete(), and the manual says you can extract completed data set by using complete(object).
But if I use mice for m = 5 times, does it still make sense to use complete()? Which imputation results will complete() get for me?
Also, does it make sense if I only use mice with m = 1? Thank you.
You probably overlooked that mice::complete() in arguments uses action=1 as default, which "returns the first imputed data set" (see ?mice::complete) and actually is worthless.
You should definitely use action="long" to take account for the "multiplicity" of the multiple imputation!
No, it makes no sense at all to use m=1 (apart from debugging), because every imputation is based on a random process and you have to pool the results (using any method whatsoever) to account for the variation. Often m>20 is recommended1.
Basically, multiple imputation works as follows:
Create m imputation processes with a random component, to obtain
m slightly different imputed data sets.
Analyze each imputed data set to get slightly different parameter
estimates.
Combine results, calculating the variation in parameter estimates.
(Also see multiple-imputation-in-a-nutshell for a brief overview.)
When you use mice, you get an object that is not the imputed data set. You cannot perform operations on it directly without using the special functions in mice. If you want to extract that actual imputed datasets, you use complete, the output of which is a data.frame with one row per individual per imputation (if using the "long" format). If you are doing any analysis with your imputed data that cannot be performed within mice, you need to create this dataset first.

Not losing observations when faced with missing data

I have a dataset where I've fitted a linear model and I've tried to use the step function on this linear model. I get an error message "saying number of rows in use has changed: remove missing values?".
I noticed that a few of the observations (not many) in my dataset had NA values for one variable. I've seen similar questions which suggest using na.omit(), but when I do this I lose the observations. I want to keep the observations however, because they contain useful information for the other variables. Is there a way to use step and avoid losing the observations?
You can call the nobs function to check that the number of observations is unchanged, and its use.fallback argument to potentially guess the missing values. The R documentation however recommends omitting the relevant data before running step.
I would discourage you from simply omitting the missing values if they are indeed really missing. You can use multiple imputation via Amelia to impute the data such that you have a full dataset.
see here: https://cran.r-project.org/web/packages/Amelia/Amelia.pdf
also I would recommend reviewing the book "Statistical Analysis With Missing Data" by R. Little and D.B. Rubin.

Imputation in large data

I need to impute missing values. My data set has about 800,000 rows and 92 variables. I tried kNNImpute in the imputation package in r but looks like the data set is too big. Any other packages/method in R? I would prefer not to use mean to replace the missing values.
thank you
1) You might try
library(sos)
findFn("impute")
This shows 400 matches in 113 packages. This shows 400 matches in 113 packages: you could narrow it down per your requirements of the imputation function.
2) Did you see/try Hmisc ?
Description: The Hmisc library contains many functions useful for data
analysis, high-level graphics, utility operations, functions
for computing sample size and power, importing datasets,
imputing missing values, advanced table making, variable
clustering, character string manipulation, conversion of S
objects to LaTeX code, and recoding variables.
3) Possibly mice
Multiple imputation using Fully Conditional Specification (FCS)
implemented by the MICE algorithm. Each variable has its own
imputation model. Built-in imputation models are provided for
continuous data (predictive mean matching, normal), binary data
(logistic regression), unordered categorical data (polytomous logistic
regression) and ordered categorical data (proportional odds). MICE can
also impute continuous two-level data (normal model, pan, second-level
variables). Passive imputation can be used to maintain consistency
between variables. Various diagnostic plots are available to inspect
the quality of the imputations.
MICE is a great package, with strong diagnostic tools, and may be capable of doing the job in such a large dataset.
One thing you should be aware of: MICE is S-L-O-W. Working on such a big dataset, if you intend to use MICE, I would strongly recommend you to use a computing cloud -- otherwise, you're better planning your self in advance because, with a 800k x ~100 matrix, it may take a few days to get the job done, depending on how you specify your model.
MICE offers you a number of different imputation methods to be used according to the type of variable to be imputed. The fastest one is predictive mean matching. PMM was initially intended to be used to impute continuous data but it seems pmm is flexible enough to accomodate other types of variable. Take a look at this Paul Allison's post and Stef van Buuren's response: http://statisticalhorizons.com/predictive-mean-matching
(I see this is a three years old post but I have been using MICE and have been amazed by how powerful -- and oftentimes slow -- it can be!)

Resources