In the R mice package, how do I find the number of nodes in the tree used in my imputation? - r

In a paper under review, I have a very large dataset with a relatively small number of imputations. The reviewer asked me to report how many nodes were in the tree I generated using the CART method within MICE. I don't know why this is important, but after hunting around for a while, my own interest is piqued.
Below is a simple example using this method to impute a single value. How many nodes are in the tree that the missing value is being chosen from? And how many members are in each node?
data(whiteside, package ="MASS")
data <- whiteside
data[1,2] <- NA
library(mice)
impute <- mice(data,m=100,method="cart")
impute2 <- complete(impute,"long")

I guess, whiteside is only used as an example here. So your actual data looks different.
I can't easily get the number of nodes for the tree generated in mice. The first problem is, that it isn't just one tree ... as the package names says mice - Multivariate Imputation by Chained Equations. Which means you are sequentially creating multiple CART trees. Also each incomplete variable is imputed by a separate model.
From the mice documentation:
The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model.
If you really want to get numbers of nodes for each used model, you probably would have to adjust the mice package itself and add some logging there.
Here is how you might approach this:
Calling impute <- mice(data,m=100,method="cart") you get a S3 object of class mids that contains information about the imputation. (but not the number of nodes for each tree)
But you can call impute$formulas, impute$method, impute$nmis to get some more information, which formulas were used and which variables actually had missing values.
From the mice.impute.cart documentation you can see, that mice uses rpart internally for creating the classification and regression trees.
Since the mids object does not contain information about the fitted trees I'd suggest, you use rpart manually with the formula from impute$formulas.
Like this:
library("rpart")
rpart(Temp ~ 0 + Insul + Gas, data = data)
Which will print / give you the nodes/tree. This wouldn't really be the tree used in mice. As I said, mice means multiple chained equations / multiple models after each other - meaning multiple possibly different trees after each other. (take a look at the algortihm description here https://stefvanbuuren.name/fimd/sec-cart.html for the univariate missingness case with CART). But this could at least be an indicator, if applying rpart on your specific data provides a useful model and thus leads to good imputation results.

Related

Can I use xgboost global model properly, if I skip step_dummy(all_nominal_predictors(), one_hot = TRUE)?

I wanted to try xgboost global model from: https://business-science.github.io/modeltime/articles/modeling-panel-data.html
On smaller scale it works fine( Like wmt data-7 departments,7ids), but what if I would like to run it on 200 000 time series (ids)? It means step dummy creates another 200k columns & pc can't handle it.(pc can't handle even 14k ids)
I tried to remove step_dummy, but then I end up with xgboost forecasting same values for all ids.
My question is: How can I forecast 200k time series with global xgboost model and be able to forecast proper values for each one of the 200k ids.
Or is it necessary to put there step_ dummy in oder to create proper FC for all ids?
Ps:code should be the same as one in the link. Only in my dataset there are 50 monthly observations for each id.
For this model, the data must be given to xgboost in the format of a sparse matrix. That means that there should not be any non-numeric columns in the data prior to the conversion (with tidymodels does under the hood at the last minute).
The traditional method for converting a qualitative predictor into a quantitative one is to use dummy variables. There are a lot of other choices though. You can use an effect encoding, feature hashing, or others too.
I think that there is no proper answer to the question "how it would be possible to forecast 200k ts" properly. Global Models are the way to go here, but you need to experiment to find out, which models do not belong inside the global forecast model.
There will be a threshold, determined mostly by the length of the series, that you put inside the global model.
Keep in mind to use several global models, with different feature recipes.
If you want to avoid step_dummy function, use lightgbm from the bonsai package, which is considerably faster and more accurate.

How to impute missing "build_year" column in Sberbank Russian Housing Market dataset on Kaggle?

I am working on an academic project that involves predicting the house prices based on the Sberbank Russian Housing Market dataset. However, I am stuck in the data cleaning process of a particular column that indicates the date when the property was built. I can't just impute the missing values by replacing it with a mean or median. I was looking for all the possible ways available to impute such a data that are meaningful and not just random numbers. Also, the scope of the project allows me the usage of only linear regression models in R so I would not want models like XGBoost to automatically take care of imputation.
Your question is very broad. There are actually multiple R packages that can help you here:
missForest
imputeR
mice
VIM
simputation
There are even more, there is a whole official TaskView dedicated to listing packages for imputation in R. Look mostly for Single Imputation packages, because these will be a good fit for your task.
Can't tell you, which method performs best for your specific task. This depends on your data and the linear regression model you are using afterwards.
So you have to test, with which combination of imputation algorithm + regression model you get the best overall performance.
So overall you are testing with which feature engineering / preprocessing + imputation algorithm + regression model you archive the best result.
Be careful of leakage in your testing (accidentally sharing information between the test and training datasets). Usually you can combine train+test data and perform the imputation on the complete dataset. But it is important, that the target variable is removed from the test dataset. (because you wouldn't have this for the real data)
Most of the mentioned packages are quite easy to use, here an example for missForest:
library("missForest")
# create example dataset with missing values
missing_data_iris <- prodNA(iris, noNA = 0.1)
# Impute the dataset
missForest(missing_data_iris)
The other packages are equally easy to use. Usually for all these single imputation packages it is just one function, where you give in your incomplete dataset and you get the data back without NAs.

What exactly does complete in mice do?

I am researching how to use multiple imputation results. The following is my understanding, and please let me know if there're mistakes.
Suppose you have a data set with missing values, and you want to conduct a regression analysis. You may perform multiple imputation for m = 5 times, and for each imputed data set (5 imputed data sets now) you run a regression analysis, then "pool" the coefficient estimates from these m = 5 models via Rubin's rules (or use R package "pool").
My question is that, in mice you have a function complete(), and the manual says you can extract completed data set by using complete(object).
But if I use mice for m = 5 times, does it still make sense to use complete()? Which imputation results will complete() get for me?
Also, does it make sense if I only use mice with m = 1? Thank you.
You probably overlooked that mice::complete() in arguments uses action=1 as default, which "returns the first imputed data set" (see ?mice::complete) and actually is worthless.
You should definitely use action="long" to take account for the "multiplicity" of the multiple imputation!
No, it makes no sense at all to use m=1 (apart from debugging), because every imputation is based on a random process and you have to pool the results (using any method whatsoever) to account for the variation. Often m>20 is recommended1.
Basically, multiple imputation works as follows:
Create m imputation processes with a random component, to obtain
m slightly different imputed data sets.
Analyze each imputed data set to get slightly different parameter
estimates.
Combine results, calculating the variation in parameter estimates.
(Also see multiple-imputation-in-a-nutshell for a brief overview.)
When you use mice, you get an object that is not the imputed data set. You cannot perform operations on it directly without using the special functions in mice. If you want to extract that actual imputed datasets, you use complete, the output of which is a data.frame with one row per individual per imputation (if using the "long" format). If you are doing any analysis with your imputed data that cannot be performed within mice, you need to create this dataset first.

Multiple Imputation on New/Predictor Data

Can someone please help me understand how to handle missing values in new/unseen data? I have investigated a handful of multiple imputation packages in R and all seem only to impute the training and test set (at the same time). How do you then handle new unlabeled data to estimate in the same way that you did train/test? Basically, I want to use multiple imputation for missing values in the training/test set and also the same model/method for predictor data. Based on my research of multiple imputation (not an expert), it does not seem feasible to do this with MI? However, for example, using caret, you can easily use the same model that was used in training/test for new data. Any help would be greatly appreciated. Thanks.
** Edit
Basically, My data set contains many missing values. Deletion is not an option as it will discard most of my train/test set. Up to this point, I have encoded categorical variables, removed near zero variance and highl correlated variables. After this preprocessing, I was able to easily apply the mice package for imputation
m=mice(sg.enc)
At this point, I could use the pool command to apply the model against the imputed data sets. That works fine. However, I know that future data will have missing values and would like to somehow apply this MI incrementally?
It does not have multiple imputation, but the yaImpute package has a predict() function to impute values for new data. I ran a test using training data (that included NAs) to create a "yai" object, then used that object via predict() to impute values in a new testing data set. Unlike Caret preProcess(), yaImpute supports factor variables (at least for imputing values for them) in its knn algorithm. I did not yet test if factors can be part of the "predictors" for the missing target variables. yaImpute does support other imputation methods besides knn.

Imputation in large data

I need to impute missing values. My data set has about 800,000 rows and 92 variables. I tried kNNImpute in the imputation package in r but looks like the data set is too big. Any other packages/method in R? I would prefer not to use mean to replace the missing values.
thank you
1) You might try
library(sos)
findFn("impute")
This shows 400 matches in 113 packages. This shows 400 matches in 113 packages: you could narrow it down per your requirements of the imputation function.
2) Did you see/try Hmisc ?
Description: The Hmisc library contains many functions useful for data
analysis, high-level graphics, utility operations, functions
for computing sample size and power, importing datasets,
imputing missing values, advanced table making, variable
clustering, character string manipulation, conversion of S
objects to LaTeX code, and recoding variables.
3) Possibly mice
Multiple imputation using Fully Conditional Specification (FCS)
implemented by the MICE algorithm. Each variable has its own
imputation model. Built-in imputation models are provided for
continuous data (predictive mean matching, normal), binary data
(logistic regression), unordered categorical data (polytomous logistic
regression) and ordered categorical data (proportional odds). MICE can
also impute continuous two-level data (normal model, pan, second-level
variables). Passive imputation can be used to maintain consistency
between variables. Various diagnostic plots are available to inspect
the quality of the imputations.
MICE is a great package, with strong diagnostic tools, and may be capable of doing the job in such a large dataset.
One thing you should be aware of: MICE is S-L-O-W. Working on such a big dataset, if you intend to use MICE, I would strongly recommend you to use a computing cloud -- otherwise, you're better planning your self in advance because, with a 800k x ~100 matrix, it may take a few days to get the job done, depending on how you specify your model.
MICE offers you a number of different imputation methods to be used according to the type of variable to be imputed. The fastest one is predictive mean matching. PMM was initially intended to be used to impute continuous data but it seems pmm is flexible enough to accomodate other types of variable. Take a look at this Paul Allison's post and Stef van Buuren's response: http://statisticalhorizons.com/predictive-mean-matching
(I see this is a three years old post but I have been using MICE and have been amazed by how powerful -- and oftentimes slow -- it can be!)

Resources