How to do multiple imputation on Julia? - julia

I've found the package Impute.jl but it's only able to use these simple methods:
drop: remove missing.
locf: last observation carried forward
nocb: next observation carried backward
interp: linear interpolation of values in vector
fill: replace with a specific value or a function...
There seems not to exist any advanced "multiple imputation" method.
How can use more advanced methods when I have several variables?
Such as: fully conditional specification (mice), bayesian methods, random forest, multilevel, nested imputation, censored data, categorical data, survival data...
I don't mean creating my own code but finding any Julia package able to do it automatically. Other software do have it (R, Stata, SAS…).

Related

In the R mice package, how do I find the number of nodes in the tree used in my imputation?

In a paper under review, I have a very large dataset with a relatively small number of imputations. The reviewer asked me to report how many nodes were in the tree I generated using the CART method within MICE. I don't know why this is important, but after hunting around for a while, my own interest is piqued.
Below is a simple example using this method to impute a single value. How many nodes are in the tree that the missing value is being chosen from? And how many members are in each node?
data(whiteside, package ="MASS")
data <- whiteside
data[1,2] <- NA
library(mice)
impute <- mice(data,m=100,method="cart")
impute2 <- complete(impute,"long")
I guess, whiteside is only used as an example here. So your actual data looks different.
I can't easily get the number of nodes for the tree generated in mice. The first problem is, that it isn't just one tree ... as the package names says mice - Multivariate Imputation by Chained Equations. Which means you are sequentially creating multiple CART trees. Also each incomplete variable is imputed by a separate model.
From the mice documentation:
The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model.
If you really want to get numbers of nodes for each used model, you probably would have to adjust the mice package itself and add some logging there.
Here is how you might approach this:
Calling impute <- mice(data,m=100,method="cart") you get a S3 object of class mids that contains information about the imputation. (but not the number of nodes for each tree)
But you can call impute$formulas, impute$method, impute$nmis to get some more information, which formulas were used and which variables actually had missing values.
From the mice.impute.cart documentation you can see, that mice uses rpart internally for creating the classification and regression trees.
Since the mids object does not contain information about the fitted trees I'd suggest, you use rpart manually with the formula from impute$formulas.
Like this:
library("rpart")
rpart(Temp ~ 0 + Insul + Gas, data = data)
Which will print / give you the nodes/tree. This wouldn't really be the tree used in mice. As I said, mice means multiple chained equations / multiple models after each other - meaning multiple possibly different trees after each other. (take a look at the algortihm description here https://stefvanbuuren.name/fimd/sec-cart.html for the univariate missingness case with CART). But this could at least be an indicator, if applying rpart on your specific data provides a useful model and thus leads to good imputation results.

Applying a population total variable in R?

I have a weighting variable that I'd like to apply to my dataset so that I have weighted totals. In SPSS, this is straightforward enough. However, in R, I've been multiplying the variable by the weight variable to create a new variable as shown in the following example:
https://stats.stackexchange.com/questions/210697/weighting-variable-based-on-another-variable
Is there a more sophisticated way of applying weights in R?
Thanks.
If you need to work with a weighted dataset and define a complex survey sample, you can use the survey package : https://cran.r-project.org/web/packages/survey/survey.pdf.
You can therefore use all sorts of summary statistics once you have defined the weights to be applied.
However, I would advise this for complex weighted analysis.
Otherwise, there are several other packages dealing with weights such as questionr for instance.
It all depends on if you have to do a simple weighted sum or go on to do other types of analysis that require using more sophisticated methods.

R simulations and regression in mice()

I am using the mice package in R to do multiple imputation and trying to understand the algorithm behind it.
From its documentation http://www.jstatsoft.org/v45/i03/paper, the MICE algorithm is said to be used. From my understanding, it performs MCMC using Gibbs Sampler, where simulates parameters BETA that defines the conditional distribution of Y(variable with missing value) given Y-(all other variables without Y). With the simulated BETA,the corresponding conditional distribution is defined. Then it draws values from the conditional distribution and replace missing with it. It repeats the procedure across all variables with missing values.
However, what I don't understand is that, where does the regression happen? In the mice() function, we do need to specify the 'method' parameter. For example, 'logreg' for binomial distributed variables and 'polyreg' for factor variable with more than 2 level. If imputation is done by MCMC, why would we need to specify a regression?
Some documentation indicates that MICE algorithm runs regression iteratively across all variables with missing pattern. In each time, one variable with missing is the respondent variable and all others are explanatory variables. Then fitted values are used to replace missing and move on to the next variable with missing. The next regression will include imputed data from last regression. This is the same scheme as Gibbs sampler but it seems that there is no simulation. Details are here http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/
Could anyone help me to understand what really happens in mice in R?
For each variable with missing data (Y1,...,Yj,...Yk), the MICE algorithm fits a statistical model conditioning Yj on all other variables (Yj-, or a subset therof).
The type of statistical model is indicated by method.
This is the "regression".
The fitted model is used to draw replacements for the missing portions of Yj, given Yj-. Afterwards, the algorithm proceeds with the next variable that contains missing values.
Once all variables have been filled, the algorithm starts over.
Note that, when fitting the models, the MICE algorithm regresses the observed portions of Yj on the observed and imputed portions of Yj-.
In other words, at each iteration, the regression models condition on a different set of predictor values (hence the need for usually more than one iterations). This is slightly different from other implementations of MI.
Note also that the MICE algorithm is not formally a Gibbs sampler (see the very well-written discussion by Carpenter and Kenward, 2013).

Imputation in large data

I need to impute missing values. My data set has about 800,000 rows and 92 variables. I tried kNNImpute in the imputation package in r but looks like the data set is too big. Any other packages/method in R? I would prefer not to use mean to replace the missing values.
thank you
1) You might try
library(sos)
findFn("impute")
This shows 400 matches in 113 packages. This shows 400 matches in 113 packages: you could narrow it down per your requirements of the imputation function.
2) Did you see/try Hmisc ?
Description: The Hmisc library contains many functions useful for data
analysis, high-level graphics, utility operations, functions
for computing sample size and power, importing datasets,
imputing missing values, advanced table making, variable
clustering, character string manipulation, conversion of S
objects to LaTeX code, and recoding variables.
3) Possibly mice
Multiple imputation using Fully Conditional Specification (FCS)
implemented by the MICE algorithm. Each variable has its own
imputation model. Built-in imputation models are provided for
continuous data (predictive mean matching, normal), binary data
(logistic regression), unordered categorical data (polytomous logistic
regression) and ordered categorical data (proportional odds). MICE can
also impute continuous two-level data (normal model, pan, second-level
variables). Passive imputation can be used to maintain consistency
between variables. Various diagnostic plots are available to inspect
the quality of the imputations.
MICE is a great package, with strong diagnostic tools, and may be capable of doing the job in such a large dataset.
One thing you should be aware of: MICE is S-L-O-W. Working on such a big dataset, if you intend to use MICE, I would strongly recommend you to use a computing cloud -- otherwise, you're better planning your self in advance because, with a 800k x ~100 matrix, it may take a few days to get the job done, depending on how you specify your model.
MICE offers you a number of different imputation methods to be used according to the type of variable to be imputed. The fastest one is predictive mean matching. PMM was initially intended to be used to impute continuous data but it seems pmm is flexible enough to accomodate other types of variable. Take a look at this Paul Allison's post and Stef van Buuren's response: http://statisticalhorizons.com/predictive-mean-matching
(I see this is a three years old post but I have been using MICE and have been amazed by how powerful -- and oftentimes slow -- it can be!)

"Forward" entry stepwise regression using p-values in R

Note that the previous question flagged as a possible duplicate is not a duplicate because the previous question concerns backwards elimination and this question concerns forward entry.
I am currently performing a simulation where I want to show how stepwise regression is a biased estimator. In particular, previous researchers seem to have used one of the stepwise procedure in SPSS (or something identical to it). This involves using the p-value of the F value for r-square change to determine whether an additional variable should be added into the model. Thus, in order for my simulation results to have the most impact I need to replicate the SPSS stepwise regression procedure in R.
While R has a number of stepwise procedures (e.g., based on AIC), the ones that I have found are not the same as SPSS.
I have found this function by Paul Rubin. It seems to work, but the input and output of the function is a little strange. I've started tweaking it so that it (a) take a formula as input, (b) returns the best fitting model. The logic of the function is what I'm after.
I have also found this question on backwards stepwise regression. Note that backwards entry is different to forwards entry because backwards entry removes non-significant terms whereas forwards entry adds significant terms.
Nonetheless, it would be great if there was another function in an existing R package that could do what I want.
Is there an R function designed to perform forward entry stepwise regression using p-values of the F change?
Ideally, it could take a DV a set of IVs (either as named variables or as a formula) and a data.frame and would return the model that the stepwise regression selects as "best". For my purposes, there are no issues with inclusion of interaction terms.
The function two.ways.stepfor in the bioconductor package maSigPro contains a form of forward entry stepwise regression based on p-values.
However, the alpha in and alpha out can be specified and they must be the same. In SPSS the alpha in and alpha out can be different.
The package can be installed with:
source("http://bioconductor.org/biocLite.R")
biocLite("maSigPro")

Resources