R simulations and regression in mice() - r

I am using the mice package in R to do multiple imputation and trying to understand the algorithm behind it.
From its documentation http://www.jstatsoft.org/v45/i03/paper, the MICE algorithm is said to be used. From my understanding, it performs MCMC using Gibbs Sampler, where simulates parameters BETA that defines the conditional distribution of Y(variable with missing value) given Y-(all other variables without Y). With the simulated BETA,the corresponding conditional distribution is defined. Then it draws values from the conditional distribution and replace missing with it. It repeats the procedure across all variables with missing values.
However, what I don't understand is that, where does the regression happen? In the mice() function, we do need to specify the 'method' parameter. For example, 'logreg' for binomial distributed variables and 'polyreg' for factor variable with more than 2 level. If imputation is done by MCMC, why would we need to specify a regression?
Some documentation indicates that MICE algorithm runs regression iteratively across all variables with missing pattern. In each time, one variable with missing is the respondent variable and all others are explanatory variables. Then fitted values are used to replace missing and move on to the next variable with missing. The next regression will include imputed data from last regression. This is the same scheme as Gibbs sampler but it seems that there is no simulation. Details are here http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/
Could anyone help me to understand what really happens in mice in R?

For each variable with missing data (Y1,...,Yj,...Yk), the MICE algorithm fits a statistical model conditioning Yj on all other variables (Yj-, or a subset therof).
The type of statistical model is indicated by method.
This is the "regression".
The fitted model is used to draw replacements for the missing portions of Yj, given Yj-. Afterwards, the algorithm proceeds with the next variable that contains missing values.
Once all variables have been filled, the algorithm starts over.
Note that, when fitting the models, the MICE algorithm regresses the observed portions of Yj on the observed and imputed portions of Yj-.
In other words, at each iteration, the regression models condition on a different set of predictor values (hence the need for usually more than one iterations). This is slightly different from other implementations of MI.
Note also that the MICE algorithm is not formally a Gibbs sampler (see the very well-written discussion by Carpenter and Kenward, 2013).

Related

Imputation missing data for MLM in R

Maybe anyone can help me with this question. I conducted a follow-up study and obviously now have to face missing data. Now I am considering how to impute the missing data at best using MLM in R (f.e. participants concluded the follow up 2 survey, but not the follow up 1 survey, therefore I am missing L1 predictors for my longitudinal analysis).
I read about Multiple Imputation of multilevel data using the pan package (Schafer & Yucel, 2002) and came across the following code:
imp <- panImpute(data, formula = fml, n.burn = 1000, n.iter = 100, m = 5)
Yet, I have troubles understanding it completely. Is there maybe another way to impute missing data in R? Or maybe somebody could illustrate the process of the imputation method a bit more detailed, that would be so great! Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
Thank you in advance
Is there maybe another way to impute missing data in R?
There are other packages. mice is the one that I normally use, and it does support multilevel data.
Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
You have to specify the imputation model. Basically that means you have to tell the software which variables are predicted by which other variables. Since you are comparing models with the same fixed effect, and only changing the random effects (in particular comparing models with and without random slopes), the imputation model should be the same in both cases. So the workflow is:
perform the imputations;
run the model on all the imputed datasets,
pool the results (typically using Rubin's rules)
So you will need to do this twice, to end up with 2 sets of pooled results - one for each model. The software should provide functionality for doing all of this.
Having said all of that, I would advise against choosing your model based on fit statistics and instead use expert knowledge. If you have strong theoretical reasons for expecting slopes to vary by group, then include random slopes. If not, then don't include them.

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

How to run a truncated and inflated Poisson model in R?

My data doesn't contain any zeros. The minimum value for my outcome, y, is 1 and that is the value that is inflated. My objective is to run a truncated and inflated Poisson regression model using R.
I already know how to separate way each regression zero truncated and zero inflated. I want to know how to combine the two conditions into one model.
Thanks for you help.
For zero inflated models or zero-hurdle models, the standard approach is to use pscl package. I also wrote a package fitting that kind of models here but it is not yet mature and fully tested. Unless you have voluminous data, I still recommend you to use pscl that is more flexible, robust and documented.
For zero-truncated models, you can have a look at the VGML::vglm function. You might find useful information here.
Note that you are not doing the same distributional assumption so you won't need the same estimation data. Given the description of your dataset, I think you are looking for a zero-truncated model (since you do not observe zeros). With zero-inflated models, you decompose your observed pattern into zeros generated by a selection model and others generated by a count data model. This doesn't look to be a pattern consistent with your dataset.

Random Forest vs Logistic Regression

I am working on a dataset. It is a classification problem. One column of the dataset has around 11000 missing values out of total 300k observations (It is a categorical variable so missing value imputation like numerical ones is not possible).
Is it advisable to go ahead with Random Forest rather than Logistic Regression as Random Forest is not affected by missing values?
Also do i need to take care of multi-collinearity among independent variables while using RF or there is no need of that?
Although the RFM can handle noise data and missing values, it seems difficult to say that it is better than logistic. Because logistic can also be improved through other pre-processing (PCA or missing data imputation) or ensemble method.
I think RF does not have to take into account the multi-collinearity . This is because the variables are randomly selected to create different trees and produce results. In this process, the most important attributes are chosen and interpreted as solving the problem of multi-collinearity with similar trends.

Procedure to identify the most significant predictors variables using R when data has tremendous multicollinearity?

I have a database of around 36 predictor variables which I am using to predict a target variable. The target is a categorical variable consisting of three different classes whereas predictor variables include both numeric and categorical variables.
However, data is subject to severe multi-collinearity. I am trying to build a parsimonious logistic regression model so need to reduce the variables. According to VIF values results become counter intuitive as soon as I reduce the number of variables. On the other hand, I am not very sure that PCR can solve the problem as I need inferences from the sensitivity from each variable.
What is the better option to deal with such problem?
Which packages from 'R' I can use?
Will factor analysis solve the problem?
Or can we infer everything from PCR?
You have first to run ANOVA/Kruskall Wallis test to check which variables are well suited for your problem. For 36 variables I don't think you will need PCA, as this will make your model lose some explainability.
Remember that PCA will reduce dimensionality and also explain only part of the data variance. Factor Analysis will generate groups of variables in factors, in case you want to run a segmented logistic regression for each factor of grouped variables.
If you want to build a parsimonious logistic regression, you can apply some regularization so that you can increase the generalization properties of it, instead of reducing number of variables.
You can use the following R packages: caret (logistic regression), ROCR (AUC), ggplot (plot), DMwR (outliers), mice (missing values)
Also, if you want to make a regularization, you can use the following formula:
In this case, you can develop regularization from scratch, without a library, to adjust the inclination of the sigmoid, so that you can correctly classify your classes:

Resources