How to handle missing values (NA's) in a column in lmer - r

I would like to use na.pass for na.action when working with lmer. There are NA values in some observations of the data set in some columns. I just want to control for this variables that contains the NA's. It is very important that the size of the data set will be the same after the control of the fixed effects.
I think I have to work with na.action in lmer(). I am using the following model:
baseline_model_0 <- lmer(formula=log_life_time_income_child ~ nationality_dummy +
sex_dummy + region_dummy + political_position_dummy +(1|Family), data = baseline_df
Error in qr.default(X, tol = tol, LAPACK = FALSE) :
NA/NaN/Inf in foreign function call (arg 1)
My data: as you see below, there are quite a lot of NA's in all the control variables. So "throwing" away all of these observations is no option!
One example:
nat_dummy
1 : 335
2 : 19
NA's: 252
My questions:
1.) How can I include all of my control variables (expressed in multiple columns) to the model without kicking out observations (expressed in rows)?
2.) How does lmer handle the missing variables in all the columns?

To answer your second question, lmer typically uses maximum likelihood, where it will estimate missing values of the dependent variable and kick out missing values of your predictors. To avoid this, as others have suggested, you can use multiple imputation instead. I demonstrate below an example with the airquality dataset native to R since you don't have your data included in your question. First, load the necessary libraries: lmerTest for fitting the regression, mice for imputation and broom.mixed for summarizing the results.
#### Load Libraries ####
library(lmerTest)
library(mice)
library(broom.mixed)
We can inspect the missing patterns with the next code:
#### Missing Patterns ####
md.pattern(airquality)
Which gives us this nice plot of all the missing data patterns. For example, you may notice that we have two observations that are missing both Ozone and Solar.R.
To fill in the gap, we can impute the data with 5 imputations (the default, so you don't have to include the m=5 part, but I specify explicitly for your understanding.
#### Impute ####
imp <- mice(airquality,
m=5)
After, you run your imputations with the model like below. The with argument takes your imputed data and runs each imputation with the regression model. This model is a bit erroneous and comes back singular, but I just use it because its the quickest dataset I could remember with missing values included.
#### Fit With Imputations ####
fit <- with(imp,
lmer(Solar.R ~ Ozone + (1|Month)))
From there you can pool and summarize your results like so:
#### Pool and Summarise ####
pool <- pool(fit)
summary(pool)
Obviously with the model being singular this would be meaningless, but with a proper fit model, your summary should look something like this:
term estimate std.error statistic df p.value
1 (Intercept) 151.9805678 12.1533295 12.505262 138.8303 0.000000000
2 Ozone 0.8051218 0.2190679 3.675216 135.4051 0.000341446
As Ben already mentioned, you need to also determine why your data is missing. If there are non-random reasons for their missingness, this would require some consideration, as this can bias your imputations/model. I really recommend the mice vignettes here as a gentle introduction to the topic:
https://www.gerkovink.com/miceVignettes/
Edit
You asked in the comments about adding in random effects estimates. I'm not sure why this isn't already something ported into the respective packages already, but the mitml package can help fill that gap. Here is the code:
#### Load Library and Get All Estimates ####
library(mitml)
testEstimates(as.mitml.result(fit),
extra.pars = T)
Which gives you both fixed and random effects for imputed lmer objects:
Call:
testEstimates(model = as.mitml.result(fit), extra.pars = T)
Final parameter estimates and inferences obtained from 5 imputed data sets.
Estimate Std.Error t.value df P(>|t|) RIV FMI
(Intercept) 146.575 14.528 10.089 68.161 0.000 0.320 0.264
Ozone 0.921 0.254 3.630 90.569 0.000 0.266 0.227
Estimate
Intercept~~Intercept|Month 112.587
Residual~~Residual 7274.260
ICC|Month 0.015
Unadjusted hypothesis test as appropriate in larger samples.
And if you just want to pull the random effects, you can use testEstimates(as.mitml.result(fit), extra.pars = T)$extra.pars instead, which gives you just the random effects:
Estimate
Intercept~~Intercept|Month 1.125872e+02
Residual~~Residual 7.274260e+03
ICC|Month 1.522285e-02

Unfortunately there is no easy answer to your question; using na.pass doesn't do anything smart, it just lets the NA values go forward into the mixed-model machinery, where (as you have seen) they screw things up.
For most analysis types, in order to deal with missing values you need to use some form of imputation (using a model of some kind to fill in plausible values). If you only care about prediction without confidence intervals, you can use some simple single imputation method such as replacing NA values with means. If you want to do inference (compute p-values/confidence intervals), you need multiple imputation, i.e. generating multiple data sets with imputed values drawn differently in each one, fitting the model to each data set, then pooling estimates and confidence intervals appropriately across the fits.
mice is the standard/state-of-the-art R package for multiple imputation: there is an example of its use with lmer here.
There a bunch of questions you should ask/understand the answers to before you embark on any kind of analysis with missing data:
what kind of missingness do I have ("completely at random" [MCAR], "at random" [MAR], "not at random" [MNAR])? Can my missing-data strategy lead to bias if the data are missing not-at-random?
have I explored the pattern of missingness? Are there subsets of rows/columns that I can drop without much loss of information (e.g. if some column(s) or row(s) have mostly missing information, imputation won't help very much)
mice has a variety of imputation methods to choose from. It won't hurt to try out the default methods when you're getting started (as in #ShawnHemelstrand's answer), but before you go too far you should at least make sure you understand what methods mice is using on your data, and that the defaults make sense for your case.
I would strongly recommend the relevant chapter of Frank Harrell's Regression Modeling Strategies, if you can get ahold of a copy.

Related

Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

How to proceed with descriptive statistics (median, IQR, frequencies, proportions etc) after multiple imputation using MICE

I have been performing multiple imputation using the mice-package (van Buuren) in R, with m = 50 (50 imputation datasets) and 20 iterations for roughly 9 variables with missing data (MAR = missing at random) ranging from 5-13 %. After this, I want to proceed with estimating descriptive statistics for my dataset (i.e. not use complete case analysis only for descriptive statistics, but also compare the results with the descriptive statistics from my imputation). So my question is now, how to proceed.
I know that the correct procedure for dealing with MICE-data is:
Impute the missing data by the mice function, resulting in a multiple imputed data set (class mids);
Fit the model of interest (scientific model) on each imputed data set by the with() function, resulting an object of class mira;
Pool the estimates from each model into a single set of estimates and standard errors, resulting is an object of class mipo;
Optionally, compare pooled estimates from different scientific models by the D1() or D3() functions.
My problem is that I do not understand how to apply this theory on my data. I have done:
#Load package:
library(mice)
library(dplyr)
#Perform imputation:
Imp_Data <- mice(MY_DATA, m=50, method = "pmm", maxit = 20, seed = 123)
#Make the imputed data in long format:
Imp_Data_Long <- complete(Imp_Data, action = "long", include = FALSE)
I then assumed that this procedure is correct for getting the median of the BMI variable, where the .imp variable is the number of the imputation dataset (i.e. from 1-50):
BMI_Medians_50 <- Imp_Data_Long %>% group_by(.imp, Smoker) %>% summarise(Med_BMI = median(BMI))
BMI_Median_Pooled <- mean(BMI_Medians_50$Med_BMI)
I might have understood things completely wrong, but I have very much tried to understand the right procedure and found very different procedures here on StackOverflow and StatQuest.
Well, in general you perform the multiple imputation (in comparison to single imputation), because you want to account for the uncertainty that comes with performing imputation.
The missing data is ultimately lost - we can only estimate, what the real data might could look like. With multiple imputation we generate multiple estimates, what the real data could look like. Our goal is to have something like a probability distribution for the value of the imputed values.
For your descriptive statistics you do not need a pooling with rubins rules (these are important for standard errors and other metrics for linear models). You would calculate your statistics on each of your m = 50 imputed datasets separately and pool / sum them up with your desired metrics.
What you want to archive is to provide your reader with an information about the uncertainty that comes with the imputation. (and an estimate in which bounds the imputed values most likely are)
Looking for example at the mean as a descriptive statistic. Here you could e.g. provide the lowest mean and the highest mean over these imputed datasets. You can provide the mean of these means and the standard deviation for the mean over the imputed datasets.
Your complete case analysis would just provide e.g. 3.3 as a mean for a variable. But, might be the same mean varies quite a lot in your m=50 multiple imputed datasets e.g. from 1.1 up to 50.3. This can give you the valueable information, that you should take the 3.3 from your complete case analysis with a lot of care and that there is a lot of uncertainty in general with this kind of statistic for this dataset.

Error in glsEstimate(object, control = control) : computed "gls" fit is singular, rank 19

First time asking in the forums, this time I couldn't find the solutions in other answers.
I'm just starting to learn to use R, so I can't help but think this has a simple solution I'm failing to see.
I'm analyzing the relationship between different insect species (SP) and temperature (T), explanatory variables
and the area of the femur of the resulting adult (Femur.area) response variable.
This is my linear model:
ModeloP <- lm(Femur.area ~ T * SP, data=Datos)
No error, but when I want to model variance with gls,
modelo_varPower <- gls(Femur.area ~ T*SP,
weights = varPower(),
data = Datos
)
I get the following errors...
Error in glsEstimate(object, control = control) :
computed "gls" fit is singular, rank 19
The linear model barely passes the Shapiro test of normality, could this be the issue?
Shapiro-Wilk normality test
data: re
W = 0.98269, p-value = 0.05936
Strangely I've run this model using another explanatory variable and had no errors, all I can read in the forums has to do with multiple samplings along a period of time, and thats not my case.
Since the only difference is the response variable I'm uploading and image of how the table looks like in case it helps.
You have some missing cells in your SP:T interaction. lm() tolerates these (if you look at coef(lm(Femur.area~SP*T,data=Datos)) you'll see some NA values for the missing interactions). gls() does not. One way to deal with this is to create an interaction variable and drop the missing levels, then fit the model as (effectively) a one-way rather than a two-way ANOVA. (I called the data dd rather than datos.)
dd3 <- transform(na.omit(dd), SPT=droplevels(interaction(SP,T)))
library(nlme)
gls(Femur.area~SPT,weights=varPower(form=~fitted(.)),data=dd3)
If you want the main effects and the interaction term and the power-law variance that's possible, but it's harder.

Error in missing value imputation using MICE package

I have a huge data (4M x 17) that has missing values. Two columns are categorical, rest all are numerical. I want to use MICE package for missing value imputation. This is what I tried:
> testMice <- mice(myData[1:100000,]) # runs fine
> testTot <- predict(testMice, myData)
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "mids"
Running the imputation on whole dataset was computationally expensive, so I ran it on only the first 100K observations. Then I am trying to use the output to impute the whole data.
Is there anything wrong with my approach? If yes, what should I do to make it correct? If no, then why am I getting this error?
Neither mice nor hmisc provide the parameter estimates from the imputation process. Both Amelia and imputeMulti do. In both cases, you can extract the parameter estimates and use them for imputing your other observations.
Amelia assumes your data are distributed as a multivariate normal (eg. X \sim N(\mu, \Sigma).
imputeMulti assumes that your data is distributed as a multivariate multinomial distribution. That is the complete cell counts are distributed (X \sim M(n,\theta)) where n is the number of observations.
Fitting can be done as follows, via example data. Examining parameter estimates is shown further below.
library(Amelia)
library(imputeMulti)
data(tract2221, package= "imputeMulti")
test_dat2 <- tract2221[, c("gender", "marital_status","edu_attain", "emp_status")]
# fitting
IM_EM <- multinomial_impute(test_dat2, "EM",conj_prior = "non.informative", verbose= TRUE)
amelia_EM <- amelia(test_dat2, m= 1, noms= c("gender", "marital_status","edu_attain", "emp_status"))
The parameter estimates from the amelia function are found in amelia_EM$mu and amelia_EM$theta.
The parameter estimates in imputeMulti are found in IM_EM#mle_x_y and can be accessed via the get_parameters method.
imputeMulti has noticeably higher imputation accuracy for categorical data relative to either of the other 3 packages, though it only accepts multinomial (eg. factor) data.
All of this information is in the currently unpublished vignette for imputeMulti. The paper has been submitted to JSS and I am awaiting a response before adding the vignette to the package.
You don't use predict() with mice. It's not a model you're fitting per se. Your imputed results are already there for the 100,000 rows.
If you want data for all rows then you have to put all rows in mice. I wouldn't recommend it though, unless you set it up on a large cluster with dozens of CPU cores.

Variable selection methods

I have been doing variable selection for a modeling problem.
I have used trial and error for the selection (adding / removing a variable) with a decrease in error. However, I have the challenge as the number of variables grows into the hundreds that manual variable selection can not be performed as the model takes 1/2 hour to compute, rendering the task impossible.
Would you happen to know of any other packages than the regsubsets from the leaps package (which when tested with the same trial and error variables produced a higher error, it did not include some variables which were lineraly dependant - excluding some valuable variables).
You need a better (i.e. not flawed) approach to model selection. There are plenty of options, but one that should be easy to adapt to your situation would be using some form of regularization, such as the Lasso or the elastic net. These apply shrinkage to the sizes of the coefficients; if a coefficient is shrunk from its least squares solution to zero, that variable is removed from the model. The resulting model coefficients are slightly biased but they have lower variance than the selected OLS terms.
Take a look at the lars, glmnet, and penalized packages
Try using the stepAIC function of the MASS package.
Here is a really minimal example:
library(MASS)
data(swiss)
str(swiss)
lm <- lm(Fertility ~ ., data = swiss)
lm$coefficients
## (Intercept) Agriculture Examination Education Catholic
## 66.9151817 -0.1721140 -0.2580082 -0.8709401 0.1041153
## Infant.Mortality
## 1.0770481
st1 <- stepAIC(lm, direction = "both")
st2 <- stepAIC(lm, direction = "forward")
st3 <- stepAIC(lm, direction = "backward")
summary(st1)
summary(st2)
summary(st3)
You should try the 3 directions and ckeck which model works better with your test data.
Read ?stepAIC and take a look at the examples.
EDIT
It's true stepwise regression isn't the greatest method. As it's mentioned in GavinSimpson answer, lasso regression is a better/much more efficient method. It's much faster than stepwise regression and will work with large datasets.
Check out the glmnet package vignette:
http://www.stanford.edu/~hastie/glmnet/glmnet_alpha.html

Resources