Parallel regression assumption on Imputed (MICE) Data with Brant Test in R - r

My data is ordinal, and so missing values are imputed with the polr method from the MICE package. Now I have multiple datasets which I can run an Ordinal Logistic Regression on. But, as the title mentioned: I want to perform a Brant test to check the parallel regression assumption. How can I perform such a test on my imputed datasets?
olr <- with(imputed, polr(target ~ var1+var2))
olrsummary <- summary(pool(olr))
> brant(olr)
Error in formula.default(model) : invalid formula
> brant(olrsummary)
Error in temp.data[, name] : incorrect number of dimensions
I know I can take the first dataset with complete(imputed, 1) and use that for my Brant test. But that just don't sees right.
Thanks in advance

Related

Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

Unequal Data for Model Compare in R

I'm fairly new to R and am trying to compare two models with the modelCompare function. However, the data set that I am working with is a bit large and has unevenly distributed missing values. When I try the following code for example:
Model_A <- lm(DV~var1*var2 + cont.var, data=df)
Model_C <- lm(DV~ cont.var, data=df)
modelCompare(Model_C,Model_A)
I get an error that the models have different N values and cannot be compared because data is differentially omitted between the two models. Is there an easy way to remove this variation, as I will be running a number of regression analyses with this data set?
What are you looking to compare? If you want to compare intercepts between the models just:
Model_A
Model_C
If you want to compare accuracy of the model, use a training and testing dataset!

Is there an R function that performs LASSO regression on multiple imputed datasets and pools results together?

I have a dataset with 283 observation of 60 variables. My outcome variable is dichotomous (Diagnosis) and can be either of two diseases. I am comparing two types of diseases that often show much overlap and i am trying to find the features that can help differentiate these diseases from each other. I understand that LASSO logistic regression is the best solution for this problem, however it can not be run on a incomplete dataset.
So i imputed my missing data with MICE package in R and found that approximately 40 imputations is good for the amount of missing data that i have.
Now i want to perform lasso logistic regression on all my 40 imputed datasets and somehow i am stuck at the part where i need to pool the results of all these 40 datasets.
The with() function from MICE does not work on .glmnet
# Impute database with missing values using MICE package:
imp<-mice(WMT1, m = 40)
#Fit regular logistic regression on imputed data
imp.fit <- glm.mids(Diagnosis~., data=imp,
family = binomial)
# Pool the results of all the 40 imputed datasets:
summary(pool(imp.fit),2)
The above seems to work fine with logistic regression using glm(), but when i try the exact above to perform Lasso regression i get:
# First perform cross validation to find optimal lambda value:
CV <- cv.glmnet(Diagnosis~., data = imp,
family = "binomial", alpha = 1, nlambda = 100)
When i try to perform cross validation I get this error message:
Error in as.data.frame.default(data) :
cannot coerce class ‘"mids"’ to a data.frame
Can somebody help me with this problem?
A thought:
Consider running the analyses on each of the 40 datasets.
Then, storing which variables are selected in each in a matrix.
Then, setting some threshold (e.g., selected in >50% of datasets).

Ridge Regression accuracy in R

I have been stuck on this for some time, and am in need of some help. I am new to R and have never done Ridge Regression using GLMNET. I am trying to learn ML via the MNIST-fashion dataset (https://www.kaggle.com/zalando-research/fashionmnist). The streamline the training (to make sure it works before I attempt to train on the full dataset, I take a stratified random sample (which produces a training dataset of 60 - 6 observations per label):
MNIST.sample.train = sample.split(MNIST.train$label, SplitRatio=0.001)
sample.train = MNIST.train[MNIST.sample.train,]
Next, I attempt to run ridge regression, using alpha=1...
x=model.matrix(label ~ . ,data=sample.train)
y=sample.train$label
rr.m <- glmnet(x,y,alpha=1, family="multinomial")
This seems to work. However, when I attempt to run the prediction, I get an error:
Error in cbind2(1, newx) %% (nbeta[[i]]) : not-yet-implemented
method for %% :
predict.rr.m <- predict(rr.m, MNIST.test, type = "class")
Ultimately, I am looking to obtain a single measure of the accuracy of the ridge regression. I believe that to do so, I must first obtain a prediction.
Any thoughts on how to fix my code would be greatly appreciated.
Kevin

Error in missing value imputation using MICE package

I have a huge data (4M x 17) that has missing values. Two columns are categorical, rest all are numerical. I want to use MICE package for missing value imputation. This is what I tried:
> testMice <- mice(myData[1:100000,]) # runs fine
> testTot <- predict(testMice, myData)
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "mids"
Running the imputation on whole dataset was computationally expensive, so I ran it on only the first 100K observations. Then I am trying to use the output to impute the whole data.
Is there anything wrong with my approach? If yes, what should I do to make it correct? If no, then why am I getting this error?
Neither mice nor hmisc provide the parameter estimates from the imputation process. Both Amelia and imputeMulti do. In both cases, you can extract the parameter estimates and use them for imputing your other observations.
Amelia assumes your data are distributed as a multivariate normal (eg. X \sim N(\mu, \Sigma).
imputeMulti assumes that your data is distributed as a multivariate multinomial distribution. That is the complete cell counts are distributed (X \sim M(n,\theta)) where n is the number of observations.
Fitting can be done as follows, via example data. Examining parameter estimates is shown further below.
library(Amelia)
library(imputeMulti)
data(tract2221, package= "imputeMulti")
test_dat2 <- tract2221[, c("gender", "marital_status","edu_attain", "emp_status")]
# fitting
IM_EM <- multinomial_impute(test_dat2, "EM",conj_prior = "non.informative", verbose= TRUE)
amelia_EM <- amelia(test_dat2, m= 1, noms= c("gender", "marital_status","edu_attain", "emp_status"))
The parameter estimates from the amelia function are found in amelia_EM$mu and amelia_EM$theta.
The parameter estimates in imputeMulti are found in IM_EM#mle_x_y and can be accessed via the get_parameters method.
imputeMulti has noticeably higher imputation accuracy for categorical data relative to either of the other 3 packages, though it only accepts multinomial (eg. factor) data.
All of this information is in the currently unpublished vignette for imputeMulti. The paper has been submitted to JSS and I am awaiting a response before adding the vignette to the package.
You don't use predict() with mice. It's not a model you're fitting per se. Your imputed results are already there for the 100,000 rows.
If you want data for all rows then you have to put all rows in mice. I wouldn't recommend it though, unless you set it up on a large cluster with dozens of CPU cores.

Resources