Unequal Data for Model Compare in R - r

I'm fairly new to R and am trying to compare two models with the modelCompare function. However, the data set that I am working with is a bit large and has unevenly distributed missing values. When I try the following code for example:
Model_A <- lm(DV~var1*var2 + cont.var, data=df)
Model_C <- lm(DV~ cont.var, data=df)
modelCompare(Model_C,Model_A)
I get an error that the models have different N values and cannot be compared because data is differentially omitted between the two models. Is there an easy way to remove this variation, as I will be running a number of regression analyses with this data set?

What are you looking to compare? If you want to compare intercepts between the models just:
Model_A
Model_C
If you want to compare accuracy of the model, use a training and testing dataset!

Related

Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

How to perform Post-Hoc tests (including tukey and Gammel-Howell) on imputed data using MICE

I would like to perform post-hoc tests on imputed data using MICE in R.
Typically MICE imputes data which is converted to long data to calculate total scores and can be converted back into MIDS elements. Analysis is then conducted over a MIRO element after which analysis can be pooled.
However, I am not able to get it running for post hoc tests including Tukey and Gammel-Howell. Would someone be able to help?
IMP <- mice(data, m=5, maxit=10)
IMP_long <- data.frame(complete(IMP, include=TRUE, action= 'long')
IMP_mids <- as.mids(IMP_long, where = NULL, .imp='.imp', .id = '.id'
fit <- with(IMP_mids, expr=lm(total_score ~ GroupingVariable))
The grouping variable consists of 3 groups which I would like to compare pairwise. Namely 1vs2, 2vs3 and 1vs3.
summary(pool(fit))
-> this gives comparisons between two groups, relative to the intercept. Similarly by using contrasts before creating model
Someone who knows how to compare the three groups in one analysis with tukey and/or gammel-howell?
Thanks in advance!!

quick way to fit the same mixed model to multiple dataframes or subsets in R?

I have dataset containing growth data for multiple crops. The experiment treatment is within-crop (i.e. I am not comparing between crops) and has a randomised block design, with repeated measures over time. I would like to run the same mixed model on each different crop in R:
model<-lmer(response~treatment*time+(1|block), data=data)
My dataset contains 20 crops and I want to extract all the fixed intercepts and coefficients, so it is annoying to run the model on each crop separately. Somehow I got the impression from some other posts that lmList, given the below formula, would run a separate but common mixed model on each crop. However having now checked the help page and the output in more detail, I think it just does a fixed effect linear model (lm).
model_set <- lme4::lmList(response~treatment*time+(1|block)|crop,data=data)
I can't find any nice posts on setting up a loop to run the same mixed effect model on multiple datasets and extract coefficients, or if there is any existing function for this. It would be great if someone could recommend a quick simple solution.
As an example dataset I suggest the lme4 sleepstudy data, but let's just make a fake 'group' factor to mimic my 'crop' factor:
library(lme4)
data(sleepstudy)
library(car)
sleepstudy$group<-recode(sleepstudy$Subject,
" c('308','309','310','330','331','332','333','334') = 'group1';
c('335','337','349','350','351','352','369','370','371','372') = 'group2'")
model<-lmer(Reaction~Days+(1|Subject), data=sleepstudy)
How can I automate running the same model on each group separately and then extracting the intercept and coefficients?
Many thanks in advance for any suggestions.
Today I have stumbled across the 'split' function, which means I can answer my own question. The split function subsets a dataframe into a list of dataframes based on levels of a specified factor. A simple loop can then be set up to run the same model on each dataframe in the list.
list_df <- split(sleepstudy, sleepstudy$group) #split example dataset by group factor
results_df <- as.data.frame(matrix(ncol=2,nrow=length(list_df))) # make an empty dataframe
colnames(results_df)<-c("intercept","Days") #give the dataframe column names
for (i in 1:length(list_df)){ #run a loop over the dataframes in the list
mod<-lmer(Reaction~Days+(1|Subject),data=list_df[[i]]) #mixed model
results_df[i,]<-fixef(mod) #extract coefficients to dataframe
rownames(results_df)[i]<-names(list_df)[i] #assign rowname to results from data used
}

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

Which MICE imputed data set to use in succeeding analysis?

I am not sure if I need to provide a reproducible output for this as this is more of a general question. Anyway, after running the mice package, it returns m multiple imputed dataset. We can extract the data by using the complete() function.
I am confuse however which dataset shall I used for my succeeding analysis (descriptive estimation, model building, etc).
Questions:
1. Do I need to extract specific dataset e.g. complete(imp,1)? or shall I use the whole imputed dataset e.g. complete(imp, "long", inc = TRUE)?
If it is the latter complete(imp, "long", inc = TRUE), how do I compute some descriptives like mean, proportion,etc? For example, I will analyze the long data using SPSS. Shall I split the data according to the m number of imputed dataset and manually find the average? is that how it should be done?
Thanks for your help.
You should run your statistical analysis on each of the m imputed data sets individually, then pool the results together. This allows you to take into account the additional uncertainty introduced by the imputation procedure. MICE has this functionality built in. For example, if you wanted to do a simple linear model you would do this:
fit <- with(imp, lm(y ~ x1 + x2))
est <- pool(fit)
summary(est)
Check out ?pool and ?mira
Multiple imputation is comprised of the following three steps:
1. Imputation
2. Analysis
3. Pooling
In the first step, m number of imputed datasets are generated, in the second step data analysis, such as regression is applied to each dataset separately. Finally, in the thirds step, the analysis results are pooled into a final result. There are various pooling techniques implemented for different parameters.
Here is a nice link describing the pooling in detail - mice Vignettes

Resources