So I want to do a regression analysis. and in order to account for missing data in my data set, I imputed NAs with mice. So far everything's fine, I ran mice() with m=5 and now I have the 5 imputation models. The next step according to the documentation would be to do the actual regression and combine the different imputation by using pool(), so:
model <- with (data_imp, exp = ...)
summary(pool(model))
This will create the regression output for my analysis. Seems good so far.
However, I also want to provide some descriptive statistics (namely, a boxplot) on my dependent and independent variables. So therefore I need a dataframa that contains both a) the values that were already given and b) the combined, imputed values that were inserted in place of the NAs and used in the regression model. How can I create this data.frame?
I saw in this tutorial know that you can combine the imputed data with the already given values into a new dataframe by using data_full <- complete(data_imp) but apparently thisonly works, if you want to specifically choose only one of the 5 imputations (for example data_full <- complete(data_imp, 1) to choose the first imputation). If you dont specifiy any nummer, I think it will just use the first imputation. I however want to use the combined, estimateed values from every 5 imputations and combine them into a dataframe. How can I do this?
I'd be really grateful for every piece of advice :) Thanks in advance!
I'm not entirely clear on how you want to display the data, but if you want to look at a box plot for each imputation, you can iterate through each of the 5 imputations run with complete() using lapply() or purrr::map() as per below:
library(mice)
library(tidyverse)
imp <- mice(nhanes)
map_dfr(1:5, ~ complete(imp, .x), .id = "imp") |>
ggplot(aes(x = imp, y = bmi)) +
geom_boxplot()
Related
I've imputed my data using the following code:
data_imp <- mice(data, m=5, maxit=50, meth='pmm', seed=500, printFlag=FALSE)
data.impute <- complete(data_imp, action = 1)
I wanted to perform backwards stepwise regression using the stepAIC function in order to find the most parsimonious model. How can I do this using all 5 of my imputed datasets, rather than just 1?
Thank you very much!
You'd have to apply it to each dataset separately; see below for some example code.
However, let me also give you two MASSIVE disclaimers here:
Backwards stepwise regression is really really not recommended for variable selection. In addition, there are better ways to do this for imputed datasets.
From the code below, you would still have to decide on HOW to pool your results into one interpretable set. One way would be to simply count how often each variable ends up in the final model. However, this procedure implicitly carries a loss of information.
A more extensive discussion of these points can be found here:
https://stats.stackexchange.com/questions/110585/stepwise-regression-modeling-using-multiply-imputed-data-sets
The author of mice also has a subchapter on variable selection in his book:
https://stefvanbuuren.name/fimd/sec-stepwise.html
I would thus consider whether there are better options out there for you.
Example code
## I am using `mtcars`
## Let's ampute it, then impute it
data_imp <- mice(ampute(mtcars, prop = 0.001)$amp)
## Next, we loop over all imputed datasets
out <- lapply(seq_len(data_imp$m), function(i) {
## We create a dataset
data.i <- complete(data_imp, i)
## We run our model
fit <- lm(mpg ~ ., data = data.i)
## We apply `stepAIC`
stepAIC(fit, trace = FALSE)
})
I'm fairly new to R and am trying to compare two models with the modelCompare function. However, the data set that I am working with is a bit large and has unevenly distributed missing values. When I try the following code for example:
Model_A <- lm(DV~var1*var2 + cont.var, data=df)
Model_C <- lm(DV~ cont.var, data=df)
modelCompare(Model_C,Model_A)
I get an error that the models have different N values and cannot be compared because data is differentially omitted between the two models. Is there an easy way to remove this variation, as I will be running a number of regression analyses with this data set?
What are you looking to compare? If you want to compare intercepts between the models just:
Model_A
Model_C
If you want to compare accuracy of the model, use a training and testing dataset!
I have dataset containing growth data for multiple crops. The experiment treatment is within-crop (i.e. I am not comparing between crops) and has a randomised block design, with repeated measures over time. I would like to run the same mixed model on each different crop in R:
model<-lmer(response~treatment*time+(1|block), data=data)
My dataset contains 20 crops and I want to extract all the fixed intercepts and coefficients, so it is annoying to run the model on each crop separately. Somehow I got the impression from some other posts that lmList, given the below formula, would run a separate but common mixed model on each crop. However having now checked the help page and the output in more detail, I think it just does a fixed effect linear model (lm).
model_set <- lme4::lmList(response~treatment*time+(1|block)|crop,data=data)
I can't find any nice posts on setting up a loop to run the same mixed effect model on multiple datasets and extract coefficients, or if there is any existing function for this. It would be great if someone could recommend a quick simple solution.
As an example dataset I suggest the lme4 sleepstudy data, but let's just make a fake 'group' factor to mimic my 'crop' factor:
library(lme4)
data(sleepstudy)
library(car)
sleepstudy$group<-recode(sleepstudy$Subject,
" c('308','309','310','330','331','332','333','334') = 'group1';
c('335','337','349','350','351','352','369','370','371','372') = 'group2'")
model<-lmer(Reaction~Days+(1|Subject), data=sleepstudy)
How can I automate running the same model on each group separately and then extracting the intercept and coefficients?
Many thanks in advance for any suggestions.
Today I have stumbled across the 'split' function, which means I can answer my own question. The split function subsets a dataframe into a list of dataframes based on levels of a specified factor. A simple loop can then be set up to run the same model on each dataframe in the list.
list_df <- split(sleepstudy, sleepstudy$group) #split example dataset by group factor
results_df <- as.data.frame(matrix(ncol=2,nrow=length(list_df))) # make an empty dataframe
colnames(results_df)<-c("intercept","Days") #give the dataframe column names
for (i in 1:length(list_df)){ #run a loop over the dataframes in the list
mod<-lmer(Reaction~Days+(1|Subject),data=list_df[[i]]) #mixed model
results_df[i,]<-fixef(mod) #extract coefficients to dataframe
rownames(results_df)[i]<-names(list_df)[i] #assign rowname to results from data used
}
I am not sure if I need to provide a reproducible output for this as this is more of a general question. Anyway, after running the mice package, it returns m multiple imputed dataset. We can extract the data by using the complete() function.
I am confuse however which dataset shall I used for my succeeding analysis (descriptive estimation, model building, etc).
Questions:
1. Do I need to extract specific dataset e.g. complete(imp,1)? or shall I use the whole imputed dataset e.g. complete(imp, "long", inc = TRUE)?
If it is the latter complete(imp, "long", inc = TRUE), how do I compute some descriptives like mean, proportion,etc? For example, I will analyze the long data using SPSS. Shall I split the data according to the m number of imputed dataset and manually find the average? is that how it should be done?
Thanks for your help.
You should run your statistical analysis on each of the m imputed data sets individually, then pool the results together. This allows you to take into account the additional uncertainty introduced by the imputation procedure. MICE has this functionality built in. For example, if you wanted to do a simple linear model you would do this:
fit <- with(imp, lm(y ~ x1 + x2))
est <- pool(fit)
summary(est)
Check out ?pool and ?mira
Multiple imputation is comprised of the following three steps:
1. Imputation
2. Analysis
3. Pooling
In the first step, m number of imputed datasets are generated, in the second step data analysis, such as regression is applied to each dataset separately. Finally, in the thirds step, the analysis results are pooled into a final result. There are various pooling techniques implemented for different parameters.
Here is a nice link describing the pooling in detail - mice Vignettes
So I have a small data set which should be great for modeling (<1 million records), but one variable is giving me problems. It's a categorical variable with ~98 levels called [store] - this is the name of each store. I am trying to predict each stores sales [sales] which is a continuous numeric variable. So the vector size is over 10GB and crashes with memory errors in R. Is it possible to make 98 different regression equations, and run them one by one for every level of [store]?
My other idea would be to try and create 10 or 15 clusters of this [store] variable, then use the cluster names as my categorical variable in predicting the [sales] variable (continuous variable).
Sure, this is a pretty common type of analysis. For instance, here is how you would split up the iris dataset by the Species variable and then build a separate model predicting Sepal.Width from Sepal.Length in each subset:
data(iris)
models <- lapply(split(iris, iris$Species), function(df) lm(Sepal.Width~Sepal.Length, data=df))
The result is a list of the species-specific regression models.
To predict, I think it would be most efficient to first split your test set, then call the corresponding prediction function on each subset, and finally recombine:
test.iris <- iris
test.spl <- split(test.iris, test.iris$Species)
predictions <- unlist(lapply(test.spl, function(df) {
predict(models[[df$Species[1]]], newdata=df)
}))
test.ordered <- do.call(rbind, test.spl) # Test obs. in same order as predictions
Of course, for your problem you'll need to decide how to subset the data. One reasonable approach would be clustering with something like kmeans and the passing the cluster of each point to the split function.