Preface - I really hope this makes sense!
I ran a linear-mixed effect model using an imputed dataset (FYI, the data is a mids object imputed using mice). The model has a three-way interaction with 3 continuous variables. I am now trying to plot the interaction using the interactions::interact_plot function. However, I'm receiving an error when I run the plot code, which I believe is due to the fact that the model came from a mids object and not a data frame. Does anyone know how to address this error or if there's a better way to get the plot that I'm trying to get?
Thanks very much in advance!
MIDmod1 <- with(data = df.mids, exp = lmer(GC ~ Age + Sex + Edu + Stress*Time*HLI + (1|ID)))
summary(pool(MIDmod1))
interact_plot(
model=MIDmod1,
pred = Time,
modx=Stress,
mod2=HLI,
data = df.mids,
interval=TRUE,
y.label='Global cognition composite score',
modx.labels=c('Low Baseline Stress (-1SD)','Moderate Baseline Stress (Mean)', 'High Baseline Stress (+1SD)'),
mod2.labels=c('Low HLI (-1SD)', 'Moderate HLI (Mean)', 'High HLI (+1SD)'),
legend.main='') + ylim(-2,2)
Error:
Error in rep(1, times = nrow(data)) : invalid 'times' argument
Note - I also get an error if I don't include the data argument (optional argument for this function).
Error in formula.default(object, env = baseenv()) : invalid formula
BTW - I am able to generate the plot when the model comes from a data frame - an example of what this should look like is included here: 1
Sorry, but it won’t be that easy. Multiple imputation object will definitely require special treatment, and none of the many R packages which can plot interactions are likely to work out of hte box.
Here’s a minimal example, adapted from the multiple imputation vignette of the marginaleffects package. (Disclaimer: I am the author.)
library(mice)
library(lme4)
library(ggplot2)
library(marginaleffects)
# insert missing data in an existing dataset and impute
iris_miss <- iris
iris_miss$Sepal.Width[sample(1:nrow(iris), 20)] <- NA
iris_mice <- mice(iris_miss, m = 20, printFlag = FALSE, .Random.seed = 1024)
iris_mice <- complete(iris_mice, "all")
# fit a model on 1 imputed datatset and use the `plot_predictions()` function
# with the `draw=FALSE` argument to extract the data that we want to plot
fit <- function(dat) {
mod <- lmer(Sepal.Width ~ Petal.Width * Petal.Length + (1 | Species), data = dat)
out <- plot_predictions(mod, condition = list("Petal.Width", "Petal.Length" = "threenum"), draw = FALSE)
# `mice` requires a unique row identifier called "term"
out$term <- out$rowid
class(out) <- c("custom", class(out))
return(out)
}
# `tidy.custom()` is needed by `mice` to combine datasets, but the output of fit() also has
# the right structure and column names, so it is useless
tidy.custom <- function(x, ...) return(x)
# Fit on each imputation
mod_mice <- lapply(iris_mice, fit)
# Pool
mod_pool <- pool(mod_mice)$pooled
# Merge back some of the covariates
datplot <- data.frame(mod_pool, mod_mice[[1]][, c("Petal.Width", "Petal.Length")])
# Plot
ggplot(datplot, aes(Petal.Width, estimate, color = Petal.Length)) +
geom_line() +
theme_minimal()
Related
I'll illustrate my problem with the iris data set in R. My objective here is to create 5 imputed data sets, fit a regression to each imputed data set, then pool together the results of these regressions into one final model. This is the preferred order of operations for a proper execution of multiple imputation.
library(mice)
df <- iris
# Inject some missingness into the data:
df$Sepal.Width[c(20,40,70,121)] <- NA
df$Species[c(15,80,99,136)] <- NA
# Perform the standard steps of multiple imputation with MICE:
imputed_data <- mice(df, method = c(rep("pmm", 5)), m = 5, maxit = 5)
model <- with(imputed_data, lm(Sepal.Length ~ Sepal.Width + Species))
pooled_model <- pool(model)
This leaves me with this pooled_model object which I am hoping to use as a fitted model in the predict command. However, that does not work. When I run:
predict(pooled_model, newdata = iris)
I get this error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "c('mipo', 'data.frame')"
Disregard the reasoning why I am using the original iris data set in my newly fitted model; I simply want to be able to fit this data, or a subset of it, onto the model I created with my imputation.
I specifically chose a data set with multiple levels of a categorical variable to highlight my problem. I thought about using some matrix multiplication with which I could do this manually, but the presence of a categorical variable makes that tough. In my actual data set, I have over a hundred variables, many of which have multiple categorical levels. I say this because I realize one possible solution would be to re-code my categorical variables into dummy variables, and then I can apply some matrix multiplication to get my answer. But that would be an EXTREME amount of work for me. If there's a way I can somehow get a model object I can use in the predict function, that would make my life 100x easier.
Any suggestions?
You have two issues: 1) how to use stats::predict with pooled data and 2) what to do about your categorical variables.
Your first issue has already been documented on the mice Github page and it seems like there's been a desire to have a predict.mira function for a while. The author of the mice package posted some code on how to simulate a predict.mira-like function. Unfortunately, it only works with lm models, but it seems like that's okay considering your reprex. If you have a Github account, you can comment on that Github issue to demonstrate your interest in the predict.mira function.
Your question also has been posted on StackOverflow before; although the answer was never accepted, the SO user suggested this reading by Miles (2015).
For your second question, have you considered leaving out your current method argument when using mice()? As long as your variables have been classed as factors, then mice will default to the polyreg method for categorical variables and pmm for continuous variables. You can read more about the method argument here.
library(mice)
set.seed(123)
# make missing data
df <- iris
df$Sepal.Width[c(20,40,70,121)] <- NA
df$Species[c(15,80,99,136)] <- NA
# specify method
meth <- mice(df, maxit = 0, printFlag = FALSE)$meth
print(meth)
# this is how you would change your methods, if you wanted
# but pmm and polyreg are defaults
meth["Species"] <- "polr"
meth["Sepal.Width"] <- "midastouch"
print(meth)
# impute
imputed_data <- mice(df,
m = 5,
maxit = 5,
method = meth, # new method
printFlag = FALSE)
# make model
model <- with(imputed_data, lm(Sepal.Length ~ Sepal.Width + Species))
summary(pool(model))
# obtain predictions Q and prediction variance U
predm <- lapply(getfit(model), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df
# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
pred[i, 1] <- pi[["qbar"]]
pred[i, 2] <- sqrt(pi[["t"]])
pred[i, 3] <- pi[["df"]]
}
head(pred)
I am trying to create a function to run a k-fold cross validation on a glmer object.
This is just data I got online (my dataset is quite large) so the model isn't the best but if I can get this to work using this data I should be able to switch it to my dataset quite easily.
I want to do a LOOCV(Leave One Out Cross-Validation)
"LOOCV(Leave One Out Cross-Validation) is a type of cross-validation approach in which each observation is considered as the validation set and the rest (N-1) observations are considered as the training set."
The outline I got was from Caroline's answer on this researchgate thread.
https://www.researchgate.net/post/Does_R_code_for_k-fold_cross_validation_of_a_nested_glmer_model_exist
#load libraries
library(tidyverse)
library(optimx)
library(lme4)
#add example data
Data <- read.csv("https://stats.idre.ucla.edu/stat/data/hdp.csv")
Data <- select(Data, remission, IL6, CRP, DID)
Data
Data$remission<- as.factor(Data$remission)
Data$DID<- as.factor(Data$DID)
#add ROW column
Data <- Data %>% mutate(ROW = row_number())
head(Data)
PTOT=NULL
for (i in 1:8825) { # i in total number of observations in dataset
##Data that will be predicted
DataC1=Data[unique(Data$ROW)==i,]
###To train the model
DataCV=Data[unique(DataC1$ROW)!=i,]
M1 <- glmer(remission ~ 1 + IL6 + CRP + ( 1 | DID ), data = DataCV, family = binomial, control = glmerControl(optimizer ='optimx', optCtrl=list(method='L-BFGS-B')))
P1=predict(M1, DataC1)
names(P1)=NULL
P1
PTOT= c(PTOT, P1)
}
R2cv=1-(sum((remission-PTOT)^2)/(length(PTOT))/(var(remission)))
This is the error I get
"Error: Invalid grouping factor specification, DID"
DataCV is empty.
For example:
i <- 1 ## first time through the loop
DataCV=Data[unique(DataC1$ROW)!=i,]
I think that should have been DataC$ROW), not DataC1$ROW.
A few other comments: a more compact version of your code would look something like this:
## fit the full model
M1 <- glmer(remission ~ 1 + IL6 + CRP + ( 1 | DID ), data = DataC,
family = binomial, control = glmerControl(optimizer ='optimx', optCtrl=list(method='L-BFGS-B')))
res <- numeric(nrow(DataCV))
for (i in 1:nrow(DataCV)) {
new_fit <- update(M1, data = dataC[-i,]
res[i] <- (predict(new_fit, newdata=dataC[i,]) - remission[i])^2
}
For a well-specified model LOOCV is asymptotically equivalent to AIC, so you might be doing a lot of work to get something that's not very different from the AIC (which you can get directly from a single model fit) ...
EDITED:
I'm trying to assess the effect of variables (e.g. presence of severe trauma) on a continous variable (here energy expenditure (=REE) in calories) over time (Day). The dataframe is called my_data. Amongst the variables
Following I would like to display the results using the mixed linear model for each assessed variable in one large file.
General concept:
REE ~ Time*predictor + (1 + Time | Case identifier)
(1) Starting creating the lmer model:
library(tidyverse)
library(ggpmisc)
library(sjPlot)
library(lme4)
mixed.modelloop <- function(x) {
lmer(REE ~ Day*(x) + (1 + Day | Studynumber),
data=my_data,
REML=FALSE,
na.action=na.omit,
control = lmerControl(check.nobs.vs.nRE = "ignore"))
}
(2) Then creating the predictors (x)
cols <- c(colnames(my_data))
(3) And then generating the overall purrr function:
output <- purrr::map(cols, ~ mixed.modelloop(.x) %>% tab_model)
(4) generating the file which should include all separate univariate mixed model analyses:
pdf(file="mixed linear models.pdf" )
output
dev.off()
Unfortunately currently after step (3) I'm getting the following error message:
Error in model.frame.default(data = my_data, na.action = na.omit, drop.unused.levels = TRUE, :
variable lengths differ (found for 'x')
Any idea on how to adapt the function to resolve this issue?
Thanks!
Formulas have special rules, you can't insert a string into them and expect them to work.
This should work, although you haven't given a reproducible example to test with ...
mixed.modelloop <- function(x) {
form <- reformulate(c(sprintf("Day*%s", x), "(1 + Day | Studynumber)"),
response = "REE")
lmer(form,
data=my_data,
REML=FALSE,
na.action=na.omit,
control = lmerControl(check.nobs.vs.nRE = "ignore"))
}
I'm using imputed data to test a series of regression models, including some moderation models.
Imputation
imp_data <- mice(data,m=20,maxit=20,meth='cart',seed=12345)
I then convert this to long format so I can recode / sum variables as needed, beore turning back to mids format
impdatlong_mids<-as.mids(impdat_long)
Example model:
model1 <- with(impdatlong_mids,
lm(Outcome ~ p1_sex + p2 + p3 + p4
+ p5+ p6+ p7+ p8+ p9+ p10
+ p11+ p1_sex*p12+ p1_sex*p13 + p14)
in non-imputed data, to create a graphic representation of the significant ineraction, I'd use (e.g.)
interact_plot (model=model1, pred = p1_sex, modx = p12)
This doesn't work with imputed data / mids objects.
Has anyone plotted an interaction using imputed data, and able to help or share examples?
Thanks
EDIT: Reproducible example
library(tidyverse)
library(interactions)
library(mice)
# library(reprex) does not work with this
set.seed(42)
options(warn=-1)
#---------------------------------------#
# Data preparations
# loading an editing data
d <- mtcars
d <- d %>% mutate_at(c('cyl','am'),factor)
# create missing data and impute it
mi_d <- d
nr_of_NAs <- 30
for (i in 1:nr_of_NAs) {
mi_d[sample(nrow(mi_d),1),sample(ncol(mi_d),1)] <- NA
}
mi_d <- mice(mi_d, m=2, maxit=2)
#---------------------------------------#
# regressions
#not imputed
lm_d <- lm(qsec ~ cyl*am + mpg*disp, data=d)
#imputed dataset
lm_mi <- with(mi_d,lm(qsec ~ cyl*am + mpg*disp))
lm_mi_pool <- pool(lm_mi)
#---------------------------------------#
# interaction plots
# not imputed
#continuous
interactions::interact_plot(lm_d, pred=mpg,modx=disp, interval=T,int.width=0.3)
#categorical
interactions::cat_plot(lm_d, pred = cyl, modx = am)
#---------------------------------------#
# interaction plots
# imputed
#continuous
interactions::interact_plot(lm_mi_pool, pred=mpg,modx=disp, interval=T,int.width=0.3)
# Error in model.frame.default(model) : object is not a matrix
#categorical
interactions::cat_plot(lm_mi_pool, pred = cyl, modx = am)
# Error in model.frame.default(model) : object is not a matrix
The problem seems to be that neither interact_plot, cat_plot or any other available package allows for (at least categorical) interaction plotting with objects of class mipo or pooled regression outputs.
I am using the walking data from the mice package as an example. One way to get the interaction plot (well version of one type of interaction plot) is to use the gtsummary package. Under the hood it will take the model1 use pool() from mice to average over the models and then use a combo of tbl_regression() and plot() to output a plot of the coefficients in the model. The tbl_regression() function is what is calling the pool() function.
library(mice)
library(dplyr)
library(gtsummary)
imp_data <- mice(mice::walking,m=20,maxit=20,meth='cart',seed=12345)
model1 <- with(imp_data,
lm(age ~ sex*YA))
model1 %>%
tbl_regression() %>%
plot()
The package emmeans allows you to extract interaction effects from a mira object. Here is a gentle introduction. After that, the interactions can be plotted with appropriate ggplot. This example is for the categorical variables but could be extended to the continous case - after the emmeans part things get relatively straighforward.
library(ggplot2)
library(ggstance)
library(emmeans)
library(khroma)
library(jtools)
lm_mi <- with(mi_d,lm(qsec ~ gear*carb))
#extracting interaction effects
emcatcat <- emmeans(lm_mi, ~gear*carb)
tidy <- as_tibble(emcatcat)
#plotting
pd <- position_dodge(0.5)
ggplot(tidy, aes(y=gear, x=emmean, colour=carb)) +
geom_linerangeh(aes(xmin=lower.CL, xmax=upper.CL), position=pd,size = 2) +
geom_point(position=pd,size = 4)+
ggtitle('Interactions') +
labs (x = "aggreageted interaction effect") +
scale_color_bright() +
theme_nice()
this can be extended to a three-way interaction plot with facet_grid as long as you have a third categorical interaction term.
I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.