Extract Model for Specific Factor - r

Say I've fit a model as follows fit = lm(Y ~ X + Dummy1 + Dummy2)
How can I extract the regression for a specific dummy variable?
I'm hoping to do something like the following to plot all the regressions:
plot(...)
abline(extracted.lm.dummy1)
abline(extracted.lm.dummy2)

I would look into the sjPlot package. Here is the documentation for sjp.lm, which can be used to visualize linear models in various ways. The package also has some nice tools for tabular summaries of models.
An example:
library(sjPlot)
library(dplyr)
# add a second categorical variable to the iris dataset
# then generate a linear model
set.seed(123)
fit <- iris %>%
mutate(Category = factor(sample(c("A", "B"), 150, replace = TRUE))) %>%
lm(Sepal.Length ~ Sepal.Width + Species + Category, data = .)
Different kinds of plot include:
Marginal effects plot, probably closest to what you want
sjp.lm(fit, type = "eff", vars = c("Category", "Species"))
"Forest plot" (beta coefficients + confidence interval)
sjp.lm(fit)

Related

Function to update `ggplot2::labs(caption=)` using data passed to `ggplot2::ggplot(data=)`

I've recently written my first ggplot2 stat and geom methods. I want to write another that uses the data passed in ggplot2::ggplot(data=) to add a p-value as a caption to the figure. Is that possible?
For example, I would like to write something like this:
library(ggplot2)
mtcars |>
ggplot(aes(x = mpg, y = cyl)) +
add_pvalue()
Where add_pvalue() would calculate a p-value (e.g. an anova p-value for different mean MPG by the number of cylinders), and add the p-value as a caption, labs(caption = "p = 0.45").
Thank you!
Daniel, it's possible. You can use this example. Hope that help you !
library(ggplot2)
library(glue)
p_value <- 0.05
mtcars |>
ggplot(aes(x = mpg, y = cyl)) +
labs(caption = glue("p = {p_value}"))
You could do something like the following, picking your preferred statistical model, "types" of p-values, and formatting of the p-value. If you wanted to build in lots of functionality to make it useful for a wide variety of models, you'd want to add conditional extractor functions for those models.
# Packages
library(ggplot2)
library(dplyr)
library(rlang)
# Define "add_pvalue()" function
# adds p-value from linear regression of y on x
# note that this assumes x and y are reals or integers
add_pvalue <- function(ggplot_obj) {
# Get x and y variable names from ggplot object
x <- ggplot_obj$mapping$x %>%
rlang::quo_get_expr() %>%
deparse()
y <- ggplot_obj$mapping$y %>%
rlang::quo_get_expr() %>%
deparse()
# Build regression model formula, fit model, return model summary
mod <- paste0(y, "~ ", x) %>%
as.formula() %>%
lm(data = ggplot_obj$data) %>%
summary()
# Extract two-tailed t-test p-value from model object (reformat as desired)
pval <- mod$coefficients[x, "Pr(>|t|)"]
# Add p_value as plot caption
ggplot_obj +
labs(caption = paste0("p = ", pval))
}
# Example with p-value for linear model and 95% confidence intervals
mtcars %>%
ggplot(aes(x = mpg, y = cyl)) %>%
add_pvalue() +
geom_smooth(method = "lm", se = TRUE, level = 0.95)
#> `geom_smooth()` using formula 'y ~ x'
Note that blindly fitting a linear regression or ANOVA to your data is probably not the best decision since x or y may not be real or integer types. If they aren't, this won't really make sense since some models either throw runtime errors or employ one-hot encoding when passed other types of variables.
Similarly, the p-values you obtain may be utterly meaningless if, for example, each row in the data is not an independent observation, you run lots of models that invalidate the sampling assumptions of p-values, your hypothesis doesn't match the test, etc.
Finally, you could also try using the output of stat_smooth() that is produced when you call geom_smooth() to do this. The upside would be that you wouldn't need to fit the model twice to have both that geom and the p-value (using the standard error and coefficients plus normal distribution to get the p-value). That's a bit outside of the scope and would be more limiting since you're stuck with the models it employs and the same issues plague those as well. It's also pretty annoying to extract those: Method to extract stat_smooth line fit

How to map pdp::partial to nested randomForest models?

I would like to map the function pdp::partial to nested randomForest models. I'll then use the output to plot the 3d partial dependency plots for each group via facet_wrap(). When mapping the function to the models I get an error that the predictor variables can not be found in the training data -- but they are there when I check the tibble so I'm at a loss for what to do.
library(tidyverse)
library(pdp)
library(randomForest)
data(boston)
glimpse(boston)
#Make groups, nest data by groups, apply random forest model to nested data
boston %>%
mutate(grp=ifelse(age<80, "young", "old"))%>%
nest(data= -grp)%>%
mutate(fit = map(data, ~ randomForest(cmedv ~ ., data = boston, importance = TRUE)))%>%
{.->>GrpModels}
#Map pdp::partial to fitted models for two predictor variables
GrpModels%>%
mutate(p=map2(fit,data, ~pdp::partial(fit,train=data, pred.var=c("lstat", "rm"))))%>%
unnest(p)%>%{.->>checkpdp}
Error: Problem with mutate() column p. i p = map2(...). x lstat,
rm not found in the training data.
This seems to work, although I'm not sure why plotting with geom_tile() does not quite do what I thought it would. I used geom_point() instead. In short, I needed to get pred.var as a list and then pass the three inputs (fit, data, and predictor variables) to pmap.
GrpModels %>%
mutate(preds = data.table::transpose(as.list(c('lstat','rm')))) %>%
mutate(p = pmap(list(fit, data, preds),
.f = ~pdp::partial(object=..1, train = ..2,
pred.var = ..3)))%>%
select(-data,-fit,-preds)%>%
unnest_wider(p)%>%
unnest(c(yhat,lstat,rm))%>%{.->>checkpdp}%>%
ggplot(.,aes(x=lstat,y=rm,color=yhat))+
#geom_tile()+
geom_point(shape=15, size=2)+
facet_wrap(~grp, scales='free')

plotting an interaction term in moderated regression using MICE imputation

I'm using imputed data to test a series of regression models, including some moderation models.
Imputation
imp_data <- mice(data,m=20,maxit=20,meth='cart',seed=12345)
I then convert this to long format so I can recode / sum variables as needed, beore turning back to mids format
impdatlong_mids<-as.mids(impdat_long)
Example model:
model1 <- with(impdatlong_mids,
lm(Outcome ~ p1_sex + p2 + p3 + p4
+ p5+ p6+ p7+ p8+ p9+ p10
+ p11+ p1_sex*p12+ p1_sex*p13 + p14)
in non-imputed data, to create a graphic representation of the significant ineraction, I'd use (e.g.)
interact_plot (model=model1, pred = p1_sex, modx = p12)
This doesn't work with imputed data / mids objects.
Has anyone plotted an interaction using imputed data, and able to help or share examples?
Thanks
EDIT: Reproducible example
library(tidyverse)
library(interactions)
library(mice)
# library(reprex) does not work with this
set.seed(42)
options(warn=-1)
#---------------------------------------#
# Data preparations
# loading an editing data
d <- mtcars
d <- d %>% mutate_at(c('cyl','am'),factor)
# create missing data and impute it
mi_d <- d
nr_of_NAs <- 30
for (i in 1:nr_of_NAs) {
mi_d[sample(nrow(mi_d),1),sample(ncol(mi_d),1)] <- NA
}
mi_d <- mice(mi_d, m=2, maxit=2)
#---------------------------------------#
# regressions
#not imputed
lm_d <- lm(qsec ~ cyl*am + mpg*disp, data=d)
#imputed dataset
lm_mi <- with(mi_d,lm(qsec ~ cyl*am + mpg*disp))
lm_mi_pool <- pool(lm_mi)
#---------------------------------------#
# interaction plots
# not imputed
#continuous
interactions::interact_plot(lm_d, pred=mpg,modx=disp, interval=T,int.width=0.3)
#categorical
interactions::cat_plot(lm_d, pred = cyl, modx = am)
#---------------------------------------#
# interaction plots
# imputed
#continuous
interactions::interact_plot(lm_mi_pool, pred=mpg,modx=disp, interval=T,int.width=0.3)
# Error in model.frame.default(model) : object is not a matrix
#categorical
interactions::cat_plot(lm_mi_pool, pred = cyl, modx = am)
# Error in model.frame.default(model) : object is not a matrix
The problem seems to be that neither interact_plot, cat_plot or any other available package allows for (at least categorical) interaction plotting with objects of class mipo or pooled regression outputs.
I am using the walking data from the mice package as an example. One way to get the interaction plot (well version of one type of interaction plot) is to use the gtsummary package. Under the hood it will take the model1 use pool() from mice to average over the models and then use a combo of tbl_regression() and plot() to output a plot of the coefficients in the model. The tbl_regression() function is what is calling the pool() function.
library(mice)
library(dplyr)
library(gtsummary)
imp_data <- mice(mice::walking,m=20,maxit=20,meth='cart',seed=12345)
model1 <- with(imp_data,
lm(age ~ sex*YA))
model1 %>%
tbl_regression() %>%
plot()
The package emmeans allows you to extract interaction effects from a mira object. Here is a gentle introduction. After that, the interactions can be plotted with appropriate ggplot. This example is for the categorical variables but could be extended to the continous case - after the emmeans part things get relatively straighforward.
library(ggplot2)
library(ggstance)
library(emmeans)
library(khroma)
library(jtools)
lm_mi <- with(mi_d,lm(qsec ~ gear*carb))
#extracting interaction effects
emcatcat <- emmeans(lm_mi, ~gear*carb)
tidy <- as_tibble(emcatcat)
#plotting
pd <- position_dodge(0.5)
ggplot(tidy, aes(y=gear, x=emmean, colour=carb)) +
geom_linerangeh(aes(xmin=lower.CL, xmax=upper.CL), position=pd,size = 2) +
geom_point(position=pd,size = 4)+
ggtitle('Interactions') +
labs (x = "aggreageted interaction effect") +
scale_color_bright() +
theme_nice()
this can be extended to a three-way interaction plot with facet_grid as long as you have a third categorical interaction term.

sjt.lmer displaying incorrect p-values

I've just noticed that sjt.lmer tables are displaying incorrect p-values, e.g., p-values that do not reflect the model summary. This appears to be a new-ish issue, as this worked fine last month?
Using the provided data and code in the package vignette
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(lme4)
library(sjstats)
load sample data
data(efc)
prepare grouping variables
efc$grp = as.factor(efc$e15relat)
levels(x = efc$grp) <- get_labels(efc$e15relat)
efc$care.level <- rec(efc$n4pstu, rec = "0=0;1=1;2=2;3:4=4",
val.labels = c("none", "I", "II", "III"))
data frame for fitted model
mydf <- data.frame(
neg_c_7 = efc$neg_c_7,
sex = to_factor(efc$c161sex),
c12hour = efc$c12hour,
barthel = efc$barthtot,
education = to_factor(efc$c172code),
grp = efc$grp,
carelevel = to_factor(efc$care.level)
)
fit sample models
fit1 <- lmer(neg_c_7 ~ sex + c12hour + barthel + (1 | grp), data = mydf)
summary(fit1)
p_value(fit1, p.kr =TRUE)
model summary
p_value summary
sjt.lmer output does not show these p-values??
Note that the first summary comes from a model fitted with lmerTest, which computes p-values with df based on Satterthwaite approximation (see first line in output).
p_value(), however, with p.kr = TRUE, uses the Kenward-Roger approximation from package pbkrtest, which is a bit more conservative.
Your output from sjt.lmer() seems to be messed up somehow, and I can't reproduce it with your example. My output looks ok:

Plotting predicted survival curves for continuous covariates in ggplot

How can I plot survival curves for representative values of a continuous covariate in a cox proportional hazards model? Specifically, I would like to do this in ggplot using a "survfit.cox" "survfit" object.
This may seem like a question that has already been answered, but I have searched through everything in SO with the terms 'survfit' and 'newdata' (plus many other search terms). This is the thread that comes closest to answering my question so far: Plot Kaplan-Meier for Cox regression
In keeping with the reproducible example offered in one of the answers to that post:
url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
df <- read.table(url, header = TRUE)
library(dplyr)
library(ggplot2)
library(survival)
library(magrittr)
library(broom)
# Identifying the 25th and 75th percentiles for prio (continuous covariate)
summary(df$prio)
# Cox proportional hazards model with other covariates
# 'prio' is our explanatory variable of interest
m1 <- coxph(Surv(week, arrest) ~
fin + age + race + prio,
data = df)
# Creating new df to get survival predictions
# Want separate curves for the the different 'fin' and 'race'
# groups as well as the 25th and 75th percentile of prio
newdf <- df %$%
expand.grid(fin = levels(fin),
age = 30,
race = levels(race),
prio = c(1,4))
# Obtain the fitted survival curve, then tidy
# into a dataframe that can be used in ggplot
survcurv <- survfit(m1, newdata = newdf) %>%
tidy()
The problem is, that once I have this dataframe called survcurv, I cannot tell which of the 'estimate' variables belongs to which pattern because none of the original variables are retained. For example, which of the 'estimate' variables represents the fitted curve for 30 year old, race = 'other', prio = '4', fin = 'no'?
In all other examples i've seen, usually one puts the survfit object into a generic plot() function and does not add a legend. I want to use ggplot and add a legend for each of the predicted curves.
In my own dataset, the model is a lot more complex and there are a lot more curves than I show here, so as you can imagine seeing 40 different 'estimate.1'..'estimate.40' variables makes it hard to understand what is what.
Thanks for providing a well phrased question and a good example. I'm a little surpirsed that tidy does a relatively poor job here of creating sensible output. Please see below for my attempt at creating some plottable data:
library(tidyr)
newdf$group <- as.character(1:nrow(newdf))
survcurv <- survfit(m1, newdata = newdf) %>%
tidy() %>%
gather('key', 'value', -time, -n.risk, -n.event, -n.censor) %>%
mutate(group = substr(key, nchar(key), nchar(key)),
key = substr(key, 1, nchar(key) - 2)) %>%
left_join(newdf, 'group') %>%
spread(key, value)
And the create a plot (perhaps you'd like to use geom_step instead, but there is not step shaped ribbon, unfortunately):
ggplot(survcurv, aes(x = time, y = estimate, ymin = conf.low, ymax = conf.high,
col = race, fill = race)) +
geom_line(size = 1) +
geom_ribbon(alpha = 0.2, col = NA) +
facet_grid(prio ~ fin)
Try defining your survcurv like this:
survcurv <-
lapply(1:nrow(newdf),
function(x, m1, newdata){
cbind(newdata[x, ], survfit(m1, newdata[x, ]) %>% tidy)
},
m1,
newdf) %>%
bind_rows()
This will include all of the predictor values as columns with the predicted estimates.

Resources