Marginal effects / interaction plots for lfe felm regression object - r

I need to create an interaction / marginal effects plot for a fixed effects model including clustered standard errors generated using the lfe "felm" command.
I have already created a function that achieves this. However, before I start using it, I wanted to double-check whether this function is correctly specified. Please find the function and a reproducible example below.
library(lfe)
### defining function
felm_marginal_effects <- function(regression_model, data, treatment, moderator, treatment_translation, moderator_translation, dependent_variable_translation, alpha = 0.05, se = NULL) {
library(ggplot2)
library(ggthemes)
library(gridExtra)
### defining function to get average marginal effects
getmfx <- function(betas, data, treatment, moderator) {
betas[treatment] + betas[paste0(treatment, ":", moderator)] * data[, moderator]
}
### defining function to get marginal effects for specific levels of the treatment variable
getmfx_high_low <- function(betas, data, treatment, moderator, treatment_val) {
betas[treatment] * treatment_val + betas[paste0(treatment, ":", moderator)] * data[, moderator] * treatment_val
}
### Defining function to analytically derive standard error for marginal effects
getvarmfx <- function(my_vcov, data, treatment, moderator) {
my_vcov[treatment, treatment] + data[, moderator]^2 * my_vcov[paste0(treatment, ":", moderator), paste0(treatment, ":", moderator)] + 2 * data[, moderator] * my_vcov[treatment, paste0(treatment, ":", moderator)]
}
### constraining data to relevant variables
data <- data[, c(treatment, moderator)]
### getting marginal effects
data[, "marginal_effects"] <- getmfx(coef(regression_model), data, treatment, moderator)
### getting marginal effects for high and low cases of treatment variable
data[, "marginal_effects_treatment_low"] <- getmfx_high_low(coef(regression_model), data, treatment, moderator, quantile(data[,treatment], 0.05))
data[, "marginal_effects_treatment_high"] <- getmfx_high_low(coef(regression_model), data, treatment, moderator, quantile(data[,treatment], 0.95))
### getting robust SEs
if (is.null(se)) {
data$se <- getvarmfx(regression_model$vcv, data, treatment, moderator)
} else if (se == "clustered") {
data$se <- getvarmfx(regression_model$clustervcv, data, treatment, moderator)
} else if (se == "robust") {
data$se <- getvarmfx(regression_model$robustvcv, data, treatment, moderator)
}
### Getting CI bounds
data[, "ci_lower"] <- data[, "marginal_effects"] - abs(qt(alpha/2, regression_model$df, lower.tail = TRUE)) * sqrt(data$se)
data[, "ci_upper"] <- data[, "marginal_effects"] + abs(qt(alpha/2, regression_model$df, lower.tail = TRUE)) * sqrt(data$se)
### plotting marginal effects plot
p_1 <- ggplot(data, aes_string(x = moderator)) +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), fill = "grey70", alpha = 0.4) +
geom_line(aes(y = marginal_effects)) +
theme_fivethirtyeight() +
theme(plot.title = element_text(size = 11.5, hjust = 0.5), axis.title = element_text(size = 10)) +
geom_rug() +
xlab(moderator_translation) +
ylab(paste("Marginal effect of",treatment_translation,"on",dependent_variable_translation)) +
ggtitle("Average marginal effects")
p_2 <- ggplot(data, aes_string(x = moderator)) +
geom_line(aes(y = marginal_effects_treatment_high, color = paste0("High ",treatment_translation))) +
geom_line(aes(y = marginal_effects_treatment_low, color = paste0("Low ",treatment_translation))) +
theme_fivethirtyeight() +
theme(plot.title = element_text(size = 11.5, hjust = 0.5), axis.title = element_text(size = 10), axis.title.y = element_blank(), legend.justification = c(0.95, 0.95), legend.position = c(1, 1), legend.direction = "vertical") +
geom_rug() +
xlab(moderator_translation) +
ylab(paste("Marginal effect of",treatment_translation,"on",dependent_variable_translation)) +
ggtitle("Marginal effects at high / low levels of treatment") +
scale_color_manual(name = NULL, values = c(rgb(229, 93, 89, maxColorValue = 255), rgb(75, 180, 184, maxColorValue = 255)), labels=c(paste0("High ",treatment_translation), paste0("Low ",treatment_translation)))
### exporting plots as combined grob
return(grid.arrange(p_1, p_2, ncol = 2))
}
### example:
# example model (just for demonstration, fixed effects and cluster variables make little sense here)
model <- felm(mpg ~ cyl + am + cyl:am | carb | 0 | cyl, data = mtcars)
# creating marginal effects plot
felm_marginal_effects(regression_model = model, data = mtcars, treatment = "cyl", moderator = "am", treatment_translation = "Number of cylinders", moderator_translation = "Transmission", dependent_variable_translation = "Miles per (US) gallon")
The example output looks like this:
Happy for any advice on how to make this a better, "well-coded", fast function so that it's more useful for others afterwards. However, I'm mostly looking to confirm whether it's "correct" in the first place.
Additionally, I wanted to check back with the community regarding some remaining questions, particularly:
Can I use the standard errors I generated for the average marginal effects for the "high" and "low" treatment cases as well or do I need to generate different standard errors for these cases? If so how?
Instead of using the analytically derived standard errors, I could also calculate bootstrapped standard errors by creating many coefficient estimates based on repeated sub-samples of the data. How would I generate bootstrapped standard errors for the high / low case?
Is there something about fixed effects models or fixed effects models with clustered standard errors that make marginal effects plots or anything else I did in the code fundamentally inadmissible?
PS.: The above function and questions are kind of an extension of How to plot marginal effect of an interaction after felm() function

Related

Binomial logit model with glmer doesn't yield a good fit to sigmoidal data

I am trying to fit a model to my data, which has a dependent variable that can be 0 or 1.
I tried to fit a binomial glmer to the data, but the fit is pretty bad as you can see below. This puzzles me because this is quite a sigmoid so I thought I would get a great fit with that kind of model? Am I using the wrong model?
(color is my data, black is the fit)
Here is the code I used on r
library(lme4)
library(ggplot2)
exdata <- read.csv("https://raw.githubusercontent.com/FlorianLeprevost/dummydata/main/exdata.csv")
model=glmer(VD~ as.factor(VI2)*VI1 + (1|ID),exdata,
family=binomial(link = "logit"),
control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5)))
summary(model)
exdata$fit=predict(model, type = "response")
ggplot(exdata,aes(VI1, VD, color=as.factor(VI2),
group=as.factor(VI2))) +
stat_summary(geom="line", size=0.8) +
stat_summary(aes(y=(fit)),geom="line", size=0.8, color="black") +
theme_bw()
And I tried without the random effect to see if it would change but no...
ggplot(exdata, aes(x=VI1, y=VD, color=as.factor(VI2),
group=as.factor(VI2))) +
stat_summary(fun.data=mean_se, geom="line", size=1)+
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial), color='black')
Here is the data:https://github.com/FlorianLeprevost/dummydata/blob/main/exdata.csv
tl;dr I don't think these data are as sigmoidal as you think. In particular, a logistic regression estimates a sigmoid curve that ranges from 0 to 1, whereas yours levels out (sort of) at 0.9. In much more detail:
slightly streamlined data import/model fitting
library(lme4)
library(ggplot2)
library(dplyr)
exdata <- (read.csv("https://raw.githubusercontent.com/FlorianLeprevost/dummydata/main/exdata.csv")
|> mutate(across(VI2, factor))
)
model <- glmer(VD~ VI2*VI1 + (1|ID),
exdata,
family=binomial(link = "logit"))
compute data summary and predictions
This can also be done with stat_summary(), but I like the finer control of doing it myself. In particular, I like to get Clopper-Pearson CIs on the proportions (could also do this with prop.test() to get score-test CIs). I'm also computing predictions across a wider VI1-range than the data (see why below).
ddsum <- (exdata
|> group_by(VI1, VI2)
|> summarise(
num = n(),
pos = sum(VD),
VD = mean(VD),
lwr = binom.test(pos, num)$conf.int[1],
upr = binom.test(pos, num)$conf.int[2],
.groups = "drop")
)
pframe <- expand.grid(
VI1 = seq(-50, 50, length = 101),
VI2 = unique(exdata$VI2))
pframe$VD <- predict(model, newdata = pframe, re.form = ~0, type = "response")
plot
gg0 <- ggplot(ddsum,aes(x=VI1, y=VD, color=VI2)) +
geom_pointrange(position = position_dodge(width = 0.3),
aes(ymin = lwr, ymax = upr, size = num), alpha = 0.5) +
scale_size_area(max_size = 4) +
theme_bw()
gg1 <- gg0 + geom_line(data = pframe)
ggsave(g1, file = "gglogist1.png")
Conclusion: the sharp increase from x=0 to x=15 combined with the saturation below 1.0 makes it hard to fit with a logistic curve.
We could try a quadratic-logistic fit:
model2 <- update(model, . ~ poly(VI1,2)*VI2 + (1|ID))
pframe$VD2 <- predict(model2, newdata = pframe, re.form = ~0, type = "response")
gg2 <- gg1 + geom_line(data=pframe, aes(y=VD2), linetype = 2)
ggsave(gg2, file = "gglogist2.png")
This fits better (it couldn't fit worse), but might not make sense for your application.
In principle we could fit a logistic that saturated at a value <1, but it's a bit tricky ...

Graphing 95% Confidence Intervals for Regression with Linear AND Quadratic terms

I’m currently producing graphs that map out the parabolic relationship between a continuous predictor and the risk of an outcome including 95% confidence intervals. My model has linear and quadratic terms for my predictor (in this case it is blood pressure).
The issue that I’m having is that I have two terms for blood pressure a quadratic term (SBP2) and a linear term (SBP) that are both grand mean centered at around 138 mmHg and measured in increments of 5 mmHg (i.e. I divide by 5).
If I was producing the linear 95% confidence intervals it would be very easy but when I have two terms rather than producing one line below the graph and one line above the graph it produces two lines that both intersect in the graph. (see R script below).
The way that I produced these graphs is I took the linear term and added and subtracted 1.96*Std Error for that term to the Estimate. I then did the same for the quadratic term. I calculated the upper CI line by adding the linear upper CI term with the quadratic upper CI term (see R script below).
I’ve had a read online and quickly discovered that this is clearly the wrong approach.
But I can’t find very much at all on this question. I found an interesting article informing me how I might compute the Cis for a single linear term (https://rpubs.com/aaronsc32/regression-confidence-prediction-intervals) but nothing that might tell me how to do so for a combined linear and quadratic term equation.
Do you have any ideas on how I should approach this?
Matt
Using R Version - 4.2.1
R CODE
library('readr')
library('dplyr')
library('data.table')
library("survival")
library('ggplot2')
#AD RISK AND SBP
res.cox_SBPs_AD_m2 <- coxph(Surv(AD_timing, AD_binary) ~ Sex + W1_Age_CENT + W1_Age_CENT_2 +
+ W1_Education + W1_HT_hx + W1_DIAB + W1_SMOKING + APOE4 +
+ W1_SBP_CENT5 + W1_SBP_CENT5_2, data = dat2[dat2$Wave == 1 & dat2$W1_SBP_3SD == TRUE, ])
#Graphing out the risk
#Calculating our parameters
summary_table_AD <- summary(res.cox_SBPs_AD_m2)$coefficients
SBP_linear_term_AD <- summary_table_AD['W1_SBP_CENT5', 'coef']
SBP_linear_term_AD_UL <- summary_table_AD['W1_SBP_CENT5', 'coef'] + 1.96*summary_table_AD['W1_SBP_CENT5', 'se(coef)']
SBP_linear_term_AD_LL <- summary_table_AD['W1_SBP_CENT5', 'coef'] - 1.96*summary_table_AD['W1_SBP_CENT5', 'se(coef)']
SBP_quadratic_term_AD <- summary_table_AD['W1_SBP_CENT5_2', 'coef']
SBP_quadratic_term_AD_UL <- summary_table_AD['W1_SBP_CENT5_2', 'coef'] + 1.96*summary_table_AD['W1_SBP_CENT5_2', 'se(coef)']
SBP_quadratic_term_AD_LL <- summary_table_AD['W1_SBP_CENT5_2', 'coef'] - 1.96*summary_table_AD['W1_SBP_CENT5_2', 'se(coef)']
x_vals_AD <- array(74:204)
y_vals_AD <- array()
y_vals_AD_UL <- array()
y_vals_AD_LL <- array()
for (i in x_vals_AD) {
y_vals_AD[x_vals_AD - 73] <- exp(SBP_quadratic_term_AD*((x_vals_AD - 138.5835)/5)**2 + SBP_linear_term_AD*((x_vals_AD - 138.5835)/5))
y_vals_AD_UL[x_vals_AD - 73] <- exp(SBP_quadratic_term_AD_UL*((x_vals_AD - 138.5835)/5)**2 + SBP_linear_term_AD_UL*((x_vals_AD - 138.5835)/5))
y_vals_AD_LL[x_vals_AD - 73] <- exp(SBP_quadratic_term_AD_LL*((x_vals_AD - 138.5835)/5)**2 + SBP_linear_term_AD_LL*((x_vals_AD - 138.5835)/5))
}
total_results <- data.frame(cbind(x_vals_AD, y_vals_AD_UL, y_vals_AD_LL))
# Basic line plot with points
ggplot(data=total_results) +
geom_line(aes(x=x_vals_AD, y=y_vals_AD, colour = "AD"), linetype = "solid", size = 1) +
geom_line(aes(x=x_vals_AD, y=y_vals_AD_UL, colour = "AD UL"), linetype = "dotted", size = 1) +
geom_line(aes(x=x_vals_AD, y=y_vals_AD_LL, colour = "AD LL"), linetype = "dotted", size = 1) + labs(x = 'Baseline SBP (mmHg)', y = 'Hazard Ratios (HR)', title = "SBP and Dementia Risk") +
theme_light(base_size = 15) + theme(axis.line = element_line(size = 1, colour = "black", linetype=1), panel.grid = element_line(color = "grey", size = 0.125, linetype = 1), panel.border = element_blank())
ggsave(filename = 'SBP_Coxproportional_HR_AD_ULandLL.pdf')

Plot the impact for each variable in linear regression?

I want to create a plot like below for a lm model calculated using R.
Is there a simple way of doing it?
The plot above was collected here in this page.
Package {caret} offers a convenient method varImp:
Example:
library(caret)
my_model <- lm(mpg ~ disp + cyl, data = mtcars)
## > varImp(my_model)
##
## Overall
## disp 2.006696
## cyl 2.229809
For different measures of variable importance see ?varImp. Feed values into your plotting library of choice.
Extra: {ggstatsplot} calculates and plots a host of model stats for a plethora of model objects. This includes hypotheses about regression coefficients, for which method ggcoefstats() might serve your purpose (remember to scale predictor variables for meaningful comparison of coefficients though).
Following the method in the linked article (relative marginal increase in r squared), you could write your own function that takes a formula, and the data frame, then plots the relative importance:
library(ggplot2)
plot_importance <- function(formula, data) {
lhs <- as.character(as.list(formula)[[2]])
rhs <- as.list(as.list(formula)[[3]])
vars <- grep("[+\\*]", rapply(rhs, as.character), invert = TRUE, value = TRUE)
df <- do.call(rbind, lapply(seq_along(vars), function(i) {
f1 <- as.formula(paste(lhs, paste(vars[-i], collapse = "+"), sep = "~"))
f2 <- as.formula(paste(lhs, paste(c(vars[-i], vars[i]), collapse = "+"),
sep = "~"))
r1 <- summary(lm(f1, data = data))$r.squared
r2 <- summary(lm(f2, data = data))$r.squared
data.frame(variable = vars[i], importance = r2 - r1)
}))
df$importance <- df$importance / sum(df$importance)
df$variable <- reorder(factor(df$variable), -df$importance)
ggplot(df, aes(x = variable, y = importance)) +
geom_col(fill = "deepskyblue4") +
scale_y_continuous(labels = scales::percent) +
coord_flip() +
labs(title = "Relative importance of variables",
subtitle = deparse(formula)) +
theme_classic(base_size = 16)
}
We can test this out with the sample data provided in the linked article:
IV <- read.csv(paste0("https://statisticsbyjim.com/wp-content/uploads/",
"2017/07/ImportantVariables.csv"))
plot_importance(Strength ~ Time + Pressure + Temperature, data = IV)
And we see that the plot is the same.
We can also test it out on some built-in datasets to demonstrate that its use is generalized:
plot_importance(mpg ~ disp + wt + gear, data = mtcars)
plot_importance(Petal.Length ~ Species + Petal.Width, data = iris)
Created on 2022-05-01 by the reprex package (v2.0.1)
Just ended up using relaimpo package and showing with ggplot answered by #Allan Cameron
library(relaimpo)
relative_importance <- calc.relimp(mymodel, type="lmg")$lmg
df = data.frame(
variable=names(relative_importance),
importance=round(c(relative_importance) * 100,2)
)
ggplot(df, aes(x = reorder(variable, -importance), y = importance)) +
geom_col(fill = "deepskyblue4") +
geom_text(aes(label=importance), vjust=.3, hjust=1.2, size=3, color="white")+
coord_flip() +
labs(title = "Relative importance of variables") +
theme_classic(base_size = 16)

Plotting posterior parameter estimates from multiple models with bayesplot

I am using the great plotting library bayesplot to visualize posterior probability intervals from models I am estimating with rstanarm. I want to graphically compare draws from different models by getting the posterior intervals for coefficients onto the same plot.
Imagine, for instance, that I have 1000 draws from the posterior for three parameters beta1, beta2, beta3 for two different models:
# load the plotting library
library(bayesplot)
#> This is bayesplot version 1.6.0
#> - Online documentation and vignettes at mc-stan.org/bayesplot
#> - bayesplot theme set to bayesplot::theme_default()
#> * Does _not_ affect other ggplot2 plots
#> * See ?bayesplot_theme_set for details on theme setting
library(ggplot2)
# generate fake posterior draws from model1
fdata <- matrix(rnorm(1000 * 3), ncol = 3)
colnames(fdata) <- c('beta1', 'beta2', 'beta3')
# fake posterior draws from model 2
fdata2 <- matrix(rnorm(1000 * 3, 1, 2), ncol = 3)
colnames(fdata2) <- c('beta1', 'beta2', 'beta3')
Bayesplot makes fantastic visualizations for individual model draws, and it is ggplot2 'under the hood' so I can customize as I please:
# a nice plot of 1
color_scheme_set("orange")
mcmc_intervals(fdata) + theme_minimal() + ggtitle("Model 1")
# a nice plot of 2
color_scheme_set("blue")
mcmc_intervals(fdata2) + ggtitle("Model 2")
But what I would like to achieve is to plot these two models together on the same plot, such that for each coefficient I have two intervals and can distinguish which interval is which by mapping color to the model. However I can't figure out how to do this. Some things that don't work:
# doesnt work
mcmc_intervals(fdata) + mcmc_intervals(fdata2)
#> Error: Don't know how to add mcmc_intervals(fdata2) to a plot
# appears to pool
mcmc_intervals(list(fdata, fdata2))
Any ideas on how I could do this? Or how to do it manually given the matrices of posterior draws?
Created on 2018-10-18 by the reprex package (v0.2.1)
Just so the answer is also posted here, I have expanded on the code at the link from #Manny T (https://github.com/stan-dev/bayesplot/issues/232)
# simulate having posteriors for two different models each with parameters beta[1],..., beta[4]
posterior_1 <- matrix(rnorm(4000), 1000, 4)
posterior_2 <- matrix(rnorm(4000), 1000, 4)
colnames(posterior_1) <- colnames(posterior_2) <- paste0("beta[", 1:4, "]")
# use bayesplot::mcmc_intervals_data() function to get intervals data in format easy to pass to ggplot
library(bayesplot)
combined <- rbind(mcmc_intervals_data(posterior_1), mcmc_intervals_data(posterior_2))
combined$model <- rep(c("Model 1", "Model 2"), each = ncol(posterior_1))
# make the plot using ggplot
library(ggplot2)
theme_set(bayesplot::theme_default())
pos <- position_nudge(y = ifelse(combined$model == "Model 2", 0, 0.1))
ggplot(combined, aes(x = m, y = parameter, color = model)) +
geom_linerange(aes(xmin = l, xmax = h), position = pos, size=2)+
geom_linerange(aes(xmin = ll, xmax = hh), position = pos)+
geom_point(position = pos, color="black")
If you are like me, you will want 80% and 90% credible intervals (instead of 50% being the inner ones) and might want the coordinates flipped, and let's add a dashed line at 0 (model estimates no change). You can do that like this:
# use bayesplot::mcmc_intervals_data() function to get intervals data in format easy to pass to ggplot
library(bayesplot)
combined <- rbind(mcmc_intervals_data(posterior_1,prob=0.8,prob_outer = 0.9), mcmc_intervals_data(posterior_2,prob=0.8,prob_outer = 0.9))
combined$model <- rep(c("Model 1", "Model 2"), each = ncol(posterior_1))
# make the plot using ggplot
library(ggplot2)
theme_set(bayesplot::theme_default())
pos <- position_nudge(y = ifelse(combined$model == "Model 2", 0, 0.1))
ggplot(combined, aes(x = m, y = parameter, color = model)) +
geom_linerange(aes(xmin = l, xmax = h), position = pos, size=2)+
geom_linerange(aes(xmin = ll, xmax = hh), position = pos)+
geom_point(position = pos, color="black")+
coord_flip()+
geom_vline(xintercept=0,linetype="dashed")
A few things to note on this last one. I added prob_outer = 0.9 even though that is the default, just to show how you might change the outer credible intervals. The dashed line is created with geom_vline and xintercept = here instead of geom_hline and yintercept = because of the coord_flip (everything is reversed). So if you don't flip axes, you will need to do the opposite.
I asked this question on the bayesplot page on GitHub and got a response (Issue #232).
I blew more time than I'd like to admit writing this, so might as well post it here. Here's a function that incorporates the suggestions from above that (for the moment) works for rstanarm and brms model objects.
compare_posteriors <- function(..., dodge_width = 0.5) {
dots <- rlang::dots_list(..., .named = TRUE)
draws <- lapply(dots, function(x) {
if (class(x)[1] == "stanreg") {
posterior::subset_draws(posterior::as_draws(x$stanfit),
variable = names(fixef(x))
)
} else if (class(x)[1] == "brmsfit") {
brm_draws <- posterior::subset_draws(posterior::as_draws(x$fit),
variable = paste0("b_", rownames(fixef(x)))
)
posterior::variables(brm_draws) <- stringr::str_split(posterior::variables(brm_draws), "_", simplify = T)[, 2]
posterior::rename_variables(brm_draws, `(Intercept)` = Intercept)
} else {
stop(paste0(class(x)[1], " objects not supported."))
}
})
intervals <- lapply(draws, bayesplot::mcmc_intervals_data)
combined <- dplyr::bind_rows(intervals, .id = "model")
ggplot(combined, aes(x = m, y = parameter, color = model, group = model)) +
geom_linerange(aes(xmin = l, xmax = h), size = 2, position = position_dodge(dodge_width)) +
geom_linerange(aes(xmin = ll, xmax = hh), position = position_dodge(dodge_width)) +
geom_point(color = "black", position = position_dodge(dodge_width)) +
geom_vline(xintercept = 0, linetype = "dashed")
}
Usage:
compare_posteriors(mod1, mod2, mod3)

ggplot GLM fitted curve without interaction

I want to add the fitted function from GLM on a ggplot. By default, it automatically create the plot with interaction. I am wondering, if I can plot the fitted function from the model without interaction. For example,
dta <- read.csv("http://www.ats.ucla.edu/stat/data/poisson_sim.csv")
dta <- within(dta, {
prog <- factor(prog, levels=1:3, labels=c("General", "Academic", "Vocational"))
id <- factor(id)
})
plt <- ggplot(dta, aes(math, num_awards, col = prog)) +
geom_point(size = 2) +
geom_smooth(method = "glm", , se = F,
method.args = list(family = "poisson"))
print(plt)
gives the plot with interaction,
However, I want the plot from the model,
`num_awards` = ß0 + ß1*`math` + ß2*`prog` + error
I tried to get this this way,
mod <- glm(num_awards ~ math + prog, data = dta, family = "poisson")
fun.gen <- function(awd) exp(mod$coef[1] + mod$coef[2] * awd)
fun.acd <- function(awd) exp(mod$coef[1] + mod$coef[2] * awd + mod$coef[3])
fun.voc <- function(awd) exp(mod$coef[1] + mod$coef[2] * awd + mod$coef[4])
ggplot(dta, aes(math, num_awards, col = prog)) +
geom_point() +
stat_function(fun = fun.gen, col = "red") +
stat_function(fun = fun.acd, col = "green") +
stat_function(fun = fun.voc, col = "blue") +
geom_smooth(method = "glm", se = F,
method.args = list(family = "poisson"), linetype = "dashed")
The output plot is
Is there any simple way in ggplot to do this efficiently?
Ben's idea of plotting predicted value of the response for specific model terms inspired me improving the type = "y.pc" option of the sjp.glm function. A new update is on GitHub, with version number 1.9.4-3.
Now you can plot predicted values for specific terms, one which is used along the x-axis, and a second one used as grouping factor:
sjp.glm(mod, type = "y.pc", vars = c("math", "prog"))
which gives you following plot:
The vars argument is needed in case your model has more than two terms, to specify the term for the x-axis-range and the term for the grouping.
You can also facet the groups:
sjp.glm(mod, type = "y.pc", vars = c("math", "prog"), show.ci = T, facet.grid = T)
There's no way that I know of to trick geom_smooth() into doing this, but you can do a little better than you've done. You still have to fit the model yourself and add the lines, but you can use the predict() method to generate the predictions and load them into a data frame with the same structure as the original data ...
mod <- glm(num_awards ~ math + prog, data = dta, family = "poisson")
## generate prediction frame
pframe <- with(dta,
expand.grid(math=seq(min(math),max(math),length=51),
prog=levels(prog)))
## add predicted values (on response scale) to prediction frame
pframe$num_awards <- predict(mod,newdata=pframe,type="response")
ggplot(dta, aes(math, num_awards, col = prog)) +
geom_point() +
geom_smooth(method = "glm", se = FALSE,
method.args = list(family = "poisson"), linetype = "dashed")+
geom_line(data=pframe) ## use prediction data here
## (inherits aesthetics etc. from main ggplot call)
(the only difference here is that the way I've done it the predictions span the full horizontal range for all groups, as if you had specified fullrange=TRUE in geom_smooth()).
In principle it seems as though the sjPlot package should be able to handle this sort of thing, but it looks like the relevant bit of code for doing this plot type is hard-coded to assume a binomial GLM ... oh well.
I'm not sure, but you wrote "without interaction" - maybe you are looking for effect plots? (If not, excuse me that I'm assuming something completely wrong...)
You can, for instance, use the effects package for this.
dta <- read.csv("http://www.ats.ucla.edu/stat/data/poisson_sim.csv")
dta <- within(dta, {
prog <- factor(prog, levels=1:3, labels=c("General", "Academic", "Vocational"))
id <- factor(id)
})
mod <- glm(num_awards ~ math + prog, data = dta, family = "poisson")
library(effects)
plot(allEffects(mod))
Another option would be the sjPlot package, as Ben suggested - however, the current version on CRAN only supports logistic regression models properly for effect plots. But in the current development version on GitHub I added support for various model families and link functions, so if you like, you can download that snapshot. The sjPlot package uses ggplot instead of lattice (which is used by the effects package, I think):
sjp.glm(mod, type = "eff", show.ci = T)
Or in non-faceted way:
sjp.glm(mod, type = "eff", facet.grid = F, show.ci = T)

Resources