I am trying to plot the model predictions from a binary choice glm against the empirical probability using data from the titanic. To show differences across class and sex I am using faceting, but I have two things things I can't quite figure out. The first is that I'd like to restrict the loess curve to be between 0 and 1, but if I add the option ylim(c(0,1)) to the end of the plot, the ribbon around the loess curve gets cut off if one side of it is outside the bound. The second thing I'd like to do is draw a line from the minimum x-value (predicted probability from the glm) for each facet, to the maximum x-value (within the same facet) and y = 1 so as to show glm predicted probability.
#info on this data http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt
load(url('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.sav'))
titanic <- titanic3[ ,-c(3,8:14)]; rm(titanic3)
titanic <- na.omit(titanic) #probably missing completely at random
titanic$age <- as.numeric(titanic$age)
titanic$sibsp <- as.integer(titanic$sibsp)
titanic$survived <- as.integer(titanic$survived)
training.df <- titanic[sample(nrow(titanic), nrow(titanic) / 2), ]
validation.df <- titanic[!(row.names(titanic) %in% row.names(training.df)), ]
glm.fit <- glm(survived ~ sex + sibsp + age + I(age^2) + factor(pclass) + sibsp:sex,
family = binomial(link = "probit"), data = training.df)
glm.predict <- predict(glm.fit, newdata = validation.df, se.fit = TRUE, type = "response")
plot.data <- data.frame(mean = glm.predict$fit, response = validation.df$survived,
class = validation.df$pclass, sex = validation.df$sex)
require(ggplot2)
ggplot(data = plot.data, aes(x = as.numeric(mean), y = as.integer(response))) + geom_point() +
stat_smooth(method = "loess", formula = y ~ x) +
facet_wrap( ~ class + sex, scale = "free") + ylim(c(0,1)) +
xlab("Predicted Probability of Survival") + ylab("Empirical Survival Rate")
The answer to your first question is to use coord_cartesian(ylim=c(0,1)) instead of ylim(0,1); this is a moderately FAQ.
For your second question, there may be a way to do it within ggplot but it was easier for me to summarize the data externally:
g0 <- ggplot(data = plot.data, aes(x = mean, y = response)) + geom_point() +
stat_smooth(method = "loess") +
facet_wrap( ~ class + sex, scale = "free") +
coord_cartesian(ylim=c(0,1))+
labs(x="Predicted Probability of Survival",
y="Empirical Survival Rate")
(I shortened your code slightly by eliminating some default values and using labs.)
ss <- ddply(plot.data,c("class","sex"),summarise,minx=min(mean),maxx=max(mean))
g0 + geom_segment(data=ss,aes(x=minx,y=minx,xend=maxx,yend=maxx),
colour="red",alpha=0.5)
Related
In the R statistical package, is there a way to plot a graph of a second order polynomial regression with one continuous variable and one categorical variable?
To generate a linear regression graph with one categorical variable:
library(ggplot2)
library(ggthemes) ## theme_few()
set.seed(1)
df <- data.frame(minutes = runif(60, 5, 15), endtime=60, category="a")
df$category = df$category=letters[seq( from = 1, to = 2 )]
df$endtime = df$endtime + df$minutes^3/180 + df$minutes*runif(60, 1, 2)
ggplot(df, aes(y=endtime, x=minutes, col = category)) +
geom_point() +
geom_smooth(method=lm) +
theme_few()
To plot a polynomial graph with one one continuous variable:
ggplot(df, aes(x=minutes, y=endtime)) +
geom_point() +
stat_smooth(method='lm', formula = y ~ poly(x,2), size = 1) +
xlab('Minutes of warm up') +
ylab('End time')
But I can’t figure out how to plot a polynomial graph with one continuous variable and one categorical variable.
Just add a colour or group mapping. This will make ggplot fit and display separate polynomial regressions for each category. (1) It's not possible to display an additive mixed-polynomial regression (i.e. lm(y ~ poly(x,2) + category)); (2) what's shown here is not quite equivalent to the results of the interaction model lm(y ~ poly(x,2)*col), because the residual variances (and hence the widths of the confidence ribbons) are estimated separately for each group.
ggplot(df, aes(x=minutes, y=endtime, col = category)) +
geom_point() +
stat_smooth(method='lm', formula = y ~ poly(x,2)) +
labs(x = 'Minutes of warm up', y = 'End time') +
theme_few()
I have a data.frame with observed success/failure outcomes per two groups along with expected probabilities:
library(dplyr)
observed.probability.df <- data.frame(group = c("A","B"), p = c(0.4,0.6))
expected.probability.df <- data.frame(group = c("A","B"), p = qlogis(c(0.45,0.55)))
observed.data.df <- do.call(rbind,lapply(c("A","B"), function(g)
data.frame(group = g, value = c(rep(0,1000*dplyr::filter(observed.probability.df, group != g)$p),rep(1,1000*dplyr::filter(observed.probability.df, group == g)$p)))
)) %>% dplyr::left_join(expected.probability.df)
observed.probability.df$group <- factor(observed.probability.df$group, levels = c("A","B"))
observed.data.df$group <- factor(observed.data.df$group, levels = c("A","B"))
I'm fitting a logistic regression (binomial glm with a logit link function) to these data with the offset term:
fit <- glm(value ~ group + offset(p), data = observed.data.df, family = binomial(link = 'logit'))
Now, I'd like to plot these data as a bar graph using ggplot2's geom_bar, color-coded by group, and to add to that the trend line and shaded standard error area estimated in fit.
I'd use stat_smooth for that but I don't think it can handle the offset term in it's formula, so looks like I need to resort to assembling this figure in an alternative way.
To get the bars and the trend line I used:
slope.est <- function(x, ests) plogis(ests[1] + ests[2] * x)
library(ggplot2)
ggplot(observed.probability.df, aes(x = group, y = p, fill = group)) +
geom_bar(stat = 'identity') +
stat_function(fun = slope.est,args=list(ests=coef(fit)),size=2,color="black") +
scale_x_discrete(name = NULL,labels = levels(observed.probability.df$group), breaks = sort(unique(observed.probability.df$group))) +
theme_minimal() + theme(legend.title = element_blank()) + ylab("Fraction of cells")
So the question is how to add to that the shaded standard error around the trend line?
Using stat_function I am able to shade the entire area from the upper bound of the standard error all the way down to the X-axis:
ggplot(observed.probability.df, aes(x = group, y = p, fill = group)) +
geom_bar(stat = 'identity') +
stat_function(fun = slope.est,args=list(ests=coef(fit)),size=2,color="black") +
stat_function(fun = slope.est,args=list(ests=summary(fit)$coefficients[,1]+summary(fit)$coefficients[,2]),geom='area',fill="gray",alpha=0.25) +
scale_x_discrete(name = NULL,labels = levels(observed.probability.df$group), breaks = sort(unique(observed.probability.df$group))) +
theme_minimal() + theme(legend.title = element_blank()) + ylab("Fraction of cells")
Which is close but not quite there.
Any idea how to subtract from the shaded area above the area that's below the lower bound of the standard error? Perhaps geom_ribbon is the way to go here, but I don't know how to combine it with the slope.est function
I would like to use geom_smooth to get a fitted line from a certain linear regression model.
It seems to me that the formula can only take x and y and not any additional parameter.
To show more clearly what I want:
library(dplyr)
library(ggplot2)
set.seed(35413)
df <- data.frame(pred = runif(100,10,100),
factor = sample(c("A","B"), 100, replace = TRUE)) %>%
mutate(
outcome = 100 + 10*pred +
ifelse(factor=="B", 200, 0) +
ifelse(factor=="B", 4, 0)*pred +
rnorm(100,0,60))
With
ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm") +
theme_bw()
I produce fitted lines that, due to the color=factor option, are basically the output of the linear model lm(outcome ~ pred*factor, df)
In some cases, however, I prefer the lines to be the output of a different model fit, like lm(outcome ~ pred + factor, df), for which I can use something like:
fit <- lm(outcome ~ pred+factor, df)
predval <- expand.grid(
pred = seq(
min(df$pred), max(df$pred), length.out = 1000),
factor = unique(df$factor)) %>%
mutate(outcome = predict(fit, newdata = .))
ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point() +
geom_line(data = predval) +
theme_bw()
which results in :
My question: is there a way to produce the latter graph exploiting the geom_smooth instead? I know there is a formula = - option in geom_smooth but I can't make something like formula = y ~ x + factor or formula = y ~ x + color (as I defined color = factor) work.
This is a very interesting question. Probably the main reason why geom_smooth is so "resistant" to allowing custom models of multiple variables is that it is limited to producing 2-D curves; consequently, its arguments are designed for handling two-dimensional data (i.e. formula = response variable ~ independent variable).
The trick to getting what you requested is using the mapping argument within geom_smooth, instead of formula. As you've probably seen from looking at the documentation, formula only allows you to specify the mathematical structure of the model (e.g. linear, quadratic, etc.). Conversely, the mapping argument allows you to directly specify new y-values - such as the output of a custom linear model that you can call using predict().
Note that, by default, inherit.aes is set to TRUE, so your plotted regressions will be coloured appropriately by your categorical variable. Here's the code:
# original plot
plot1 <- ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm") +
ggtitle("outcome ~ pred") +
theme_bw()
# declare new model here
plm <- lm(formula = outcome ~ pred + factor, data=df)
# plot with lm for outcome ~ pred + factor
plot2 <-ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm", mapping=aes(y=predict(plm,df))) +
ggtitle("outcome ~ pred + factor") +
theme_bw()
I have data that looks like this:
height <- c(1,2,3,4,2,4,6,8)
weight <- c(12,13,14,15,22,23,24,25)
type <- c("Wheat","Wheat","Wheat","Wheat","Rice","Rice","Rice","Rice")
set <- c(1,1,1,1,2,2,2,2)
dat <- data.frame(set,type,height,weight)
I run a lmer model with set as a random effect in R:
mod <- lmer(weight~height + type + (1|set), data = dat)
Now, I want to plot the estimates of the model and plot a regression, with weight on the x-axis and height on the y-axis, facet(~type)
I use the predict function as follows
dat$pred <- predict(mod, type = "response")
And I want to achieve a ggplot that will look like this:
ggplot(dat,aes(x = weight, y = height)) +
geom_point() + geom_smooth(method="lm", fill=NA) + facet_grid(~ type, scales = "free")
However, I note that the predict function has only a singular output. How do I plot that to achieve the same as above? Or do I have to store two different predict responses, and then plug it into the x,y of ggplot?
I can adapt your plot to show raw vs. predicted values like this:
ggplot(dat,aes(y = height)) +
geom_point(aes(x = weight)) +
geom_line(aes(x = pred)) +
facet_grid(~ type, scales = "free")
In your example plot though you have weight, the outcome variable in your model, on the x-axis, which is confusing. Normally you would have the outcome/predicted variable on the y-axis, so I would have plotted your model predictions like:
ggplot(dat,aes(x = height)) +
geom_point(aes(y = weight)) +
geom_line(aes(y = pred)) +
facet_grid(~ type, scales = "free")
Because I wanted to have the nls model separately, I did a fit to my data inside the geom_smooth function and outside ggplot:
library(ggplot2)
set.seed(1)
data <- data.frame(x=rnorm(100))
a <- 4
b <- -2
data$y <- with(data, exp(a + b * x) + rnorm(100) + 100)
mod <- nls(formula = y ~ (exp(a + b * x)), data = data, start = list(a = a, b = b))
data$fit <- predict(mod, newdata=data)
plot <- ggplot(data, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method = "nls", colour = "red", formula=y ~ exp(a + b * x),
method.args = list(start = c(a = a, b = b)), se=F, span=0) +
geom_line(aes(x=x, y=fit), colour="blue") +
scale_y_log10()
I just wondering why both methods, though with the same parameters, give a different fit? Does geom_smooth use some transformation?
geom_smooth doesn't make predictions from the original dataset, but instead makes a dataset for prediction. By default this dataset has 80 rows, but you can change this with the n argument.
To see that the model fit via geom_smooth and the model fit by nls are the same, you need to use the same dataset for prediction. You can pull the one used by geom_smooth out via ggplot_build. The dataset used for prediction is the second in the list.
dat2 = ggplot_build(plot)$data[[2]]
Now use dat2 for making predictions from the nls model and remake the plot.
dat2$fit2 = predict(mod, newdata = dat2)
ggplot(data, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method = "nls", colour = "red", formula=y ~ exp(a + b * x),
method.args = list(start = c(a = 4, b = -2)), se = FALSE) +
geom_line(data = dat2, aes(x=x, y=fit2), colour="blue")
Note that if you want to display on the log10 scale when comparing geom_smooth to a predicted line you'll want to use coord_trans(y = "log10") instead of scale_y_log10. Scale transformation happens prior to model fitting, so you would be fitting a model to a log10-transformed y if you use scale_y_log10.