Plotting regression line on scatter plot using ggplot - r

I have a quadratic regression model. I would like to add the model's
fitted regression line to a scatter plot. My preference is to use ggplot2.
I am able to draw the scatter plot but when I use "stat_smooth()"
to specify the formula, I get the following warning and the fitted
line is not drawn on the scatter plot.
Warning messages:
1: 'newdata' had 80 rows but variables found have 24 rows
2: Computation failed in stat_smooth():
arguments imply differing number of rows: 80, 24
My code is below. Can someone please guide me what should I do
differently so that I can get fitted regression line in a scatter
plot using ggplot.
Code:
library(gamair)
library(ggplot2)
data(hubble)
names(hubble)[names(hubble) == "y"] <- c("velocity")
names(hubble)[names(hubble) == "x"] <- c("distance")
hubble$distance.sqr <- hubble$distance^2
model2.formula <- hubble$velocity ~ hubble$distance +
hubble$distance.sqr - 1
model2.hbl <- lm(model2.formula, data = hubble)
summary(model2.hbl)
model2.sp <- ggplot(hubble, aes(x = distance, y = velocity)) +
geom_point() + labs(title = "Scatter Plot between Distance & Velocity",
x = "Distance", y = "Velocity")
model2.sp + stat_smooth(method = "lm", formula = hubble$velocity ~
hubble$distance + hubble$distance.sqr - 1)

I think the issue here is how you specify the quadratic formula. For the squared term you could use I(x^2) or poly(x, 2). For example:
ggplot(hubble, aes(x, y)) +
geom_point() +
stat_smooth(method = "lm",
formula = y ~ x + poly(x, 2) - 1) +
labs(x = "Distance", y = "Velocity")

here is a MWE based on "mpg" dataset:
library(ggplot2)
ggplot(mpg, aes(x = hwy, y = displ)) +
geom_point(shape = 1) +
geom_smooth(method = lm, se = FALSE)

Related

ggplot - reverse y axis when plotting geom_smooth with gamma error distribution

I am trying to plot a geom_smooth using a gamma error distribution.
library(ggplot)
data <- data.frame(x = 1:100, y = (1:100 + runif(1:100, min = 0, max = 50))^2)
p <- ggplot(data, aes(x, y)) +
geom_point() +
geom_smooth(method = 'glm', method.args = list(family = Gamma(link = "log")))
I also want to reverse the y-axis however using scale_y_reverse, but this causes the Gamma distribution to fail as it can't be applied to negative values. How can I reverse the y-axis for this plot?
p + scale_y_reverse()
Warning message:
Computation failed in `stat_smooth()`:
non-positive values not allowed for the 'Gamma' family
I'm not sure if there are build-in methods to call out the predicted values of geom_smooth for scale_y_reverse to work.
Here's the more conventional method with visualizing of regression models, i.e. construct, predict and plot.
library(broom)
model <- glm(y ~ x, data, family = Gamma(link = "log"))
new <- augment(model, se=TRUE)
ggplot(new, aes(x, y)) +
geom_point() +
geom_line(aes(y=exp(1)^.fitted)) +
geom_line(aes(y=exp(1)^(.fitted + .se.fit)), linetype="dashed") +
geom_line(aes(y=exp(1)^(.fitted - .se.fit)), linetype="dashed") +
scale_y_reverse()

How can a graph of a polynomial regression with a categorical variable be plotted?

In the R statistical package, is there a way to plot a graph of a second order polynomial regression with one continuous variable and one categorical variable?
To generate a linear regression graph with one categorical variable:
library(ggplot2)
library(ggthemes) ## theme_few()
set.seed(1)
df <- data.frame(minutes = runif(60, 5, 15), endtime=60, category="a")
df$category = df$category=letters[seq( from = 1, to = 2 )]
df$endtime = df$endtime + df$minutes^3/180 + df$minutes*runif(60, 1, 2)
ggplot(df, aes(y=endtime, x=minutes, col = category)) +
geom_point() +
geom_smooth(method=lm) +
theme_few()
To plot a polynomial graph with one one continuous variable:
ggplot(df, aes(x=minutes, y=endtime)) +
geom_point() +
stat_smooth(method='lm', formula = y ~ poly(x,2), size = 1) +
xlab('Minutes of warm up') +
ylab('End time')
But I can’t figure out how to plot a polynomial graph with one continuous variable and one categorical variable.
Just add a colour or group mapping. This will make ggplot fit and display separate polynomial regressions for each category. (1) It's not possible to display an additive mixed-polynomial regression (i.e. lm(y ~ poly(x,2) + category)); (2) what's shown here is not quite equivalent to the results of the interaction model lm(y ~ poly(x,2)*col), because the residual variances (and hence the widths of the confidence ribbons) are estimated separately for each group.
ggplot(df, aes(x=minutes, y=endtime, col = category)) +
geom_point() +
stat_smooth(method='lm', formula = y ~ poly(x,2)) +
labs(x = 'Minutes of warm up', y = 'End time') +
theme_few()

ggplot2 geom_smooth, extended model for method=lm

I would like to use geom_smooth to get a fitted line from a certain linear regression model.
It seems to me that the formula can only take x and y and not any additional parameter.
To show more clearly what I want:
library(dplyr)
library(ggplot2)
set.seed(35413)
df <- data.frame(pred = runif(100,10,100),
factor = sample(c("A","B"), 100, replace = TRUE)) %>%
mutate(
outcome = 100 + 10*pred +
ifelse(factor=="B", 200, 0) +
ifelse(factor=="B", 4, 0)*pred +
rnorm(100,0,60))
With
ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm") +
theme_bw()
I produce fitted lines that, due to the color=factor option, are basically the output of the linear model lm(outcome ~ pred*factor, df)
In some cases, however, I prefer the lines to be the output of a different model fit, like lm(outcome ~ pred + factor, df), for which I can use something like:
fit <- lm(outcome ~ pred+factor, df)
predval <- expand.grid(
pred = seq(
min(df$pred), max(df$pred), length.out = 1000),
factor = unique(df$factor)) %>%
mutate(outcome = predict(fit, newdata = .))
ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point() +
geom_line(data = predval) +
theme_bw()
which results in :
My question: is there a way to produce the latter graph exploiting the geom_smooth instead? I know there is a formula = - option in geom_smooth but I can't make something like formula = y ~ x + factor or formula = y ~ x + color (as I defined color = factor) work.
This is a very interesting question. Probably the main reason why geom_smooth is so "resistant" to allowing custom models of multiple variables is that it is limited to producing 2-D curves; consequently, its arguments are designed for handling two-dimensional data (i.e. formula = response variable ~ independent variable).
The trick to getting what you requested is using the mapping argument within geom_smooth, instead of formula. As you've probably seen from looking at the documentation, formula only allows you to specify the mathematical structure of the model (e.g. linear, quadratic, etc.). Conversely, the mapping argument allows you to directly specify new y-values - such as the output of a custom linear model that you can call using predict().
Note that, by default, inherit.aes is set to TRUE, so your plotted regressions will be coloured appropriately by your categorical variable. Here's the code:
# original plot
plot1 <- ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm") +
ggtitle("outcome ~ pred") +
theme_bw()
# declare new model here
plm <- lm(formula = outcome ~ pred + factor, data=df)
# plot with lm for outcome ~ pred + factor
plot2 <-ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm", mapping=aes(y=predict(plm,df))) +
ggtitle("outcome ~ pred + factor") +
theme_bw()

Plotting lm curve does not show using ggplot

As per ex18q1 in "R for Data Science" I am trying to find the best model for the data:
sim1a <- tibble(
x = rep(1:10, each = 3),
y = x * 1.5 + 6 + rt(length(x), df = 2)
)
I've applied linear model and am trying to plot the results on a graph using ggplot:
sim1a_mod <- lm(x ~ y, data = sim1a)
ggplot(sim1a, aes(x, y)) +
geom_point(size = 2, colour= "gray") +
geom_abline(intercept = coef(sim1a_mod)[[1]], slope = coef(sim1a_mod)[[2]], colour = "red")
coef(sim1a_mod)[[1]] prints -1.14403
coef(sim1a_mod)[[2]] prints 0.4384473
I create the plot with the data points, but the model is not showing. What am I doing wrong?
The nomenclature for typing formulas for model functions like lm(), glm(), lmer() etc. in R is always DV ~ IV1 + IV2 + ... + IVn where DV is your dependent variable and IVn is your list of independent variables. We typically chart the dependent variable on the y-axis and the independent variable on the x-axis, so in your case you'll need to change your sim1a_mod model to lm(y ~ x, data = sim1a).
In your original code, because you were running a different model, your line was being charted, but it was outside of your view. If you attempt to chart again with your original model with the following code you will then see your regression line:
ggplot(sim1a, aes(x, y)) +
geom_point(size = 2, colour= "gray") +
geom_abline(intercept = coef(sim1a_mod)[[1]], slope = coef(sim1a_mod)[[2]], colour = "red") +
scale_x_continuous(limits = c(-30, 30)) + scale_y_continuous(limits = c(-30, 30))

loess and glm plotting with ggplot2

I am trying to plot the model predictions from a binary choice glm against the empirical probability using data from the titanic. To show differences across class and sex I am using faceting, but I have two things things I can't quite figure out. The first is that I'd like to restrict the loess curve to be between 0 and 1, but if I add the option ylim(c(0,1)) to the end of the plot, the ribbon around the loess curve gets cut off if one side of it is outside the bound. The second thing I'd like to do is draw a line from the minimum x-value (predicted probability from the glm) for each facet, to the maximum x-value (within the same facet) and y = 1 so as to show glm predicted probability.
#info on this data http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt
load(url('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.sav'))
titanic <- titanic3[ ,-c(3,8:14)]; rm(titanic3)
titanic <- na.omit(titanic) #probably missing completely at random
titanic$age <- as.numeric(titanic$age)
titanic$sibsp <- as.integer(titanic$sibsp)
titanic$survived <- as.integer(titanic$survived)
training.df <- titanic[sample(nrow(titanic), nrow(titanic) / 2), ]
validation.df <- titanic[!(row.names(titanic) %in% row.names(training.df)), ]
glm.fit <- glm(survived ~ sex + sibsp + age + I(age^2) + factor(pclass) + sibsp:sex,
family = binomial(link = "probit"), data = training.df)
glm.predict <- predict(glm.fit, newdata = validation.df, se.fit = TRUE, type = "response")
plot.data <- data.frame(mean = glm.predict$fit, response = validation.df$survived,
class = validation.df$pclass, sex = validation.df$sex)
require(ggplot2)
ggplot(data = plot.data, aes(x = as.numeric(mean), y = as.integer(response))) + geom_point() +
stat_smooth(method = "loess", formula = y ~ x) +
facet_wrap( ~ class + sex, scale = "free") + ylim(c(0,1)) +
xlab("Predicted Probability of Survival") + ylab("Empirical Survival Rate")
The answer to your first question is to use coord_cartesian(ylim=c(0,1)) instead of ylim(0,1); this is a moderately FAQ.
For your second question, there may be a way to do it within ggplot but it was easier for me to summarize the data externally:
g0 <- ggplot(data = plot.data, aes(x = mean, y = response)) + geom_point() +
stat_smooth(method = "loess") +
facet_wrap( ~ class + sex, scale = "free") +
coord_cartesian(ylim=c(0,1))+
labs(x="Predicted Probability of Survival",
y="Empirical Survival Rate")
(I shortened your code slightly by eliminating some default values and using labs.)
ss <- ddply(plot.data,c("class","sex"),summarise,minx=min(mean),maxx=max(mean))
g0 + geom_segment(data=ss,aes(x=minx,y=minx,xend=maxx,yend=maxx),
colour="red",alpha=0.5)

Resources