Because I wanted to have the nls model separately, I did a fit to my data inside the geom_smooth function and outside ggplot:
library(ggplot2)
set.seed(1)
data <- data.frame(x=rnorm(100))
a <- 4
b <- -2
data$y <- with(data, exp(a + b * x) + rnorm(100) + 100)
mod <- nls(formula = y ~ (exp(a + b * x)), data = data, start = list(a = a, b = b))
data$fit <- predict(mod, newdata=data)
plot <- ggplot(data, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method = "nls", colour = "red", formula=y ~ exp(a + b * x),
method.args = list(start = c(a = a, b = b)), se=F, span=0) +
geom_line(aes(x=x, y=fit), colour="blue") +
scale_y_log10()
I just wondering why both methods, though with the same parameters, give a different fit? Does geom_smooth use some transformation?
geom_smooth doesn't make predictions from the original dataset, but instead makes a dataset for prediction. By default this dataset has 80 rows, but you can change this with the n argument.
To see that the model fit via geom_smooth and the model fit by nls are the same, you need to use the same dataset for prediction. You can pull the one used by geom_smooth out via ggplot_build. The dataset used for prediction is the second in the list.
dat2 = ggplot_build(plot)$data[[2]]
Now use dat2 for making predictions from the nls model and remake the plot.
dat2$fit2 = predict(mod, newdata = dat2)
ggplot(data, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method = "nls", colour = "red", formula=y ~ exp(a + b * x),
method.args = list(start = c(a = 4, b = -2)), se = FALSE) +
geom_line(data = dat2, aes(x=x, y=fit2), colour="blue")
Note that if you want to display on the log10 scale when comparing geom_smooth to a predicted line you'll want to use coord_trans(y = "log10") instead of scale_y_log10. Scale transformation happens prior to model fitting, so you would be fitting a model to a log10-transformed y if you use scale_y_log10.
Related
I am trying to plot a geom_smooth using a gamma error distribution.
library(ggplot)
data <- data.frame(x = 1:100, y = (1:100 + runif(1:100, min = 0, max = 50))^2)
p <- ggplot(data, aes(x, y)) +
geom_point() +
geom_smooth(method = 'glm', method.args = list(family = Gamma(link = "log")))
I also want to reverse the y-axis however using scale_y_reverse, but this causes the Gamma distribution to fail as it can't be applied to negative values. How can I reverse the y-axis for this plot?
p + scale_y_reverse()
Warning message:
Computation failed in `stat_smooth()`:
non-positive values not allowed for the 'Gamma' family
I'm not sure if there are build-in methods to call out the predicted values of geom_smooth for scale_y_reverse to work.
Here's the more conventional method with visualizing of regression models, i.e. construct, predict and plot.
library(broom)
model <- glm(y ~ x, data, family = Gamma(link = "log"))
new <- augment(model, se=TRUE)
ggplot(new, aes(x, y)) +
geom_point() +
geom_line(aes(y=exp(1)^.fitted)) +
geom_line(aes(y=exp(1)^(.fitted + .se.fit)), linetype="dashed") +
geom_line(aes(y=exp(1)^(.fitted - .se.fit)), linetype="dashed") +
scale_y_reverse()
I have the following data
df <- data.frame(x= c(0,1,10,100,1000,0,1, 10,100,1000,0,1,10,100,1000),
y=c(7,15,135,1132,6459,-3,11,127,1120,6249,-5,13,126,1208,6208))
After making a linear model using the data, I used the model to predict y values from know x values. Stored the predicted y values in a data frame "pred.fits"
fit <- lm(data = df, y ~ x)
pred.fits <- expand.grid(x=seq(1, 2000, length=2001))
pm <- predict(fit, newdata=pred.fits, interval="confidence")
pred.fits$py <- pm[,1]
I plot the data and use both geom_smooth() and geom_line(), they seem to be quite coincident.
ggplot(df, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, se = FALSE, size=1.5) +
geom_line(data=pred.fits, aes(x=x, y=py), size=.2)
However, when I plot the same data, with setting the axes in log scale the two regressions differs drastically.
ggplot(df, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, se = FALSE, size=1.5) +
geom_line(data=pred.fits, aes(x=x, y=py), size=.2) +
scale_x_log10() +
scale_y_log10()
Am I missing something here?
UPDATE
After #Duck pointed me to correct direction, I was able to get it right. The issue was, I wanted the data to be untransformed, but the axes transformed to log10 scale. This is how I was able to do it.
df2 <- df[df$x>=1,] # remove annoying warning msgs.
fit2 <- lm(data = df2, log10(y) ~ log10(x))
pred.fits2 <- expand.grid(x=seq(10^0, 10^3 , length=200))
pm2 <- predict(fit2, newdata=pred.fits2, interval="confidence")
pred.fits2$py <- 10^pm2[,1] # convert the predicted y values to linear scale
ggplot(df2, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, se = FALSE, size=1.5) +
geom_line(data=pred.fits2, aes(x=x, y=py), size=1.5, linetype = "longdash") +
scale_x_log10() +
scale_y_log10()
Thanks everyone for your help.
This code can be useful for your understanding (Thanks to #BWilliams for the valious comment). You want x and y in log scale so if mixing a linear model with different scales can mess everything. If you want to see similar scales it is better if you train a different model with log variables and then plot it also using the proper values. Here an approach where we build a log-log model and then plot (data values as ones or negative have been isolated in a new dataframe df2). Here the code:
First linear model:
library(ggplot2)
#Data
df <- data.frame(x= c(0,1,10,100,1000,0,1, 10,100,1000,0,1,10,100,1000),
y=c(7,15,135,1132,6459,-3,11,127,1120,6249,-5,13,126,1208,6208))
#Model 1 all obs
fit <- lm(data = df, y ~ x)
pred.fits <- expand.grid(x=seq(1, 2000, length=2001))
pm <- predict(fit, newdata=pred.fits, interval="confidence")
pred.fits$py <- pm[,1]
#Plot 1
ggplot(df, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, se = FALSE, size=1.5) +
geom_line(data=pred.fits, aes(x=x, y=py), size=.2)
Output:
Now the sketch for log variables, notice how we use log() across main variables and also how the model is build:
#First remove issue values
df2 <- df[df$x>1,]
#Train a new model
pred.fits2 <- expand.grid(x=seq(1, 2000, length=2001))
fit2 <- lm(data = df2, log(y) ~ log(x))
pm2 <- predict(fit2, newdata=pred.fits2, interval="confidence")
pred.fits2$py <- pm2[,1]
#Plot 2
ggplot(df2, aes(x=log(x), y=log(y))) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, se = FALSE, size=1.5) +
geom_line(data=pred.fits2, aes(x=log(x), y=py), size=.2)
Output:
I would like to use geom_smooth to get a fitted line from a certain linear regression model.
It seems to me that the formula can only take x and y and not any additional parameter.
To show more clearly what I want:
library(dplyr)
library(ggplot2)
set.seed(35413)
df <- data.frame(pred = runif(100,10,100),
factor = sample(c("A","B"), 100, replace = TRUE)) %>%
mutate(
outcome = 100 + 10*pred +
ifelse(factor=="B", 200, 0) +
ifelse(factor=="B", 4, 0)*pred +
rnorm(100,0,60))
With
ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm") +
theme_bw()
I produce fitted lines that, due to the color=factor option, are basically the output of the linear model lm(outcome ~ pred*factor, df)
In some cases, however, I prefer the lines to be the output of a different model fit, like lm(outcome ~ pred + factor, df), for which I can use something like:
fit <- lm(outcome ~ pred+factor, df)
predval <- expand.grid(
pred = seq(
min(df$pred), max(df$pred), length.out = 1000),
factor = unique(df$factor)) %>%
mutate(outcome = predict(fit, newdata = .))
ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point() +
geom_line(data = predval) +
theme_bw()
which results in :
My question: is there a way to produce the latter graph exploiting the geom_smooth instead? I know there is a formula = - option in geom_smooth but I can't make something like formula = y ~ x + factor or formula = y ~ x + color (as I defined color = factor) work.
This is a very interesting question. Probably the main reason why geom_smooth is so "resistant" to allowing custom models of multiple variables is that it is limited to producing 2-D curves; consequently, its arguments are designed for handling two-dimensional data (i.e. formula = response variable ~ independent variable).
The trick to getting what you requested is using the mapping argument within geom_smooth, instead of formula. As you've probably seen from looking at the documentation, formula only allows you to specify the mathematical structure of the model (e.g. linear, quadratic, etc.). Conversely, the mapping argument allows you to directly specify new y-values - such as the output of a custom linear model that you can call using predict().
Note that, by default, inherit.aes is set to TRUE, so your plotted regressions will be coloured appropriately by your categorical variable. Here's the code:
# original plot
plot1 <- ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm") +
ggtitle("outcome ~ pred") +
theme_bw()
# declare new model here
plm <- lm(formula = outcome ~ pred + factor, data=df)
# plot with lm for outcome ~ pred + factor
plot2 <-ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm", mapping=aes(y=predict(plm,df))) +
ggtitle("outcome ~ pred + factor") +
theme_bw()
As per ex18q1 in "R for Data Science" I am trying to find the best model for the data:
sim1a <- tibble(
x = rep(1:10, each = 3),
y = x * 1.5 + 6 + rt(length(x), df = 2)
)
I've applied linear model and am trying to plot the results on a graph using ggplot:
sim1a_mod <- lm(x ~ y, data = sim1a)
ggplot(sim1a, aes(x, y)) +
geom_point(size = 2, colour= "gray") +
geom_abline(intercept = coef(sim1a_mod)[[1]], slope = coef(sim1a_mod)[[2]], colour = "red")
coef(sim1a_mod)[[1]] prints -1.14403
coef(sim1a_mod)[[2]] prints 0.4384473
I create the plot with the data points, but the model is not showing. What am I doing wrong?
The nomenclature for typing formulas for model functions like lm(), glm(), lmer() etc. in R is always DV ~ IV1 + IV2 + ... + IVn where DV is your dependent variable and IVn is your list of independent variables. We typically chart the dependent variable on the y-axis and the independent variable on the x-axis, so in your case you'll need to change your sim1a_mod model to lm(y ~ x, data = sim1a).
In your original code, because you were running a different model, your line was being charted, but it was outside of your view. If you attempt to chart again with your original model with the following code you will then see your regression line:
ggplot(sim1a, aes(x, y)) +
geom_point(size = 2, colour= "gray") +
geom_abline(intercept = coef(sim1a_mod)[[1]], slope = coef(sim1a_mod)[[2]], colour = "red") +
scale_x_continuous(limits = c(-30, 30)) + scale_y_continuous(limits = c(-30, 30))
I am trying to plot the model predictions from a binary choice glm against the empirical probability using data from the titanic. To show differences across class and sex I am using faceting, but I have two things things I can't quite figure out. The first is that I'd like to restrict the loess curve to be between 0 and 1, but if I add the option ylim(c(0,1)) to the end of the plot, the ribbon around the loess curve gets cut off if one side of it is outside the bound. The second thing I'd like to do is draw a line from the minimum x-value (predicted probability from the glm) for each facet, to the maximum x-value (within the same facet) and y = 1 so as to show glm predicted probability.
#info on this data http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt
load(url('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.sav'))
titanic <- titanic3[ ,-c(3,8:14)]; rm(titanic3)
titanic <- na.omit(titanic) #probably missing completely at random
titanic$age <- as.numeric(titanic$age)
titanic$sibsp <- as.integer(titanic$sibsp)
titanic$survived <- as.integer(titanic$survived)
training.df <- titanic[sample(nrow(titanic), nrow(titanic) / 2), ]
validation.df <- titanic[!(row.names(titanic) %in% row.names(training.df)), ]
glm.fit <- glm(survived ~ sex + sibsp + age + I(age^2) + factor(pclass) + sibsp:sex,
family = binomial(link = "probit"), data = training.df)
glm.predict <- predict(glm.fit, newdata = validation.df, se.fit = TRUE, type = "response")
plot.data <- data.frame(mean = glm.predict$fit, response = validation.df$survived,
class = validation.df$pclass, sex = validation.df$sex)
require(ggplot2)
ggplot(data = plot.data, aes(x = as.numeric(mean), y = as.integer(response))) + geom_point() +
stat_smooth(method = "loess", formula = y ~ x) +
facet_wrap( ~ class + sex, scale = "free") + ylim(c(0,1)) +
xlab("Predicted Probability of Survival") + ylab("Empirical Survival Rate")
The answer to your first question is to use coord_cartesian(ylim=c(0,1)) instead of ylim(0,1); this is a moderately FAQ.
For your second question, there may be a way to do it within ggplot but it was easier for me to summarize the data externally:
g0 <- ggplot(data = plot.data, aes(x = mean, y = response)) + geom_point() +
stat_smooth(method = "loess") +
facet_wrap( ~ class + sex, scale = "free") +
coord_cartesian(ylim=c(0,1))+
labs(x="Predicted Probability of Survival",
y="Empirical Survival Rate")
(I shortened your code slightly by eliminating some default values and using labs.)
ss <- ddply(plot.data,c("class","sex"),summarise,minx=min(mean),maxx=max(mean))
g0 + geom_segment(data=ss,aes(x=minx,y=minx,xend=maxx,yend=maxx),
colour="red",alpha=0.5)