I'm trying to plot different simple linear regression estimates on the same coordinate plane to understand something of the differences between different methods. But my question is about adding these lines in R code and not about the statistics of the differences lines.
Here I'm using the mtcars dataset. And I'm using the mblm and quantreq packages to come up with different regression equations or, more specifically, the parameters for the slope and intercept for different simple linear regression estimates.
The OLS estimate I add using the geom_smooth() function and specifying the method argument. I could add the slope and intercept using geom_abline() after creating a linear model object; that's another option.
The Theil-Sen and median least squares deviation estimates I'm creating a model first for each using the respective packages. Then I'm adding the slope and intercept using geom_abline().
So now I've added the lines manually. But how can I create a key or legend in ggplot to show these different lines? ggplot() adds the key automatically when geom_smooth() is separated into different groups. But I don't think it adds a legend for geom_abline. And anyway my plot uses a mixture of both. Any ideas? I've never had to add more own key in this way.
library(mblm)
ts_fit <- mblm(mpg ~ wt, data = mtcars)
library(quantreg)
lad_fit <- rq(mpg ~ wt, data = mtcars)
ggplot(mtcars, aes(x = wt, y = mpg)) +
labs(subtitle = "Simple Linear Regressions") +
geom_point() +
geom_smooth(method = 'lm', se = FALSE, color = '#376795') +
geom_abline(intercept = coef(ts_fit)[1], slope = coef(ts_fit)[2], color = '#f7aa58', size = 1) +
geom_abline(intercept = coef(lad_fit)[1], slope = coef(lad_fit)[2], color = '#72bcd5', size = 1)
Rather than adding a separate geom for each model, I would create a dataframe including the intercept and slope for all models. Then you can pass this to a single geom_abline() and map color to the different models.
Note, I don't have {mblm} or {quantreg} installed, so I ran lm() on different subsets of mtcars as an approximation.
library(tidyverse)
# create dataframe with model coefficients
models <- data.frame(
lm = coef(lm(mpg ~ wt, data = mtcars[1:20,])),
ts = coef(lm(mpg ~ wt, data = mtcars[7:26,])),
lad = coef(lm(mpg ~ wt, data = mtcars[11:32,]))
) %>%
t() %>%
as_tibble(rownames = "model") %>%
rename_with(~ c("model", "intercept", "slope"))
models
# # A tibble: 3 x 3
# model intercept slope
# <chr> <dbl> <dbl>
# 1 lm 38.5 -5.41
# 2 ts 38.9 -5.59
# 3 lad 37.6 -5.41
# specify ggplot, passing `mtcars` to `geom_point()` and `models` to `geom_abline()`
ggplot() +
labs(subtitle = "Simple Linear Regressions") +
geom_point(data = mtcars, aes(wt, mpg)) +
geom_abline(
data = models,
aes(intercept = intercept, slope = slope, color = model),
size = 1
)
Related
I have grouped Area values, for each of which I can compute and plot regressions:
set.seed(123)
df <- data.frame(
Group = c(rep("A",8), rep("B",10), rep("C",7)),
Area = c(1,3,2,4,3,5,7,9, rnorm(10), sample(7)),
x = c(1:8,1:10,1:7)
)
library(ggplot2)
ggplot(df,
aes(x = x, y = Area, group = factor(Group))) +
geom_smooth(method = "lm", se = FALSE)
But what I'm looking for is how to compute and plot what could be called a 'grand' regression for all Area groups. Is this possible and how would it be possible?
EDIT:
My guess is that it's not enough to simply disregard the group variable by running a model over all Area and all x values and excluding the groupvariable. This would treat the different groups as irrelevant. In actual fact each group represents a distribution in its own right. Consider each group as collecting the values of an independent event . What I need is a model that incorporates the distinction between the groups/events while at the same time summarizing over them.
I would be wary of the answers using stat_smooth/geom_smooth to plot a fitted line for the disaggregated values. This simply draws a best fit line through all of the data, ignoring how they are clustered.
As you say in your edit, what you need is a model that can account for the fact that you have an Area ~ X relationship in each group:
EDIT: My guess is that it's not enough to simply disregard the group variable by running a model over all Area and all x values and excluding the groupvariable. This would treat the different groups as irrelevant. In actual fact each group represents a distribution in its own right. Consider each group as collecting the values of an independent event . What I need is a model that incorporates the distinction between the groups/events while at the same time summarizing over them.
Without knowing more about your design, my first recommendation would be a mixed-effects model (e.g., using lme4).
You can fit the model, accounting for the fact that you have unique relationships in each group:
example_mod<- lmer(Area~
# Fixed Effects
1+X+
# Random Effects
(1+X|Group),
data=df,
REML=TRUE,
control=lmerControl(optimizer="bobyqa",optCtrl=list(maxfun=5e5)))
You can then extract the predicted values from this model to plot those, or calculate your own predicted values from the fixed-effects.
fitted(example_mod)
fixef(example_mod)
use two geom_smooth and put the grouping aesthetic into each geom separately
set.seed(123)
df <- data.frame(
Group = c(rep("A",8), rep("B",10), rep("C",7)),
Area = c(1,3,2,4,3,5,7,9, rnorm(10), sample(7)),
x = c(1:8,1:10,1:7)
)
library(ggplot2)
ggplot(df, aes(x = x, y = Area)) +
geom_smooth(aes(group = factor(Group)), method = "lm", se = FALSE) +
geom_smooth()
#> `geom_smooth()` using formula 'y ~ x'
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Created on 2022-06-29 by the reprex package (v2.0.1)
ggplot(df,
aes(x = x, y = Area)) +
geom_smooth(method = "lm", se = FALSE, colour="red") +
geom_smooth(method="lm", se = FALSE, aes(group=factor(Group)))
Edit: Since I'd been called out to provide more details, here is what is going on behind the scene for when you run geom_smooth(aes(group=factor(Group))
library(nlme)
fit1 <- lmList(Area ~ x|Group, data=df)
df$fit1 <- fitted(fit1)
ggplot(df, aes(x, fit1, colour=Group)) + geom_line()
When you add a second geom_smooth without the group factor, you are running a linear regression (method lm) for the whole data set. i.e.
fit2 <- lm(Area ~ x, data=df)
df$fit2 <- fitted(fit2)
ggplot(df, aes(x, fit2)) + geom_line()
I'm trying hard to add a regression line on a ggplot. I first tried with abline but I didn't manage to make it work. Then I tried this...
data = data.frame(x.plot=rep(seq(1,5),10),y.plot=rnorm(50))
ggplot(data,aes(x.plot,y.plot))+stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm',formula=data$y.plot~data$x.plot)
But it is not working either.
In general, to provide your own formula you should use arguments x and y that will correspond to values you provided in ggplot() - in this case x will be interpreted as x.plot and y as y.plot. You can find more information about smoothing methods and formula via the help page of function stat_smooth() as it is the default stat used by geom_smooth().
ggplot(data,aes(x.plot, y.plot)) +
stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm', formula= y~x)
If you are using the same x and y values that you supplied in the ggplot() call and need to plot the linear regression line then you don't need to use the formula inside geom_smooth(), just supply the method="lm".
ggplot(data,aes(x.plot, y.plot)) +
stat_summary(fun.data= mean_cl_normal) +
geom_smooth(method='lm')
As I just figured, in case you have a model fitted on multiple linear regression, the above mentioned solution won't work.
You have to create your line manually as a dataframe that contains predicted values for your original dataframe (in your case data).
It would look like this:
# read dataset
df = mtcars
# create multiple linear model
lm_fit <- lm(mpg ~ cyl + hp, data=df)
summary(lm_fit)
# save predictions of the model in the new data frame
# together with variable you want to plot against
predicted_df <- data.frame(mpg_pred = predict(lm_fit, df), hp=df$hp)
# this is the predicted line of multiple linear regression
ggplot(data = df, aes(x = mpg, y = hp)) +
geom_point(color='blue') +
geom_line(color='red',data = predicted_df, aes(x=mpg_pred, y=hp))
# this is predicted line comparing only chosen variables
ggplot(data = df, aes(x = mpg, y = hp)) +
geom_point(color='blue') +
geom_smooth(method = "lm", se = FALSE)
The simple and versatile solution is to draw a line using slope and intercept from geom_abline. Example usage with a scatterplot and lm object:
library(tidyverse)
petal.lm <- lm(Petal.Length ~ Petal.Width, iris)
ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
geom_point() +
geom_abline(slope = coef(petal.lm)[["Petal.Width"]],
intercept = coef(petal.lm)[["(Intercept)"]])
coef is used to extract the coefficients of the formula provided to lm. If you have some other linear model object or line to plot, just plug in the slope and intercept values similarly.
I found this function on a blog
ggplotRegression <- function (fit) {
`require(ggplot2)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}`
once you loaded the function you could simply
ggplotRegression(fit)
you can also go for ggplotregression( y ~ x + z + Q, data)
Hope this helps.
If you want to fit other type of models, like a dose-response curve using logistic models you would also need to create more data points with the function predict if you want to have a smoother regression line:
fit: your fit of a logistic regression curve
#Create a range of doses:
mm <- data.frame(DOSE = seq(0, max(data$DOSE), length.out = 100))
#Create a new data frame for ggplot using predict and your range of new
#doses:
fit.ggplot=data.frame(y=predict(fit, newdata=mm),x=mm$DOSE)
ggplot(data=data,aes(x=log10(DOSE),y=log(viability)))+geom_point()+
geom_line(data=fit.ggplot,aes(x=log10(x),y=log(y)))
Another way to use geom_line() to add regression line is to use broom package to get fitted values and use it as shown here
https://cmdlinetips.com/2022/06/add-regression-line-to-scatterplot-ggplot2/
I would like to use geom_smooth to get a fitted line from a certain linear regression model.
It seems to me that the formula can only take x and y and not any additional parameter.
To show more clearly what I want:
library(dplyr)
library(ggplot2)
set.seed(35413)
df <- data.frame(pred = runif(100,10,100),
factor = sample(c("A","B"), 100, replace = TRUE)) %>%
mutate(
outcome = 100 + 10*pred +
ifelse(factor=="B", 200, 0) +
ifelse(factor=="B", 4, 0)*pred +
rnorm(100,0,60))
With
ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm") +
theme_bw()
I produce fitted lines that, due to the color=factor option, are basically the output of the linear model lm(outcome ~ pred*factor, df)
In some cases, however, I prefer the lines to be the output of a different model fit, like lm(outcome ~ pred + factor, df), for which I can use something like:
fit <- lm(outcome ~ pred+factor, df)
predval <- expand.grid(
pred = seq(
min(df$pred), max(df$pred), length.out = 1000),
factor = unique(df$factor)) %>%
mutate(outcome = predict(fit, newdata = .))
ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point() +
geom_line(data = predval) +
theme_bw()
which results in :
My question: is there a way to produce the latter graph exploiting the geom_smooth instead? I know there is a formula = - option in geom_smooth but I can't make something like formula = y ~ x + factor or formula = y ~ x + color (as I defined color = factor) work.
This is a very interesting question. Probably the main reason why geom_smooth is so "resistant" to allowing custom models of multiple variables is that it is limited to producing 2-D curves; consequently, its arguments are designed for handling two-dimensional data (i.e. formula = response variable ~ independent variable).
The trick to getting what you requested is using the mapping argument within geom_smooth, instead of formula. As you've probably seen from looking at the documentation, formula only allows you to specify the mathematical structure of the model (e.g. linear, quadratic, etc.). Conversely, the mapping argument allows you to directly specify new y-values - such as the output of a custom linear model that you can call using predict().
Note that, by default, inherit.aes is set to TRUE, so your plotted regressions will be coloured appropriately by your categorical variable. Here's the code:
# original plot
plot1 <- ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm") +
ggtitle("outcome ~ pred") +
theme_bw()
# declare new model here
plm <- lm(formula = outcome ~ pred + factor, data=df)
# plot with lm for outcome ~ pred + factor
plot2 <-ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm", mapping=aes(y=predict(plm,df))) +
ggtitle("outcome ~ pred + factor") +
theme_bw()
I am trying to extract intercepts and slopes for 500 variables from a qplot. The code I am using:
qplot(gd, nd, data = test, colour = factor(ENT)) +
geom_smooth(method = "lm", se = FALSE)
Could someone help me extract the intercept and slope for each regression line (500 lines/variables) as plotted in the attached figure.
ggplot will draw the graph, but you extract the coefficients (Intercepts and Slopes) from the lm() objects. One way to do the latter is to use dplyr's group_by() and do() functions. See ?do
I'm using the mtcars data frame here.
library(ggplot2)
library(dplyr)
ggplot(mtcars, aes(mpg, disp, colour = factor(cyl))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
mtcars %>%
group_by(cyl) %>%
do({
mod = lm(disp ~ mpg, data = .)
data.frame(Intercept = coef(mod)[1],
Slope = coef(mod)[2])
})
Source: local data frame [3 x 3]
Groups: cyl
cyl Intercept Slope
1 4 233.0674 -4.797961
2 6 125.1225 2.947487
3 8 560.8703 -13.759624
How about using the lmList function, which is designed for computing linear regressions across multiple groups?
library("nlme")
coef(lmList(nd~gd|ENT , data = test))
I'm trying hard to add a regression line on a ggplot. I first tried with abline but I didn't manage to make it work. Then I tried this...
data = data.frame(x.plot=rep(seq(1,5),10),y.plot=rnorm(50))
ggplot(data,aes(x.plot,y.plot))+stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm',formula=data$y.plot~data$x.plot)
But it is not working either.
In general, to provide your own formula you should use arguments x and y that will correspond to values you provided in ggplot() - in this case x will be interpreted as x.plot and y as y.plot. You can find more information about smoothing methods and formula via the help page of function stat_smooth() as it is the default stat used by geom_smooth().
ggplot(data,aes(x.plot, y.plot)) +
stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm', formula= y~x)
If you are using the same x and y values that you supplied in the ggplot() call and need to plot the linear regression line then you don't need to use the formula inside geom_smooth(), just supply the method="lm".
ggplot(data,aes(x.plot, y.plot)) +
stat_summary(fun.data= mean_cl_normal) +
geom_smooth(method='lm')
As I just figured, in case you have a model fitted on multiple linear regression, the above mentioned solution won't work.
You have to create your line manually as a dataframe that contains predicted values for your original dataframe (in your case data).
It would look like this:
# read dataset
df = mtcars
# create multiple linear model
lm_fit <- lm(mpg ~ cyl + hp, data=df)
summary(lm_fit)
# save predictions of the model in the new data frame
# together with variable you want to plot against
predicted_df <- data.frame(mpg_pred = predict(lm_fit, df), hp=df$hp)
# this is the predicted line of multiple linear regression
ggplot(data = df, aes(x = mpg, y = hp)) +
geom_point(color='blue') +
geom_line(color='red',data = predicted_df, aes(x=mpg_pred, y=hp))
# this is predicted line comparing only chosen variables
ggplot(data = df, aes(x = mpg, y = hp)) +
geom_point(color='blue') +
geom_smooth(method = "lm", se = FALSE)
The simple and versatile solution is to draw a line using slope and intercept from geom_abline. Example usage with a scatterplot and lm object:
library(tidyverse)
petal.lm <- lm(Petal.Length ~ Petal.Width, iris)
ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
geom_point() +
geom_abline(slope = coef(petal.lm)[["Petal.Width"]],
intercept = coef(petal.lm)[["(Intercept)"]])
coef is used to extract the coefficients of the formula provided to lm. If you have some other linear model object or line to plot, just plug in the slope and intercept values similarly.
I found this function on a blog
ggplotRegression <- function (fit) {
`require(ggplot2)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}`
once you loaded the function you could simply
ggplotRegression(fit)
you can also go for ggplotregression( y ~ x + z + Q, data)
Hope this helps.
If you want to fit other type of models, like a dose-response curve using logistic models you would also need to create more data points with the function predict if you want to have a smoother regression line:
fit: your fit of a logistic regression curve
#Create a range of doses:
mm <- data.frame(DOSE = seq(0, max(data$DOSE), length.out = 100))
#Create a new data frame for ggplot using predict and your range of new
#doses:
fit.ggplot=data.frame(y=predict(fit, newdata=mm),x=mm$DOSE)
ggplot(data=data,aes(x=log10(DOSE),y=log(viability)))+geom_point()+
geom_line(data=fit.ggplot,aes(x=log10(x),y=log(y)))
Another way to use geom_line() to add regression line is to use broom package to get fitted values and use it as shown here
https://cmdlinetips.com/2022/06/add-regression-line-to-scatterplot-ggplot2/