I have grouped Area values, for each of which I can compute and plot regressions:
set.seed(123)
df <- data.frame(
Group = c(rep("A",8), rep("B",10), rep("C",7)),
Area = c(1,3,2,4,3,5,7,9, rnorm(10), sample(7)),
x = c(1:8,1:10,1:7)
)
library(ggplot2)
ggplot(df,
aes(x = x, y = Area, group = factor(Group))) +
geom_smooth(method = "lm", se = FALSE)
But what I'm looking for is how to compute and plot what could be called a 'grand' regression for all Area groups. Is this possible and how would it be possible?
EDIT:
My guess is that it's not enough to simply disregard the group variable by running a model over all Area and all x values and excluding the groupvariable. This would treat the different groups as irrelevant. In actual fact each group represents a distribution in its own right. Consider each group as collecting the values of an independent event . What I need is a model that incorporates the distinction between the groups/events while at the same time summarizing over them.
I would be wary of the answers using stat_smooth/geom_smooth to plot a fitted line for the disaggregated values. This simply draws a best fit line through all of the data, ignoring how they are clustered.
As you say in your edit, what you need is a model that can account for the fact that you have an Area ~ X relationship in each group:
EDIT: My guess is that it's not enough to simply disregard the group variable by running a model over all Area and all x values and excluding the groupvariable. This would treat the different groups as irrelevant. In actual fact each group represents a distribution in its own right. Consider each group as collecting the values of an independent event . What I need is a model that incorporates the distinction between the groups/events while at the same time summarizing over them.
Without knowing more about your design, my first recommendation would be a mixed-effects model (e.g., using lme4).
You can fit the model, accounting for the fact that you have unique relationships in each group:
example_mod<- lmer(Area~
# Fixed Effects
1+X+
# Random Effects
(1+X|Group),
data=df,
REML=TRUE,
control=lmerControl(optimizer="bobyqa",optCtrl=list(maxfun=5e5)))
You can then extract the predicted values from this model to plot those, or calculate your own predicted values from the fixed-effects.
fitted(example_mod)
fixef(example_mod)
use two geom_smooth and put the grouping aesthetic into each geom separately
set.seed(123)
df <- data.frame(
Group = c(rep("A",8), rep("B",10), rep("C",7)),
Area = c(1,3,2,4,3,5,7,9, rnorm(10), sample(7)),
x = c(1:8,1:10,1:7)
)
library(ggplot2)
ggplot(df, aes(x = x, y = Area)) +
geom_smooth(aes(group = factor(Group)), method = "lm", se = FALSE) +
geom_smooth()
#> `geom_smooth()` using formula 'y ~ x'
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Created on 2022-06-29 by the reprex package (v2.0.1)
ggplot(df,
aes(x = x, y = Area)) +
geom_smooth(method = "lm", se = FALSE, colour="red") +
geom_smooth(method="lm", se = FALSE, aes(group=factor(Group)))
Edit: Since I'd been called out to provide more details, here is what is going on behind the scene for when you run geom_smooth(aes(group=factor(Group))
library(nlme)
fit1 <- lmList(Area ~ x|Group, data=df)
df$fit1 <- fitted(fit1)
ggplot(df, aes(x, fit1, colour=Group)) + geom_line()
When you add a second geom_smooth without the group factor, you are running a linear regression (method lm) for the whole data set. i.e.
fit2 <- lm(Area ~ x, data=df)
df$fit2 <- fitted(fit2)
ggplot(df, aes(x, fit2)) + geom_line()
Related
I am following the last set of code in https://drsimonj.svbtle.com/plot-some-variables-against-many-others, and have modified the code for my data.
In this code:
t3 %>%
gather(-Border, key = "var", value = "value") %>%
ggplot(aes(x = value, y = Border)) +
geom_point() +
stat_smooth() +
facet_wrap(~ var, scales = "free") +
theme_bw()
I get this error message:
Computation failed in stat_smooth():
x has insufficient unique values to support 10 knots: reduce k.
The code runs without the stat_smooth() command but I need the smooth line.
With the exception of one var with 20 values, every other var has between 5 and 6 unique values. How do I reduce k?
Is a k of 5 reasonable?
The sample size is 1,000.
Thanks
Obviously, we don't have your data, but we can generate some data that reproduces your problem:
library(ggplot2)
set.seed(1)
df <- data.frame(x = rep(1:10, 150), y = rnorm(1500),
group = c(rep("A", 1490), rep("B", 10)))[-1500,]
ggplot(df, aes(x, y, color = group)) + stat_smooth()
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> Warning: Computation failed in `stat_smooth()`
#> Caused by error in `smooth.construct.cr.smooth.spec()`:
#> ! x has insufficient unique values to support 10 knots: reduce k.
The reason for this error is that if any of your groups has over 1,000 data points, stat_smooth will by default use a generalized additive model on all your groups. From the docs:
For method = NULL the smoothing method is chosen based on the size of the largest group (across all panels). stats::loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs")
This means that if one or more of the groups is very small, stat_smooth will end up running a gam regression with these default settings, which will fail due to the number of points being inadequate for the specified model.
We can fix this by specifying method = "loess" in stat_smooth
ggplot(df, aes(x, y, color = group)) + stat_smooth(method = "loess")
#> `geom_smooth()` using formula = 'y ~ x'
Created on 2022-11-14 with reprex v2.0.2
I'm trying to plot different simple linear regression estimates on the same coordinate plane to understand something of the differences between different methods. But my question is about adding these lines in R code and not about the statistics of the differences lines.
Here I'm using the mtcars dataset. And I'm using the mblm and quantreq packages to come up with different regression equations or, more specifically, the parameters for the slope and intercept for different simple linear regression estimates.
The OLS estimate I add using the geom_smooth() function and specifying the method argument. I could add the slope and intercept using geom_abline() after creating a linear model object; that's another option.
The Theil-Sen and median least squares deviation estimates I'm creating a model first for each using the respective packages. Then I'm adding the slope and intercept using geom_abline().
So now I've added the lines manually. But how can I create a key or legend in ggplot to show these different lines? ggplot() adds the key automatically when geom_smooth() is separated into different groups. But I don't think it adds a legend for geom_abline. And anyway my plot uses a mixture of both. Any ideas? I've never had to add more own key in this way.
library(mblm)
ts_fit <- mblm(mpg ~ wt, data = mtcars)
library(quantreg)
lad_fit <- rq(mpg ~ wt, data = mtcars)
ggplot(mtcars, aes(x = wt, y = mpg)) +
labs(subtitle = "Simple Linear Regressions") +
geom_point() +
geom_smooth(method = 'lm', se = FALSE, color = '#376795') +
geom_abline(intercept = coef(ts_fit)[1], slope = coef(ts_fit)[2], color = '#f7aa58', size = 1) +
geom_abline(intercept = coef(lad_fit)[1], slope = coef(lad_fit)[2], color = '#72bcd5', size = 1)
Rather than adding a separate geom for each model, I would create a dataframe including the intercept and slope for all models. Then you can pass this to a single geom_abline() and map color to the different models.
Note, I don't have {mblm} or {quantreg} installed, so I ran lm() on different subsets of mtcars as an approximation.
library(tidyverse)
# create dataframe with model coefficients
models <- data.frame(
lm = coef(lm(mpg ~ wt, data = mtcars[1:20,])),
ts = coef(lm(mpg ~ wt, data = mtcars[7:26,])),
lad = coef(lm(mpg ~ wt, data = mtcars[11:32,]))
) %>%
t() %>%
as_tibble(rownames = "model") %>%
rename_with(~ c("model", "intercept", "slope"))
models
# # A tibble: 3 x 3
# model intercept slope
# <chr> <dbl> <dbl>
# 1 lm 38.5 -5.41
# 2 ts 38.9 -5.59
# 3 lad 37.6 -5.41
# specify ggplot, passing `mtcars` to `geom_point()` and `models` to `geom_abline()`
ggplot() +
labs(subtitle = "Simple Linear Regressions") +
geom_point(data = mtcars, aes(wt, mpg)) +
geom_abline(
data = models,
aes(intercept = intercept, slope = slope, color = model),
size = 1
)
I'm trying hard to add a regression line on a ggplot. I first tried with abline but I didn't manage to make it work. Then I tried this...
data = data.frame(x.plot=rep(seq(1,5),10),y.plot=rnorm(50))
ggplot(data,aes(x.plot,y.plot))+stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm',formula=data$y.plot~data$x.plot)
But it is not working either.
In general, to provide your own formula you should use arguments x and y that will correspond to values you provided in ggplot() - in this case x will be interpreted as x.plot and y as y.plot. You can find more information about smoothing methods and formula via the help page of function stat_smooth() as it is the default stat used by geom_smooth().
ggplot(data,aes(x.plot, y.plot)) +
stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm', formula= y~x)
If you are using the same x and y values that you supplied in the ggplot() call and need to plot the linear regression line then you don't need to use the formula inside geom_smooth(), just supply the method="lm".
ggplot(data,aes(x.plot, y.plot)) +
stat_summary(fun.data= mean_cl_normal) +
geom_smooth(method='lm')
As I just figured, in case you have a model fitted on multiple linear regression, the above mentioned solution won't work.
You have to create your line manually as a dataframe that contains predicted values for your original dataframe (in your case data).
It would look like this:
# read dataset
df = mtcars
# create multiple linear model
lm_fit <- lm(mpg ~ cyl + hp, data=df)
summary(lm_fit)
# save predictions of the model in the new data frame
# together with variable you want to plot against
predicted_df <- data.frame(mpg_pred = predict(lm_fit, df), hp=df$hp)
# this is the predicted line of multiple linear regression
ggplot(data = df, aes(x = mpg, y = hp)) +
geom_point(color='blue') +
geom_line(color='red',data = predicted_df, aes(x=mpg_pred, y=hp))
# this is predicted line comparing only chosen variables
ggplot(data = df, aes(x = mpg, y = hp)) +
geom_point(color='blue') +
geom_smooth(method = "lm", se = FALSE)
The simple and versatile solution is to draw a line using slope and intercept from geom_abline. Example usage with a scatterplot and lm object:
library(tidyverse)
petal.lm <- lm(Petal.Length ~ Petal.Width, iris)
ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
geom_point() +
geom_abline(slope = coef(petal.lm)[["Petal.Width"]],
intercept = coef(petal.lm)[["(Intercept)"]])
coef is used to extract the coefficients of the formula provided to lm. If you have some other linear model object or line to plot, just plug in the slope and intercept values similarly.
I found this function on a blog
ggplotRegression <- function (fit) {
`require(ggplot2)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}`
once you loaded the function you could simply
ggplotRegression(fit)
you can also go for ggplotregression( y ~ x + z + Q, data)
Hope this helps.
If you want to fit other type of models, like a dose-response curve using logistic models you would also need to create more data points with the function predict if you want to have a smoother regression line:
fit: your fit of a logistic regression curve
#Create a range of doses:
mm <- data.frame(DOSE = seq(0, max(data$DOSE), length.out = 100))
#Create a new data frame for ggplot using predict and your range of new
#doses:
fit.ggplot=data.frame(y=predict(fit, newdata=mm),x=mm$DOSE)
ggplot(data=data,aes(x=log10(DOSE),y=log(viability)))+geom_point()+
geom_line(data=fit.ggplot,aes(x=log10(x),y=log(y)))
Another way to use geom_line() to add regression line is to use broom package to get fitted values and use it as shown here
https://cmdlinetips.com/2022/06/add-regression-line-to-scatterplot-ggplot2/
I would like to use geom_smooth to get a fitted line from a certain linear regression model.
It seems to me that the formula can only take x and y and not any additional parameter.
To show more clearly what I want:
library(dplyr)
library(ggplot2)
set.seed(35413)
df <- data.frame(pred = runif(100,10,100),
factor = sample(c("A","B"), 100, replace = TRUE)) %>%
mutate(
outcome = 100 + 10*pred +
ifelse(factor=="B", 200, 0) +
ifelse(factor=="B", 4, 0)*pred +
rnorm(100,0,60))
With
ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm") +
theme_bw()
I produce fitted lines that, due to the color=factor option, are basically the output of the linear model lm(outcome ~ pred*factor, df)
In some cases, however, I prefer the lines to be the output of a different model fit, like lm(outcome ~ pred + factor, df), for which I can use something like:
fit <- lm(outcome ~ pred+factor, df)
predval <- expand.grid(
pred = seq(
min(df$pred), max(df$pred), length.out = 1000),
factor = unique(df$factor)) %>%
mutate(outcome = predict(fit, newdata = .))
ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point() +
geom_line(data = predval) +
theme_bw()
which results in :
My question: is there a way to produce the latter graph exploiting the geom_smooth instead? I know there is a formula = - option in geom_smooth but I can't make something like formula = y ~ x + factor or formula = y ~ x + color (as I defined color = factor) work.
This is a very interesting question. Probably the main reason why geom_smooth is so "resistant" to allowing custom models of multiple variables is that it is limited to producing 2-D curves; consequently, its arguments are designed for handling two-dimensional data (i.e. formula = response variable ~ independent variable).
The trick to getting what you requested is using the mapping argument within geom_smooth, instead of formula. As you've probably seen from looking at the documentation, formula only allows you to specify the mathematical structure of the model (e.g. linear, quadratic, etc.). Conversely, the mapping argument allows you to directly specify new y-values - such as the output of a custom linear model that you can call using predict().
Note that, by default, inherit.aes is set to TRUE, so your plotted regressions will be coloured appropriately by your categorical variable. Here's the code:
# original plot
plot1 <- ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm") +
ggtitle("outcome ~ pred") +
theme_bw()
# declare new model here
plm <- lm(formula = outcome ~ pred + factor, data=df)
# plot with lm for outcome ~ pred + factor
plot2 <-ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm", mapping=aes(y=predict(plm,df))) +
ggtitle("outcome ~ pred + factor") +
theme_bw()
I'm trying hard to add a regression line on a ggplot. I first tried with abline but I didn't manage to make it work. Then I tried this...
data = data.frame(x.plot=rep(seq(1,5),10),y.plot=rnorm(50))
ggplot(data,aes(x.plot,y.plot))+stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm',formula=data$y.plot~data$x.plot)
But it is not working either.
In general, to provide your own formula you should use arguments x and y that will correspond to values you provided in ggplot() - in this case x will be interpreted as x.plot and y as y.plot. You can find more information about smoothing methods and formula via the help page of function stat_smooth() as it is the default stat used by geom_smooth().
ggplot(data,aes(x.plot, y.plot)) +
stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm', formula= y~x)
If you are using the same x and y values that you supplied in the ggplot() call and need to plot the linear regression line then you don't need to use the formula inside geom_smooth(), just supply the method="lm".
ggplot(data,aes(x.plot, y.plot)) +
stat_summary(fun.data= mean_cl_normal) +
geom_smooth(method='lm')
As I just figured, in case you have a model fitted on multiple linear regression, the above mentioned solution won't work.
You have to create your line manually as a dataframe that contains predicted values for your original dataframe (in your case data).
It would look like this:
# read dataset
df = mtcars
# create multiple linear model
lm_fit <- lm(mpg ~ cyl + hp, data=df)
summary(lm_fit)
# save predictions of the model in the new data frame
# together with variable you want to plot against
predicted_df <- data.frame(mpg_pred = predict(lm_fit, df), hp=df$hp)
# this is the predicted line of multiple linear regression
ggplot(data = df, aes(x = mpg, y = hp)) +
geom_point(color='blue') +
geom_line(color='red',data = predicted_df, aes(x=mpg_pred, y=hp))
# this is predicted line comparing only chosen variables
ggplot(data = df, aes(x = mpg, y = hp)) +
geom_point(color='blue') +
geom_smooth(method = "lm", se = FALSE)
The simple and versatile solution is to draw a line using slope and intercept from geom_abline. Example usage with a scatterplot and lm object:
library(tidyverse)
petal.lm <- lm(Petal.Length ~ Petal.Width, iris)
ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
geom_point() +
geom_abline(slope = coef(petal.lm)[["Petal.Width"]],
intercept = coef(petal.lm)[["(Intercept)"]])
coef is used to extract the coefficients of the formula provided to lm. If you have some other linear model object or line to plot, just plug in the slope and intercept values similarly.
I found this function on a blog
ggplotRegression <- function (fit) {
`require(ggplot2)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}`
once you loaded the function you could simply
ggplotRegression(fit)
you can also go for ggplotregression( y ~ x + z + Q, data)
Hope this helps.
If you want to fit other type of models, like a dose-response curve using logistic models you would also need to create more data points with the function predict if you want to have a smoother regression line:
fit: your fit of a logistic regression curve
#Create a range of doses:
mm <- data.frame(DOSE = seq(0, max(data$DOSE), length.out = 100))
#Create a new data frame for ggplot using predict and your range of new
#doses:
fit.ggplot=data.frame(y=predict(fit, newdata=mm),x=mm$DOSE)
ggplot(data=data,aes(x=log10(DOSE),y=log(viability)))+geom_point()+
geom_line(data=fit.ggplot,aes(x=log10(x),y=log(y)))
Another way to use geom_line() to add regression line is to use broom package to get fitted values and use it as shown here
https://cmdlinetips.com/2022/06/add-regression-line-to-scatterplot-ggplot2/