plot lines using ggplot and fit a linear regression line - r

I have generated a toy data as following:
toy.df <- data.frame(ID = rep(paste("S",1:25, sep = "_") ,4) , x = rep(seq(1,5), 20), y = rnorm(100), color_code = rnorm(100)
,group = rep(letters[1:4] , 25) )
I would like to use ggplot to generate multiple lines as well as points in 4 facets and fit a linear regression line to each the group of lines in each facet.
toy.df %>% ggplot(aes(x = x, y = y, group = ID)) +
facet_wrap( . ~ group, scales = 'free', ncol = 2)+
geom_point() +
geom_line(aes(color = color_code)) +
geom_smooth(method = 'lm')
But it does not generate the lines over the x axis (1 to 5)
Do you have any idea how can I fix this?

You are using ID as group and each group contains only one observation. So, it is not possible to fit linear regression using only one observation. Removing group = ID works fine like
library(tidyverse)
toy.df %>%
ggplot(aes(x = x, y = y)) +
facet_wrap( . ~ group, scales = 'free', ncol = 2)+
geom_point() +
geom_smooth(method=lm, se=F, formula = y ~ x)

Related

How to adjust the position of regression equation on ggplot?

I would like to add the regression line and R^2 to my ggplot. I am fitting the regression line to different categories and for each category I am getting a unique equation. I'd like to set the position of equations for each category manually. i.e. Finding the max expression of y for each group and printing the equation at ymax + 1.
Here is my code:
library(ggpmisc)
df <- data.frame(x = c(1:100))
df$y <- 20 * c(0, 1) + 3 * df$x + rnorm(100, sd = 40)
df$group <- factor(rep(c("A", "B"), 50))
df <- df %>% group_by(group) %>% mutate(ymax = max(y))
my.formula <- y ~ x
df %>%
group_by(group) %>%
do(tidy(lm(y ~ x, data = .)))
p <- ggplot(data = df, aes(x = x, y = y, colour = group)) +
geom_smooth(method = "lm", se=FALSE, formula = my.formula) +
stat_poly_eq(formula = my.formula,
aes(x = x , y = ymax + 1, label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = TRUE) +
geom_point()
p
Any suggestion how to do this?
Also is there any way I can only print the slope of the equation. (remove the intercept from plot)?
Thanks,
I'm pretty sure that setting adjusting stat_poly_eq() with the geom argument will get what you want. Doing so will center the equations, leaving the left half of each clipped, so we use hjust = 0 to left-adjust the equations. Finally, depending on your specific data, the equations may be overlapping each other, so we use the position argument to have ggplot attempt to separate them.
This adjusted call should get you started, I hope:
p <- ggplot(data = df, aes(x = x, y = y, colour = group)) +
geom_smooth(method = "lm", se=FALSE, formula = my.formula) +
stat_poly_eq(
formula = my.formula,
geom = "text", # or 'label'
hjust = 0, # left-adjust equations
position = position_dodge(), # in case equations now overlap
aes(x = x , y = ymax + 1, label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = TRUE) +
geom_point()
p

R - Adding legend to ggplot graph for regression lines

I do a Multiple Linear Regression in R, where I want to add a simple legend to a graph (ggplot). The legend should show the points and fitted lines with their corresponding colors. So far it works fine (without legend):
ggplot() +
geom_point(aes(x = training_set$R.D.Spend, y = training_set$Profit),
col = 'red') +
geom_line(aes(x = training_set$R.D.Spend, y = predict(regressor, newdata = training_set)),
col = 'blue') +
geom_line(aes(x = training_set$R.D.Spend, y = predict(regressor_sig, newdata = training_set)),
col = 'green') +
ggtitle('Multiple Linear Regression (Training set)') +
xlab('R.D.Spend [k$]') +
ylab('Profit of Venture [k$]')
How can I add a legend here most easily?
I tried the solutions from similar question, but did not succeed (add legend to ggplot2 | Add legend for multiple regression lines from different datasets to ggplot)
So, I appended my original model like this:
ggplot() +
geom_point(aes(x = training_set$R.D.Spend, y = training_set$Profit),
col = 'p1') +
geom_line(aes(x = training_set$R.D.Spend, y = predict(regressor, newdata = training_set)),
col = 'p2') +
geom_line(aes(x = training_set$R.D.Spend, y = predict(regressor_sig, newdata = training_set)),
col = 'p3') +
scale_color_manual(
name='My lines',
values=c('blue', 'orangered', 'green')) +
ggtitle('Multiple Linear Regression (Training set)') +
xlab('R.D.Spend [k$]') +
ylab('Profit of Venture [k$]')
But here I am getting the error of "Unknown colour name: p1". which makes somewhat sense, as I do not define p1 above. How can I make the ggplot recognise my intended legend?
Move col into the aes and then you can set the color using scale_color_manual:
library(ggplot2)
set.seed(1)
x <- 1:30
y <- rnorm(30) + x
fit <- lm(y ~ x)
ggplot2::ggplot(data.frame(x, y)) +
geom_point(aes(x = x, y = y)) +
geom_line(aes(x = x, y = predict(fit), col = "Regression")) +
scale_color_manual(name = "My Lines",
values = c("blue"))

Can I mimick facet_wrap() with 5 completely separate ggplots?

I like the neatness of using facet_wrap() or facet_grid() with ggplot since the plots are all made to be the same size and are fitted row and column wise automatically.
I have a data frame and I am experimenting with various transformations and their impact on fit as measured by R2
dm1 <- lm(price ~ x, data = diamonds)
dm1R2 <- summary(dm1)$r.squared #0.78
dm2 <- lm(log(price) ~ x, data = diamonds)
dm2R2 <- summary(dm2)$r.squared # 0.9177831
dm3 <- lm(log(price) ~ x^2, data = diamonds)
dm3R2 <- summary(dm3)$r.squared # also 0.9177831. Aside, why?
ggplot(diamonds, aes(x = x, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_text(x = 3.5, y = 10000, label = paste0('R-Squared: ', round(dm1R2, 3)))
ggplot(diamonds, aes(x = x, y = log(price))) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_text(x = 3, y = 9, label = paste0('R-Squared: ', round(dm2R2, 3)))
ggplot(diamonds, aes(x = x^2, y = log(price))) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_text(x = 3, y = 20, label = paste0('R-Squared: ', round(dm3R2, 3)))
This produces 3 completely separate plots. Within Rmd file they will appear one after the other.
Is there a way to add them to a grid like when using facet_wrap?
You can use ggplot2's built-in faceting if you generate a "long" data frame from the regression model objects. The model object returned by lm includes the data used to fit the model, so we can extract the data and the r-squared for each model, stack them into a single data frame, and generate a faceted plot.
The disadvantage of this approach is that you lose the ability to easily set separate x-axis and y-axis titles for each panel, which is important, because the x and y values have different transformations in different panels. In an effort to mitigate that problem, I've used the model formulas as the facet labels.
Also, the reason you got the same r-squared for the models specified by log(price) ~ x and log(price) ~ x^2 is that R treats them as the same model. To tell R that you literally mean x^2 in a model formula, you need to wrap it in the I() function, making the formula log(price) ~ I(x^2). You could also do log(price) ~ poly(x, 2, raw=TRUE).
library(tidyverse)
theme_set(theme_bw(base_size=14))
# Generate a small subset of the diamonds data frame
set.seed(2)
dsub = diamonds[sample(1:nrow(diamonds), 2000), ]
dm1 <- lm(price ~ x, data = dsub)
dm2 <- lm(log(price) ~ x, data = dsub)
dm3 <- lm(log(price) ~ I(x^2), data = dsub)
# Create long data frame from the three model objects
dat = list(dm1, dm2, dm3) %>%
map_df(function(m) {
tibble(r2=summary(m)$r.squared,
form=as_label(formula(m))) %>%
cbind(m[["model"]] %>% set_names(c("price","x")))
}, .id="Model") %>%
mutate(form=factor(form, levels=unique(form)))
# Create data subset for geom_text
text.dat = dat %>% group_by(form) %>%
summarise(x = quantile(x, 1),
price = quantile(price, 0.05),
r2=r2[1])
dat %>%
ggplot(aes(x, price)) +
geom_point(alpha=0.3, colour="red") +
geom_smooth(method="lm") +
geom_text(data=text.dat, parse=TRUE,
aes(label=paste0("r^2 ==", round(r2, 2))),
hjust=1, size=3.5, colour="grey30") +
facet_wrap(~ form, scales="free")
ggarrange from the ggpubr package can do this:
p1 = ggplot(diamonds, aes(x = x, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_text(x = 3.5, y = 10000, label = paste0('R-Squared: ', round(dm1R2, 3)))
p2 = ggplot(diamonds, aes(x = x, y = log(price))) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_text(x = 3, y = 9, label = paste0('R-Squared: ', round(dm2R2, 3)))
p3 = ggplot(diamonds, aes(x = x^2, y = log(price))) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_text(x = 3, y = 20, label = paste0('R-Squared: ', round(dm3R2, 3)))
ggpubr::ggarrange(p1, p2, p3, ncol = 2, nrow = 2, align = "hv")
Other packages that have been suggested in the comments like cowplot and patchwork also offer good options for this.

Plotting categorical variables OLS in R

I am trying to produce a plot with age in the x-axis, expected serum urate in the y-axis and lines for male/white, female/white, male/black, female/black, using the estimates from the lm() function.
goutdata <- read.table("gout.txt", header = TRUE)
goutdata$sex <- factor(goutdata$sex,levels = c("M", "F"))
goutdata$race <- as.factor(goutdata$race)
fm <- lm(su~sex+race+age, data = goutdata)
summary(fm)
ggplot(fm, aes(x= age, y = su))+xlim(30, 70) + geom_jitter(aes(age,su, colour=age)) + facet_grid(sex~race)
I have tried using the facet_wrap() function with ggplot to address the categorical variables, but I am wanting to create just one plot. I was trying a combination of geom_jitter and geom_smooth, but I am not sure how to use geom_smooth() with categorical variables. Any help would be appreciated.
Data: https://github.com/gdlc/STT465/blob/master/gout.txt
We can use interaction() to create groupings on the fly and perform the OLS right within geom_smooth(). Here they are grouped on one plot:
ggplot(goutdata, aes(age, su, color = interaction(sex, race))) +
geom_smooth(formula = y~x, method="lm") +
geom_point() +
hrbrthemes::theme_ipsum_rc(grid="XY")
and, spread out into facets:
ggplot(goutdata, aes(age, su, color = interaction(sex, race))) +
geom_smooth(formula = y~x, method="lm") +
geom_point() +
facet_wrap(sex~race) +
hrbrthemes::theme_ipsum_rc(grid="XY")
You've now got a partial answer to #1 of https://github.com/gdlc/STT465/blob/master/HW_4_OLS.md :-)
You could probably use geom_smooth() to show regression lines?
dat <- read.table("https://raw.githubusercontent.com/gdlc/STT465/master/gout.txt",
header = T, stringsAsFactors = F)
library(tidyverse)
dat %>%
dplyr::mutate(sex = ifelse(sex == "M", "Male", "Female"),
race = ifelse(race == "W", "Caucasian", "African-American"),
group = paste(race, sex, sep = ", ")
) %>%
ggplot(aes(x = age, y = su, colour = group)) +
geom_smooth(method = "lm", se = F, show.legend = F) +
geom_point(show.legend = F, position = "jitter", alpha = .5, pch = 16) +
facet_wrap(~group) +
ggthemes::theme_few() +
labs(x = "Age", y = "Expected serum urate level")

How to put variables in legend in ggplot2

I want to get the following plot.
So, how would I put a variable i.e. cov(x,y) as string in legend using ggplot?
I would recommend calculating the covariance in a separate data frame, and customizing the color scale using the values in the covariance data frame:
Sample Data
library(dplyr)
library(ggplot2)
set.seed(999)
d <- data.frame(
x = runif(60, 0, 100),
z = rep(c(0, 1), each = 30)
) %>%
mutate(
y = x + 50 * z + rnorm(60, sd = 50),
z = factor(z)
)
Here is the basic plot, with a separate color for each value of z:
ggplot(d, aes(x = x, y = y, color = z)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
Now create a smaller data frame that contains covariance values:
cov_df <- d %>%
group_by(z) %>%
summarise(covar = round(cov(x, y)))
Extract the covariance values and store as a character vector:
legend_text <- as.character(pull(cov_df, covar))
Control the color scale to achieve your desired outcome:
ggplot(d, aes(x = x, y = y, color = z)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
scale_color_discrete(
"Covariance",
labels = legend_text
)

Resources