Can I mimick facet_wrap() with 5 completely separate ggplots? - r

I like the neatness of using facet_wrap() or facet_grid() with ggplot since the plots are all made to be the same size and are fitted row and column wise automatically.
I have a data frame and I am experimenting with various transformations and their impact on fit as measured by R2
dm1 <- lm(price ~ x, data = diamonds)
dm1R2 <- summary(dm1)$r.squared #0.78
dm2 <- lm(log(price) ~ x, data = diamonds)
dm2R2 <- summary(dm2)$r.squared # 0.9177831
dm3 <- lm(log(price) ~ x^2, data = diamonds)
dm3R2 <- summary(dm3)$r.squared # also 0.9177831. Aside, why?
ggplot(diamonds, aes(x = x, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_text(x = 3.5, y = 10000, label = paste0('R-Squared: ', round(dm1R2, 3)))
ggplot(diamonds, aes(x = x, y = log(price))) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_text(x = 3, y = 9, label = paste0('R-Squared: ', round(dm2R2, 3)))
ggplot(diamonds, aes(x = x^2, y = log(price))) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_text(x = 3, y = 20, label = paste0('R-Squared: ', round(dm3R2, 3)))
This produces 3 completely separate plots. Within Rmd file they will appear one after the other.
Is there a way to add them to a grid like when using facet_wrap?

You can use ggplot2's built-in faceting if you generate a "long" data frame from the regression model objects. The model object returned by lm includes the data used to fit the model, so we can extract the data and the r-squared for each model, stack them into a single data frame, and generate a faceted plot.
The disadvantage of this approach is that you lose the ability to easily set separate x-axis and y-axis titles for each panel, which is important, because the x and y values have different transformations in different panels. In an effort to mitigate that problem, I've used the model formulas as the facet labels.
Also, the reason you got the same r-squared for the models specified by log(price) ~ x and log(price) ~ x^2 is that R treats them as the same model. To tell R that you literally mean x^2 in a model formula, you need to wrap it in the I() function, making the formula log(price) ~ I(x^2). You could also do log(price) ~ poly(x, 2, raw=TRUE).
library(tidyverse)
theme_set(theme_bw(base_size=14))
# Generate a small subset of the diamonds data frame
set.seed(2)
dsub = diamonds[sample(1:nrow(diamonds), 2000), ]
dm1 <- lm(price ~ x, data = dsub)
dm2 <- lm(log(price) ~ x, data = dsub)
dm3 <- lm(log(price) ~ I(x^2), data = dsub)
# Create long data frame from the three model objects
dat = list(dm1, dm2, dm3) %>%
map_df(function(m) {
tibble(r2=summary(m)$r.squared,
form=as_label(formula(m))) %>%
cbind(m[["model"]] %>% set_names(c("price","x")))
}, .id="Model") %>%
mutate(form=factor(form, levels=unique(form)))
# Create data subset for geom_text
text.dat = dat %>% group_by(form) %>%
summarise(x = quantile(x, 1),
price = quantile(price, 0.05),
r2=r2[1])
dat %>%
ggplot(aes(x, price)) +
geom_point(alpha=0.3, colour="red") +
geom_smooth(method="lm") +
geom_text(data=text.dat, parse=TRUE,
aes(label=paste0("r^2 ==", round(r2, 2))),
hjust=1, size=3.5, colour="grey30") +
facet_wrap(~ form, scales="free")

ggarrange from the ggpubr package can do this:
p1 = ggplot(diamonds, aes(x = x, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_text(x = 3.5, y = 10000, label = paste0('R-Squared: ', round(dm1R2, 3)))
p2 = ggplot(diamonds, aes(x = x, y = log(price))) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_text(x = 3, y = 9, label = paste0('R-Squared: ', round(dm2R2, 3)))
p3 = ggplot(diamonds, aes(x = x^2, y = log(price))) +
geom_point() +
geom_smooth(method = "lm", se = F) +
geom_text(x = 3, y = 20, label = paste0('R-Squared: ', round(dm3R2, 3)))
ggpubr::ggarrange(p1, p2, p3, ncol = 2, nrow = 2, align = "hv")
Other packages that have been suggested in the comments like cowplot and patchwork also offer good options for this.

Related

How to adjust the position of regression equation on ggplot?

I would like to add the regression line and R^2 to my ggplot. I am fitting the regression line to different categories and for each category I am getting a unique equation. I'd like to set the position of equations for each category manually. i.e. Finding the max expression of y for each group and printing the equation at ymax + 1.
Here is my code:
library(ggpmisc)
df <- data.frame(x = c(1:100))
df$y <- 20 * c(0, 1) + 3 * df$x + rnorm(100, sd = 40)
df$group <- factor(rep(c("A", "B"), 50))
df <- df %>% group_by(group) %>% mutate(ymax = max(y))
my.formula <- y ~ x
df %>%
group_by(group) %>%
do(tidy(lm(y ~ x, data = .)))
p <- ggplot(data = df, aes(x = x, y = y, colour = group)) +
geom_smooth(method = "lm", se=FALSE, formula = my.formula) +
stat_poly_eq(formula = my.formula,
aes(x = x , y = ymax + 1, label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = TRUE) +
geom_point()
p
Any suggestion how to do this?
Also is there any way I can only print the slope of the equation. (remove the intercept from plot)?
Thanks,
I'm pretty sure that setting adjusting stat_poly_eq() with the geom argument will get what you want. Doing so will center the equations, leaving the left half of each clipped, so we use hjust = 0 to left-adjust the equations. Finally, depending on your specific data, the equations may be overlapping each other, so we use the position argument to have ggplot attempt to separate them.
This adjusted call should get you started, I hope:
p <- ggplot(data = df, aes(x = x, y = y, colour = group)) +
geom_smooth(method = "lm", se=FALSE, formula = my.formula) +
stat_poly_eq(
formula = my.formula,
geom = "text", # or 'label'
hjust = 0, # left-adjust equations
position = position_dodge(), # in case equations now overlap
aes(x = x , y = ymax + 1, label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = TRUE) +
geom_point()
p

plot lines using ggplot and fit a linear regression line

I have generated a toy data as following:
toy.df <- data.frame(ID = rep(paste("S",1:25, sep = "_") ,4) , x = rep(seq(1,5), 20), y = rnorm(100), color_code = rnorm(100)
,group = rep(letters[1:4] , 25) )
I would like to use ggplot to generate multiple lines as well as points in 4 facets and fit a linear regression line to each the group of lines in each facet.
toy.df %>% ggplot(aes(x = x, y = y, group = ID)) +
facet_wrap( . ~ group, scales = 'free', ncol = 2)+
geom_point() +
geom_line(aes(color = color_code)) +
geom_smooth(method = 'lm')
But it does not generate the lines over the x axis (1 to 5)
Do you have any idea how can I fix this?
You are using ID as group and each group contains only one observation. So, it is not possible to fit linear regression using only one observation. Removing group = ID works fine like
library(tidyverse)
toy.df %>%
ggplot(aes(x = x, y = y)) +
facet_wrap( . ~ group, scales = 'free', ncol = 2)+
geom_point() +
geom_smooth(method=lm, se=F, formula = y ~ x)

Exclude a particular area from geom_smooth fit automatically

I am plotting different plots in my shiny app.
By using geom_smooth(), I am fitting a smoothing curve on a scatterplot.
I am plotting these plots with ggplot() and rendering with ggplotly().
Is there any way, I can exclude a particular data profile from geom_smooth().
For e.g.:
It can be seen in the fit, the fit is getting disturbed and which is not desirable. I have tried plotly_click(), plotly_brush(), plotly_select(). But, I don't want user's interference when plotting this fit, this makes the process much slower and inaccurate.
Here is my code to plot this:
#plot
g <- ggplot(data = d_f4, aes_string(x = d_f4$x, y = d_f4$y)) + theme_bw() +
geom_point(colour = "blue", size = 0.1)+
geom_smooth(formula = y ~ splines::bs(x, df = 10), method = "lm", color = "green3", level = 1, size = 1)
Unfortunately, I can not include my dataset in my question, because the dataset is quite big.
You can make an extra data.frame without the "outliers" and use this as the input for geom_smooth:
set.seed(8)
test_data <- data.frame(x = 1:100)
test_data$y <- sin(test_data$x / 10) + rnorm(100, sd = 0.1)
test_data[60:65, "y"] <- test_data[60:65, "y"] + 1
data_plot <- test_data[-c(60:65), ]
library(ggplot2)
ggplot(data = test_data, aes(x = x, y = y)) + theme_bw() +
geom_point(colour = "blue", size = 0.1) +
geom_smooth(formula = y ~ splines::bs(x, df = 10), method = "lm", color = "green3", level = 1, size = 1)
ggplot(data = test_data, aes(x = x, y = y)) + theme_bw() +
geom_point(colour = "blue", size = 0.1) +
geom_smooth(data = data_plot, formula = y ~ splines::bs(x, df = 10), method = "lm", color = "green3", level = 1, size = 1)
Created on 2020-11-27 by the reprex package (v0.3.0)
BTW: you don't need aes_string (which is deprecated) and d_f4$x, you can just use aes(x = x)

Multiple Polynomial fits in ggplot using facetwrap

So I would like to use multiple polynomial curves to fit 2 dimensional data,
I am able to plot one polynomial function but I would like to use for example 4 and then plot all of them at the same time using facet_wrap.
Now I am using simple 2 order polynomial:
library(ggplot2)
df <- mtcars
df <- data.frame("x"=df$mpg, "y"=df$hp)
my.formula <- y ~ x + I(x^2)
p <- ggplot(df, aes(x, y)) +
geom_point(shape=21, fill="blue", colour="black", size=2, alpha = 0.7) +
geom_smooth(method = "lm", se = F,
formula = my.formula,
colour = "red")
m <- lm(my.formula, df)
my.eq <- as.character(signif(as.polynomial(coef(m)), 3))
label.text <- paste(gsub("x", "~italic(x)", my.eq, fixed = TRUE),
paste("italic(R)^2",
format(summary(m)$r.squared, digits = 2),
sep = "~`=`~"),
sep = "~~~~")
p + annotate(geom = "text", label = label.text,
family = "serif", hjust = 0, parse = TRUE, size = 4)
lets say we would like to use another formulas such as:
my.formula2 <- y ~ x + I(x^2) + I(x^3)
my.formula4 <- y ~ x + I(x^2) + I(x^3) + I(x^4)
my.formula5 <- y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5)
And plot it in the base plot above using facet_wrap so we would have 4 seperate plots and each has to have its own label text and anotation.
Here is an answer that first fits polynomial regression and gets the predicted values, then plots them all with geom_line, not geom_smooth.
library(ggplot2)
df <- mtcars
df <- data.frame("x"=df$mpg, "y"=df$hp)
tmp <- sapply(2:5, function(d){
predict(lm(y ~ poly(x, d), df))
})
df2 <- df
df2 <- cbind(df2, tmp)
rm(tmp)
names(df2)[-(1:2)] <- paste0("degree", 2:5)
long <- reshape2::melt(df2, id.vars = c("x", "y"))
ggplot(long, aes(x, y)) +
geom_point(shape=21, fill="blue", colour="black", size=2, alpha = 0.7) +
geom_line(aes(y = value), colour = "red") +
facet_wrap(~ variable)
Edit.
Another way, without fitting the models previously, is the following, inspired in a RStudio community post.
library(tidyverse)
cbind(df, tmp) %>%
gather(degree, value, -x, -y) %>%
{
reduce2(.init = ggplot(., aes(x = x, y = y)),
.x = .$degree,
.y = .$value,
function(prev, .x, .y) {
force(.y) # The formula below won't evaluate .y by itself
prev + geom_smooth(
data = . %>% filter(degree == .x),
method = "lm",
se = FALSE,
formula = y ~ poly(x, .y))
})
} +
geom_point(fill = "blue", colour = "black",size = 2, alpha = 0.7) +
facet_wrap(~ degree)
It's pretty simple with the stat_function function
I know you said you wanted to use facet_wrap but I would suggest using ggarrange in the ggpubr library
mylm1 <- lm(hp ~ mpg + I(mpg^2), data = df)
mylm2 <- lm(hp ~ mpg + I(mpg^2) + I(mpg^3), data = df)
mylm3 <- lm(hp ~ mpg + I(mpg^2) + I(mpg^3) + I(mpg^4), data = df)
mylm4 <- lm(hp ~ mpg + I(mpg^2) + I(mpg^3) + I(mpg^4) + I(mpg^5), data = df)
b1 <- coef(mylm1)
b2 <- coef(mylm2)
b3 <- coef(mylm3)
b4 <- coef(mylm4)
p1 <- df %>%
ggplot() +
geom_point(aes(x = mpg, y = hp)) +
stat_function(fun = function(x) b1[1] + b1[2]*x + b1[3]*x^2)
p2 <- df %>%
ggplot() +
geom_point(aes(x = mpg, y = hp)) +
stat_function(fun = function(x) b2[1] + b2[2]*x + b2[3]*x^2 + b2[4]*x^3)
p3 <- df %>%
ggplot() +
geom_point(aes(x = mpg, y = hp)) +
stat_function(fun = function(x) b3[1] + b3[2]*x + b3[3]*x^2 + b3[4]*x^3 + b3[5]*x^4)
p4 <- df %>%
ggplot() +
geom_point(aes(x = mpg, y = hp)) +
stat_function(fun = function(x) b4[1] + b4[2]*x + b4[3]*x^2 + b4[4]*x^3 + b4[5]*x^4 + b4[6]*x^5)
library(ggpubr)
ggarrange(p1,p2,p3,p4)

Coefficients per facet with output.type="numeric" in ggpmisc::stat_poly_eq

ggpmisc::stat_poly_eq has an option output.type = "numeric" allowing to get the estimates of the parameters of the fitted model. Below is my attempt to use it with facet_wrap. I get a different R² per facet but the coefficients are the same in the two facets. Do I do something wrong, or is it a bug?
library(ggpmisc)
set.seed(4321)
x <- 1:100
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 4)
my.data <- data.frame(x = x,
y = y,
group = c("A", "B"))
my.data[my.data$group=="A",]$y <- my.data[my.data$group=="A",]$y + 200000
formula <- y ~ poly(x, 1, raw = TRUE)
myformat <- "Intercept: %s\nSlope: %s\nR²: %s"
ggplot(my.data, aes(x, y)) +
facet_wrap(~ group) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(formula = formula, output.type = "numeric",
mapping = aes(label =
sprintf(myformat,
formatC(stat(coef.ls)[[1]][[1, "Estimate"]]),
formatC(stat(coef.ls)[[1]][[2, "Estimate"]]),
formatC(stat(r.squared)))))
Edit
We have to catch the panel number. It is strange that formatC(stat(as.integer(PANEL))) returns the panel number per facet:
but however formatC(stat(coef.ls)[[stat(as.integer(PANEL))]][[1, "Estimate"]]) does not work, because here PANEL = c(1,2).
Ok, I figured it out.
ggplot(my.data, aes(x, y)) +
facet_wrap(~ group) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(
formula = formula, output.type = "numeric",
mapping = aes(label =
sprintf(myformat,
c(formatC(stat(coef.ls)[[1]][[1, "Estimate"]]),
formatC(stat(coef.ls)[[2]][[1, "Estimate"]])),
c(formatC(stat(coef.ls)[[1]][[2, "Estimate"]]),
formatC(stat(coef.ls)[[2]][[2, "Estimate"]])),
formatC(stat(r.squared)))))
Version 0.3.2 of 'ggpmisc' is now in CRAN. Submitted earlier this week. In the documentation I now give some examples of the use of geom_debug() from my package 'gginnards' to have a look at the data frame returned by stats (usable with any ggplot stat or by itself). For your example, it would work like this:
library(ggpmisc)
library(gginnards)
set.seed(4321)
x <- 1:100
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 4)
my.data <- data.frame(x = x,
y = y,
group = c("A", "B"))
my.data[my.data$group=="A",]$y <- my.data[my.data$group=="A",]$y + 200000
formula <- y ~ poly(x, 1, raw = TRUE)
myformat <- "Intercept: %s\nSlope: %s\nR²: %s"
ggplot(my.data, aes(x, y)) +
facet_wrap(~ group) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(formula = formula, output.type = "numeric",
aes(label = ""),
geom = "debug")
Which prints to the console, two tibbles, one for each panel:
Example below added to address comment:
ggplot(my.data, aes(x, y)) +
facet_wrap(~ group) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(formula = formula, output.type = "numeric",
aes(label = ""),
summary.fun = function(x) {x[["coef.ls"]][[1]]})
prints just the coefs.ls.
I added the "numeric" option recently in response to a suggestion and with this example I noticed a bug: aes(label = "") should not have been needed, but is needed because the default mapping for the label aesthetic is wrong. I will fix this for the next release.

Resources