I built a simple linear regression model, and produced some predicted values using the model. However, I am more interested in visualizing it on the graph, but I do not know how to add a legend to highlight original mpg values as 'black' and new predicted values as "red".
Data used in this example is mtcars dataset from datasets package
library(ggplot2)
library(datasets)
library(broom)
# Build a simple linear model between hp and mpg
m1<-lm(hp~mpg,data=mtcars)
# Predict new `mpg` given values below
new_mpg = data.frame(mpg=c(23,21,30,28))
new_hp<- augment(m1,newdata=new_mpg)
# plot new predicted values in the graph along with original mpg values
ggplot(data=mtcars,aes(x=mpg,y=hp)) + geom_point(color="black") + geom_smooth(method="lm",col=4,se=F) +
geom_point(data=new_hp,aes(y=.fitted),color="red")
scatter plot
Here is one idea. You can combine the predicted and observed data in the same data frame and then create the scatter plot to generate the legend. The following code is an extension of your existing code.
# Prepare the dataset
library(dplyr)
new_hp2 <- new_hp %>%
select(mpg, hp = .fitted) %>%
# Add a label to show it is predicted data
mutate(Type = "Predicted")
dt <- mtcars %>%
select(mpg, hp) %>%
# Add a label to show it is observed data
mutate(Type = "Observed") %>%
# Combine predicted data and observed data
bind_rows(new_hp2)
# plot the data
ggplot(data = dt, aes(x = mpg, y = hp, color = factor(Type))) +
geom_smooth(method="lm", col = 4, se = F) +
geom_point() +
scale_color_manual(name = "Type", values = c("Black", "Red"))
Here is another way of doing it without dplyr:
ggplot() +
geom_point(data = mtcars, aes(x = mpg, y = hp, colour = "Obs")) +
geom_point(data = new_hp, aes(x = mpg, y = .fitted, colour = "Pred")) +
scale_colour_manual(name="Type",
values = c("black", "red")) +
geom_smooth(data = mtcars, aes(x = mpg, y = hp),
method = "lm", col = 4, se = F)
Related
I have my an empty panel in my facetted ggplot. I would like to insert my standalone plot into this. Is this possible? See below for example code.
I found a possible solution Here, but can't get it to 'look nice'. To 'look nice' I want the standalone plot to have the same dimensions as one of the facetted plots.
library(ggplot2)
library(plotly)
data("mpg")
first_plot = ggplot(data = mpg, aes(x = trans, y = cty)) +
geom_point(size= 1.3)
facet_plot = ggplot(data = mpg, aes(x = year, y = cty)) +
geom_point(size = 1.3) +
facet_wrap(~manufacturer)
facet_plot # room for one more panel which I want first_plot to go?
# try an merge but makes first plot huge, compared with facetted plots.
subplot(first_plot, facet_plot, which_layout = 2)
Besides the options to manipulate the gtable or using patchwork one approach to achieve your desired result would be via some data wrangling to add the standalone plot as an additional facet. Not sure whether this will work for your real data but at least for mpg you could do:
library(ggplot2)
library(dplyr)
mpg_bind <- list(standalone = mpg, facet = mpg) %>%
bind_rows(.id = "id") %>%
mutate(x = ifelse(id == "standalone", trans, year),
facet = ifelse(id == "standalone", "all", manufacturer),
facet = forcats::fct_relevel(facet, "all", after = 1000))
ggplot(data = mpg_bind, aes(x = x, y = cty)) +
geom_point(size = 1.3) +
facet_wrap(~facet, scales = "free_x")
I have a figure with 16 regression lines and I need to be able to identify them. Using a color gradient or symbols or different line types do not really help.
My idea therefore is, to just (haha) annotate every line.
Therefore, I build a dataset (hpAnnotatedLines) with the different maximum x values. This is the position the text should start. However, I have no idea how to automatically extract the respective y values of the predicted regression lines at the maximum x-axis values, which is different for each line.
Please find a smaller data set using mtcars as an example
library(ggplot2)
library(dplyr)
library(ggrepel)
#just select the data I need
mtcars1 <- select(mtcars, disp,cyl,hp)
mtcars1$cyl <- as.factor(mtcars1$cyl)
#extract max values
mtcars2 <- mtcars1 %>%
group_by(cyl) %>%
summarise(Max.disp= max(disp))
#build dataset for the annotation layer
#note that hp was done by hand. Here I need help
hpAnnotatedLines <- data.frame(cyl=levels(mtcars2$cyl),
disp=mtcars2$Max.disp,
hp=c(90,100,210))
#example plot
ggplot(mtcars, aes(x=disp, y=hp, color = factor(cyl))) +
geom_point() +
geom_smooth(method=lm)+
coord_cartesian(xlim = c(min(mtcars$disp), max(mtcars$disp) + 50)) +
geom_text_repel(
data = hpAnnotatedLines,
aes(label = cyl),
size = 3,
nudge_x = 1)
Instead of extracting the fitted values you could add the labels via geom_text by switching the stat to smooth and setting the label aesthetic via after_stat such that only the last point of each regression line gets labelled:
library(ggplot2)
library(dplyr)
myfun <- function(x, color) {
data.frame(x = x, color = color) %>%
group_by(color) %>%
mutate(label = ifelse(x %in% max(x), as.character(color), "")) %>%
pull(label)
}
ggplot(mtcars, aes(x=disp, y=hp, color = factor(cyl))) +
geom_point() +
geom_smooth(method=lm) +
geom_text(aes(label = after_stat(myfun(x, color))),
stat = "smooth", method = "lm", hjust = 0, size = 3, nudge_x = 1, show.legend = FALSE) +
coord_cartesian(xlim = c(min(mtcars$disp), max(mtcars$disp) + 50))
It's a bit of a hack, but you can extract the data from the compiled plot object. For example first make the plot without the labels,
myplot <- ggplot(mtcars, aes(x=disp, y=hp, color = factor(cyl))) +
geom_point() +
geom_smooth(method=lm)+
coord_cartesian(xlim = c(min(mtcars$disp), max(mtcars$disp) + 50))
Then use ggplot_build to get the data from the second layer (The geom_smooth layer) and transform it back into the names used by your data. Here we find the largest x value per group, and then take that y value.
pobj <- ggplot_build(myplot)
hpAnnotatedLines <- pobj$data[[2]] %>% group_by(group) %>%
top_n(1, x) %>%
transmute(disp=x, hp=y, cyl=levels(mtcars$cyl)[group])
Then add an additional layer to your plot
myplot +
geom_text_repel(
data = hpAnnotatedLines,
aes(label = cyl),
size = 3,
nudge_x = 1)
If your data is not that huge, you can extract the predictions out using augment() from broom and take that with the largest value:
library(broom)
library(dplyr)
library(ggplot2)
hpAnn = mtcars %>% group_by(cyl) %>%
do(augment(lm(hp ~ disp,data=.))) %>%
top_n(1,disp) %>%
select(cyl,disp,.fitted) %>%
rename(hp = .fitted)
# A tibble: 3 x 3
# Groups: cyl [3]
cyl disp hp
<dbl> <dbl> <dbl>
1 4 147. 96.7
2 6 258 99.9
3 8 472 220.
Then plot:
ggplot(mtcars, aes(x=disp, y=hp, color = factor(cyl))) +
geom_point() +
geom_smooth(method=lm)+
coord_cartesian(xlim = c(min(mtcars$disp), max(mtcars$disp) + 50))+
geom_text_repel(
data = hpAnn,
aes(label = cyl),
size = 3,
nudge_x = 1)
I'm hoping to recreate the gridExtra output below with ggplot's facet_grid, but I'm unsure of what variable ggplot identifies with the layers in the plot. In this example, there are two geoms...
require(tidyverse)
a <- ggplot(mpg)
b <- geom_point(aes(displ, cyl, color = drv))
c <- geom_smooth(aes(displ, cyl, color = drv))
d <- a + b + c
# output below
gridExtra::grid.arrange(
a + b,
a + c,
ncol = 2
)
# Equivalent with gg's facet_grid
# needs a categorical var to iter over...
d$layers
#d + facet_grid(. ~ d$layers??)
The gridExtra output that I'm hoping to recreate is:
A hacky way of doing this is to take the existing data frame and create two, three, as many copies of the data frame you need with a value linked to it to be used for the facet and filtering later on. Union (or rbind) the data frames together into one data frame. Then set up the ggplot and geoms and filter each geom for the desired attribute. Also for the facet use the existing attribute to split the plots.
This can be seen below:
df1 <- data.frame(
graph = "point_plot",
mpg
)
df2 <- data.frame(
graph = "spline_plot",
mpg
)
df <- rbind(df1, df2)
ggplot(df, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point(data = filter(df, graph == "point_plot")) +
geom_smooth(data = filter(df, graph == "spline_plot"), se=FALSE) +
facet_grid(. ~ graph)
If you really want to show different plots on different facets, one hacky way would be to make separate copies of the data and subset those...
mpg2 <- mpg %>% mutate(facet = 1) %>%
bind_rows(mpg %>% mutate(facet = 2))
ggplot(mpg2, aes(displ, cyl, color = drv)) +
geom_point(data = subset(mpg2, facet == 1)) +
geom_smooth(data = subset(mpg2, facet == 2)) +
facet_wrap(~facet)
I have a dataset with numeric values and a categorical variable. The distribution of the numeric variable differs for each category. I want to plot "density plots" for each categorical variable so that they are visually below the entire density plot.
This is similiar to components of a mixture model without calculating the mixture model (as I already know the categorical variable which splits the data).
If I take ggplot to group according to the categorical variable, each of the four densities are real densities and integrate to one.
library(ggplot2)
ggplot(iris, aes(x = Sepal.Width)) + geom_density() + geom_density(aes(x = Sepal.Width, group = Species, colour = 'Species'))
What I want is to have the densities of each category as a sub-density (not integrating to 1). Similiar to the following code (which I only implemented for two of the three iris species)
myIris <- as.data.table(iris)
# calculate density for entire dataset
dens_entire <- density(myIris[, Sepal.Width], cut = 0)
dens_e <- data.table(x = dens_entire[[1]], y = dens_entire[[2]])
# calculate density for dataset with setosa
dens_setosa <- density(myIris[Species == 'setosa', Sepal.Width], cut = 0)
dens_sa <- data.table(x = dens_setosa[[1]], y = dens_setosa[[2]])
# calculate density for dataset with versicolor
dens_versicolor <- density(myIris[Species == 'versicolor', Sepal.Width], cut = 0)
dens_v <- data.table(x = dens_versicolor[[1]], y = dens_versicolor[[2]])
# plot densities as mixture model
ggplot(dens_e, aes(x=x, y=y)) + geom_line() + geom_line(data = dens_sa, aes(x = x, y = y/2.5, colour = 'setosa')) +
geom_line(data = dens_v, aes(x = x, y = y/1.65, colour = 'versicolor'))
resulting in
Above I hard-coded the number to reduce the y values. Is there any way to do it with ggplot? Or to calculate it?
Thanks for your ideas.
Do you mean something like this? You need to change the scale though.
ggplot(iris, aes(x = Sepal.Width)) +
geom_density(aes(y = ..count..)) +
geom_density(aes(x = Sepal.Width, y = ..count..,
group = Species, colour = Species))
Another option may be
ggplot(iris, aes(x = Sepal.Width)) +
geom_density(aes(y = ..density..)) +
geom_density(aes(x = Sepal.Width, y = ..density../3,
group = Species, colour = Species))
How would I add a text annotation (eg. sd = sd_value) of the standard deviation in each panel of the following plot using ggplot2 in R?
library(datasets)
data(mtcars)
ggplot(data = mtcars, aes(x = hp)) +
geom_dotplot(binwidth = 1) +
geom_density() +
facet_grid(. ~ cyl) +
theme_bw()
I'd post an image of the plot, but I don't have enough rep.
I think "geom_text" or "annotate" might be useful but I'm not sure quite sure how.
If you want to vary the text label in each facet, you will want to use geom_text. If you want the same text to appear in each facet, you can use annotate.
p <- ggplot(data = mtcars, aes(x = hp)) +
geom_dotplot(binwidth = 1) +
geom_density() +
facet_grid(. ~ cyl)
mylabels <- data.frame(cyl = c(4, 6, 8),
label = c("first label", "seond label different", "and another"))
p + geom_text(x = 200, y = 0.75, aes(label = label), data = my labels)
### compare that to this way with annotate
p + annotate("text", x = 200, y = 0.75, label = "same label everywhere")
Now, if you really want standard deviation by cyl in this example, I'd probably use dplyr to do the calculation first and then complete this with geom_text like so:
library(ggplot2)
library(dplyr)
df.sd.hp <- mtcars %>%
group_by(cyl) %>%
summarise(hp.sd = round(sd(hp), 2))
ggplot(data = mtcars, aes(x = hp)) +
geom_dotplot(binwidth = 1) +
geom_density() +
facet_grid(. ~ cyl) +
geom_text(x = 200, y = 0.75,
aes(label = paste0("SD: ", hp.sd)),
data = df.sd.hp)
I prefer the appearance of the graph when the statistic appears within the facet label itself. I made the following script, which allows the choice of displaying the standard deviation, mean or count. Essentially it calculates the summary statistic then merges this with the name so that you have the format CATEGORY (SUMMARY STAT = VALUE).
#' Function will update the name with the statistic of your choice
AddNameStat <- function(df, category, count_col, stat = c("sd","mean","count"), dp= 0){
# Create temporary data frame for analysis
temp <- data.frame(ref = df[[category]], comp = df[[count_col]])
# Aggregate the variables and calculate statistics
agg_stats <- plyr::ddply(temp, .(ref), summarize,
sd = sd(comp),
mean = mean(comp),
count = length(comp))
# Dictionary used to replace stat name with correct symbol for plot
labelName <- mapvalues(stat, from=c("sd","mean","count"), to=c("\u03C3", "x", "n"))
# Updates the name based on the selected variable
agg_stats$join <- paste0(agg_stats$ref, " \n (", labelName," = ",
round(agg_stats[[stat]], dp), ")")
# Map the names
name_map <- setNames(agg_stats$join, as.factor(agg_stats$ref))
return(name_map[as.character(df[[category]])])
}
Using this script with your original question:
library(datasets)
data(mtcars)
# Update the variable name
mtcars$cyl <- AddNameStat(mtcars, "cyl", "hp", stat = "sd")
ggplot(data = mtcars, aes(x = hp)) +
geom_dotplot(binwidth = 1) +
geom_density() +
facet_grid(. ~ cyl) +
theme_bw()
The script should be easy to alter to include other summary statistics. I am also sure it could be rewritten in parts to make it a bit cleaner!