ggplot2 - a custom histogram with a rug plot - r

I am trying to create a custom histogram with a rug plot showing the original values on the X axis.
I am going to use the mtcars dataset to illustrate. Its not be best dataset for this question...but hopefully the reader will understand what I am trying to achieve...
Below shows the basic histogram, without any rug plot attempt.
I want to create the histogram using geom_bar as this allows for more flexibility with custom bins.
I also want a small gap between the histgram bars (i.e width = 0.95) .... which adds to this
problem's complexity.
library(dplyr)
library(ggplot2)
# create custom bins
vct_seq <- c(seq(from = 10, to = 25, by = 5), 34)
mtcars$bin <- cut(mtcars$mpg, breaks = vct_seq)
# create data.frame for the ggplot graph..using bins above
df_mtcars_count <- mtcars %>% group_by(bin) %>% summarise(count = n())
# indicative labels
vct_labels <- c("bin 1", "bin 2", "bin 3", "bin 4")
# attempt 1 - basic plot -- no rug plot
p <- ggplot(data = df_mtcars_count, aes(x = bin, y = count))
p <- p + geom_bar(stat = "identity", width = 0.95)
p <- p + geom_text(aes(label = count), vjust = -0.5)
p <- p + scale_x_discrete("x title to go here", labels = df_mtcars_count$bin, breaks = df_mtcars_count$bin)
p
Next, try and add a basic rug plot on the X axis. This obviously doesn't work as the geom_bar and geom_rug have completely different scales.
# attempt 2 with no scaling.... doesn't work as x scale for ordinal (bins) and
# x scale for continuous (mpg) do not match
p <- ggplot(data = df_mtcars_count, aes(x = bin, y = count))
p <- p + geom_bar(stat = "identity", width = 0.95)
p <- p + geom_text(aes(label = count), vjust = -0.5)
p <- p + scale_x_discrete("x title to go here", labels = df_mtcars_count$bin, breaks = df_mtcars_count$bin)
p <- p + geom_rug(data = mtcars, aes(x = mpg), inherit.aes = F, alpha = 0.3)
p
Now, try and rescale the mpg column to match with the ordinal scale....
First define a linear mapping function...
fn_linear_map <- function(vct_existing_val, vct_new_range) {
# example....converts 1:20 into the range 1 to 10 like this:
# fn_linear_map(1:20, c(1, 10))
fn_r_diff <- function(x) x %>% range() %>% diff()
flt_ratio <- fn_r_diff(vct_new_range) / fn_r_diff(vct_existing_val)
vct_old_min_offset <- vct_existing_val - min(vct_existing_val)
vct_new_range_val <- (vct_old_min_offset * flt_ratio) + min(vct_new_range)
return(vct_new_range_val)
}
Now apply the function...we try and map mpg to the range 1 to 4 (which is an attempt to match
the ordinal scale)
mtcars$mpg_remap <- fn_linear_map(mtcars$mpg, c(1, 4))
Try the plot again.... getting closer ... but not really accurate...
# attempt 3: getting closer but doesn't really match the ordinal scale
p <- ggplot(data = df_mtcars_count, aes(x = bin, y = count))
p <- p + geom_bar(stat = "identity", width = 0.95)
p <- p + geom_text(aes(label = count), vjust = -0.5)
p <- p + scale_x_discrete("x title to go here", labels = df_mtcars_count$bin, breaks = df_mtcars_count$bin)
p <- p + geom_rug(data = mtcars, aes(x = mpg_remap), inherit.aes = F, alpha = 0.3)
p
The graph above is getting close to what I want....but rug plot does not line up
with the actual data ... example the max observation (33.9) should be displayed
almost aligning with the right hand side of the bar.. see below:
mtcars %>% filter(bin == "(25,34]") %>% arrange(mpg) %>% dplyr::select(mpg, mpg_remap)

Your scale makes no sense to me, as you are showing a bin that is twice as wide using the same bar width. Doing that in combination with a rug strikes me as confusing as best and misleading at worst. I suggest you plot the bars with their correct widths, after which the rug is trivial.
I think the best solution is to just use geom_histogram:
ggplot(mtcars, aes(mpg)) +
geom_histogram(breaks = vct_seq, col = 'grey80') +
geom_rug(aes(mpg, y = NULL))
If you really want the gaps between the bars you'll have to do more work:
library(tidyr)
d <- mtcars %>%
count(bin) %>%
separate(bin, c('min', 'max'), sep = ',', remove = FALSE) %>%
mutate_at(vars('min', 'max'), readr::parse_number) %>%
mutate(
middle = min + (max - min) / 2,
width = 0.9 * (max - min)
)
ggplot(d, aes(middle, n)) +
geom_col(width = d$width) +
geom_rug(aes(mpg, y = NULL), mtcars)

Related

Define custom transformation of ggplot axis labels with trans_new function

I am working on percentage changes between periods and struggling with logaritmic transformation of labels. Here is an example based on the storms dataset:
library(dplyr)
library(ggplot2)
library(scales)
df <- storms |>
group_by(year) |>
summarise(wind = mean(wind)) |>
mutate(lag = lag(wind, n = 1)) |>
mutate(perc = (wind / lag) - 1) |>
tidyr::drop_na()
I want to visualize the distribution of percentages, making the percentage change symmetrical (log difference) with log1p.
ggplot(df, aes(x = log1p(perc))) +
geom_histogram(bins = 5)
x-axis with log1p values
At this point I wanted to transform the x-axis label back to the original percentage value.
I tried to create my own transformation with trans_new, and applied it to the labels in scale_x_continuous, but I can't make it work.
trans_perc <- trans_new(
name = "trans_perc",
transform = log1p_trans(),
inverse = function(x)
expm1(x),
breaks = breaks_log(),
format = percent_format(),
domain = c(-Inf, Inf)
)
ggplot(df, aes(x = log1p(perc))) +
geom_histogram(bins = 5) +
scale_x_continuous(labels = trans_perc)
Currently, the result is:
Error in get_labels():
! breaks and labels are different lengths
Run rlang::last_error() to see where the error occurred.
Thanks!
EDIT
I am adding details on the different output I am getting from Alan's first answer:
trans_perc <- trans_new(
name = "trans_perc",
transform = log1p,
inverse = expm1,
breaks = pretty_breaks(5),
format = percent_format(),
domain = c(-Inf, Inf)
)
library(ggpubr)
a <- ggplot(df, aes(x = log1p(perc))) +
geom_histogram(bins = 5)
b <- ggplot(df, aes(x = log1p(perc))) +
geom_histogram(bins = 5) +
scale_x_continuous(trans = trans_perc)
c <- ggplot(df, aes(x = perc)) +
geom_histogram(bins = 5) +
scale_x_continuous(trans = trans_perc)
ggarrange(a, b, c,
ncol = 3,
labels = c("Log on Value only",
"Log on Value and X",
"Log on X only"))
[different outcomes]:(https://i.stack.imgur.com/dCW2m.png
If I understand you correctly, you want to keep the shape of the histogram, but change the labels so that they reflect the value of the perc column rather the transformed log1p(perc) value. If that is the case, there is no need for a transformer object. You can simply put the reverse transformation (plus formatting) as a function into the labels argument of scale_x_continuous.
ggplot(df, aes(x = log1p(perc))) +
geom_histogram(bins = 5) +
scale_x_continuous("Percentage Change",
breaks = log1p(pretty(df$perc, 5)),
labels = ~ percent(expm1(.x)))
Note that although the histogram remains symmetrical in shape, the axis labels represent the back-transformed values of the original axis labels.
The point of a transformer object is to do all this for you without having to pass a transformed data set (i.e. without having to pass log1p(perc)). So in your case, you could do:
trans_perc <- trans_new(
name = "trans_perc",
transform = log1p,
inverse = expm1,
format = percent_format(),
domain = c(-Inf, Inf)
)
ggplot(df, aes(x = perc)) +
geom_histogram(bins = 5) +
scale_x_continuous(trans = trans_perc)
Which gives essentially the same result

Adding a single label per group in ggplot with stat_summary and text geoms

I would like to add counts to a ggplot that uses stat_summary().
I am having an issue with the requirement that the text vector be the same length as the data.
With the examples below, you can see that what is being plotted is the same label multiple times.
The workaround to set the location on the y axis has the effect that multiple labels are stacked up. The visual effect is a bit strange (particularly when you have thousands of observations) and not sufficiently professional for my purposes. You will have to trust me on this one - the attached picture doesn't fully convey the weirdness of it.
I was wondering if someone else has worked out another way. It is for a plot in shiny that has dynamic input, so text cannot be overlaid in a hardcoded fashion.
I'm pretty sure ggplot wasn't designed for the kind of behaviour with stat_summary that I am looking for, and I may have to abandon stat_summary and create a new summary dataframe, but thought I would first check if someone else has some wizardry to offer up.
This is the plot without setting the y location:
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_x <- df_x %>%
group_by(Group) %>%
mutate(w_count = n())
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(label = w_count)) +
coord_flip() +
theme_classic()
and this is with my hack
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(y = 1, label = w_count)) +
coord_flip() +
theme_classic()
Create a df_text that has the grouped info for your labels. Then use annotate:
library(dplyr)
library(ggplot2)
set.seed(123)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_text <- df_x %>%
group_by(Group) %>%
summarise(avg = mean(Value),
n = n()) %>%
ungroup()
yoff <- 0.0
xoff <- -0.1
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
annotate("text",
x = 1:2 + xoff,
y = df_text$avg + yoff,
label = df_text$n) +
coord_flip() +
theme_classic()
I found another way which is a little more robust for when the plot is dynamic in its ordering and filtering, and works well for faceting. More robust, because it uses stat_summary for the text.
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
counts_df <- function(y) {
return( data.frame( y = 1, label = paste0('n=', length(y)) ) )
}
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
coord_flip() +
theme_classic()
p + stat_summary(geom="text", fun.data=counts_df)

Handle ggplot2 axis text face programmatically

(x-posted to community.rstudio.com)
I'm wondering if it's possible to change the axis text in ggplot2 programatically or if there is some native way to do this in ggplot2. In this reprex, the idea is that I want to bold the axis text of a variable y that has an absolute value of x over 1.5. I can add it in manually via theme(), and that works fine:
library(ggplot2)
library(dplyr)
library(forcats)
set.seed(2939)
df <- data.frame(x = rnorm(15), y = paste0("y", 1:15), group = rep(1:3, 5))
df <- mutate(df, big_number = abs(x) > 1.5, face = ifelse(big_number, "bold",
"plain"))
p <- ggplot(df, aes(x = x, y = fct_inorder(y), col = big_number)) + geom_point() +
theme(axis.text.y = element_text(face = df$face))
p
Plot 1 with no facets
But if I facet it by group, y gets reordered and ggplot2 has no idea how face is connected to df and thus y, so it just bolds in the same order as the first plot.
p + facet_grid(group ~ .)
Plot 2 with facets
And it's worse if I use a different scale for each.
p + facet_grid(group ~ ., scales = "free")
Plot 3 with facets and different scales
What do you think? Is there a general way to handle this that would work consistently here?
Idea: Don't change theme, change y-axis labels. Create a call for every y with if/else condition and parse it with parse.
Not the most elegant solution (using for loop), but works (need loop as bquote doesn't work with ifelse). I always get confused when trying to work with multiple expressions (more on that here).
Code:
# Create data
library(tidyverse)
set.seed(2939)
df <- data.frame(x = rnorm(15), y = paste0("y", 1:15), group = rep(1:3, 5)) %>%
mutate(yF = fct_inorder(y),
big_number = abs(x) > 1.5)
# Expressions for y-axis
# ifelse doesn't work
# ifelse(df$big_number, bquote(bold(1)), bquote(plain(2)))
yExp <- c() # Ignore terrible way of concatenating
for(i in 1:nrow(df)) {
if (df$big_number[i]) {
yExp <- c(yExp, bquote(bold(.(as.character(df$yF[i])))))
} else {
yExp <- c(yExp, bquote(plain(.(as.character(df$yF[i])))))
}
}
# Plot with facets
ggplot(df, aes(x, yF, col = big_number)) +
geom_point() +
scale_y_discrete(breaks = levels(df$yF),
labels = parse(text = yExp)) +
facet_grid(group ~ ., scales = "free")
Result:
Inspired by #PoGibas, I also used a function in scale_y_discrete(), which works, too.
bold_labels <- function(breaks) {
big_nums <- filter(df, y %in% breaks) %>%
pull(big_number)
labels <- purrr::map2(
breaks, big_nums,
~ if (.y) bquote(bold(.(.x))) else bquote(plain(.(.x)))
)
parse(text = labels)
}
ggplot(df, aes(x, fct_inorder(y), col = big_number)) +
geom_point() +
scale_y_discrete(labels = bold_labels) +
facet_grid(group ~ ., scales = "free")

showing count on x-axis for dot plot

I'd like to have a dot plot that shows the count on the x-axis. How can you get the dotplot below to show the count on the x-asix?
Thank you.
date = seq(as.Date("2016/1/5"), as.Date("2016/1/11"), "day")
value = c(11,11,12,12,13,14,14)
dat =data.frame(date = date, value = value)
dat
library(ggplot2)
library(ggplot2)
ggplot(dat, aes(x = value)) + geom_dotplot(binwidth = .8) +
scale_y_discrete(breaks= seq(1,max(table(dat$value))+2,1),
labels = seq(1,max(table(dat$value))+2,1) ) #tried using scale_y discrete but it does nothing
ylim(0, A) gives what you want, where A is the number of stacked dots necessary to count 1.00 density. We can calculate the exact value of A (but a little complexly ; Dialogical approach gives you approximate value).
(I reffered to post1, post2, and post3)
library(ggplot2); library(grid)
date = seq(as.Date("2016/1/5"), as.Date("2016/1/12"), "day")
value = c(11,11,12,12,13,14,14,14)
dat =data.frame(date = date, value = value)
### base plot
g <- ggplot(dat, aes(x = value)) + geom_dotplot(binwidth = 0.8) + coord_flip()
g # output to read parameter
### calculation of width and height of panel
grid.ls(view=TRUE,grob=FALSE)
seekViewport('panel.3-4-3-4')
real_width <- convertWidth(unit(1,'npc'), 'inch', TRUE)
real_height <- convertHeight(unit(1,'npc'), 'inch', TRUE)
### calculation of other values
height_coordinate_range <- diff(ggplot_build(g)$panel$ranges[[1]]$y.range)
real_binwidth <- real_height / height_coordinate_range * 0.8 # 0.8 is the argument binwidth
num_balls <- real_width / 1.1 / real_binwidth # the number of stacked balls. 1.1 is expanding value.
g + ylim(0, num_balls)
# The dirty balls border probably comes from my environment.
You can add coord_flip() to switch the x and y axes in ggplot. Here's an example with your script:
date = seq(as.Date("2016/1/5"), as.Date("2016/1/11"), "day")
value = c(11,11,12,12,13,14,14)
dat =data.frame(date = date, value = value)
dat
Edit, count on x-axis:
This will give a dotplot with simplified commands, and the counts as labels on the x-axis. Note: The binwidth has been changed from 0.8 to 1 to accommodate the use of ylim rather than scales.
library(ggplot2)
ggplot(dat, aes(x = value)) +
geom_dotplot(binwidth = 1) +
coord_flip() +
ylim(0,max(table(dat$value))+2)
Edit, count on y-axis:
library(ggplot2)
ggplot(dat, aes(x = value)) +
geom_dotplot(binwidth = 1) +
ylim(0,max(table(dat$value))+2)

Ggplot Heatmap - customized colors for customized count ranges

I want to make a heatmap that creates a group of clarity & color combinations as the X axis and cut as the Y axis. The heatmap would color based upon the counts of clarity+color and its intersection with the cut.
library(ggplot2)
library(dplyr)
## rename diamonds df
# 1. Generate a count for the frequency of cut+clarity
# 2. Make a heatmap of this using the following bins
# 3. Red <= 100 Frequency
Yellow = between (100 and 500)
Green > 500
# place counts inside the cell:
df = diamonds %>%
select( cut, clarity) %>%
group_by(cut,clarity)%>%
mutate(count = n())
myplot = ggplot(df, aes(x = clarity, y=cut)) +
geom_bin2d( bins = c(100,500,50000), col='orange') #
geom_text( aes(label = count),col='red')
myplot
Try this:
df$col <- cut(df$count,breaks = c(-Inf,100,500,Inf),right = TRUE)
df$color<-df$col
levels(df$color) <- c("<=100","100<#<=500",">500")
ggplot(data = df, aes(x = clarity, y = cut)) +
geom_tile(aes(fill = df$color), colour = "white") +
scale_fill_brewer("Count",palette = "Set1")+
geom_text(aes(label = count),col='yellow',cex=3)

Resources