using y-axis values to create secondary x-axis in ggplot2 - r

I would like to create a dot plot with percentiles, which looks something like this-
Here is the ggplot2 code I used to create the dot plot. There are two things I'd like to change:
I can plot the percentile values on the y-axis but I want these
values on the x-axis (as shown in the graph above). Note that
the coordinates are flipped.
The axes don't display label for the
minimum value (for example the percentile axis labels start at 25
when they should start at 0 instead.)
# loading needed libraries
library(tidyverse)
library(ggstatsplot)
# creating dataframe with mean mileage per manufacturer
cty_mpg <- ggplot2::mpg %>%
dplyr::group_by(.data = ., manufacturer) %>%
dplyr::summarise(.data = ., mileage = mean(cty, na.rm = TRUE)) %>%
dplyr::rename(.data = ., make = manufacturer) %>%
dplyr::arrange(.data = ., mileage) %>%
dplyr::mutate(.data = ., make = factor(x = make, levels = .$make)) %>%
dplyr::mutate(
.data = .,
percent_rank = (trunc(rank(mileage)) / length(mileage)) * 100
) %>%
tibble::as_data_frame(x = .)
# plot
ggplot2::ggplot(data = cty_mpg, mapping = ggplot2::aes(x = make, y = mileage)) +
ggplot2::geom_point(col = "tomato2", size = 3) + # Draw points
ggplot2::geom_segment(
mapping = ggplot2::aes(
x = make,
xend = make,
y = min(mileage),
yend = max(mileage)
),
linetype = "dashed",
size = 0.1
) + # Draw dashed lines
ggplot2::scale_y_continuous(sec.axis = ggplot2::sec_axis(trans = ~(trunc(rank(.)) / length(.)) * 100, name = "percentile")) +
ggplot2::coord_flip() +
ggplot2::labs(
title = "City mileage by car manufacturer",
subtitle = "Dot plot",
caption = "source: mpg dataset in ggplot2"
) +
ggstatsplot::theme_ggstatsplot()
Created on 2018-08-17 by the reprex package (v0.2.0.9000).

I am not 100% sure to have understood what you really want, but below is my attempt to reproduce the first picture with mpg data:
require(ggplot2)
data <- aggregate(cty~manufacturer, mpg, FUN = mean)
data <- data.frame(data[order(data$cty), ], rank=1:nrow(data))
g <- ggplot(data, aes(y = rank, x = cty))
g <- g + geom_point(size = 2)
g <- g + scale_y_continuous(name = "Manufacturer", labels = data$manufacturer, breaks = data$rank,
sec.axis = dup_axis(name = element_blank(),
breaks = seq(1, nrow(data), (nrow(data)-1)/4),
labels = 25 * 0:4))
g <- g + scale_x_continuous(name = "Mileage", limits = c(10, 25),
sec.axis = dup_axis(name = element_blank()))
g <- g + theme_classic()
g <- g + theme(panel.grid.major.y = element_line(color = "black", linetype = "dotted"))
print(g)
That produces:
data <- aggregate(cty~manufacturer, mpg, FUN = mean)
data <- data.frame(data[order(data$cty), ], rank=1:nrow(data))
These two lines generate the data for the graph. Basically we need the manufacturers, the mileage (average of cty by manufacturer) and the rank.
g <- g + scale_y_continuous(name = "Manufacturer", labels = data$manufacturer, breaks = data$rank,
sec.axis = dup_axis(name = element_blank(),
breaks = seq(1, nrow(data), (nrow(data)-1)/4),
labels = 25 * 0:4))
Note that here the scale is using rank and not the column manufacturer. To display the name of the manufacturers, you must use the labels property and you must force the breaks to be for every values (see property breaks).
The second y-axis is generated using the sec.axis property. This is very straight-forward using dup_axis that easily duplicate the axis. By replacing the labels and the breaks, you can display the %-value.
g <- g + theme(panel.grid.major.y = element_line(color = "black", linetype = "dotted"))
The horizontal lines are just the major grid. This is much easier to manipulate than geom_segments in my opinion.
Regarding your question 1, you can flip the coordinates easily using coord_flip, with minor adjustments. Replace the following line:
g <- g + theme(panel.grid.major.y = element_line(color = "black", linetype = "dotted")
By the following two lines:
g <- g + coord_flip()
g <- g + theme(panel.grid.major.x = element_line(color = "black", linetype = "dotted"),
axis.text.x = element_text(angle = 90, hjust = 1))
Which produces:
Regarding your question 2, the problem is that the value 0% is outside the limits. You can solve this issue by changing the way you calculate the percentage (starting from zero and not from one), or you can extend the limit of your plot to include the value zero, but then no point will be associated to 0%.

Related

ggplot get a color for each value

I have a spatial dataset, containing values from 0 to 10. I want every number (11 numbers) to have a unique color from a gradient. The simple plot function does the trick (assigning one color to one value) but my default is ggplot, which I also want to use here. ggplot only uses ten colors for some reason and I cannot figure out why. I think I might just be using the wrong scale_x_y function.
Reproducible example:
library(raster)
#Colors
cols <- colorRampPalette(c("yellow", "red", "darkred", "black"))
# Create Raster
r <- raster(ncol=100, nrow=100)
r[] <- sample(0:10, 10000, replace = T)
# Plot simple
plot(r, col=cols(11)) # 11 colors seen here
# Convert to df
r <- as.data.frame(r, xy=T)
# Plot with ggplot
X <- ggplot(data = r) + geom_raster(aes(x = x, y = y, fill = layer), interpolate = F) +
scale_fill_stepsn(colors=cols(11), breaks=seq(0,10,1), show.limits=T)
print(X) # only 10 colors seen here
In scale_fill_stepsn the breaks are at the limits of each bin. If you have a sequence of 11 breaks, then you only have ten bins (if you have 11 fence posts you only have 10 stretches of fence between them). You need to add one to your sequence of breaks, otherwise the level 10 will be excluded:
ggplot(data = r) +
geom_raster(aes(x = x, y = y, fill = layer), interpolate = FALSE) +
scale_fill_stepsn(colors = cols(11), breaks = seq(0, 11, 1),
show.limits = TRUE) +
coord_equal()
An alternative is to use a manual scale, which I think makes more sense here. As I understand it, you are treating the fill color as a discrete variable, and the labels should correspond to the levels rather than corresponding to the break between labels as implied by scale_fill_stepsn
ggplot(data = r) +
geom_raster(aes(x = x, y = y, fill = factor(layer, 10:0))) +
scale_fill_manual(values = rev(cols(11)), name = 'layer') +
coord_equal()
EDIT
To get the legend at the bottom, try:
ggplot(data = r) +
geom_raster(aes(x = x, y = y, fill = factor(layer, 0:10))) +
scale_fill_manual(values = cols(11), name = 'layer ') +
coord_equal() +
guides(fill = guide_legend(label.position = 'top', nrow = 1)) +
theme(legend.position = 'bottom',
legend.spacing.x = unit(0, 'mm'),
legend.title = element_text(hjust = 3, vjust = 0.25))

How to create a bar graph with standard deviation and with facets with different X axes in R ggplot?

My data looks like:
I would like to generate barplots of the samples (Sample2) in relation to the counts (log) and create facets for each genus, as:
I used the following codes:
library (ggplot2)
library (ggpubr)
p <- ggbarplot (dataset, x = "Sample2", y = "log", add = c("mean_se", "jitter"), add.params = list(shape = "Origin"), fill = "Origin", palette = c("orange", "blue", "red"), ylim = c(5, 10))
p <- p + facet_grid(~Genera, scales = "free_x", space = "free_x")
But I obtained:
I couldn't find the answer to this issue. If anyone can help me, I'd appreciate it.
I don't know a lot about ggpubr, but here is an outline how you can construct a similar barchart with vanilla ggplot2.
library(ggplot2)
# I'm guessing this is roughly the shape of your data
set.seed(0)
df <- data.frame(
Sample2 = rep(c(1:18), each = 5),
log = rnorm(90, rep(rnorm(18), each = 5)) + 5,
Genera = rep(c("Halomonas", "Pseudoalteromonas", "Psychrobacter"), each = 30)
)
df$Origin <- sample(c("Cheese", "Environmental", "Food"), 18, replace = TRUE)[df$Sample2]
ggplot(df, aes(as.factor(Sample2), log, fill = Origin)) +
geom_bar(stat = "summary", fun = mean) +
geom_errorbar(stat = "summary", fun.data = mean_se) +
geom_jitter(width = 0.2) +
facet_grid(~ Genera, scales = "free_x")
Created on 2020-10-04 by the reprex package (v0.3.0)

Adding a label to a straight line in each facet of a facet wrapped ggplot

I've been struggling with one last bit of code to make this graph I'm working on really work for me and my audience. I have a bar chart with a two lines (one is acting as a rolling average, the other as the peak of that rolling average). What I want to do is label that peak line with a number, one time, but in each facet where the number is different in each facet. Here's some stripped down data and code:
tdf <- data.frame(a=as.POSIXct(c("2019-10-15 08:00:00","2019-10-15 09:00:00","2019-10-15 10:00:00","2019-10-15 08:00:00","2019-10-15 09:00:00","2019-10-15 10:00:00")),
b=as.Date(c("2019-09-02","2019-09-02","2019-09-02","2019-09-03","2019-09-03","2019-09-03")),
m1=c(0.2222222,0.3636364, 0.2307692, 0.4000000, 0.3428571, 0.3529412),
m2=c(0.2222222,0.2929293, 0.2972028, 0.3153846, 0.3714286, 0.3529412),
m3=c(0.2929293, 0.2929293, 0.2929293, 0.3529412,0.3529412,0.3529412))
g <- ggplot(data = tdf, aes(x = a, y = m1)) +
geom_bar(stat = "identity", alpha = 0.75, fill = 352) +
xlab("time of day") +
ylab("metric name") +
ggtitle("Graph Title") +
scale_x_datetime(breaks = scales::date_breaks("1 hours"),
date_labels = "%H")+
scale_y_continuous(breaks = c(0,.10,.20,.30,.40,.50,.50,.60,.70,.80,.90,1.0),
labels = scales::percent) +
theme_minimal()
# add line for m2
g <- g +
geom_line(data = tdf,
aes(x = a, y = m2),
color = "blue",
size = 1.2)
# add line for m3
g <- g + geom_line(data=tdf,
aes(x = a, y = m3),
color = "#d95f02",
size = 0.6,
linetype = "dashed")
# last attempt to label the line results in an error: Invalid input: time_trans works with objects of class POSIXct
#g <- g+geom_text(aes(x=-Inf, y=Inf, label=median(tdf$m3)), size=2, hjust=-0.5, vjust= 1.4,inherit.aes=FALSE)
# facet wrap
g <- g + facet_wrap(~b, ncol = 5, scales = "fixed")
I've seen a few techniques, but none of them seem to relate having a time for the x-axis in the facets, and each facet having a different date. I'm reasonably certain it's related to the date, but I sort of have no clue how to make the text block happen on each facet anyway.
You just need to pass a different dataset to the labeling layer that still preserves your faceting variable. This will work using dplyr
g <- g +
geom_text(data = tdf %>%
group_by(b) %>%
summarize(median = median(m3)),
aes(x = as.POSIXct(-Inf, origin="1970-01-01"),
y = Inf,
label = median),
size = 2,
hjust = -0.5,
vjust = 1.4,
inherit.aes = FALSE)
We also have to explicitly convert the x to a date/time value for the axis to work.

How do I add a legend to identify vertical lines in ggplot?

I have a chart that shows mobile usage by operating system. I'd like to add vertical lines to identify when those operating systems were released. I'll go through the chart and then the code.
The chart -
The code -
dev %>%
group_by(os) %>%
mutate(monthly_change = prop - lag(prop)) %>%
ggplot(aes(month, monthly_change, color = os)) +
geom_line() +
geom_vline(xintercept = as.numeric(ymd("2013-10-01"))) +
geom_text(label = "KitKat", x = as.numeric(ymd("2013-10-01")) + 80, y = -.5)
Instead of adding the text in the plot, I'd like to create a legend to identify each of the lines. I'd like to give each of them its own color and then have a legend to identify each. Something like this -
Can I make my own custom legend like that?
1) Define a data frame that contains the line data and then use geom_vline with it. Note that BOD is a data frame that comes with R.
line.data <- data.frame(xintercept = c(2, 4), Lines = c("lower", "upper"),
color = c("red", "blue"), stringsAsFactors = FALSE)
ggplot(BOD, aes( Time, demand ) ) +
geom_point() +
geom_vline(aes(xintercept = xintercept, color = Lines), line.data, size = 1) +
scale_colour_manual(values = line.data$color)
2) Alternately put the labels right on the plot itself to avoid an extra legend. Using the line.data frame above. This also has the advantage of avoiding possible multiple legends with the same aesthetic.
ggplot(BOD, aes( Time, demand ) ) +
geom_point() +
annotate("text", line.data$xintercept, max(BOD$demand), hjust = -.25,
label = line.data$Lines) +
geom_vline(aes(xintercept = xintercept), line.data, size = 1)
3) If the real problem is that you want two color legends then there are two packages that can help.
3a) ggnewscale Any color geom that appears after invoking new_scale_color will get its own scale.
library(ggnewscale)
BOD$g <- gl(2, 3, labels = c("group1", "group2"))
line.data <- data.frame(xintercept = c(2, 4), Lines = c("lower", "upper"),
color = c("red", "blue"), stringsAsFactors = FALSE)
ggplot(BOD, aes( Time, demand ) ) +
geom_point(aes(colour = g)) +
scale_colour_manual(values = c("red", "orange")) +
new_scale_color() +
geom_vline(aes(xintercept = xintercept, colour = line.data$color), line.data,
size = 1) +
scale_colour_manual(values = line.data$color)
3b) relayer The experimental relayer package (only on github) allows one to define two color aethetics, color and color2, say, and then have separate scales for each one.
library(dplyr)
library(relayer)
BOD$g <- gl(2, 3, labels = c("group1", "group2"))
ggplot(BOD, aes( Time, demand ) ) +
geom_point(aes(colour = g)) +
geom_vline(aes(xintercept = xintercept, colour2 = line.data$color), line.data,
size = 1) %>% rename_geom_aes(new_aes = c("colour" = "colour2")) +
scale_colour_manual(aesthetics = "colour", values = c("red", "orange")) +
scale_colour_manual(aesthetics = "colour2", values = line.data$color)
You can definitely make your own custom legend, but it is a bit complicated, so I'll take you through it step-by-step with some fake data.
The fake data contained 100 samples from a normal distribution (monthly_change for your data), 5 groupings (similar to the os variable in your data) and a sequence of dates from a random starting point.
library(tidyverse)
library(lubridate)
y <- rnorm(100)
df <- tibble(y) %>%
mutate(os = factor(rep_len(1:5, 100)),
date = seq(from = ymd('2013-01-01'), by = 1, length.out = 100))
You already use the colour aes for your call to geom_line, so you will need to choose a different aes to map onto the calls to geom_vline. Here, I use linetype and a call to scale_linetype_manual to manually edit the linetype legend to how I want it.
ggplot(df, aes(x = date, y = y, colour = os)) +
geom_line() +
# set `xintercept` to your date and `linetype` to the name of the os which starts
# at that date in your `aes` call; set colour outside of the `aes`
geom_vline(aes(xintercept = min(date),
linetype = 'os 1'), colour = 'red') +
geom_vline(aes(xintercept = median(date),
linetype = 'os 2'), colour = 'blue') +
# in the call to `scale_linetype_manual`, `name` will be the legend title;
# set `values` to 1 for each os to force a solid vertical line;
# use `guide_legend` and `override.aes` to change the colour of the lines in the
# legend to match the colours in the calls to `geom_vline`
scale_linetype_manual(name = 'lines',
values = c('os 1' = 1,
'os 2' = 1),
guide = guide_legend(override.aes = list(colour = c('red',
'blue'))))
And there you go, a nice custom legend. Please do remember next time that if you can provide your data, or a minimally reproducible example, we can better answer your question without having to generate fake data.

ggplot2: histogram with normal curve

I've been trying to superimpose a normal curve over my histogram with ggplot 2.
My formula:
data <- read.csv (path...)
ggplot(data, aes(V2)) +
geom_histogram(alpha=0.3, fill='white', colour='black', binwidth=.04)
I tried several things:
+ stat_function(fun=dnorm)
....didn't change anything
+ stat_density(geom = "line", colour = "red")
...gave me a straight red line on the x-axis.
+ geom_density()
doesn't work for me because I want to keep my frequency values on the y-axis, and want no density values.
Any suggestions?
Solution found!
+geom_density(aes(y=0.045*..count..), colour="black", adjust=4)
Think I got it:
library(ggplot2)
set.seed(1)
df <- data.frame(PF = 10*rnorm(1000))
ggplot(df, aes(x = PF)) +
geom_histogram(aes(y =..density..),
breaks = seq(-50, 50, by = 10),
colour = "black",
fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(df$PF), sd = sd(df$PF)))
This has been answered here and partially here.
The area under a density curve equals 1, and the area under the histogram equals the width of the bars times the sum of their height ie. the binwidth times the total number of non-missing observations. To fit both on the same graph, one or other needs to be rescaled so that their areas match.
If you want the y-axis to have frequency counts, there are a number of options:
First simulate some data.
library(ggplot2)
set.seed(1)
dat_hist <- data.frame(
group = c(rep("A", 200), rep("B",150)),
value = c(rnorm(200, 20, 5), rnorm(150,25,10)))
# Set desired binwidth and number of non-missing obs
bw = 2
n_obs = sum(!is.na(dat_hist$value))
Option 1: Plot both histogram and density curve as density and then rescale the y axis
This is perhaps the easiest approach for a single histogram.
Using the approach suggested by Carlos, plot both histogram and density curve as density
g <- ggplot(dat_hist, aes(value)) +
geom_histogram(aes(y = ..density..), binwidth = bw, colour = "black") +
stat_function(fun = dnorm, args = list(mean = mean(dat_hist$value), sd = sd(dat_hist$value)))
And then rescale the y axis.
ybreaks = seq(0,50,5)
## On primary axis
g + scale_y_continuous("Counts", breaks = round(ybreaks / (bw * n_obs),3), labels = ybreaks)
## Or on secondary axis
g + scale_y_continuous("Density", sec.axis = sec_axis(
trans = ~ . * bw * n_obs, name = "Counts", breaks = ybreaks))
Option 2: Rescale the density curve using stat_function
With code tidied as per PatrickT's answer.
ggplot(dat_hist, aes(value)) +
geom_histogram(colour = "black", binwidth = bw) +
stat_function(fun = function(x)
dnorm(x, mean = mean(dat_hist$value), sd = sd(dat_hist$value)) * bw * n_obs)
Option 3: Create an external dataset and plot using geom_line.
Unlike the above options, this one works with facets. (EDITED to provide dplyr rather than plyr based solution). Note, the summarised dataset is being used as the primary, and the raw passed in for the histogram only.
library(tidyverse)
dat_hist %>%
group_by(group) %>%
nest(data = c(value)) %>%
mutate(y = map(data, ~ dnorm(
.$value, mean = mean(.$value), sd = sd(.$value)
) * bw * sum(!is.na(.$value)))) %>%
unnest(c(data,y)) %>%
ggplot(aes(x = value)) +
geom_histogram(data = dat_hist, binwidth = bw, colour = "black") +
geom_line(aes(y = y)) +
facet_wrap(~ group)
Option 4: Create external functions to edit the data on the fly
A bit over the top perhaps, but might be useful for someone?
## Function to create scaled dnorm data along full x axis range
dnorm_scaled <- function(data, x = NULL, binwidth = 1, xlim = NULL) {
.x <- na.omit(data[,x])
if(is.null(xlim))
xlim = c(min(.x), max(.x))
x_range = seq(xlim[1], xlim[2], length.out = 101)
setNames(
data.frame(
x = x_range,
y = dnorm(x_range, mean = mean(.x), sd = sd(.x)) * length(.x) * binwidth),
c(x, "y"))
}
## Function to apply over groups
dnorm_scaled_group <- function(data, x = NULL, group = NULL, binwidth = NULL, xlim = NULL) {
dat_hists <- lapply(
split(data, data[, group]), dnorm_scaled,
x = x, binwidth = binwidth, xlim = xlim)
for(g in names(dat_hists))
dat_hists[[g]][, "group"] <- g
setNames(do.call(rbind, dat_hists), c(x, "y", group))
}
## Single histogram
ggplot(dat_hist, aes(value)) +
geom_histogram(binwidth = bw, colour = "black") +
geom_line(data = ~ dnorm_scaled(., "value", binwidth = bw),
aes(y = y))
## With a single faceting variable
ggplot(dat_hist, aes(value)) +
geom_histogram(binwidth = 2, colour = "black") +
geom_line(data = ~ dnorm_scaled_group(
., x = "value", group = "group", binwidth = 2, xlim = c(0,50)),
aes(y = y)) +
facet_wrap(~ group)
This is an extended comment on JWilliman's answer. I found J's answer very useful. While playing around I discovered a way to simplify the code. I'm not saying it is a better way, but I thought I would mention it.
Note that JWilliman's answer provides the count on the y-axis and a "hack" to scale the corresponding density normal approximation (which otherwise would cover a total area of 1 and have therefore a much lower peak).
Main point of this comment: simpler syntax inside stat_function, by passing the needed parameters to the aesthetics function, e.g.
aes(x = x, mean = 0, sd = 1, binwidth = 0.3, n = 1000)
This avoids having to pass args = to stat_function and is therefore more user-friendly. Okay, it's not very different, but hopefully someone will find it interesting.
# parameters that will be passed to ``stat_function``
n = 1000
mean = 0
sd = 1
binwidth = 0.3 # passed to geom_histogram and stat_function
set.seed(1)
df <- data.frame(x = rnorm(n, mean, sd))
ggplot(df, aes(x = x, mean = mean, sd = sd, binwidth = binwidth, n = n)) +
theme_bw() +
geom_histogram(binwidth = binwidth,
colour = "white", fill = "cornflowerblue", size = 0.1) +
stat_function(fun = function(x) dnorm(x, mean = mean, sd = sd) * n * binwidth,
color = "darkred", size = 1)
This code should do it:
set.seed(1)
z <- rnorm(1000)
qplot(z, geom = "blank") +
geom_histogram(aes(y = ..density..)) +
stat_density(geom = "line", aes(colour = "bla")) +
stat_function(fun = dnorm, aes(x = z, colour = "blabla")) +
scale_colour_manual(name = "", values = c("red", "green"),
breaks = c("bla", "blabla"),
labels = c("kernel_est", "norm_curv")) +
theme(legend.position = "bottom", legend.direction = "horizontal")
Note: I used qplot but you can use the more versatile ggplot.
Here's a tidyverse informed version:
Setup
library(tidyverse)
Some data
d <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/speed_gender_height.csv")
Preparing data
We'll use a "total" histogram for the whole sample, to that end, we'll need to remove the grouping information from the data.
d2 <-
d |>
select(-gender)
Here's a data set with summary data:
d_summary <-
d %>%
group_by(gender) %>%
summarise(height_m = mean(height, na.rm = T),
height_sd = sd(height, na.rm = T))
d_summary
Plot it
d %>%
ggplot() +
aes() +
geom_histogram(aes(y = ..density.., x = height, fill = gender)) +
facet_wrap(~ gender) +
geom_histogram(data = d2, aes(y = ..density.., x = height),
alpha = .5) +
stat_function(data = d_summary %>% filter(gender == "female"),
fun = dnorm,
#color = "red",
args = list(mean = filter(d_summary,
gender == "female")$height_m,
sd = filter(d_summary,
gender == "female")$height_sd)) +
stat_function(data = d_summary %>% filter(gender == "male"),
fun = dnorm,
#color = "red",
args = list(mean = filter(d_summary,
gender == "male")$height_m,
sd = filter(d_summary,
gender == "male")$height_sd)) +
theme(legend.position = "none",
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
labs(title = "Facetted histograms with overlaid normal curves",
caption = "The grey histograms shows the whole distribution (over) both groups, i.e. females and men") +
scale_fill_brewer(type = "qual", palette = "Set1")

Resources