Trying to plot a stacked histogram using ggplot:
set.seed(1)
my.df <- data.frame(param = runif(10000,0,1),
x = runif(10000,0.5,1))
my.df$param.range <- cut(my.df$param, breaks = 5)
require(ggplot2)
not logging the y-axis:
ggplot(my.df,aes_string(x = "x", fill = "param.range")) +
geom_histogram(binwidth = 0.1, pad = TRUE) +
scale_fill_grey()
gives:
But I want to log10+1 transform the y-axis to make it easier to read:
ggplot(my.df, aes_string(x = "x", y = "..count..+1", fill = "param.range")) +
geom_histogram(binwidth = 0.1, pad = TRUE) +
scale_fill_grey() +
scale_y_log10()
which gives:
The tick marks on the y-axis don't make sense.
I get the same behavior if I log10 transform rather than log10+1:
ggplot(my.df, aes_string(x = "x", fill = "param.range")) +
geom_histogram(binwidth = 0.1, pad = TRUE) +
scale_fill_grey() +
scale_y_log10()
Any idea what is going on?
It looks like invoking scale_y_log10 with a stacked histogram is causing ggplot to plot the product of the counts for each component of the stack within each x bin. Below is a demonstration. We create a data frame called product.of.counts that contains the product, within each x bin of the counts for each param.range bin. We use geom_text to add those values to the plot and see that they coincide with the top of each stack of histogram bars.
At first I thought this was a bug, but after a bit of searching, I was reminded of the way ggplot does the log transformation. As described in the linked answer, "scale_y_log10 makes the counts, converts them to logs, stacks those logs, and then displays the scale in the anti-log form. Stacking logs, however, is not a linear transformation, so what you have asked it to do does not make any sense."
As a simpler example, say each of five components of a stacked bar have a count of 100. Then log10(100) = 2 for all five and the sum of the logs will be 10. Then ggplot takes the anti-log for the scale, which gives 10^10 for the total height of the bar (which is 100^5), even though the actual height is 100x5=500. This is exactly what's happening with your plot.
library(dplyr)
library(ggplot2)
# Data
set.seed(1)
my.df <- data.frame(param=runif(10000,0,1),x=runif(10000,0.5,1))
my.df$param.range <- cut(my.df$param,breaks=5)
# Calculate product of counts within each x bin
product.of.counts = my.df %>%
group_by(param.range, breaks=cut(x, breaks=seq(-0.05, 1.05, 0.1), labels=seq(0,1,0.1))) %>%
tally %>%
group_by(breaks) %>%
summarise(prod = prod(n),
param.range=NA) %>%
ungroup %>%
mutate(breaks = as.numeric(as.character(breaks)))
ggplot(my.df, aes(x, fill=param.range)) +
geom_histogram(binwidth = 0.1, colour="grey30") +
scale_fill_grey() +
scale_y_log10(breaks=10^(0:14)) +
geom_text(data=product.of.counts, size=3.5,
aes(x=breaks, y=prod, label=format(prod, scientific=TRUE, digits=3)))
Related
I am working on a boxplot with points overlayed and lines connecting the points between two time sets, example data provided below.
I have two questions:
I would like the points to look like this, with just a little height jitter and more width jitter. However, I want the points to be symmetrically centered around the middle of the boxplot on each y axis label (to make the plots more visually pleasing). For example, I would like the 6 datapoints at y = 4 and x = "after to be placed 3 to the right of the boxplot center and 3 to the left of the center, at symmetrical distances from the center.
Also, I want the lines to connect with the correct points, but now the lines start and end in the wrong places. I know I can use position = position_dodge() in geom_point() and geom_line() to get the correct positions, but I want to be able to adjust the points by height also (why do the points and lines align with position_dodge() but not with position_jitter?).
Are these to things possible to achieve?
Thank you!
examiner <- rep(1:15, 2)
time <- rep(c("before", "after"), each = 15)
result <- c(1,3,2,3,2,1,2,4,3,2,3,2,1,3,3,3,4,4,5,3,4,3,2,2,3,4,3,4,4,3)
data <- data.frame(examiner, time, result)
ggplot(data, aes(time, result, fill=time)) +
geom_boxplot() +
geom_point(aes(group = examiner),
position = position_jitter(width = 0.2, height = 0.03)) +
geom_line(aes(group = examiner),
position = position_jitter(width = 0.2, height = 0.03), alpha = 0.3)
I'm not sure that you can satisfy both of your questions together.
You can have a more "symmetric" jitter by using a geom_dotplot, as per:
ggplot(data, aes(time, result, fill=time)) +
geom_boxplot() +
geom_dotplot(binaxis="y", aes(x=time, y=result, group = time),
stackdir = "center", binwidth = 0.075)
The problem is that when you add the lines, they will join at the original, un-jittered points.
To join jittered points with lines that map to the jittered points, the jitter can be added to the data before plotting. As you saw, jittering both ends up with points and lines that don't match. See Connecting grouped points with lines in ggplot for a better explanation.
library(dplyr)
data <- data %>%
mutate(result_jit = jitter(result, amount=0.1),
time_jit = jitter(case_when(
time == "before" ~ 2,
time == "after" ~ 1
), amount=0.1)
)
ggplot(data, aes(time, result, fill=time)) +
geom_boxplot() +
geom_point(aes(x=time_jit, y=result_jit, group = examiner)) +
geom_line(aes(x=time_jit, y=result_jit, group = examiner), alpha=0.3)
Result
It is possible to extract the transformed points from the geom_dotplot using ggplot_build() - see Is it possible to get the transformed plot data? (e.g. coordinates of points in dot plot, density curve)
These points can be merged onto the original data, to be used as the anchor points for the geom_line.
Putting it all together:
library(dplyr)
library(ggplot2)
examiner <- rep(1:15, 2)
time <- rep(c("before", "after"), each = 15)
result <- c(1,3,2,3,2,1,2,4,3,2,3,2,1,3,3,3,4,4,5,3,4,3,2,2,3,4,3,4,4,3)
# Create a numeric version of time
data <- data.frame(examiner, time, result) %>%
mutate(group = case_when(
time == "before" ~ 2,
time == "after" ~ 1)
)
# Build a ggplot of the dotplot to extract data
dotpoints <- ggplot(data, aes(time, result, fill=time)) +
geom_dotplot(binaxis="y", aes(x=time, y=result, group = time),
stackdir = "center", binwidth = 0.075)
# Extract values of the dotplot
dotpoints_dat <- ggplot_build(dotpoints)[["data"]][[1]] %>%
mutate(key = row_number(),
x = as.numeric(x),
newx = x + 1.2*stackpos*binwidth/2) %>%
select(key, x, y, newx)
# Join the extracted values to the original data
data <- arrange(data, group, result) %>%
mutate(key = row_number())
newdata <- inner_join(data, dotpoints_dat, by = "key") %>%
select(-key)
# Create final plot
ggplot(newdata, aes(time, result, fill=time)) +
geom_boxplot() +
geom_dotplot(binaxis="y", aes(x=time, y=result, group = time),
stackdir = "center", binwidth = 0.075) +
geom_line(aes(x=newx, y=result, group = examiner), alpha=0.3)
Result
I'm aware there are similar posts but I could not get those answers to work in my case.
e.g. Here and here.
Example:
diamonds %>%
ggplot(aes(scale(price) %>% as.vector)) +
geom_density() +
xlim(-3, 3) +
facet_wrap(vars(cut))
Returns a plot:
Since I used scale, those numbers are the zscores or standard deviations away from the mean of each break.
I would like to add as a row underneath the equivalent non scaled raw number that corresponds to each.
Tried:
diamonds %>%
ggplot(aes(scale(price) %>% as.vector)) +
geom_density() +
xlim(-3, 3) +
facet_wrap(vars(cut)) +
geom_text(aes(label = price))
Gives:
Error: geom_text requires the following missing aesthetics: y
My primary question is how can I add the raw values underneath -3:3 of each break? I don't want to change those breaks, I still want 6 breaks between -3:3.
Secondary question, how can I get -3 and 3 to actually show up in the chart? They have been trimmed.
[edit]
I've been trying to make it work with geom_text but keep hitting errors:
diamonds %>%
ggplot(aes(x = scale(price) %>% as.vector)) +
geom_density() +
xlim(-3, 3) +
facet_wrap(vars(cut)) +
geom_text(label = price)
Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomText, :
object 'price' not found
I then tried changing my call to geom_text()
geom_text(data = diamonds, aes(price), label = price)
This results in the same error message.
You can make a custom labeling function for your axis. This takes each label on the axis and performs a custom transform for you. In your case you could paste the z score, a line break, and the z-score times the standard deviation plus the mean. Because of the distribution of prices in the diamonds data set, this means that z scores below about -1 represent negative prices. This may not be a problem in your own data. For clarity I have drawn in a vertical line representing $0
labeller <- function(x) {
paste0(x,"\n", scales::dollar(sd(diamonds$price) * x + mean(diamonds$price)))
}
diamonds %>%
ggplot(aes(scale(price) %>% as.vector)) +
geom_density() +
geom_vline(aes(xintercept = -0.98580251364833), linetype = 2) +
facet_wrap(vars(cut)) +
scale_x_continuous(label = labeller, limits = c(-3, 3)) +
xlab("price")
We can use the sec_axis functionality in scale_x_continuous. To use this functionality we need to manually scale your data. This will add a secondary axis at the top of the plot, not underneath. So it's not quite exactly what you're looking for.
library(tidyverse)
# manually scale the data
mean_price <- mean(diamonds$price)
sd_price <- sd(diamonds$price)
diamonds$price_scaled <- (diamonds$price - mean_price) / sd_price
# make the plot
ggplot(diamonds, aes(price_scaled))+
geom_density()+
facet_wrap(~cut)+
scale_x_continuous(sec.axis = sec_axis(~ mean_price + (sd_price * .)),
limits = c(-3, 4), breaks = -3:3)
You could cheat a bit by passing some dummy data to geom_text:
geom_text(data = tibble(label = round(((-3:3) * sd_price) + mean_price),
y = -0.25,
x = -3:3),
aes(x, y, label = label))
I'd like to add an "id" annotation to certain observations in a histogram.
So far, I'm able to add the annotation with no problem, but I'd like the 'y' position of my annotations to be the count of the bin + 1 (for aesthetic reasons).
This is what I have so far:
library(tidyverse)
library(ggrepel)
selected_obs <- c("S10", "S100", "S245", "S900")
set.seed(0)
values <- rnorm(1000)
plot_df <- tibble(id = paste0("S", 1:1000),
values = values) %>%
mutate(obs_labels = ifelse(id %in% selected_obs, id, NA))
ggplot(plot_df, aes(values)) +
geom_histogram(binwidth = 0.3, color = "white") +
geom_label_repel(aes(label = obs_labels, y = 100))
I've seen multiple answers dealing with annotating the count for each bin using geom_text(stat = count", aes(y=..count.., label=..count..).
Based on that, I've tried these two work-arounds, but no success:
geom_label_repel(stat = "count", aes(label = obs_labels, y = ..count..)) yields:
"Error: geom_label_repel requires the following missing aesthetics: label"
geom_label_repel(aes(label = obs_labels, y = ..count..)) yields "Error: Aesthetics must be valid computed stats. Problematic aesthetic(s): y = ..count...
Did you map your stat in the wrong layer?".
Anybody that can shed some light here?
That may be a mildly misleading visualisation, because you are labelling a unique ID, but with the positioning of this label to the count height you are suggesting that this ID was counted that often. Anyways.
The most straight forward option is to manually calculate the bin to which your ID belongs, then count this bin, and then use this data in order to set the x and y for your labels.
Unfortunately, I have to use R online and cannot create a nice reprex, therefore including a screenshot. But the code should be reproducible, as it is running online
library(tidyverse)
library(ggrepel)
selected_obs <- c("S10", "S100", "S245", "S900")
set.seed(0)
values <- rnorm(1000)
plot_df <- tibble(id = paste0("S", 1:1000),
values = values) %>%
mutate(obs_labels = ifelse(id %in% selected_obs, id, NA),
bins = as.factor( as.numeric( cut(values, 30)))) # cutting into 30 bins
label_df<- plot_df %>% filter(id %in% selected_obs) %>% left_join(plot_df, by = 'bins') %>%
group_by(values = values.x, obs_labels = obs_labels.x) %>% count
ggplot(plot_df, aes(values)) +
geom_histogram(color = "white") + # removed your bin argument, as to default to 30
geom_label(data = label_df, aes(label = obs_labels, y = n))
The label positions are not quite perfect - this is because I chose to cut into 30 equal bins and the binning may be slightly different between cut and histogram. This may need some tweaking, depending on the size of your bins, and if you include upper/lower margins.
P.S. Credit to cut into equal bins goes to this answer by user pedrosaurio
I’m totally new to ggplot, relatively fresh with R and want to make a smashing ”before-and-after” scatterplot with connecting lines to illustrate the movement in percentages of different subgroups before and after a special training initiative. I’ve tried some options, but have yet to:
show each individual observation separately (now same values are overlapping)
connect the related before and after measures (x=0 and X=1) with lines to more clearly illustrate the direction of variation
subset the data along class and id using shape and colors
How can I best create a scatter plot using ggplot (or other) fulfilling the above demands?
Main alternative: geom_point()
Here is some sample data and example code using genom_point
x <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1) # 0=before, 1=after
y <- c(45,30,10,40,10,NA,30,80,80,NA,95,NA,90,NA,90,70,10,80,98,95) # percentage of ”feelings of peace"
class <- c(0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1) # 0=multiple days 1=one day
id <- c(1,1,2,3,4,4,4,4,5,6,1,1,2,3,4,4,4,4,5,6) # id = per individual
df <- data.frame(x,y,class,id)
ggplot(df, aes(x=x, y=y), fill=id, shape=class) + geom_point()
Alternative: scale_size()
I have explored stat_sum() to summarize the frequencies of overlapping observations, but then not being able to subset using colors and shapes due to overlap.
ggplot(df, aes(x=x, y=y)) +
stat_sum()
Alternative: geom_dotplot()
I have also explored geom_dotplot() to clarify the overlapping observations that arise from using genom_point() as I do in the example below, however I have yet to understand how to combine the before and after measures into the same plot.
df1 <- df[1:10,] # data before
df2 <- df[11:20,] # data after
p1 <- ggplot(df1, aes(x=x, y=y)) +
geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
binwidth=(1/0.3))
p2 <- ggplot(df2, aes(x=x, y=y)) +
geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
binwidth=(1/0.3))
grid.arrange(p1,p2, nrow=1) # GridExtra package
Or maybe it is better to summarize data by x, id, class as mean/median of y, filter out ids producing NAs (e.g. ids 3 and 6), and connect the points by lines? So in case if you don't really need to show variability for some ids (which could be true if the plot only illustrates tendencies) you can do it this way:
library(ggplot)
library(dplyr)
#library(ggthemes)
df <- df %>%
group_by(x, id, class) %>%
summarize(y = median(y, na.rm = T)) %>%
ungroup() %>%
mutate(
id = factor(id),
x = factor(x, labels = c("before", "after")),
class = factor(class, labels = c("one day", "multiple days")),
) %>%
group_by(id) %>%
mutate(nas = any(is.na(y))) %>%
ungroup() %>%
filter(!nas) %>%
select(-nas)
ggplot(df, aes(x = x, y = y, col = id, group = id)) +
geom_point(aes(shape = class)) +
geom_line(show.legend = F) +
#theme_few() +
#theme(legend.position = "none") +
ylab("Feelings of peace, %") +
xlab("")
Here's one possible solution for you.
First - to get the color and shapes determined by variables, you need to put these into the aes function. I turned several into factors, so the labs function fixes the labels so they don't appear as "factor(x)" but just "x".
To address multiple points, one solution is to use geom_smooth with method = "lm". This plots the regression line, instead of connecting all the dots.
The option se = FALSE prevents confidence intervals from being plotted - I don't think they add a lot to your plot, but play with it.
Connecting the dots is done by geom_line - feel free to try that as well.
Within geom_point, the option position = position_jitter(width = .1) adds random noise to the x-axis so points do not overlap.
ggplot(df, aes(x=factor(x), y=y, color=factor(id), shape=factor(class), group = id)) +
geom_point(position = position_jitter(width = .1)) +
geom_smooth(method = 'lm', se = FALSE) +
labs(
x = "x",
color = "ID",
shape = 'Class'
)
Sometimes you want to limit the axis range of a plot to a region of interest so that certain features (e.g. location of the median & quartiles) are emphasized. Nevertheless, it may be of interest to make it clear how many/what proportion of values lie outside the (truncated) axis range.
I am trying to show this when using ggplot2 in R and am wondering whether there is some buildt-in way of doing this in ggplot2 (or alternatively some sensible solution some of you may have used). I am not actually particularly wedded to any particular way of displaying this (e.g. jittered points with a different symbol at the edge of the plot, a little bar outside that depending on how full it is shows the proportion outside the range, some kind of other display that somehow conveys the information).
Below is some example code that creates some mock data and the kind of plot I have in mind (shown below the code), but without any clear indication exactly how much data is outside the y-axis range.
library(ggplot2)
set.seed(seed=123)
group <- rep(c(0,1),each=500)
y <- rcauchy(1000, group, 10)
mockdata <- data.frame(group,y)
ggplot(mockdata, aes(factor(group),y)) + geom_boxplot(aes(fill = factor(group))) + coord_cartesian(ylim = c(-40,40))
You may compute these values in advance and display them via e.g. geom_text:
library(dplyr)
upper_lim <- 40
lower_lim <- -40
mockdata$upper_cut <- mockdata$y > upper_lim
mockdata$lower_cut <- mockdata$y < lower_lim
mockdata$group <- as.factor(mockdata$group)
mockpts <- mockdata %>%
group_by(group) %>%
summarise(upper_count = sum(upper_cut),
lower_count = sum(lower_cut))
ggplot(mockdata, aes(group, y)) +
geom_boxplot(aes(fill = group)) +
coord_cartesian(ylim = c(lower_lim, upper_lim)) +
geom_text(y = lower_lim, data = mockpts,
aes(label = lower_count, x = group), hjust = 1.5) +
geom_text(y = upper_lim, data = mockpts,
aes(label = upper_count, x = group), hjust = 1.5)