I understand the different between a scale and coord transform in ggplot2 is that scale transforms are done before the statistic is computed and coordinate transforms are done after the statistic is computed. However, I'm having trouble understanding this with an actual example.
Packages
library(ggplot2)
library(gapminder)
Create base plot
base <- ggplot(data = gapminder,
mapping = aes(x = year,
y = gdpPercap * pop,
color = continent)) +
geom_line(stat = "summary")
Scale transform
base +
scale_y_continuous(trans = "log10")
Coordinate transform
base +
coord_trans(y = "log10")
The coord_trans() results in the correct depiction of the data, but I do not understand why.
Note: I have seen this question and it did not fully help
Here's a simpler example that should help explain the differences. Suppose we have two values in a data frame, 1 and 10. The mean of these is 11 / 2 = 5.5.
my_data = data.frame(y = c(1, 10))
mean(my_data$y)
#[1] 5.5
If we take the log (base 10) of those, we get 0 and 1. The average of the logs is (0+1)/2 = 0.5. If we transform that back to the original scale, we get 10^0.5 = 3.162. So we can see that ten to the mean of the logs is not the same as the mean; the log "squishes" the large values so they have less of an impact on the average.
log10(my_data$y)
#[1] 0 1
mean(log10(my_data$y))
#[1] 0.5
10^mean(log10(my_data$y))
#[1] 3.162278
We'll see the same thing if we plot this. Using a coord transformation will control the viewport and the spatial position of the data points (e.g. note that the vertical height in pixels between 5.00 to 5.25 is a smidge bigger than the distance from 5.75 to 6.00, due to the log scale), but it doesn't change the data points -- we still get an average of 5.5:
ggplot(my_data, aes(y = y, x = 1)) +
geom_point(stat = "summary", fun = "mean") +
coord_trans(y = "log10")
But if we switch to scale_y_log10, the transformation is applied upstream of the mean calculation, so the value we get is ten to the mean of the logs, which we saw is not the same as the arithmetic mean.
ggplot(my_data, aes(y = y, x = 1)) +
geom_point(stat = "summary", fun = "mean") +
scale_y_log10()
Related
I have searched SO and other online sources to no avail.
Is there a way to scale an axis such that z-scores will better reflect the actual difference from 0 to 1 and from 1 to 2 (or any other equally spaced score)?
If I have an x-axis with z-scores ranging from -3 to 3 and axis ticks at every integer between, is there a way to have those axis ticks which are closer to 0 be spaced smaller than those that are farther?
Example:
-3 -2 -1 0 1 2 3
|----------|------|--|--|------|----------|
Am I missing some axis scaling method which accepts both the breaks as values but also the position of the breaks relative to the entire scale?
EDIT:
Maybe not quite a reprex, but this is the structure of the data and basic method of visualization:
df <-
data.frame(
metric = c('metric1', 'metric2', 'metric3'),
z_score = c(2, -1.5, 2.8)
)
df %>%
ggplot(aes(x = metric, y = z_score)) +
geom_col() +
coord_flip() +
ylim(-4,4)
The code above produces a plot where the z_score axis has evenly spaced breaks, whereas I would like the breaks to be "pulled" toward zero like I attempted to draw above.
What you describe seems to correspond to a modulus transformation, but I don't know how to choose the correct parameters to get the exact transformation that you want.
Here is an example:
library(ggplot2)
library(scales)
df <- data.frame(
metric = c('metric1', 'metric2', 'metric3'),
z_score = c(2, -1.5, 2.8)
)
ggplot(df, aes(x = metric, y = z_score)) +
geom_col() +
coord_flip() +
scale_y_continuous(trans = modulus_trans(2),
limits = c(-4, 4),
breaks = c(-3:3))
Created on 2020-05-28 by the reprex package (v0.3.0)
The trick to this is to use a new transformation object. There are several already defined in scales::, and the closest I found (though it is opposite, in a sense) is:
ggplot(df, aes(x = metric, y = z_score)) +
geom_col() +
coord_flip() +
scale_y_continuous(trans=scales::pseudo_log_trans(0.2, 2),
limits = c(-3, 3), breaks = -3:3)
But that has the opposite expansion I think you want. Since one way to see the opposite of pseudo_log would be pseudo_exp, and I didn't find one, here's an attempt:
pseudo_exp_trans <- function(pow = 2) {
scales::trans_new(
"pseudo_exp",
function(x) sign(x) * abs(x^pow),
function(x) sign(x) * abs(x)^(1/pow))
}
ggplot(df, aes(x = metric, y = z_score)) +
geom_col() +
coord_flip() +
scale_y_continuous(trans=pseudo_exp_trans(),
limits = c(-3, 3), breaks = -3:3)
Just play with the pow= argument to find the growth-rate you want in the axis.
I have data with around 25,000 rows myData with column attr having values from 0 -> 45,600. I am not sure how to make a simplified or reproducible data...
Anyway, I am plotting the density of attr like below, and I also find the attr value where density is maximum:
library(ggplot)
max <- which.max(density(myData$attr)$y)
density(myData$attr)$x[max]
ggplot(myData, aes(x=attr))+
geom_density(color="darkblue", fill="lightblue")+
geom_vline(xintercept = density(myData$attr)$x[max])+
xlab("attr")
Here is the plot I have got with the x-intercept at maximum point:
Since the data is skewed, I then attempted to draw x-axis in log scale by adding scale_x_log10() to the ggplot, here is the new graph:
My questions now are:
1. Why does it have 2 maximum points now? Why is my x-intercept no longer at the maximum point(s)?
2. How do I find the intercepts for the 2 new maximum points?
Finally, I attempt to convert the y-axis to count instead:
ggplot(myData, aes(x=attr)) +
stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3)+
xlab("attr")+
scale_x_log10()
I got the following plot:
3. How do I find the count of the 2 peaks?
Why the density shapes are different
To put my comments into a fuller context, ggplot is taking the log before doing the density estimation, which is causing the difference in shape because the binning covers different parts of the domain. For example,
(bins <- seq(1, 10, length.out = 10))
#> [1] 1 2 3 4 5 6 7 8 9 10
(bins_log <- 10^seq(log10(1), log10(10), length.out = 10))
#> [1] 1.000000 1.291550 1.668101 2.154435 2.782559 3.593814 4.641589
#> [8] 5.994843 7.742637 10.000000
library(ggplot2)
ggplot(data.frame(x = c(bins, bins_log),
trans = rep(c('identity', 'log10'), each = 10)),
aes(x, y = trans, col = trans)) +
geom_point()
This binning can affect the resulting density shape. For example, compare an untransformed density:
d <- density(mtcars$disp)
plot(d)
to one which is logged beforehand:
d_log <- density(log10(mtcars$disp))
plot(d_log)
Note that the height of the modes flips! I believe what you are asking for is the first one, but with the log transformation applied after the density, i.e.
d_x_log <- d
d_x_log$x <- log10(d_x_log$x)
plot(d_x_log)
Here the modes are similar, just compressed.
Moving to ggplot
When moving to ggplot, to do the density estimation before the log transformation it's easiest to do it outside of ggplot beforehand:
library(ggplot2)
d <- density(mtcars$disp)
ggplot(data.frame(x = d$x, y = d$y), aes(x, y)) +
geom_density(stat = "identity", fill = 'burlywood', alpha = 0.3) +
scale_x_log10()
Finding modes
Finding modes when there's a single one is relatively easy; it's just d$x[which.max(d$x)]. But when you have multiple modes, that's not good enough, since it will only show you the highest one. A solution is to effectively take the derivative and look for where the slope changes from positive to negative. We can do this numerically with diff, and since we only care about whether the result is positive or negative, call sign on that to turn everything into -1 and 1.* If we call diff on that, everything will be 0 except the maximums and minimums, which will be -2 and 2, respectively. We can then look for which values are less than 0, which we can use to subset. (Because diff does not insert NAs on the end, you'll have to add one to the indices.) Altogether, designed to work on a density object,
d <- density(mtcars$disp)
modes <- function(d){
i <- which(diff(sign(diff(d$y))) < 0) + 1
data.frame(x = d$x[i], y = d$y[i])
}
modes(d)
#> x y
#> 1 128.3295 0.003100294
#> 2 305.3759 0.002204658
d$x[which.max(d$y)] # double-check
#> [1] 128.3295
We can add them to our plot, and they'll get transformed nicely:
ggplot(data.frame(x = d$x, y = d$y), aes(x, y)) +
geom_density(stat = "identity", fill = 'mistyrose', alpha = 0.3) +
geom_vline(xintercept = modes(d)$x) +
scale_x_log10()
Plotting counts instead of density
To turn the y-axis into counts instead of density, multiply y by the number of observations, which is stored in the density object as n:
ggplot(data.frame(x = d$x, y = d$y * d$n), aes(x, y)) +
geom_density(stat = "identity", fill = 'thistle', alpha = 0.3) +
geom_vline(xintercept = modes(d)$x) +
scale_x_log10()
In this case it looks a little silly because there are only 32 observations spread over a wide domain, but with a larger n and smaller domain, it is more interpretable:
d <- density(diamonds$carat, n = 2048)
ggplot(data.frame(x = d$x, y = d$y * d$n), aes(x, y)) +
geom_density(stat = "identity", fill = 'papayawhip', alpha = 0.3) +
geom_point(data = modes(d), aes(y = y * d$n)) +
scale_x_log10()
* Or 0 if the value is exactly 0, but that's unlikely here and will work fine regardless.
Sometimes you want to limit the axis range of a plot to a region of interest so that certain features (e.g. location of the median & quartiles) are emphasized. Nevertheless, it may be of interest to make it clear how many/what proportion of values lie outside the (truncated) axis range.
I am trying to show this when using ggplot2 in R and am wondering whether there is some buildt-in way of doing this in ggplot2 (or alternatively some sensible solution some of you may have used). I am not actually particularly wedded to any particular way of displaying this (e.g. jittered points with a different symbol at the edge of the plot, a little bar outside that depending on how full it is shows the proportion outside the range, some kind of other display that somehow conveys the information).
Below is some example code that creates some mock data and the kind of plot I have in mind (shown below the code), but without any clear indication exactly how much data is outside the y-axis range.
library(ggplot2)
set.seed(seed=123)
group <- rep(c(0,1),each=500)
y <- rcauchy(1000, group, 10)
mockdata <- data.frame(group,y)
ggplot(mockdata, aes(factor(group),y)) + geom_boxplot(aes(fill = factor(group))) + coord_cartesian(ylim = c(-40,40))
You may compute these values in advance and display them via e.g. geom_text:
library(dplyr)
upper_lim <- 40
lower_lim <- -40
mockdata$upper_cut <- mockdata$y > upper_lim
mockdata$lower_cut <- mockdata$y < lower_lim
mockdata$group <- as.factor(mockdata$group)
mockpts <- mockdata %>%
group_by(group) %>%
summarise(upper_count = sum(upper_cut),
lower_count = sum(lower_cut))
ggplot(mockdata, aes(group, y)) +
geom_boxplot(aes(fill = group)) +
coord_cartesian(ylim = c(lower_lim, upper_lim)) +
geom_text(y = lower_lim, data = mockpts,
aes(label = lower_count, x = group), hjust = 1.5) +
geom_text(y = upper_lim, data = mockpts,
aes(label = upper_count, x = group), hjust = 1.5)
I'm using stat_summary to display the mean and, based off my calculations, "type1, G-" should have a mean of ~10^7.3. And that's the value I get from plotting it without a log10 axis. But when I add in the log10 axis, suddenly "type1, G-" shows a value of 10^6.5.
What's going on?
#Data
Type = rep(c("type1", "type2"), each = 6)
Gen = rep(rep(c("G-", "G+"), each = 3), 2)
A = c(4.98E+05, 5.09E+05, 1.03E+05, 3.08E+05, 5.07E+03, 4.22E+04, 6.52E+05, 2.51E+04, 8.66E+05, 8.10E+04, 6.50E+06, 1.64E+06)
B = c(6.76E+07, 3.25E+07, 1.11E+07, 2.34E+06, 4.10E+04, 1.20E+06, 7.50E+07, 1.65E+05, 9.52E+06, 5.92E+06, 3.11E+08, 1.93E+08)
df = melt(data.frame(Type, Gen, A, B))
#Correct, non-log10 version ("type1 G-" has a value over 1e+07)
ggplot(data = df, aes(x =Type,y = value)) +
stat_summary(fun.y="mean",geom="bar",position="dodge",aes(fill=Gen))+
scale_x_discrete(limits=c("type1"))+
coord_cartesian(ylim=c(10^7,10^7.5))
#Incorrect, log10 version ("type1 G-" has a value under 1e+07)
ggplot(data = df, aes(x =Type,y = value)) +
stat_summary(fun.y="mean",geom="bar",position="dodge",aes(fill=Gen))+
scale_y_log10()
You want coord_trans. As its documentation says:
# The difference between transforming the scales and
# transforming the coordinate system is that scale
# transformation occurs BEFORE statistics, and coordinate
# transformation afterwards.
However, you cannot make a barplot with this, since bars start at 0 and log10(0) is not defined. But barplots are usually not a good visualization anyway.
ggplot(data = df, aes(x =Type,y = value)) +
stat_summary(fun.y="mean",geom="point",position="identity",aes(color=Gen))+
coord_trans(y = "log10", limy = c(1e5, 1e8)) +
scale_y_continuous(breaks = 10^(5:8))
Obviously you should plot some kind of uncertainty information. I'd recommend a boxplot.
I am plotting the results of 50 - 100 experiments.
Each experiment results in a time series.
I can plot a spaghetti plot of all time series, but
what I'd like to have is sort of a density map for the time series plume.
(something similar to the gray shading in the lower panel
in this figure: http://www.ipcc.ch/graphics/ar4-wg1/jpg/fig-6-14.jpg)
I can 'sort of' do this with 2d binning or binhex but the result could be prettier (see example below).
Here is a code that reproduces a plume plot for mock data (uses ggplot2 and reshape2).
# mock data: random walk plus a sinus curve.
# two envelopes for added contrast.
tt=10*sin(c(1:100)/(3*pi))
rr=apply(matrix(rnorm(5000),100,50),2,cumsum) +tt
rr2=apply(matrix(rnorm(5000),100,50),2,cumsum)/1.5 +tt
# stuff data into a dataframe and melt it.
df=data.frame(c(1:100),cbind(rr,rr2) )
names(df)=c("step",paste("ser",c(1:100),sep=""))
dfm=melt(df,id.vars = 1)
# ensemble average
ensemble_av=data.frame(step=df[,1],ensav=apply(df[,-1],1,mean))
ensemble_av$variable=as.factor("Mean")
ggplot(dfm,aes(step,value,group=variable))+
stat_binhex(alpha=0.2) + geom_line(alpha=0.2) +
geom_line(data=ensemble_av,aes(step,ensav,size=2))+
theme(legend.position="none")
Does anyone know of a nice way do get a shaded envelope with gradients. I have also tried geom_ribbon but that did not give any indication of density changes along the plume. binhex does that, but not with aesthetically pleasing results.
Compute quantiles:
qs = data.frame(
do.call(
rbind,
tapply(
dfm$value, dfm$step, function(i){quantile(i)})),
t=1:100)
head(qs)
X0. X25. X50. X75. X100. t
1 -0.8514179 0.4197579 0.7681517 1.396382 2.883903 1
2 -0.6506662 1.2019163 1.6889073 2.480807 5.614209 2
3 -0.3182652 2.0480082 2.6206045 4.205954 6.485394 3
4 -0.1357976 2.8956990 4.2082762 5.138747 8.860838 4
5 0.8988975 3.5289219 5.0621513 6.075937 10.253379 5
6 2.0027973 4.5398120 5.9713921 7.015491 11.494183 6
Plot ribbons:
ggplot() +
geom_ribbon(data=qs, aes(x=t, ymin=X0., ymax=X100.),fill="gray30", alpha=0.2) +
geom_ribbon(data=qs, aes(x=t, ymin=X25., ymax=X75.),fill="gray30", alpha=0.2)
This is for two quantile intervals, (0-100) and (25-75). You'll need more args to quantile and more ribbon layers for more quantiles, and need to adjust the colours too.
Based on the idea of Spacedman, I found a way to add more intervals in an automatic way: I first compute the quantiles for each step, group them by pairs of symmetric values and then use geom_ribbon in the right order...
library(tidyr)
library(dplyr)
condquant <- dfm %>% group_by(step) %>%
do(quant = quantile(.$value, probs = seq(0,1,.05)), probs = seq(0,1,.05)) %>%
unnest() %>%
mutate(delta = 2*round(abs(.5-probs)*100)) %>%
group_by(step, delta) %>%
summarize(quantmin = min(quant), quantmax= max(quant))
ggplot() +
geom_ribbon(data = condquant, aes(x = step, ymin = quantmin, ymax = quantmax,
group = reorder(delta, -delta), fill = as.numeric(delta)),
alpha = .5) +
scale_fill_gradient(low = "grey10", high = "grey95") +
geom_line(data = dfm, aes(x = step, y = value, group=variable), alpha=0.2) +
geom_line(data=ensemble_av,aes(step,ensav),size=2)+
theme(legend.position="none")
Thanks Erwan and Spacedman.
Avoiding 'tidyr' ('dplyr' and 'magrittr') my version of Erwans answer becomes
probs=c(0:10)/10 # use fewer quantiles than Erwan
arr=t(apply(df[,-1],1,quantile,prob=probs))
dfq=data.frame(step=df[,1],arr)
names(dfq)=c("step",colnames(arr))
dfqm=melt(dfq,id.vars=c(1))
# add inter-quantile (per) range as delta
dfqm$delta=dfqm$variable
levels(dfqm$delta)=abs(probs-rev(probs))*100
dfplot=ddply(dfqm,.(step,delta),summarize,
quantmin=min(value),
quantmax=max(value) )
ggplot() +
geom_ribbon(data = dfplot, aes(x = step, ymin = quantmin,
ymax =quantmax,group=rev(delta),
fill = as.numeric(delta)),
alpha = .5) +
scale_fill_gradient(low = "grey25", high = "grey75") +
geom_line(data=ensemble_av,aes(step,ensav),size=2) +
theme(legend.position="none")