Is there a workaround for when one wants to apply geom_rect to Infinity on the y axis of a ggplot object when a transformation is applied to the y axis?
The code below does not plot the interval rectangles unless you comment out the scale_y_continuous line. When using the transformed scale, I have to put in actual data limits. I could probably write a function to find the min/max of the other data being plotted to avoid hard coding values but I'm looking for something closer to the Inf approach. I tried using NA instead of Inf but no luck.
library(tidyverse)
data(economics)
ints<-data.frame(start=as.Date(paste0(seq(1970,2020,by=10),"-01-01"))) %>%
mutate(end=start+1785)
plt<-ggplot(economics,aes(date,unemploy)) + theme_bw() +
scale_y_continuous(trans="sqrt") +
geom_rect(data=ints,inherit.aes=F,aes(xmin=start,xmax=end,ymin=-Inf,ymax=Inf)) + geom_line()
plt
library(tidyverse)
data(economics)
ints<-data.frame(start=as.Date(paste0(seq(1970,2020,by=10),"-01-01"))) %>%
mutate(end=start+1785)
plt<-ggplot(economics,aes(date,unemploy)) + theme_bw() +
scale_y_continuous(trans="sqrt") +
geom_rect(data=ints,inherit.aes=F,aes(xmin=start,xmax=end,ymin=0,ymax=Inf)) + geom_line()+
coord_cartesian(ylim=c(2000,20000)) # This will allow you to control how zoomed in you want the plot
plt
I'm studying the example of coord_trans() of ggplot2:
library(ggplot2)
library(scales)
set.seed(4747)
df <- data.frame(a = abs(rnorm(26)),letters)
plot <- ggplot(df,aes(a,letters)) + geom_point()
plot + coord_trans(x = "log10")
plot + coord_trans(x = "sqrt")
I modified the code plot + coord_trans(x = "log10") as following and get what I expected:
plot + scale_x_log10(breaks=trans_breaks("log10", function(x) 10^x),
labels=trans_format("log10", math_format(10^.x)))
I modified the code plot + coord_trans(x = "sqrt") as following and get a strange x-axis:
plot + scale_x_sqrt(breaks=trans_breaks("sqrt", function(x) sqrt(x)),
labels=trans_format("sqrt", math_format(.x^0.5)))
How could I fix the problem?
I get why you said it was a strange / terrible axis. The documentation for trans_breaks even warns you about this in its first line:
These often do not produce very attractive breaks.
To make it less unattractive, I would use round(,2) so my axis labels only have 2 decimal points instead of the default 8 or 9 - cluttering up the axis. Then I would set a sensible range, say in your case 0 to 5 (c(0,5)).
Finally, you can specify the number of ticks for your axis using n in the trans_breaks call.
So putting it together, here's how you can format your x-axis and its tick label in the scale_x_sqrt(x) format:
plot <- ggplot(df,aes(a,letters)) + geom_point()
plot + scale_x_sqrt(breaks=trans_breaks("sqrt", function(x) round(sqrt(x),2), n=5)(c(0, 5)))
Produces this:
The c(0,5) is passed to pretty(), a lesser-known Base R's function. From the documentation, pretty does the following:
Compute a sequence of about n+1 equally spaced "round" values which cover the range of the values in x.
pretty(c(0,5)) simply produces [1] 0 1 2 3 4 5 in our case.
You can even fine-tune your axis by changing the parameters. Here the code uses 3 decimal points (round(x,3)) and we asked for 3 number of ticks n=3:
plot <- ggplot(df,aes(a,letters)) + geom_point()
plot + scale_x_sqrt(breaks=trans_breaks("sqrt", function(x) round(sqrt(x),3), n=3)(c(0, 5)))
Produces this:
EDIT based on OP's additional comments:
To get round integer values, floor() or round(x,0) works, so the following code:
plot <- ggplot(df,aes(a,letters)) + geom_point()
plot + scale_x_sqrt(breaks=trans_breaks("sqrt", function(x) round(sqrt(x),0), n=5)(c(0, 5)))
Produces this:
I'm trying to create a horizontal boxplot with logarithmic axis using ggplot2. But, the length of whiskers are wrong.
A minimal reproducible example:
Some data
library(ggplot2)
library(reshape2)
set.seed(1234)
my.df <- data.frame(a = rnorm(1000,150,50), b = rnorm(1000,500,150))
my.df$a[which(my.df$a < 5)] <- 5
my.df$b[which(my.df$b < 5)] <- 5
If I plot this using base R boxplot(), everything is fine
boxplot(my.df, log="x", horizontal=T)
But with ggplot,
my.df.long <- melt(my.df, value.name = "vals")
ggplot(my.df.long, aes(x=variable, y=vals)) +
geom_boxplot() +
scale_y_log10(breaks=c(5,10,20,50,100,200,500,1000), limits=c(5,1000)) +
theme_bw() + coord_flip()
I get this plot, in which the whiskers are the wrong length (see for example how there are many additional outliers below the whiskers and none above).
Note that, without log axes, ggplot has the whiskers the correct length
ggplot(my.df.long, aes(x=variable, y=vals)) +
geom_boxplot() +
theme_bw() + coord_flip()
How do I produce a horizontal logarithmic boxplot using ggplot with the correct length whiskers? Preferably with the whiskers extending to 1.5 times the IQR.
N.B. as explained here. It is possible to use coord_trans(y = "log10") instead of scale_y_log10, which will cause the stats to be calculated before transforming the data. However, coord_trans cannot be used in combination with coord_flip. So this does not solve the issue of creating horizontal boxplots with a log axis.
You can have ggplot use boxplot.stats (the same function used by base boxplot) to set the y-values for the box-and-whiskers and the outliers. For example:
# Function to use boxplot.stats to set the box-and-whisker locations
mybxp = function(x) {
bxp = boxplot.stats(x)[["stats"]]
names(bxp) = c("ymin","lower", "middle","upper","ymax")
return(bxp)
}
# Function to use boxplot.stats for the outliers
myout = function(x) {
data.frame(y=boxplot.stats(x)[["out"]])
}
Now we use those functions in stat_summary to draw the boxplot, as in the example below:
ggplot(my.df.long, aes(x=variable, y=vals)) +
stat_summary(fun.data=mybxp, geom="boxplot") +
stat_summary(fun.data=myout, geom="point") +
theme_bw() + coord_flip()
Now for the log transformation issue: The plots below show, respectively, no coordinate transformation, scale_y_log10, and coord_trans(y="log10"). In addition, I've used geom_hline to add dotted lines at each of the box-and-whisker values and I've added text to show the actual values. To reduce clutter, I've removed the outlier points, and I've faded out the boxplots a bit so that the other components will show up better.
# Set up common plot elements
p = ggplot(my.df.long, aes(x=variable, y=vals)) +
geom_hline(yintercept=mybxp(my.df$a), colour="red", lty="11", size=0.3) +
geom_hline(yintercept=mybxp(my.df$b), colour="blue", lty="11", size=0.3) +
stat_summary(fun.data=mybxp, geom="boxplot", colour="#000000A0", fatten=0.5) +
#stat_summary(fun.data=myout, geom="point") +
theme_bw() + coord_flip()
br = c(5,10,20,50,100,200,500,1000)
## Create plots
# Without log transformation
p1 = p + scale_y_continuous(breaks=br, limits=c(5,1000)) +
stat_summary(fun.y=mybxp, aes(label=round(..y..)), geom="text", size=3, colour="red") +
ggtitle("No Transformation")
# With scale_y_log10
p2 = p + scale_y_log10(breaks=br, limits=c(5,1000)) + ggtitle("scale_y_log10") +
stat_summary(fun.y=mybxp, aes(label=round(..y..,2)), geom="text", size=3, colour="red") +
stat_summary(fun.y=mybxp, aes(label=round(10^(..y..))), geom="text", size=3,
colour="blue", position=position_nudge(x=0.3))
# With coord_trans
p3 = p + scale_y_continuous(breaks=br, limits=c(5,1000)) +
stat_summary(fun.y=mybxp, aes(label=round(..y..)), geom="text", size=3, colour="red") +
coord_trans(y="log10") + ggtitle("coord_trans(y='log 10')")
The three plots are shown below. Note that the last plot, using coord_trans is not flipped, because coord_trans overrides coord_flip. You can probably use something like the code in this SO answer to flip the plot, but I haven't done that here.
The first plot, with no transformations, shows the correct values.
The third plot, using coord_trans also has everything in the correct locations. Note that coord_trans is actually changing the y-coordinate system of the plot without changing the values of the plotted points. It's the space itself that's been "distorted" to a log scale.
Now, note that in the second plot, using scale_y_log10, the boxes are in the correct locations but the ends of the whiskers are in the wrong locations. On the other hand, comparison with the other two plots shows that the location of all the geom_hlines is correct. Also note that, unlike coord_trans, scale_y_log10 takes the log of the points themselves and just relabels the y-axis breaks with the unlogged values, while leaving the "space" in the which the points are plotted unchanged. You can see this by looking at the values in red text. The values in blue text are the unlogged values.
See #dww's answer for an explanation of why scale_y_log10 results only in the whisker ends being transformed incorrectly, while the box values are plotted in the right place.
The problem is due to the fact that scale_y_log10 transforms the data before calculating the stats. This does not matter for the median and percentile points, because e.g. 10^log10(median) is still the median value, which will be plotted in the correct location. But it does matter for the whiskers which are calculated using 1.5 * IQR, because 10^(1.5 * IQR(log10(x)) is not equal to 1.5 * IQR(x). So the calculation fails for the whiskers.
This error becomes evident if we compare
boxplot.stats(my.df$b)$stats
# [1] 117.4978 407.3983 502.0460 601.2937 873.0992
10^boxplot.stats(log10(my.df$b))$stats
# [1] 231.1603 407.3983 502.0459 601.2935 975.1906
In which we see that the median and percentile ppoints are identical, but the whisker ends (1st and last elements of the stats vector) differ
This detailed and useful answer by #eipi10, shows how to calculate the stats yourself and force ggplot to use these user-defined stats rather than its internal (and incorrect) algorithm. Using this approach, it becomes relatively simple to calculate the correct statistics and use these instead.
# Function to use boxplot.stats to set the box-and-whisker locations
mybxp = function(x) {
bxp = log10(boxplot.stats(10^x)[["stats"]])
names(bxp) = c("ymin","lower", "middle","upper","ymax")
return(bxp)
}
# Function to use boxplot.stats for the outliers
myout = function(x) {
data.frame(y=log10(boxplot.stats(10^x)[["out"]]))
}
ggplot(my.df.long, aes(x=variable, y=vals)) + theme_bw() + coord_flip() +
scale_y_log10(breaks=c(5,10,20,50,100,200,500,1000), limits=c(5,1000)) +
stat_summary(fun.data=mybxp, geom="boxplot") +
stat_summary(fun.data=myout, geom="point")
Which produces the correct plot
A note on using coord_trans as an alternative approach:
Using coord_trans(y = "log10") instead of scale_y_log10, causes the stats to be calculated (correctly) on the untransformed data. However, coord_trans cannot be used in combination with coord_flip. So, this does not solve the issue of creating horizontal boxplots with a log axis. The suggestion here to use ggdraw(switch_axis_position()) from the cowplot package to flip the axes after using coord_trans did not work, but throws an error (cowplot v0.4.0 with ggplot2 v2.1.0)
Error in Ops.unit(gyl$x, grid::unit(0.5, "npc")) : both operands
must be units
In addition: Warning message: axis.ticks.margin is
deprecated. Please set margin property of axis.text instead
I think that the easiest answer if you don't need to make the boxplots horizontal is to transform the coordinate system in stead of changing the scale, using coord_trans(y = "log10") in stead of scale_y_log10().
Why do the following plots look different? Both methods appear to use Gaussian kernels.
How does ggplot2 compute a density?
library(fueleconomy)
d <- density(vehicles$cty, n=2000)
ggplot(NULL, aes(x=d$x, y=d$y)) + geom_line() + scale_x_log10()
ggplot(vehicles, aes(x=cty)) + geom_density() + scale_x_log10()
UPDATE:
A solution to this question already appears on SO here, however the specific parameters ggplot2 is passing to the R stats density function remain unclear.
An alternate solution is to extract the density data straight from the ggplot2 plot, as shown here
In this case, it is not the density calculation that is different but how
the log10 transform is applied.
First check the densities are similar without transform
library(ggplot2)
library(fueleconomy)
d <- density(vehicles$cty, from=min(vehicles$cty), to=max(vehicles$cty))
ggplot(data.frame(x=d$x, y=d$y), aes(x=x, y=y)) + geom_line()
ggplot(vehicles, aes(x=cty)) + stat_density(geom="line")
So the issue seems to be the transform. In the stat_density below, it seems as
if the log10 transform is applied to the x variable before the density calculation.
So to reproduce the results manually you have to transform the variable prior to the
calculating the density. Eg
d2 <- density(log10(vehicles$cty), from=min(log10(vehicles$cty)),
to=max(log10(vehicles$cty)))
ggplot(data.frame(x=d2$x, y=d2$y), aes(x=x, y=y)) + geom_line()
ggplot(vehicles, aes(x=cty)) + stat_density(geom="line") + scale_x_log10()
PS: To see how ggplot prepares the data for the density, you can look at the code as.list(StatDensity) leads to StatDensity$compute_group to ggplot2:::compute_density