Check if y-axis begins at zero - r

I want to determine if a plot generated by ggplot begins at zero.
I am generating a couple hundred reports that each have thirty or more charts in them. I am content with ggplot's defaults for when a plot starts at zero and when it doesn't, but I want to add a caption that draws the reader's attention to this fact.
Something like:
labs(caption = ifelse(XXXXX, "Note: y-axis does not begin at zero", ""))
But I have no clue what my test should be.

Try this:
library(ggplot2)
g <- ggplot(data.frame(x=1:10, y=0:9), aes(x=x,y=y)) + geom_point()
yrange <- ggplot_build(g)$layout$panel_params[[1]]$y.range
if(yrange[1] <= 0) g <- g + labs(caption = "Note: y-axis does not begin at zero")
plot(g)

Related

I need ggplot scale_x_log10() to give me both negative and positive numbers as output

I generate a fine histogram here with both positive and negative numbers.
x <- rnorm(5000,0,1000)
library(ggplot2)
df <- data.frame(x)
ggplot(df, aes(x = x)) + geom_histogram()
What I want is to have a logged x-axis. When I do this for only positive numbers with scale_x_log10(), it works like a charm. But here it does not and it either removes my negative numbers are adds them to the positive numbers.
ggplot(df, aes(x = x)) + geom_histogram() + scale_x_log10()
All I really want is for the ticks and the spacing between the ticks to follow the log pattern and for either side of 0 on the x-axis to be mirror images of each other but I cannot seem to get that.
This is possible to do by defining a new transformation (a "signed log", sign(x)*log(abs(x)); the asinh transformation suggested by Histogram with "negative" logarithmic scale in R might be more principled, or a signed square root as suggested in the comments above), but I question whether it's a good idea or not. Nevertheless ... ("Teach a man to fish and you feed him for a lifetime; give him a rope, and he can go hang himself ...") ... you can define your own axis transformations via trans_new as shown below.
Setup:
library(ggplot2); theme_set(theme_bw())
set.seed(101)
df <- data.frame(x=rnorm(5000,0,1000))
Set up the new transformation:
weird <- scales::trans_new("signed_log",
transform=function(x) sign(x)*log(abs(x)),
inverse=function(x) sign(x)*exp(abs(x)))
Try it out -- first on the raw points:
ggplot(df,aes(x,x))+geom_point()+
scale_y_continuous(trans=weird)
Now on the histogram:
ggplot(df, aes(x = x)) + geom_histogram()+
scale_x_continuous(trans=weird)
Things you should worry about:
this transformation is going to be nonsensical when you have values between -1 and 1
you might have to worry about transforming the axis of a histogram without scaling bin height appropriately: it may give you a misleading impression of the probability density -- although in this case ggplot(df, aes(x = weird$transform(x))) + geom_histogram() looks about the same as the plot above ...
This works with small numbers!
if (require("lme4") && require("glmmTMB")) {
weird <- scales::trans_new("signed_log",
transform=function(x) sign(x)*log1p(abs(x)),
inverse=function(x) sign(x)*expm1(abs(x)))
p<- plot_model(fit.me.model)
p+ scale_y_continuous(trans=weird)
}
r randomeffects lme4 glmmtmb

R - Control Histogram Y-axis Limits by second-tallest peak

I've written an R script that loops through a data.frame making multiple of complex plots that includes a histogram. The problem is that the histograms often show a tall, uninformative peak at x=0 or x=1 and it obscures the rest of the data which is more informative. I have figured out that I can hide the tall peak by defining the limits of the x and y axes of each histogram as seen in the code below - but what I really need to figure out is how to define the y-axis limits such that they are optimized for the second-largest peak in my histogram.
Here's some code that simulates my data and plots histograms with different sorts of axis limits imposed:
require(ggplot2)
set.seed(5)
df = data.frame(matrix(sample(c(1:10), 1000, replace = TRUE, prob = c(0.8,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01)), nrow=100))
cols = names(df)
for (i in c(1:length(cols))) {
my_col = cols[i]
p1 = ggplot(df, aes_string(my_col)) + geom_histogram(bins = 10)
print(p1)
p2 = p1 + ggtitle(paste("Fixed X Limits", my_col)) + scale_x_continuous(limits = c(1,10))
print(p2)
p3 = p1 + ggtitle(paste("Fixed Y Limits", my_col)) + scale_y_continuous(limits = c(0,3))
print(p3)
p4 = p1 + ggtitle(paste("Fixed X & Y Limits", my_col)) + scale_y_continuous(limits = c(0,3)) + scale_x_continuous(limits = c(1,10))
print(p4)
}
The problem is that in this data, I can hard-code y-limits and have a reasonable expectation that they will work well for all the histograms. With my real data the size of the peaks varies wildly between the numerous histograms I am producing. I've tried defining the y-limit with various equations based on descriptive numbers like the mean, median and range but nothing I've come up with works well for all cases.
If I could define the y-limit in relation to the second-tallest peak of the histogram, I would have something that was perfectly suited for each situation.
I am not sure how ggplot builds its histograms, but one method would be to grab the results from hist:
maxDensities <- sapply(df, function(i) max(hist(i)$density))
# take the second highest peak:
myYlim <- rev(sort(maxDensities))[2]
I would process the data to determine the height you need.
Something along the lines of:
sort(table(cut(df$X1,breaks=10)),T)[2]
Working from the inside out
cut will bin the data (not really needed with integer data like you have but probably needed with real data
table then creates a table with the count of each of those bins
sort sorts the table from highest to lowest
[2] takes the 2nd highest value

Storing parameters from a graph and applying to other graphs

I would like to store the xmin and xmax parameters from one geom_histogram and apply them to a second geom_histogram.
I am putting both graphs on the same page using grid.arrange and would like them to have the same x range, while allowing the first graph to establish the range based on its data. The second graph is produced from a subset of the first graphs data, so it will not have data that falls outside of the x-range established by the first. But I don't want the range to shrink to fit the second graph. Using the example below I want the second graph to have the same x dimensions as the first graph.
library(ggplot2)
library(scales)
library(gdata)
library(grid)
library(gridExtra)
a<-(ggplot(mpg, aes(x = hwy)) + geom_histogram() + labs(title = "All Cars"))
b<-(ggplot(subset(mpg, cyl == 4), aes(x = hwy)) + geom_histogram() + labs(title = "Just 4 Cylindars"))
grid.arrange(a,b, ncol = 2)
Faceting would clearly be cleaner. This is just a hack to show you how to look inside a ggplot-object.
Try adding this to the b plot commands:
... +xlim( range( a$data[ , a$labels$x] )*c(0.9,1.1) )
Needed to make limits expand in "both directions". If those limits had spanned neative an postive values a different strategy would have been needed. For interest you might want to look at:
names(a)
str(a)
range( a$data[ , a$labels$x] )
#[1] 12 44
Notice that there were no xlimits set in the coordinates-element and that we needed to use the original data that was stored along with the plot.

Scatterplot with ugly margins when using log scale

I have a somewhat "weird" two-dimensional distribution (not normal with some uniform values, but it kinda looks like this.. this is just a minimal reproducible example), and want to log-transform the values and plot them.
library("ggplot2")
library("scales")
df <- data.frame(x = c(rep(0,200),rnorm(800, 4.8)), y = c(rnorm(800, 3.2),rep(0,200)))
Without the log transformation, the scatterplot (incl. rug plot which I need) works (quite) well, apart from a marginally narrower rug plot on the x axis:
p <- ggplot(df, aes(x, y)) + geom_point() + geom_rug(alpha = I(0.5)) + theme_minimal()
p
When plotting the same with a log10-transform though, the points at the margin (at x = 0 and y = 0, respectively) are plotted outside the rug plot or just on the axis (with other data, and only one half side of a point is visible).
p + scale_x_log10() + scale_y_log10()
How can I "rescale" the axes so that all the points are contained fully within the grid and the rug plots are unaffected, as in the first example?
Maybe you want
p + scale_x_log10(oob=squish_infinite) + scale_y_log10(oob=squish_infinite)
I don't really know what you expect to happen for those values that can be negative or infinite, but one general advice when transformations don't do what you want is to perform them outside of ggplot2. Something like this might be useful,
library(plyr)
df2 <- colwise(log10)(df) # log transform columns
df2 <- colwise(squish_infinite)(df2) # do something with infinites
p %+% df2 # plot the transformed data

How to plot stacked point histograms?

What's the ggplot2 equivalent of "dotplot" histograms? With stacked points instead of bars? Similar to this solution in R:
Plot Histogram with Points Instead of Bars
Is it possible to do this in ggplot2? Ideally with the points shown as stacks and a faint line showing the smoothed line "fit" to these points (which would make a histogram shape.)
ggplot2 does dotplots Link to the manual.
Here is an example:
library(ggplot2)
set.seed(789); x <- data.frame(y = sample(1:20, 100, replace = TRUE))
ggplot(x, aes(y)) + geom_dotplot()
In order to make it behave like a simple dotplot, we should do this:
ggplot(x, aes(y)) + geom_dotplot(binwidth=1, method='histodot')
You should get this:
To address the density issue, you'll have to add another term, ylim(), so that your plot call will have the form ggplot() + geom_dotplot() + ylim()
More specifically, you'll write ylim(0, A), where A will be the number of stacked dots necessary to count 1.00 density. In the example above, the best you can do is see that 7.5 dots reach the 0.50 density mark. From there, you can infer that 15 dots will reach 1.00.
So your new call looks like this:
ggplot(x, aes(y)) + geom_dotplot(binwidth=1, method='histodot') + ylim(0, 15)
Which will give you this:
Usually, this kind of eyeball estimate will work for dotplots, but of course you can try other values to fine-tune your scale.
Notice how changing the ylim values doesn't affect how the data is displayed, it just changes the labels in the y-axis.
As #joran pointed out, we can use geom_dotplot
require(ggplot2)
ggplot(mtcars, aes(x = mpg)) + geom_dotplot()
Edit: (moved useful comments into the post):
The label "count" it's misleading because this is actually a density estimate may be you could suggest we changed this label to "density" by default. The ggplot implementation of dotplot follow the original one of Leland Wilkinson, so if you want to understand clearly how it works take a look at this paper.
An easy transformation to make the y axis actually be counts, i.e. "number of observations". From the help page it is written that:
When binning along the x axis and stacking along the y axis, the numbers on y axis are not meaningful, due to technical limitations of ggplot2. You can hide the y axis, as in one of the examples, or manually scale it to match the number of dots.
So you can use this code to hide y axis:
ggplot(mtcars, aes(x = mpg)) +
geom_dotplot(binwidth = 1.5) +
scale_y_continuous(name = "", breaks = NULL)
I introduce an exact approach using #Waldir Leoncio's latter method.
library(ggplot2); library(grid)
set.seed(789)
x <- data.frame(y = sample(1:20, 100, replace = TRUE))
g <- ggplot(x, aes(y)) + geom_dotplot(binwidth=0.8)
g # output to read parameter
### calculation of width and height of panel
grid.ls(view=TRUE, grob=FALSE)
real_width <- convertWidth(unit(1,'npc'), 'inch', TRUE)
real_height <- convertHeight(unit(1,'npc'), 'inch', TRUE)
### calculation of other values
width_coordinate_range <- diff(ggplot_build(g)$panel$ranges[[1]]$x.range)
real_binwidth <- real_width / width_coordinate_range * 0.8 # 0.8 is the argument binwidth
num_balls <- real_height / 1.1 / real_binwidth # the number of stacked balls. 1.1 is expanding value.
# num_balls is the value of A
g + ylim(0, num_balls)
Apologies : I don't have enough reputation to 'comment'.
I like cuttlefish44's "exact approach", but to make it work (with ggplot2 [2.2.1]) I had to change the following line from :
### calculation of other values
width_coordinate_range <- diff(ggplot_build(g)$panel$ranges[[1]]$x.range)
to
### calculation of other values
width_coordinate_range <- diff(ggplot_build(g)$layout$panel_ranges[[1]]$x.range)

Resources