Weird behavior with ggplot2 geom_histogram - r

I've run into a weird issue regarding geom_histogram and it can easily be seen by plotting the uniform distribution.
library(tidyverse)
u <- runif(10000)
ggplot(data = as_tibble(u), aes(x = value)) + geom_histogram()
Generates the following count histogram:
It can be seen that it is rather assymetrical, I've read somewhere that this is because the leftmost (and rightmost) bin is centered around 0, and there are no values being generated from -0.5 up to 0, creating the discrepancy seen. So I've fiddled with the boundary parameter and it got me:
ggplot(data = as_tibble(u), aes(x = value)) + geom_histogram(boundary = 0)
Similarly setting boundary = 1 made it assymetric at the left.
My questions are: what is the preferred way to fix this behavior? Also, what exactly does the boundary argument does? It is not very clear from the docs.
Thank you.

As far as I can tell, boundary specifies a spot to be a split between two bins. The rest of bins are set according to the number of bins or supplied break points. If the supplied boundary is outside the range of the data, some clever shifting is done according to the documentation. Maybe with the following examples it becomes clear what boundary does.
workaround
if you set limits for the x axis, you can circumvent the issue, although not a very elegant solution.
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.1.0
set.seed(123)
u <- runif(1000)
p1 <- ggplot(data = as_tibble(u), aes(x = value)) + geom_histogram(boundary = 0)
p2 <- ggplot(data = as_tibble(u), aes(x = value)) + geom_histogram(boundary = 0) +
scale_x_continuous(limits = c(0, 1))
cowplot::plot_grid(p1, p2, nrow = 2)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Boundary examples
in the third plot (p3), boundary is set to 0.5 and you can see that two bins are split exactly at 0.5. The same for the fourth plot at the 0.75 point. Boundary is likely not the best name for what it does but basically states that the given number should be the boundary between two bins.
u <- runif(10)
p3 <- ggplot(data = as_tibble(u), aes(x = value)) +
geom_histogram(bins = 22, boundary = .5, binwidth = 0.1) +
scale_x_continuous(limits = c(0, 1))
p4 <- ggplot(data = as_tibble(u), aes(x = value)) +
geom_histogram(bins = 22, boundary = .75, binwidth = 0.1) +
scale_x_continuous(limits = c(0, 1))
cowplot::plot_grid(p3, p4, nrow = 2)
#> Warning: Removed 2 rows containing missing values (geom_bar).
Created on 2021-06-13 by the reprex package (v2.0.0)

Related

Histogram not showing correct count/values? (Histogram vs Geom Freqpoly)

I have a dataset for the 2002 NYC Marathon and the places of each person. I also have the gender for each person.
When I plot a histogram, grouping by gender, the counts for female are off!
When I plot a FreqPoly plot, the distribution is as expected based on the data.
Can anyone explain this discrepency? The red bars are for females and the blue bar is for males. The same colors apply to the freq_poly graph.
The red line is where the female racers' counts should be, but the histogram shows them at much higher values. Why?
To elaborate what teunbrand states in the comment, the problem is that your histogram bars are being stacked on top of each other. This is because the default position argument for geom_histogram is position = "stack". This is in contradistinction to geom_freqpoly where the default is position = "identity".
Thus, all you need to do is add position = "identity":
data(nym.2002, package = "UsingR")
ggplot(nym.2002, aes(x = place)) +
geom_freqpoly(aes(color = gender)) +
geom_histogram(aes(fill = gender),
alpha = 0.2,
position = "identity")
If you check out help(geom_freqpoly), you can find the default arguments for yourself.
Not an answer but a visualisation of the different position options as discussed in Ian Campbell's and teunbrand's answers
library(ggplot2)
set.seed(1)
p1 <- ggplot()+
geom_histogram(data = data.frame(x = rnorm(100), g = rep(1:2, 50)), aes(x, fill = factor(g)), position = "dodge")+
ggtitle("position = dodge")
set.seed(1)
p2 <- ggplot()+
geom_histogram(data = data.frame(x = rnorm(100), g = rep(1:2, 50)), aes(x, fill = factor(g)), position = "identity")+
ggtitle("position = identity")
set.seed(1)
p3 <- ggplot()+
geom_histogram(data = data.frame(x = rnorm(100), g = rep(1:2, 50)), aes(x, fill = factor(g)))+
ggtitle("position = stack")
library(patchwork)
p1/p2/p3
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Created on 2020-07-11 by the reprex package (v0.3.0)

ggpot2: space axis ticks unevenly between equidistant values

I have searched SO and other online sources to no avail.
Is there a way to scale an axis such that z-scores will better reflect the actual difference from 0 to 1 and from 1 to 2 (or any other equally spaced score)?
If I have an x-axis with z-scores ranging from -3 to 3 and axis ticks at every integer between, is there a way to have those axis ticks which are closer to 0 be spaced smaller than those that are farther?
Example:
-3 -2 -1 0 1 2 3
|----------|------|--|--|------|----------|
Am I missing some axis scaling method which accepts both the breaks as values but also the position of the breaks relative to the entire scale?
EDIT:
Maybe not quite a reprex, but this is the structure of the data and basic method of visualization:
df <-
data.frame(
metric = c('metric1', 'metric2', 'metric3'),
z_score = c(2, -1.5, 2.8)
)
df %>%
ggplot(aes(x = metric, y = z_score)) +
geom_col() +
coord_flip() +
ylim(-4,4)
The code above produces a plot where the z_score axis has evenly spaced breaks, whereas I would like the breaks to be "pulled" toward zero like I attempted to draw above.
What you describe seems to correspond to a modulus transformation, but I don't know how to choose the correct parameters to get the exact transformation that you want.
Here is an example:
library(ggplot2)
library(scales)
df <- data.frame(
metric = c('metric1', 'metric2', 'metric3'),
z_score = c(2, -1.5, 2.8)
)
ggplot(df, aes(x = metric, y = z_score)) +
geom_col() +
coord_flip() +
scale_y_continuous(trans = modulus_trans(2),
limits = c(-4, 4),
breaks = c(-3:3))
Created on 2020-05-28 by the reprex package (v0.3.0)
The trick to this is to use a new transformation object. There are several already defined in scales::, and the closest I found (though it is opposite, in a sense) is:
ggplot(df, aes(x = metric, y = z_score)) +
geom_col() +
coord_flip() +
scale_y_continuous(trans=scales::pseudo_log_trans(0.2, 2),
limits = c(-3, 3), breaks = -3:3)
But that has the opposite expansion I think you want. Since one way to see the opposite of pseudo_log would be pseudo_exp, and I didn't find one, here's an attempt:
pseudo_exp_trans <- function(pow = 2) {
scales::trans_new(
"pseudo_exp",
function(x) sign(x) * abs(x^pow),
function(x) sign(x) * abs(x)^(1/pow))
}
ggplot(df, aes(x = metric, y = z_score)) +
geom_col() +
coord_flip() +
scale_y_continuous(trans=pseudo_exp_trans(),
limits = c(-3, 3), breaks = -3:3)
Just play with the pow= argument to find the growth-rate you want in the axis.

gplot2 grids overlap out of boundary

I create points uniformly in [0,1] and each point has observations. But ggpolot shows some observations larger than 1 which is outside of the boundary. How come this can happen even though coordinates are within 0 and 1 range? Do you have any idea how to avoid this?
x=runif(10^6)
y=runif(10^6)
z=rnorm(10^6)
new.data=data.frame(x,y,z)
library(ggplot2)
ggplot(data=new.data) + stat_summary_2d(fun = mean, aes(x=x, y=y, z=z))
It’s an issue related to the grid used for binning.
Let’s use a smaller example.
set.seed(42)
x=runif(10^3)
y=runif(10^3)
z=rnorm(10^3)
new.data=data.frame(x,y,z)
library(ggplot2)
(g <- ggplot(data=new.data) +
stat_summary_2d(fun = mean, aes(x=x, y=y, z=z)) +
geom_point(aes(x, y)))
Now let’s zoom at that box on the upper left corner
g + coord_cartesian(xlim = c(0.02, 0.075), ylim = c(0.99, 1.035),
expand = FALSE)
As you can see, that box starts below y = 1 but extends above that value
because you are binning observations according to some binwidth.
The same phenomenon can occur if you use a histogram.
ggplot(data.frame(x = runif(1000, 0, 1)), aes(x)) +
geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In geom_histogram this can be aboided by setting the boundary argument
to 0 and the amount of bins to a multiple of the total length.
ggplot(data.frame(x = runif(1000, 0, 1)), aes(x)) +
geom_histogram(boundary = 0, binwidth = 0.1)
So the solution in your case is to set binwidth to 1/n where n is
an integer
ggplot(data=new.data) +
stat_summary_2d(fun = mean, aes(x=x, y=y, z=z), binwidth = 0.1) +
geom_point(aes(x, y))
Created on 2018-11-04 by the reprex package (v0.2.1.9000)
You have:
set.seed(1)
x=runif(10^6)
Here's what's going on behind the scenes:
bins <- 30L
range <- range(x)
origin <- 0L
binwidth <- diff(range)/bins
breaks <- seq(origin, range[2] + binwidth, binwidth)
bins <- cut(x, breaks, include.lowest = TRUE, right = TRUE, dig.lab = 7)
table(bins)
# ...
# (0.8999984,0.9333317] (0.9333317,0.9666649] (0.9666649,0.9999982]
# 33217 33039 33297
# (0.9999982,1.033331]
# 1
max(x)
# [1] 0.9999984
How come this can happen even though coordinates are within 0 and 1
range
binning starts at 0 (not a the minimum value)
each bin has a size of binwidth
there's a final bin that ends at the maximum value + binwidth, which gets the maximum value
Do you have any idea how to avoid this?
One way would be to define your own breaks:
ggplot(data=new.data) + stat_summary_2d(fun = mean, aes(x=x, y=y, z=z), breaks = seq(0, 1, .1))

ggplot2 - How do get bins of two different histograms to match?

I have the following histogram which uses the default binwidth,
x <- rnorm(100)
p1 <- ggplot() + geom_histogram(aes(x=x))
I want the following histogram to have the exact same bins as p1,
x <- rnorm(100)/2
p2 <- ggplot() + geom_histogram(aes(x=x))
In other words, I want p2 to use the same default bins as p1. How do I do this?
Something that we can do is to extract breaks from the first plot:
x1 <- rnorm(100)
p1 <- ggplot() +
geom_histogram(aes(x = x1))
breaks <- unique(unlist(ggplot_build(p1)$data[[1]][, c("xmin", 'xmax')]))
x2 <- rnorm(100) / 2
p2 <- ggplot() +
geom_histogram(aes(x = x2), breaks = breaks)
library(gridExtra)
grid.arrange(p1, p2, nrow = 1)
I think the easiest way to force the same bins is to facet the plots (because just setting binwidth might start bins at different places on two different plots, and manually setting limits with boundary and breaks will need to be done for the particular data which could be annoying). Plus, this makes the plots directly comparable on their axes as well as the bins, which is presumably the point of putting them on the same bins in the first place
library(tidyverse)
tbl <- tibble(
x = rnorm(100),
y = rnorm(100) / 2
)
tbl %>%
gather(var, val, x, y) %>%
ggplot() +
geom_histogram(aes(x = val)) +
facet_wrap(~var)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Created on 2018-10-31 by the reprex package (v0.2.0).

How to get a really periodic polar surface plot with ggplot

Sample data:
mydata="theta,rho,value
0,0.8400000,0.0000000
40,0.8400000,0.4938922
80,0.8400000,0.7581434
120,0.8400000,0.6675656
160,0.8400000,0.2616592
200,0.8400000,-0.2616592
240,0.8400000,-0.6675656
280,0.8400000,-0.7581434
320,0.8400000,-0.4938922
360,0.8400000,0.0000000
0,0.8577778,0.0000000
40,0.8577778,0.5152213
80,0.8577778,0.7908852
120,0.8577778,0.6963957
160,0.8577778,0.2729566
200,0.8577778,-0.2729566
240,0.8577778,-0.6963957
280,0.8577778,-0.7908852
320,0.8577778,-0.5152213
360,0.8577778,0.0000000
0,0.8755556,0.0000000
40,0.8755556,0.5367990
80,0.8755556,0.8240077
120,0.8755556,0.7255612
160,0.8755556,0.2843886
200,0.8755556,-0.2843886
240,0.8755556,-0.7255612
280,0.8755556,-0.8240077
320,0.8755556,-0.5367990
360,0.8755556,0.0000000
0,0.8933333,0.0000000
40,0.8933333,0.5588192
80,0.8933333,0.8578097
120,0.8933333,0.7553246
160,0.8933333,0.2960542
200,0.8933333,-0.2960542
240,0.8933333,-0.7553246
280,0.8933333,-0.8578097
320,0.8933333,-0.5588192
360,0.8933333,0.0000000
0,0.9111111,0.0000000
40,0.9111111,0.5812822
80,0.9111111,0.8922910
120,0.9111111,0.7856862
160,0.9111111,0.3079544
200,0.9111111,-0.3079544
240,0.9111111,-0.7856862
280,0.9111111,-0.8922910
320,0.9111111,-0.5812822
360,0.9111111,0.0000000
0,0.9288889,0.0000000
40,0.9288889,0.6041876
80,0.9288889,0.9274519
120,0.9288889,0.8166465
160,0.9288889,0.3200901
200,0.9288889,-0.3200901
240,0.9288889,-0.8166465
280,0.9288889,-0.9274519
320,0.9288889,-0.6041876
360,0.9288889,0.0000000
0,0.9466667,0.0000000
40,0.9466667,0.6275358
80,0.9466667,0.9632921
120,0.9466667,0.8482046
160,0.9466667,0.3324593
200,0.9466667,-0.3324593
240,0.9466667,-0.8482046
280,0.9466667,-0.9632921
320,0.9466667,-0.6275358
360,0.9466667,0.0000000
0,0.9644444,0.0000000
40,0.9644444,0.6512897
80,0.9644444,0.9997554
120,0.9644444,0.8803115
160,0.9644444,0.3450427
200,0.9644444,-0.3450427
240,0.9644444,-0.8803115
280,0.9644444,-0.9997554
320,0.9644444,-0.6512897
360,0.9644444,0.0000000
0,0.9822222,0.0000000
40,0.9822222,0.6751215
80,0.9822222,1.0363380
120,0.9822222,0.9125230
160,0.9822222,0.3576658
200,0.9822222,-0.3576658
240,0.9822222,-0.9125230
280,0.9822222,-1.0363380
320,0.9822222,-0.6751215
360,0.9822222,0.0000000
0,1.0000000,0.0000000
40,1.0000000,0.6989533
80,1.0000000,1.0729200
120,1.0000000,0.9447346
160,1.0000000,0.3702890
200,1.0000000,-0.3702890
240,1.0000000,-0.9447346
280,1.0000000,-1.0729200
320,1.0000000,-0.6989533
360,1.0000000,0.0000000"
read in a data frame:
foobar <- read.csv(text = mydata)
You can check (if you really want to!) that the data are periodic in the theta direction, i.e., for each given rho, the point at theta=0 and theta=360 are precisely the same. I would like to plot a nice polar surface plot, in other words an annulus colored according to value. I tried the following:
library(viridis) # just because I very much like viridis: if you don't want to install it, just comment this line and uncomment the scale_fill_distiller line
library(ggplot2)
p <- ggplot(data = foobar, aes(x = theta, y = rho, fill = value)) +
geom_tile() +
coord_polar(theta = "x") +
scale_x_continuous(breaks = seq(0, 360, by = 45), limits=c(0,360)) +
scale_y_continuous(limits = c(0, 1)) +
# scale_fill_distiller(palette = "Oranges")
scale_fill_viridis(option = "plasma")
I'm getting:
Yuck! Why the nasty hole in the annulus? If I generate a foobar data frame with more rows (more theta and rho values) the hole gets smaller. This isn't a viable solutione, both because computing data at more rho/theta values is costly and time-consuming, and both because even with 100x100=10^4 rows I still get a hole. Also, with a bigger dataframe, ggplot takes forever to render the plot: the combination of geom_tile and coord_polar is incredibly inefficient. Isn't there a way to get a nice-looking polar plot without unnecessarily wasting memory & CPU time?
Edit: all value of data for theta=360 were removed (repeat from the values of theta=0)
ggplot(data = foobar, aes(x = theta, y = rho, fill = value)) +
geom_tile() +
coord_polar(theta = "x",start=-pi/9) +
scale_y_continuous(limits = c(0, 1))+
scale_x_continuous(breaks = seq(0, 360, by = 45))
I just removed limits from scale_x_continuous
That gives me:

Resources