I create points uniformly in [0,1] and each point has observations. But ggpolot shows some observations larger than 1 which is outside of the boundary. How come this can happen even though coordinates are within 0 and 1 range? Do you have any idea how to avoid this?
x=runif(10^6)
y=runif(10^6)
z=rnorm(10^6)
new.data=data.frame(x,y,z)
library(ggplot2)
ggplot(data=new.data) + stat_summary_2d(fun = mean, aes(x=x, y=y, z=z))
It’s an issue related to the grid used for binning.
Let’s use a smaller example.
set.seed(42)
x=runif(10^3)
y=runif(10^3)
z=rnorm(10^3)
new.data=data.frame(x,y,z)
library(ggplot2)
(g <- ggplot(data=new.data) +
stat_summary_2d(fun = mean, aes(x=x, y=y, z=z)) +
geom_point(aes(x, y)))
Now let’s zoom at that box on the upper left corner
g + coord_cartesian(xlim = c(0.02, 0.075), ylim = c(0.99, 1.035),
expand = FALSE)
As you can see, that box starts below y = 1 but extends above that value
because you are binning observations according to some binwidth.
The same phenomenon can occur if you use a histogram.
ggplot(data.frame(x = runif(1000, 0, 1)), aes(x)) +
geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In geom_histogram this can be aboided by setting the boundary argument
to 0 and the amount of bins to a multiple of the total length.
ggplot(data.frame(x = runif(1000, 0, 1)), aes(x)) +
geom_histogram(boundary = 0, binwidth = 0.1)
So the solution in your case is to set binwidth to 1/n where n is
an integer
ggplot(data=new.data) +
stat_summary_2d(fun = mean, aes(x=x, y=y, z=z), binwidth = 0.1) +
geom_point(aes(x, y))
Created on 2018-11-04 by the reprex package (v0.2.1.9000)
You have:
set.seed(1)
x=runif(10^6)
Here's what's going on behind the scenes:
bins <- 30L
range <- range(x)
origin <- 0L
binwidth <- diff(range)/bins
breaks <- seq(origin, range[2] + binwidth, binwidth)
bins <- cut(x, breaks, include.lowest = TRUE, right = TRUE, dig.lab = 7)
table(bins)
# ...
# (0.8999984,0.9333317] (0.9333317,0.9666649] (0.9666649,0.9999982]
# 33217 33039 33297
# (0.9999982,1.033331]
# 1
max(x)
# [1] 0.9999984
How come this can happen even though coordinates are within 0 and 1
range
binning starts at 0 (not a the minimum value)
each bin has a size of binwidth
there's a final bin that ends at the maximum value + binwidth, which gets the maximum value
Do you have any idea how to avoid this?
One way would be to define your own breaks:
ggplot(data=new.data) + stat_summary_2d(fun = mean, aes(x=x, y=y, z=z), breaks = seq(0, 1, .1))
Related
Look at the data and graph in this example:
df <- data.frame(x = round(rnorm(10000, mean=100, sd=15)))
df$x <- ifelse(df$x < 50, 50, df$x)
df$x <- ifelse(df$x > 150, 150, df$x)
library(ggplot2)
ggplot(df) +
aes(x = x) +
geom_histogram(aes(y = ..density..),
binwidth = 10,
fill="#69b3a2",
color="#e9ecef", alpha=0.9) +
stat_function(fun = dnorm, args = list(mean = mean(df$x),
sd = sd(df$x)))
The resulting graph is:
Note that the histogram goes outside the data bounds. The data is explicitly set to be limited to the 50 to 150 range, but the histogram seems to represent data from 45 to 155. In other words, the binning seems to be wrong.
Also note that the normal curves stops at the correct limits.
Is there a way to change the binning so that the bins go in the correct boundaries?
comment:
I have found work-arounds such as this
ggplot axis ticks fall at center of bin value rather than at the bin limits
but if I understand correctly, the idea here is to move the data by half the bin width, but that would be wrong at the other side in this case. It would also ruin the normal curve)
The default binning isn’t “wrong” per se, but you can control bin alignment by setting the boundary for one of your bins, using the boundary arg for geom_histogram():
library(ggplot2)
ggplot(df) +
aes(x = x) +
geom_histogram(aes(y = ..density..),
binwidth = 10,
fill="#69b3a2",
color="#e9ecef",
boundary = 50,
alpha=0.9) +
stat_function(fun = dnorm, args = list(mean = mean(df$x),
sd = sd(df$x)))
You could alternatively set center = 55 for the same result.
I've run into a weird issue regarding geom_histogram and it can easily be seen by plotting the uniform distribution.
library(tidyverse)
u <- runif(10000)
ggplot(data = as_tibble(u), aes(x = value)) + geom_histogram()
Generates the following count histogram:
It can be seen that it is rather assymetrical, I've read somewhere that this is because the leftmost (and rightmost) bin is centered around 0, and there are no values being generated from -0.5 up to 0, creating the discrepancy seen. So I've fiddled with the boundary parameter and it got me:
ggplot(data = as_tibble(u), aes(x = value)) + geom_histogram(boundary = 0)
Similarly setting boundary = 1 made it assymetric at the left.
My questions are: what is the preferred way to fix this behavior? Also, what exactly does the boundary argument does? It is not very clear from the docs.
Thank you.
As far as I can tell, boundary specifies a spot to be a split between two bins. The rest of bins are set according to the number of bins or supplied break points. If the supplied boundary is outside the range of the data, some clever shifting is done according to the documentation. Maybe with the following examples it becomes clear what boundary does.
workaround
if you set limits for the x axis, you can circumvent the issue, although not a very elegant solution.
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.1.0
set.seed(123)
u <- runif(1000)
p1 <- ggplot(data = as_tibble(u), aes(x = value)) + geom_histogram(boundary = 0)
p2 <- ggplot(data = as_tibble(u), aes(x = value)) + geom_histogram(boundary = 0) +
scale_x_continuous(limits = c(0, 1))
cowplot::plot_grid(p1, p2, nrow = 2)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Boundary examples
in the third plot (p3), boundary is set to 0.5 and you can see that two bins are split exactly at 0.5. The same for the fourth plot at the 0.75 point. Boundary is likely not the best name for what it does but basically states that the given number should be the boundary between two bins.
u <- runif(10)
p3 <- ggplot(data = as_tibble(u), aes(x = value)) +
geom_histogram(bins = 22, boundary = .5, binwidth = 0.1) +
scale_x_continuous(limits = c(0, 1))
p4 <- ggplot(data = as_tibble(u), aes(x = value)) +
geom_histogram(bins = 22, boundary = .75, binwidth = 0.1) +
scale_x_continuous(limits = c(0, 1))
cowplot::plot_grid(p3, p4, nrow = 2)
#> Warning: Removed 2 rows containing missing values (geom_bar).
Created on 2021-06-13 by the reprex package (v2.0.0)
How would I ignore outliers in ggplot2 boxplot? I don't simply want them to disappear (i.e. outlier.size=0), but I want them to be ignored such that the y axis scales to show 1st/3rd percentile. My outliers are causing the "box" to shrink so small its practically a line. Are there some techniques to deal with this?
Edit
Here's an example:
y = c(.01, .02, .03, .04, .05, .06, .07, .08, .09, .5, -.6)
qplot(1, y, geom="boxplot")
Use geom_boxplot(outlier.shape = NA) to not display the outliers and scale_y_continuous(limits = c(lower, upper)) to change the axis limits.
An example.
n <- 1e4L
dfr <- data.frame(
y = exp(rlnorm(n)), #really right-skewed variable
f = gl(2, n / 2)
)
p <- ggplot(dfr, aes(f, y)) +
geom_boxplot()
p # big outlier causes quartiles to look too slim
p2 <- ggplot(dfr, aes(f, y)) +
geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = quantile(dfr$y, c(0.1, 0.9)))
p2 # no outliers plotted, range shifted
Actually, as Ramnath showed in his answer (and Andrie too in the comments), it makes more sense to crop the scales after you calculate the statistic, via coord_cartesian.
coord_cartesian(ylim = quantile(dfr$y, c(0.1, 0.9)))
(You'll probably still need to use scale_y_continuous to fix the axis breaks.)
Here is a solution using boxplot.stats
# create a dummy data frame with outliers
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# compute lower and upper whiskers
ylim1 = boxplot.stats(df$y)$stats[c(1, 5)]
# scale y limits based on ylim1
p1 = p0 + coord_cartesian(ylim = ylim1*1.05)
I had the same problem and precomputed the values for Q1, Q2, median, ymin, ymax using boxplot.stats:
# Load package and generate data
library(ggplot2)
data <- rnorm(100)
# Compute boxplot statistics
stats <- boxplot.stats(data)$stats
df <- data.frame(x="label1", ymin=stats[1], lower=stats[2], middle=stats[3],
upper=stats[4], ymax=stats[5])
# Create plot
p <- ggplot(df, aes(x=x, lower=lower, upper=upper, middle=middle, ymin=ymin,
ymax=ymax)) +
geom_boxplot(stat="identity")
p
The result is a boxplot without outliers.
One idea would be to winsorize the data in a two-pass procedure:
run a first pass, learn what the bounds are, e.g. cut of at given percentile, or N standard deviation above the mean, or ...
in a second pass, set the values beyond the given bound to the value of that bound
I should stress that this is an old-fashioned method which ought to be dominated by more modern robust techniques but you still come across it a lot.
gg.layers::geom_boxplot2 is just what you want.
# remotes::install_github('rpkgs/gg.layers')
library(gg.layers)
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot2(width = 0.8, width.errorbar = 0.5)
https://rpkgs.github.io/gg.layers/reference/geom_boxplot2.html
If you want to force the whiskers to extend to the max and min values, you can tweak the coef argument. Default value for coef is 1.5 (i.e. default length of the whiskers is 1.5 times the IQR).
# Load package and create a dummy data frame with outliers
#(using example from Ramnath's answer above)
library(ggplot2)
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# create boxplot where whiskers extend to max and min values
p1 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)), coef = 500)
Simple, dirty and effective.
geom_boxplot(outlier.alpha = 0)
The "coef" option of the geom_boxplot function allows to change the outlier cutoff in terms of interquartile ranges. This option is documented for the function stat_boxplot. To deactivate outliers (in other words they are treated as regular data), one can instead of using the default value of 1.5 specify a very high cutoff value:
library(ggplot2)
# generate data with outliers:
df = data.frame(x=1, y = c(-10, rnorm(100), 10))
# generate plot with increased cutoff for outliers:
ggplot(df, aes(x, y)) + geom_boxplot(coef=1e30)
I have data with around 25,000 rows myData with column attr having values from 0 -> 45,600. I am not sure how to make a simplified or reproducible data...
Anyway, I am plotting the density of attr like below, and I also find the attr value where density is maximum:
library(ggplot)
max <- which.max(density(myData$attr)$y)
density(myData$attr)$x[max]
ggplot(myData, aes(x=attr))+
geom_density(color="darkblue", fill="lightblue")+
geom_vline(xintercept = density(myData$attr)$x[max])+
xlab("attr")
Here is the plot I have got with the x-intercept at maximum point:
Since the data is skewed, I then attempted to draw x-axis in log scale by adding scale_x_log10() to the ggplot, here is the new graph:
My questions now are:
1. Why does it have 2 maximum points now? Why is my x-intercept no longer at the maximum point(s)?
2. How do I find the intercepts for the 2 new maximum points?
Finally, I attempt to convert the y-axis to count instead:
ggplot(myData, aes(x=attr)) +
stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3)+
xlab("attr")+
scale_x_log10()
I got the following plot:
3. How do I find the count of the 2 peaks?
Why the density shapes are different
To put my comments into a fuller context, ggplot is taking the log before doing the density estimation, which is causing the difference in shape because the binning covers different parts of the domain. For example,
(bins <- seq(1, 10, length.out = 10))
#> [1] 1 2 3 4 5 6 7 8 9 10
(bins_log <- 10^seq(log10(1), log10(10), length.out = 10))
#> [1] 1.000000 1.291550 1.668101 2.154435 2.782559 3.593814 4.641589
#> [8] 5.994843 7.742637 10.000000
library(ggplot2)
ggplot(data.frame(x = c(bins, bins_log),
trans = rep(c('identity', 'log10'), each = 10)),
aes(x, y = trans, col = trans)) +
geom_point()
This binning can affect the resulting density shape. For example, compare an untransformed density:
d <- density(mtcars$disp)
plot(d)
to one which is logged beforehand:
d_log <- density(log10(mtcars$disp))
plot(d_log)
Note that the height of the modes flips! I believe what you are asking for is the first one, but with the log transformation applied after the density, i.e.
d_x_log <- d
d_x_log$x <- log10(d_x_log$x)
plot(d_x_log)
Here the modes are similar, just compressed.
Moving to ggplot
When moving to ggplot, to do the density estimation before the log transformation it's easiest to do it outside of ggplot beforehand:
library(ggplot2)
d <- density(mtcars$disp)
ggplot(data.frame(x = d$x, y = d$y), aes(x, y)) +
geom_density(stat = "identity", fill = 'burlywood', alpha = 0.3) +
scale_x_log10()
Finding modes
Finding modes when there's a single one is relatively easy; it's just d$x[which.max(d$x)]. But when you have multiple modes, that's not good enough, since it will only show you the highest one. A solution is to effectively take the derivative and look for where the slope changes from positive to negative. We can do this numerically with diff, and since we only care about whether the result is positive or negative, call sign on that to turn everything into -1 and 1.* If we call diff on that, everything will be 0 except the maximums and minimums, which will be -2 and 2, respectively. We can then look for which values are less than 0, which we can use to subset. (Because diff does not insert NAs on the end, you'll have to add one to the indices.) Altogether, designed to work on a density object,
d <- density(mtcars$disp)
modes <- function(d){
i <- which(diff(sign(diff(d$y))) < 0) + 1
data.frame(x = d$x[i], y = d$y[i])
}
modes(d)
#> x y
#> 1 128.3295 0.003100294
#> 2 305.3759 0.002204658
d$x[which.max(d$y)] # double-check
#> [1] 128.3295
We can add them to our plot, and they'll get transformed nicely:
ggplot(data.frame(x = d$x, y = d$y), aes(x, y)) +
geom_density(stat = "identity", fill = 'mistyrose', alpha = 0.3) +
geom_vline(xintercept = modes(d)$x) +
scale_x_log10()
Plotting counts instead of density
To turn the y-axis into counts instead of density, multiply y by the number of observations, which is stored in the density object as n:
ggplot(data.frame(x = d$x, y = d$y * d$n), aes(x, y)) +
geom_density(stat = "identity", fill = 'thistle', alpha = 0.3) +
geom_vline(xintercept = modes(d)$x) +
scale_x_log10()
In this case it looks a little silly because there are only 32 observations spread over a wide domain, but with a larger n and smaller domain, it is more interpretable:
d <- density(diamonds$carat, n = 2048)
ggplot(data.frame(x = d$x, y = d$y * d$n), aes(x, y)) +
geom_density(stat = "identity", fill = 'papayawhip', alpha = 0.3) +
geom_point(data = modes(d), aes(y = y * d$n)) +
scale_x_log10()
* Or 0 if the value is exactly 0, but that's unlikely here and will work fine regardless.
How would I ignore outliers in ggplot2 boxplot? I don't simply want them to disappear (i.e. outlier.size=0), but I want them to be ignored such that the y axis scales to show 1st/3rd percentile. My outliers are causing the "box" to shrink so small its practically a line. Are there some techniques to deal with this?
Edit
Here's an example:
y = c(.01, .02, .03, .04, .05, .06, .07, .08, .09, .5, -.6)
qplot(1, y, geom="boxplot")
Use geom_boxplot(outlier.shape = NA) to not display the outliers and scale_y_continuous(limits = c(lower, upper)) to change the axis limits.
An example.
n <- 1e4L
dfr <- data.frame(
y = exp(rlnorm(n)), #really right-skewed variable
f = gl(2, n / 2)
)
p <- ggplot(dfr, aes(f, y)) +
geom_boxplot()
p # big outlier causes quartiles to look too slim
p2 <- ggplot(dfr, aes(f, y)) +
geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = quantile(dfr$y, c(0.1, 0.9)))
p2 # no outliers plotted, range shifted
Actually, as Ramnath showed in his answer (and Andrie too in the comments), it makes more sense to crop the scales after you calculate the statistic, via coord_cartesian.
coord_cartesian(ylim = quantile(dfr$y, c(0.1, 0.9)))
(You'll probably still need to use scale_y_continuous to fix the axis breaks.)
Here is a solution using boxplot.stats
# create a dummy data frame with outliers
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# compute lower and upper whiskers
ylim1 = boxplot.stats(df$y)$stats[c(1, 5)]
# scale y limits based on ylim1
p1 = p0 + coord_cartesian(ylim = ylim1*1.05)
I had the same problem and precomputed the values for Q1, Q2, median, ymin, ymax using boxplot.stats:
# Load package and generate data
library(ggplot2)
data <- rnorm(100)
# Compute boxplot statistics
stats <- boxplot.stats(data)$stats
df <- data.frame(x="label1", ymin=stats[1], lower=stats[2], middle=stats[3],
upper=stats[4], ymax=stats[5])
# Create plot
p <- ggplot(df, aes(x=x, lower=lower, upper=upper, middle=middle, ymin=ymin,
ymax=ymax)) +
geom_boxplot(stat="identity")
p
The result is a boxplot without outliers.
One idea would be to winsorize the data in a two-pass procedure:
run a first pass, learn what the bounds are, e.g. cut of at given percentile, or N standard deviation above the mean, or ...
in a second pass, set the values beyond the given bound to the value of that bound
I should stress that this is an old-fashioned method which ought to be dominated by more modern robust techniques but you still come across it a lot.
gg.layers::geom_boxplot2 is just what you want.
# remotes::install_github('rpkgs/gg.layers')
library(gg.layers)
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot2(width = 0.8, width.errorbar = 0.5)
https://rpkgs.github.io/gg.layers/reference/geom_boxplot2.html
If you want to force the whiskers to extend to the max and min values, you can tweak the coef argument. Default value for coef is 1.5 (i.e. default length of the whiskers is 1.5 times the IQR).
# Load package and create a dummy data frame with outliers
#(using example from Ramnath's answer above)
library(ggplot2)
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# create boxplot where whiskers extend to max and min values
p1 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)), coef = 500)
Simple, dirty and effective.
geom_boxplot(outlier.alpha = 0)
The "coef" option of the geom_boxplot function allows to change the outlier cutoff in terms of interquartile ranges. This option is documented for the function stat_boxplot. To deactivate outliers (in other words they are treated as regular data), one can instead of using the default value of 1.5 specify a very high cutoff value:
library(ggplot2)
# generate data with outliers:
df = data.frame(x=1, y = c(-10, rnorm(100), 10))
# generate plot with increased cutoff for outliers:
ggplot(df, aes(x, y)) + geom_boxplot(coef=1e30)