I have data with around 25,000 rows myData with column attr having values from 0 -> 45,600. I am not sure how to make a simplified or reproducible data...
Anyway, I am plotting the density of attr like below, and I also find the attr value where density is maximum:
library(ggplot)
max <- which.max(density(myData$attr)$y)
density(myData$attr)$x[max]
ggplot(myData, aes(x=attr))+
geom_density(color="darkblue", fill="lightblue")+
geom_vline(xintercept = density(myData$attr)$x[max])+
xlab("attr")
Here is the plot I have got with the x-intercept at maximum point:
Since the data is skewed, I then attempted to draw x-axis in log scale by adding scale_x_log10() to the ggplot, here is the new graph:
My questions now are:
1. Why does it have 2 maximum points now? Why is my x-intercept no longer at the maximum point(s)?
2. How do I find the intercepts for the 2 new maximum points?
Finally, I attempt to convert the y-axis to count instead:
ggplot(myData, aes(x=attr)) +
stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3)+
xlab("attr")+
scale_x_log10()
I got the following plot:
3. How do I find the count of the 2 peaks?
Why the density shapes are different
To put my comments into a fuller context, ggplot is taking the log before doing the density estimation, which is causing the difference in shape because the binning covers different parts of the domain. For example,
(bins <- seq(1, 10, length.out = 10))
#> [1] 1 2 3 4 5 6 7 8 9 10
(bins_log <- 10^seq(log10(1), log10(10), length.out = 10))
#> [1] 1.000000 1.291550 1.668101 2.154435 2.782559 3.593814 4.641589
#> [8] 5.994843 7.742637 10.000000
library(ggplot2)
ggplot(data.frame(x = c(bins, bins_log),
trans = rep(c('identity', 'log10'), each = 10)),
aes(x, y = trans, col = trans)) +
geom_point()
This binning can affect the resulting density shape. For example, compare an untransformed density:
d <- density(mtcars$disp)
plot(d)
to one which is logged beforehand:
d_log <- density(log10(mtcars$disp))
plot(d_log)
Note that the height of the modes flips! I believe what you are asking for is the first one, but with the log transformation applied after the density, i.e.
d_x_log <- d
d_x_log$x <- log10(d_x_log$x)
plot(d_x_log)
Here the modes are similar, just compressed.
Moving to ggplot
When moving to ggplot, to do the density estimation before the log transformation it's easiest to do it outside of ggplot beforehand:
library(ggplot2)
d <- density(mtcars$disp)
ggplot(data.frame(x = d$x, y = d$y), aes(x, y)) +
geom_density(stat = "identity", fill = 'burlywood', alpha = 0.3) +
scale_x_log10()
Finding modes
Finding modes when there's a single one is relatively easy; it's just d$x[which.max(d$x)]. But when you have multiple modes, that's not good enough, since it will only show you the highest one. A solution is to effectively take the derivative and look for where the slope changes from positive to negative. We can do this numerically with diff, and since we only care about whether the result is positive or negative, call sign on that to turn everything into -1 and 1.* If we call diff on that, everything will be 0 except the maximums and minimums, which will be -2 and 2, respectively. We can then look for which values are less than 0, which we can use to subset. (Because diff does not insert NAs on the end, you'll have to add one to the indices.) Altogether, designed to work on a density object,
d <- density(mtcars$disp)
modes <- function(d){
i <- which(diff(sign(diff(d$y))) < 0) + 1
data.frame(x = d$x[i], y = d$y[i])
}
modes(d)
#> x y
#> 1 128.3295 0.003100294
#> 2 305.3759 0.002204658
d$x[which.max(d$y)] # double-check
#> [1] 128.3295
We can add them to our plot, and they'll get transformed nicely:
ggplot(data.frame(x = d$x, y = d$y), aes(x, y)) +
geom_density(stat = "identity", fill = 'mistyrose', alpha = 0.3) +
geom_vline(xintercept = modes(d)$x) +
scale_x_log10()
Plotting counts instead of density
To turn the y-axis into counts instead of density, multiply y by the number of observations, which is stored in the density object as n:
ggplot(data.frame(x = d$x, y = d$y * d$n), aes(x, y)) +
geom_density(stat = "identity", fill = 'thistle', alpha = 0.3) +
geom_vline(xintercept = modes(d)$x) +
scale_x_log10()
In this case it looks a little silly because there are only 32 observations spread over a wide domain, but with a larger n and smaller domain, it is more interpretable:
d <- density(diamonds$carat, n = 2048)
ggplot(data.frame(x = d$x, y = d$y * d$n), aes(x, y)) +
geom_density(stat = "identity", fill = 'papayawhip', alpha = 0.3) +
geom_point(data = modes(d), aes(y = y * d$n)) +
scale_x_log10()
* Or 0 if the value is exactly 0, but that's unlikely here and will work fine regardless.
Related
I understand the different between a scale and coord transform in ggplot2 is that scale transforms are done before the statistic is computed and coordinate transforms are done after the statistic is computed. However, I'm having trouble understanding this with an actual example.
Packages
library(ggplot2)
library(gapminder)
Create base plot
base <- ggplot(data = gapminder,
mapping = aes(x = year,
y = gdpPercap * pop,
color = continent)) +
geom_line(stat = "summary")
Scale transform
base +
scale_y_continuous(trans = "log10")
Coordinate transform
base +
coord_trans(y = "log10")
The coord_trans() results in the correct depiction of the data, but I do not understand why.
Note: I have seen this question and it did not fully help
Here's a simpler example that should help explain the differences. Suppose we have two values in a data frame, 1 and 10. The mean of these is 11 / 2 = 5.5.
my_data = data.frame(y = c(1, 10))
mean(my_data$y)
#[1] 5.5
If we take the log (base 10) of those, we get 0 and 1. The average of the logs is (0+1)/2 = 0.5. If we transform that back to the original scale, we get 10^0.5 = 3.162. So we can see that ten to the mean of the logs is not the same as the mean; the log "squishes" the large values so they have less of an impact on the average.
log10(my_data$y)
#[1] 0 1
mean(log10(my_data$y))
#[1] 0.5
10^mean(log10(my_data$y))
#[1] 3.162278
We'll see the same thing if we plot this. Using a coord transformation will control the viewport and the spatial position of the data points (e.g. note that the vertical height in pixels between 5.00 to 5.25 is a smidge bigger than the distance from 5.75 to 6.00, due to the log scale), but it doesn't change the data points -- we still get an average of 5.5:
ggplot(my_data, aes(y = y, x = 1)) +
geom_point(stat = "summary", fun = "mean") +
coord_trans(y = "log10")
But if we switch to scale_y_log10, the transformation is applied upstream of the mean calculation, so the value we get is ten to the mean of the logs, which we saw is not the same as the arithmetic mean.
ggplot(my_data, aes(y = y, x = 1)) +
geom_point(stat = "summary", fun = "mean") +
scale_y_log10()
How would I ignore outliers in ggplot2 boxplot? I don't simply want them to disappear (i.e. outlier.size=0), but I want them to be ignored such that the y axis scales to show 1st/3rd percentile. My outliers are causing the "box" to shrink so small its practically a line. Are there some techniques to deal with this?
Edit
Here's an example:
y = c(.01, .02, .03, .04, .05, .06, .07, .08, .09, .5, -.6)
qplot(1, y, geom="boxplot")
Use geom_boxplot(outlier.shape = NA) to not display the outliers and scale_y_continuous(limits = c(lower, upper)) to change the axis limits.
An example.
n <- 1e4L
dfr <- data.frame(
y = exp(rlnorm(n)), #really right-skewed variable
f = gl(2, n / 2)
)
p <- ggplot(dfr, aes(f, y)) +
geom_boxplot()
p # big outlier causes quartiles to look too slim
p2 <- ggplot(dfr, aes(f, y)) +
geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = quantile(dfr$y, c(0.1, 0.9)))
p2 # no outliers plotted, range shifted
Actually, as Ramnath showed in his answer (and Andrie too in the comments), it makes more sense to crop the scales after you calculate the statistic, via coord_cartesian.
coord_cartesian(ylim = quantile(dfr$y, c(0.1, 0.9)))
(You'll probably still need to use scale_y_continuous to fix the axis breaks.)
Here is a solution using boxplot.stats
# create a dummy data frame with outliers
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# compute lower and upper whiskers
ylim1 = boxplot.stats(df$y)$stats[c(1, 5)]
# scale y limits based on ylim1
p1 = p0 + coord_cartesian(ylim = ylim1*1.05)
I had the same problem and precomputed the values for Q1, Q2, median, ymin, ymax using boxplot.stats:
# Load package and generate data
library(ggplot2)
data <- rnorm(100)
# Compute boxplot statistics
stats <- boxplot.stats(data)$stats
df <- data.frame(x="label1", ymin=stats[1], lower=stats[2], middle=stats[3],
upper=stats[4], ymax=stats[5])
# Create plot
p <- ggplot(df, aes(x=x, lower=lower, upper=upper, middle=middle, ymin=ymin,
ymax=ymax)) +
geom_boxplot(stat="identity")
p
The result is a boxplot without outliers.
One idea would be to winsorize the data in a two-pass procedure:
run a first pass, learn what the bounds are, e.g. cut of at given percentile, or N standard deviation above the mean, or ...
in a second pass, set the values beyond the given bound to the value of that bound
I should stress that this is an old-fashioned method which ought to be dominated by more modern robust techniques but you still come across it a lot.
gg.layers::geom_boxplot2 is just what you want.
# remotes::install_github('rpkgs/gg.layers')
library(gg.layers)
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot2(width = 0.8, width.errorbar = 0.5)
https://rpkgs.github.io/gg.layers/reference/geom_boxplot2.html
If you want to force the whiskers to extend to the max and min values, you can tweak the coef argument. Default value for coef is 1.5 (i.e. default length of the whiskers is 1.5 times the IQR).
# Load package and create a dummy data frame with outliers
#(using example from Ramnath's answer above)
library(ggplot2)
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# create boxplot where whiskers extend to max and min values
p1 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)), coef = 500)
Simple, dirty and effective.
geom_boxplot(outlier.alpha = 0)
The "coef" option of the geom_boxplot function allows to change the outlier cutoff in terms of interquartile ranges. This option is documented for the function stat_boxplot. To deactivate outliers (in other words they are treated as regular data), one can instead of using the default value of 1.5 specify a very high cutoff value:
library(ggplot2)
# generate data with outliers:
df = data.frame(x=1, y = c(-10, rnorm(100), 10))
# generate plot with increased cutoff for outliers:
ggplot(df, aes(x, y)) + geom_boxplot(coef=1e30)
I have searched SO and other online sources to no avail.
Is there a way to scale an axis such that z-scores will better reflect the actual difference from 0 to 1 and from 1 to 2 (or any other equally spaced score)?
If I have an x-axis with z-scores ranging from -3 to 3 and axis ticks at every integer between, is there a way to have those axis ticks which are closer to 0 be spaced smaller than those that are farther?
Example:
-3 -2 -1 0 1 2 3
|----------|------|--|--|------|----------|
Am I missing some axis scaling method which accepts both the breaks as values but also the position of the breaks relative to the entire scale?
EDIT:
Maybe not quite a reprex, but this is the structure of the data and basic method of visualization:
df <-
data.frame(
metric = c('metric1', 'metric2', 'metric3'),
z_score = c(2, -1.5, 2.8)
)
df %>%
ggplot(aes(x = metric, y = z_score)) +
geom_col() +
coord_flip() +
ylim(-4,4)
The code above produces a plot where the z_score axis has evenly spaced breaks, whereas I would like the breaks to be "pulled" toward zero like I attempted to draw above.
What you describe seems to correspond to a modulus transformation, but I don't know how to choose the correct parameters to get the exact transformation that you want.
Here is an example:
library(ggplot2)
library(scales)
df <- data.frame(
metric = c('metric1', 'metric2', 'metric3'),
z_score = c(2, -1.5, 2.8)
)
ggplot(df, aes(x = metric, y = z_score)) +
geom_col() +
coord_flip() +
scale_y_continuous(trans = modulus_trans(2),
limits = c(-4, 4),
breaks = c(-3:3))
Created on 2020-05-28 by the reprex package (v0.3.0)
The trick to this is to use a new transformation object. There are several already defined in scales::, and the closest I found (though it is opposite, in a sense) is:
ggplot(df, aes(x = metric, y = z_score)) +
geom_col() +
coord_flip() +
scale_y_continuous(trans=scales::pseudo_log_trans(0.2, 2),
limits = c(-3, 3), breaks = -3:3)
But that has the opposite expansion I think you want. Since one way to see the opposite of pseudo_log would be pseudo_exp, and I didn't find one, here's an attempt:
pseudo_exp_trans <- function(pow = 2) {
scales::trans_new(
"pseudo_exp",
function(x) sign(x) * abs(x^pow),
function(x) sign(x) * abs(x)^(1/pow))
}
ggplot(df, aes(x = metric, y = z_score)) +
geom_col() +
coord_flip() +
scale_y_continuous(trans=pseudo_exp_trans(),
limits = c(-3, 3), breaks = -3:3)
Just play with the pow= argument to find the growth-rate you want in the axis.
I create points uniformly in [0,1] and each point has observations. But ggpolot shows some observations larger than 1 which is outside of the boundary. How come this can happen even though coordinates are within 0 and 1 range? Do you have any idea how to avoid this?
x=runif(10^6)
y=runif(10^6)
z=rnorm(10^6)
new.data=data.frame(x,y,z)
library(ggplot2)
ggplot(data=new.data) + stat_summary_2d(fun = mean, aes(x=x, y=y, z=z))
It’s an issue related to the grid used for binning.
Let’s use a smaller example.
set.seed(42)
x=runif(10^3)
y=runif(10^3)
z=rnorm(10^3)
new.data=data.frame(x,y,z)
library(ggplot2)
(g <- ggplot(data=new.data) +
stat_summary_2d(fun = mean, aes(x=x, y=y, z=z)) +
geom_point(aes(x, y)))
Now let’s zoom at that box on the upper left corner
g + coord_cartesian(xlim = c(0.02, 0.075), ylim = c(0.99, 1.035),
expand = FALSE)
As you can see, that box starts below y = 1 but extends above that value
because you are binning observations according to some binwidth.
The same phenomenon can occur if you use a histogram.
ggplot(data.frame(x = runif(1000, 0, 1)), aes(x)) +
geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In geom_histogram this can be aboided by setting the boundary argument
to 0 and the amount of bins to a multiple of the total length.
ggplot(data.frame(x = runif(1000, 0, 1)), aes(x)) +
geom_histogram(boundary = 0, binwidth = 0.1)
So the solution in your case is to set binwidth to 1/n where n is
an integer
ggplot(data=new.data) +
stat_summary_2d(fun = mean, aes(x=x, y=y, z=z), binwidth = 0.1) +
geom_point(aes(x, y))
Created on 2018-11-04 by the reprex package (v0.2.1.9000)
You have:
set.seed(1)
x=runif(10^6)
Here's what's going on behind the scenes:
bins <- 30L
range <- range(x)
origin <- 0L
binwidth <- diff(range)/bins
breaks <- seq(origin, range[2] + binwidth, binwidth)
bins <- cut(x, breaks, include.lowest = TRUE, right = TRUE, dig.lab = 7)
table(bins)
# ...
# (0.8999984,0.9333317] (0.9333317,0.9666649] (0.9666649,0.9999982]
# 33217 33039 33297
# (0.9999982,1.033331]
# 1
max(x)
# [1] 0.9999984
How come this can happen even though coordinates are within 0 and 1
range
binning starts at 0 (not a the minimum value)
each bin has a size of binwidth
there's a final bin that ends at the maximum value + binwidth, which gets the maximum value
Do you have any idea how to avoid this?
One way would be to define your own breaks:
ggplot(data=new.data) + stat_summary_2d(fun = mean, aes(x=x, y=y, z=z), breaks = seq(0, 1, .1))
How would I ignore outliers in ggplot2 boxplot? I don't simply want them to disappear (i.e. outlier.size=0), but I want them to be ignored such that the y axis scales to show 1st/3rd percentile. My outliers are causing the "box" to shrink so small its practically a line. Are there some techniques to deal with this?
Edit
Here's an example:
y = c(.01, .02, .03, .04, .05, .06, .07, .08, .09, .5, -.6)
qplot(1, y, geom="boxplot")
Use geom_boxplot(outlier.shape = NA) to not display the outliers and scale_y_continuous(limits = c(lower, upper)) to change the axis limits.
An example.
n <- 1e4L
dfr <- data.frame(
y = exp(rlnorm(n)), #really right-skewed variable
f = gl(2, n / 2)
)
p <- ggplot(dfr, aes(f, y)) +
geom_boxplot()
p # big outlier causes quartiles to look too slim
p2 <- ggplot(dfr, aes(f, y)) +
geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = quantile(dfr$y, c(0.1, 0.9)))
p2 # no outliers plotted, range shifted
Actually, as Ramnath showed in his answer (and Andrie too in the comments), it makes more sense to crop the scales after you calculate the statistic, via coord_cartesian.
coord_cartesian(ylim = quantile(dfr$y, c(0.1, 0.9)))
(You'll probably still need to use scale_y_continuous to fix the axis breaks.)
Here is a solution using boxplot.stats
# create a dummy data frame with outliers
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# compute lower and upper whiskers
ylim1 = boxplot.stats(df$y)$stats[c(1, 5)]
# scale y limits based on ylim1
p1 = p0 + coord_cartesian(ylim = ylim1*1.05)
I had the same problem and precomputed the values for Q1, Q2, median, ymin, ymax using boxplot.stats:
# Load package and generate data
library(ggplot2)
data <- rnorm(100)
# Compute boxplot statistics
stats <- boxplot.stats(data)$stats
df <- data.frame(x="label1", ymin=stats[1], lower=stats[2], middle=stats[3],
upper=stats[4], ymax=stats[5])
# Create plot
p <- ggplot(df, aes(x=x, lower=lower, upper=upper, middle=middle, ymin=ymin,
ymax=ymax)) +
geom_boxplot(stat="identity")
p
The result is a boxplot without outliers.
One idea would be to winsorize the data in a two-pass procedure:
run a first pass, learn what the bounds are, e.g. cut of at given percentile, or N standard deviation above the mean, or ...
in a second pass, set the values beyond the given bound to the value of that bound
I should stress that this is an old-fashioned method which ought to be dominated by more modern robust techniques but you still come across it a lot.
gg.layers::geom_boxplot2 is just what you want.
# remotes::install_github('rpkgs/gg.layers')
library(gg.layers)
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot2(width = 0.8, width.errorbar = 0.5)
https://rpkgs.github.io/gg.layers/reference/geom_boxplot2.html
If you want to force the whiskers to extend to the max and min values, you can tweak the coef argument. Default value for coef is 1.5 (i.e. default length of the whiskers is 1.5 times the IQR).
# Load package and create a dummy data frame with outliers
#(using example from Ramnath's answer above)
library(ggplot2)
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# create boxplot where whiskers extend to max and min values
p1 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)), coef = 500)
Simple, dirty and effective.
geom_boxplot(outlier.alpha = 0)
The "coef" option of the geom_boxplot function allows to change the outlier cutoff in terms of interquartile ranges. This option is documented for the function stat_boxplot. To deactivate outliers (in other words they are treated as regular data), one can instead of using the default value of 1.5 specify a very high cutoff value:
library(ggplot2)
# generate data with outliers:
df = data.frame(x=1, y = c(-10, rnorm(100), 10))
# generate plot with increased cutoff for outliers:
ggplot(df, aes(x, y)) + geom_boxplot(coef=1e30)