I'm trying to plot isoclines under a scatterplot using ggplot but I can't figure out how to use stat_functioncorrectly.
The isoclines are based on the distance formula:
sqrt((x1-x2)^2 + (y1-y2)^2)
and would look like these
concentric circles, except the center would be the origin of the plot:
What I've tried so far is calling the distance function within ggplot like so (Note: I use x1=1 and y1=1 because in my real problem I also have fixed values)
distance <- function(x, y) {sqrt((x - 1)^2 + (y - 1)^2)}
ggplot(my_data, aes(x, y))+
geom_point()+
stat_function(fun=distance)
but R returns the error:
Computation failed in 'stat_function()': argument "y" is missing, with
no default
How do I correctly feed x and y values to stat_function so that it plots a generic plot of the distance formula, with the center at the origin?
For anything a bit complicated, I avoid the use of the stat functions. They are mostly aimed at quick calculations. They are usually limited to calculating y based on x. I would just pre-calculate the data and the plot with stat_contour instead:
distance <- function(x, y) {sqrt((x - 1)^2 + (y - 1)^2)}
d <- expand.grid(x = seq(0, 2, 0.02), y = seq(0, 2, 0.02))
d$dist <- mapply(distance, x = d$x, y = d$y)
ggplot(d, aes(x, y)) +
geom_raster(aes(fill = dist), interpolate = T) +
stat_contour(aes(z = dist), col = 'white') +
coord_fixed() +
viridis::scale_fill_viridis(direction = -1)
Related
How would I ignore outliers in ggplot2 boxplot? I don't simply want them to disappear (i.e. outlier.size=0), but I want them to be ignored such that the y axis scales to show 1st/3rd percentile. My outliers are causing the "box" to shrink so small its practically a line. Are there some techniques to deal with this?
Edit
Here's an example:
y = c(.01, .02, .03, .04, .05, .06, .07, .08, .09, .5, -.6)
qplot(1, y, geom="boxplot")
Use geom_boxplot(outlier.shape = NA) to not display the outliers and scale_y_continuous(limits = c(lower, upper)) to change the axis limits.
An example.
n <- 1e4L
dfr <- data.frame(
y = exp(rlnorm(n)), #really right-skewed variable
f = gl(2, n / 2)
)
p <- ggplot(dfr, aes(f, y)) +
geom_boxplot()
p # big outlier causes quartiles to look too slim
p2 <- ggplot(dfr, aes(f, y)) +
geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = quantile(dfr$y, c(0.1, 0.9)))
p2 # no outliers plotted, range shifted
Actually, as Ramnath showed in his answer (and Andrie too in the comments), it makes more sense to crop the scales after you calculate the statistic, via coord_cartesian.
coord_cartesian(ylim = quantile(dfr$y, c(0.1, 0.9)))
(You'll probably still need to use scale_y_continuous to fix the axis breaks.)
Here is a solution using boxplot.stats
# create a dummy data frame with outliers
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# compute lower and upper whiskers
ylim1 = boxplot.stats(df$y)$stats[c(1, 5)]
# scale y limits based on ylim1
p1 = p0 + coord_cartesian(ylim = ylim1*1.05)
I had the same problem and precomputed the values for Q1, Q2, median, ymin, ymax using boxplot.stats:
# Load package and generate data
library(ggplot2)
data <- rnorm(100)
# Compute boxplot statistics
stats <- boxplot.stats(data)$stats
df <- data.frame(x="label1", ymin=stats[1], lower=stats[2], middle=stats[3],
upper=stats[4], ymax=stats[5])
# Create plot
p <- ggplot(df, aes(x=x, lower=lower, upper=upper, middle=middle, ymin=ymin,
ymax=ymax)) +
geom_boxplot(stat="identity")
p
The result is a boxplot without outliers.
One idea would be to winsorize the data in a two-pass procedure:
run a first pass, learn what the bounds are, e.g. cut of at given percentile, or N standard deviation above the mean, or ...
in a second pass, set the values beyond the given bound to the value of that bound
I should stress that this is an old-fashioned method which ought to be dominated by more modern robust techniques but you still come across it a lot.
gg.layers::geom_boxplot2 is just what you want.
# remotes::install_github('rpkgs/gg.layers')
library(gg.layers)
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot2(width = 0.8, width.errorbar = 0.5)
https://rpkgs.github.io/gg.layers/reference/geom_boxplot2.html
If you want to force the whiskers to extend to the max and min values, you can tweak the coef argument. Default value for coef is 1.5 (i.e. default length of the whiskers is 1.5 times the IQR).
# Load package and create a dummy data frame with outliers
#(using example from Ramnath's answer above)
library(ggplot2)
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# create boxplot where whiskers extend to max and min values
p1 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)), coef = 500)
Simple, dirty and effective.
geom_boxplot(outlier.alpha = 0)
The "coef" option of the geom_boxplot function allows to change the outlier cutoff in terms of interquartile ranges. This option is documented for the function stat_boxplot. To deactivate outliers (in other words they are treated as regular data), one can instead of using the default value of 1.5 specify a very high cutoff value:
library(ggplot2)
# generate data with outliers:
df = data.frame(x=1, y = c(-10, rnorm(100), 10))
# generate plot with increased cutoff for outliers:
ggplot(df, aes(x, y)) + geom_boxplot(coef=1e30)
I've been trying to use the function trans_new with the scales package however I can't get it to display labels correctly
# percent to fold change
fun1 <- function(x) (x/100) + 1
# fold change to percent
inv_fun1 <- function(x) (x - 1) * 100
percent_to_fold_change_trans <- trans_new(name = "transform", transform = fun1, inverse = inv_fun1)
plot_data <- data.frame(x = 1:10,
y = inv_fun1(1:10))
# Plot raw data
p1 <- ggplot(plot_data, aes(x = x, y = y)) +
geom_point()
# This doesn't really change the plot
p2 <- ggplot(plot_data, aes(x = x, y = y)) +
geom_point() +
coord_trans(y = percent_to_fold_change_trans)
p1 and p2 are identical whereas I'm expecting p2 to be a diagonal line since we are reversing the inverting function. If I replace the inverse parameter in trans_new with another function (like fun(x) x) I can see the correct transformation but the labels are completely off. Any ideas of how to define the inverse parameters to get the right label positions?
You wouldn't expect a linear function like fun1 to change the appearance of the y axis. Remember, you are not transforming the data, you are transforming the y axis. This means that you are effectively changing the positions of the horizontal gridlines, but not the values they represent.
Any function that produces a linear transformation will result in fixed spacing between the horizontal grid lines, which is what you have already. The plot therefore won't change.
Let's take a simple example:
plot_data <- data.frame(x = 1:10, y = 1:10)
p <- ggplot(plot_data, aes(x = x, y = y)) +
geom_point() +
scale_y_continuous(breaks = 1:10)
p
Now let's create a straightforward non-linear transformation:
little_trans <- trans_new(name = "transform",
transform = function(x) x^2,
inverse = function(x) sqrt(x))
p + coord_trans(y = little_trans)
Note the values on the y axis are the same, but because we applied a non-linear transformation, the distances between the gridlines now varies.
In fact, if we plot a transformed version of our data, we would get the same shape:
ggplot(plot_data, aes(x = x, y = y^2)) +
geom_point() +
scale_y_continuous(breaks = (1:10)^2)
In a sense, this is all that the transform does, except it applies the inverse transform to the axis labels. We could do that manually here:
ggplot(plot_data, aes(x = x, y = y^2)) +
geom_point() +
scale_y_continuous(breaks = (1:10)^2, labels = sqrt((1:10)^2))
Now, suppose I instead do a more complicated but linear function of x:
little_trans <- trans_new(name = "transform",
transform = function(x) (0.1 * x + 20) / 3,
inverse = function(x) (x * 3 - 20) / 0.1)
ggplot(plot_data, aes(x = x, y = y)) +
geom_point() +
coord_trans(y = little_trans)
It's unchanged from before. We can see why if we again apply our transform directly:
ggplot(plot_data, aes(x = x, y = (0.1 * y + 20) / 3)) +
geom_point() +
scale_y_continuous(breaks = (0.1 * (1:10) + 20) / 3)
Obviously, if we do the inverse transform on the axis labels we will have 1:10, which means we will just have the original plot back.
The same holds true for any linear transform, and therefore the results you are getting are exactly what are to be expected.
How can I fill a geom_violin plot in ggplot2 with different colors based on a fixed cutoff?
For instance, given the setup:
library(ggplot2)
set.seed(123)
dat <- data.frame(x = rep(1:3,each = 100),
y = c(rnorm(100,-1),rnorm(100,0),rnorm(100,1)))
dat$f <- with(dat,ifelse(y >= 0,'Above','Below'))
I'd like to take this basic plot:
ggplot() +
geom_violin(data = dat,aes(x = factor(x),y = y))
and simply have each violin colored differently above and below zero. The naive thing to try, mapping the fill aesthetic, splits and dodges the violin plots:
ggplot() +
geom_violin(data = dat,aes(x = factor(x),y = y, fill = f))
which is not what I want. I'd like a single violin plot at each x value, but with the interior filled with different colors above and below zero.
Here's one way to do this.
library(ggplot2)
library(plyr)
#Data setup
set.seed(123)
dat <- data.frame(x = rep(1:3,each = 100),
y = c(rnorm(100,-1),rnorm(100,0),rnorm(100,1)))
First we'll use ggplot::ggplot_build to capture all the calculated variables that go into plotting the violin plot:
p <- ggplot() +
geom_violin(data = dat,aes(x = factor(x),y = y))
p_build <- ggplot2::ggplot_build(p)$data[[1]]
Next, if we take a look at the source code for geom_violin we see that it does some specific transformations of this computed data frame before handing it off to geom_polygon to draw the actual outlines of the violin regions.
So we'll mimic that process and simply draw the filled polygons manually:
#This comes directly from the source of geom_violin
p_build <- transform(p_build,
xminv = x - violinwidth * (x - xmin),
xmaxv = x + violinwidth * (xmax - x))
p_build <- rbind(plyr::arrange(transform(p_build, x = xminv), y),
plyr::arrange(transform(p_build, x = xmaxv), -y))
I'm omitting a small detail from the source code about duplicating the first row in order to ensure that the polygon is closed.
Now we do two final modifications:
#Add our fill variable
p_build$fill_group <- ifelse(p_build$y >= 0,'Above','Below')
#This is necessary to ensure that instead of trying to draw
# 3 polygons, we're telling ggplot to draw six polygons
p_build$group1 <- with(p_build,interaction(factor(group),factor(fill_group)))
And finally plot:
#Note the use of the group aesthetic here with our computed version,
# group1
p_fill <- ggplot() +
geom_polygon(data = p_build,
aes(x = x,y = y,group = group1,fill = fill_group))
p_fill
Note that in general, this will clobber nice handling of any categorical x axis labels. So you will often need to do the plot using a continuous x axis and then if you need categorical labels, add them manually.
I'm having serious problems trying to get my head around stat_function in R's ggplot2. I started off with this trivial example:
ggplot(data.frame(x = c(1, 1e4)), aes(x)) + stat_function(fun = function(x) x)
which works as expected. Unfortunately, when I add log scales for both x and y axes so:
ggplot(data.frame(x = 1:1e4), aes(x)) +
scale_x_log10() +
scale_y_log10() +
stat_function(fun = function(x) x)
I get the following result, which is a pretty nasty violation of the identity function.
Is there something very basic that I'm missing? What is then the correct and least hacky way to plot a function on log scale?
EDIT:
Inspired by the answers I went on and experimented with scales and the aesthetics parameter. I was even more puzzled to find out that I got what I expected using the code below:
ggplot(data.frame(x = 1:1e4, y = 1:1e4), aes(x, y)) +
scale_x_log10() +
scale_y_log10() +
stat_function(fun = function(x) x)
with an apparently unused vector of y values (unused by stat_function that is). Do the axis transformations depend on the availability of data?
When you use scale_x_log10() then x values are log transformed, and only then used for calculation of y values with stat_function(). Then x values are backtransformed to original values to make scale. y values remain as calculated from log transformed x. You can check this by plotting values without scale_y_log10(). In plot there is straight line.
ggplot(data.frame(x=1:1e4), aes(x)) +
stat_function(fun = function(x) x) +
scale_x_log10()
If you apply scale_y_log10() you log transform already calculated y values, so curve is plotted.
In ggplot2, the rule is that scale transformation precedes statistical transformation which in turn precedes coordinate transformation. In this context, the function (via stat_function()) is the statistical transformation.
If you use a scale_x/y_*() function in a ggplot2 call, it will apply the scale transformation(s) first before computing the function.
Case 0: Plot in the original scales of x and y.
ggplot(data.frame(x = 1:1e4, y = 1:1e4), aes(x, y)) +
stat_function(fun = function(x) x)
Case 1a: Both x and y are log transformed before the function is computed because of the presence of scale_x/y_log10(). You can see this from the values on their respective scales (compare to Case 0).
ggplot(data.frame(x = 1:1e4, y = 1:1e4), aes(x, y)) +
stat_function(fun = function(x) x) +
scale_x_log10() +
scale_y_log10()
Case 1b: x is log transformed in the original data frame. Consequently, the function actually operates on the log10(x) values, so will still be a straight line, but on the log10 scale in both x and y.
ggplot(data.frame(x = log10(seq(1e4)), y = seq(1e4)), aes(x, y)) +
stat_function(fun = function(x) x)
Case 1c: The same as 1b, with one exception: the x-scale is in the original units but the y-scale is in log10(x) units, because the scale transformation on x occurs before the statistical transformation f(y) = y is computed, where y = log10(x).
ggplot(data.frame(x = seq(1e4), y = seq(1e4)), aes(x, y)) +
stat_function(fun = function(x) x) +
scale_x_log10()
Case 2: By contrast, coordinate transformations take place after statistical transformation; i.e., the function is computed in the original units first and then the coordinate transformation on x takes place, which warps the function:
ggplot(data.frame(x = seq(1e4), y = seq(1e4)), aes(x, y)) +
stat_function(fun = function(x) x) +
coord_trans(xtrans = "log10")
...unless, of course, you apply the same transformation to both x and y:
ggplot(data.frame(x = seq(1e4), y = seq(1e4)), aes(x, y)) +
stat_function(fun = function(x) x) +
coord_trans(xtrans = "log10", ytrans = "log10")
I have a contour plot in ggplot2 that I want to map one point to.
My contour plot looks like this:
v = ggplot(pts, aes(theta_1, theta_2, z = z))
v + stat_contour(aes(colour = ..level..),bins=50)
+ xlab(expression(Theta[1])) + ylab(expression(Theta[2]))
and I have a point that looks like this:
p = ggplot(ts,aes(x,y))
p + geom_point()
unfortunately the second overwrites the first.
Is there a way to get them to show up on the same plot, similar to MATLAB's "hold on;"?
Thanks!
You can provide the points directly to geom_point():
set.seed(1000)
x = rnorm(1000)
g = ggplot(as.data.frame(x), aes(x = x))
g + stat_bin() + geom_point(data = data.frame(x = -1, y = 40), aes(x=x,y=y))
Not sure if this is still of interest, but I think you just needed to save the updated v object then add the point to that, rather than create a new ggplot2 object. For example
v <- ggplot(pts, aes(theta_1, theta_2, z = z))
v <- v + stat_contour(aes(colour = ..level..),bins=50)
+ xlab(expression(Theta[1])) + ylab(expression(Theta[2]))
v <- v + geom_point(aes(x=ts$x, y=ts$y))
v # to display
ggplot2 is very good at adding layers incrementally, not all have to be based on the same dataset specified in the first ggplot call.