ggplot2 custom stat not shown when facetting - r

I'm trying to write a custom stat_* for ggplot2, where I would like to color a 2D loess surface using tiles. When I start from the extension guide, I can write a stat_chull like they do:
stat_chull = function(mapping = NULL, data = NULL, geom = "polygon",
position = "identity", na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE, ...) {
chull = ggproto("chull", Stat,
compute_group = function(data, scales) {
data[chull(data$x, data$y), , drop = FALSE]
},
required_aes = c("x", "y")
)
layer(
stat = chull, data = data, mapping = mapping, geom = geom,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ...)
)
}
This work for both the simple call and facet wrapping:
ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point() +
stat_chull()
# optionally + facet_wrap(~ class)
When I write my stat_loess2d, I can also visualize all classes or an individual class:
stat_loess2d = function(mapping = NULL, data = NULL, geom = "tile",
position = "identity", na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE, ...) {
loess2d = ggproto("loess2d", Stat,
compute_group = function(data, scales) {
dens = MASS::kde2d(data$x, data$y)
lsurf = loess(fill ~ x + y, data=data)
df = data.frame(x = rep(dens$x, length(dens$y)),
y = rep(dens$y, each=length(dens$x)),
dens = c(dens$z))
df$fill = predict(lsurf, newdata=df[c("x", "y")])
df
},
required_aes = c("x", "y", "fill")
)
layer(
stat = loess2d, data = data, mapping = mapping, geom = geom,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ...)
)
}
ggplot(mpg, aes(x=displ, y=hwy, fill=year)) +
geom_point(aes(color=year)) +
stat_loess2d()
ggplot(mpg[mpg$class == "compact",], aes(x=displ, y=hwy, fill=year)) +
geom_point(aes(color=year)) +
stat_loess2d()
However, when I try to facet the above, tiles are no longer shown:
ggplot(mpg, aes(x=displ, y=hwy, fill=year)) +
geom_point(aes(color=year)) +
stat_loess2d() +
facet_wrap(~ class)
Can someone tell me what I'm doing wrong here?

Explanation
The main issue I see here actually lies beyond what you have done, & is related to how geom_tile handles tile creation across different facets, when the specific x / y axis values differ significantly. An older question demonstrated a similar phenomenon: geom_tile works fine with each facet's data on its own, but put them together, and the tiles shrink to match the smallest difference between different facets' values. This leaves vast amounts of white space in the plot layer, and usually gets progressively worse with every additional facet, until the tiles themselves become practically invisible.
To get around this, I would add a data processing step after the density / loess calculations for each facet, to standardize the range of x and y values across all facets.
Some elaboration in case you are not very familiar with the relationship between compute_layer, compute_panel, and compute_group (I certainly wasn't when I started messing around with ggproto objects...):
Essentially, all Stat* objects have these three functions to bridge the gap between a given dataframe (mpg in this case), and what's received by the Geom* side of things.
Of the three, compute_layer is the top-level function, and typically triggers compute_panel to calculate a separate dataframe for each facet / panel (the terminology used in exported functions is facet, but the underlying package code refers to the same as panel; I'm not sure why either). In turn, compute_panel triggers compute_group to calculate a separate dataframe for each group (as defined by the group / colour / fill / etc. aesthetic parameters).
The results from compute_group are returned to compute_panel and combined into one dataframe. Likewise, compute_layer receives one dataframe from each facet's compute_panel, and combines them together again. The combined dataframe is then passed on to Geom* to draw.
(Above is the generic set-up as defined in the top-level Stat. Other Stat* objects inheriting from Stat may override the behaviour in any of the steps. For example, StatIdentity's compute_layer returns the original dataframe as-is, without triggering compute_panel / compute_group at all, because there is no need to do so for unchanged data.)
For this use case, we can modify the code in compute_layer, after the results have been returned from compute_panel / compute_group and combined together, to interpolate values associated with each facet into common bins. Because common bins = nice large tiles without white space in between.
Modification
Here's how I would have written the loess2d ggproto object, with an additional definition for compute_layer:
loess2d = ggproto("loess2d", Stat,
compute_group = function(data, scales) {
dens = MASS::kde2d(data$x, data$y)
lsurf = loess(fill ~ x + y, data=data)
df = data.frame(x = rep(dens$x, length(dens$y)),
y = rep(dens$y, each=length(dens$x)),
dens = c(dens$z))
df$fill = predict(lsurf, newdata=df[c("x", "y")])
df
},
compute_layer = function(self, data, params, layout) {
# no change from Stat$compute_layer in this chunk, except
# for liberal usage of `ggplot2:::` to utilise un-exported
# functions from the package
ggplot2:::check_required_aesthetics(self$required_aes,
c(names(data), names(params)),
ggplot2:::snake_class(self))
data <- remove_missing(data, params$na.rm,
c(self$required_aes, self$non_missing_aes),
ggplot2:::snake_class(self),
finite = TRUE)
params <- params[intersect(names(params), self$parameters())]
args <- c(list(data = quote(data), scales = quote(scales)), params)
df <- plyr::ddply(data, "PANEL", function(data) {
scales <- layout$get_scales(data$PANEL[1])
tryCatch(do.call(self$compute_panel, args),
error = function(e) {
warning("Computation failed in `", ggplot2:::snake_class(self),
"()`:\n", e$message, call. = FALSE)
data.frame()
})
})
# define common x/y grid range across all panels
# (length = 25 chosen to match the default value for n in MASS::kde2d)
x.range <- seq(min(df$x), max(df$x), length = 25)
y.range <- seq(min(df$y), max(df$y), length = 25)
# interpolate each panel's data to a common grid,
# with NA values for regions where each panel doesn't
# have data (this can be changed via the extrap
# parameter in akima::interp, but I think
# extrapolating may create misleading visuals)
df <- df %>%
tidyr::nest(-PANEL) %>%
mutate(data = purrr::map(data,
~akima::interp(x = .x$x, y = .x$y, z = .x$fill,
xo = x.range, yo = y.range,
nx = 25, ny = 25) %>%
akima::interp2xyz(data.frame = TRUE) %>%
rename(fill = z))) %>%
tidyr::unnest()
return(df)
},
required_aes = c("x", "y", "fill")
)
Usage:
ggplot(mpg,
aes(x=displ, y=hwy, fill=year)) +
stat_loess2d() +
facet_wrap(~ class)
# this does trigger warnings (not errors) because some of the facets contain
# really very few observations. if we filter for facets with more rows of data
# in the original dataset, this wouldn't be an issue
ggplot(mpg %>% filter(!class %in% c("2seater", "minivan")),
aes(x=displ, y=hwy, fill=year)) +
stat_loess2d() +
facet_wrap(~ class)
# no warnings triggered

Related

How to align scale transformation across geoms?

I have a geom_foo() which will do some transformation of the input data and I have a scale transformation. My problem is that these work not as I would expect together with other geom_*s in terms of scaling.
To illustrate the behavior, consider foo() which will be used in the setup_data method of GeomFoo, defined at the end of the question.
foo <- function(x, y) {
data.frame(
x = x + 2,
y = y + 2
)
}
foo(1, 1)
The transformer is:
foo_trans <- scales::trans_new(
name = "foo",
transform = function(x) x / 5,
inverse = function(x) x * 5
)
Given this input data:
df1 <- data.frame(x = c(1, 2), y = c(1, 2))
Here is a basic plot:
library(ggplot2)
ggplot(df1, aes(x = x, y = y)) +
geom_foo()
When I apply the transformation to the vertical scale, I get this
ggplot(df1, aes(x = x, y = y)) +
geom_foo() +
scale_y_continuous(trans = foo_trans)
What I can say is that the y-axis limits are calculate as 11 = 1 + (2*5) and 12 = 2 + (2*5), where 1 and 2 are df1$y, and (2 * 5) are taken from the setup_data method and from trans_foo.
My real problem is, that I would like add a text layer with labels. These labels and their coordinates come from another dataframe, as below.
df_label <- foo(df1$x, df1$y)
df_label$label <- c("A", "B")
Label and point layers are on same x-y positions without the scale transformation
p <- ggplot(df1, aes(x = x, y = y)) +
geom_foo(color = "red", size = 6) +
geom_text(data = df_label, aes(x, y, label = label))
p
But when I apply the transformation, the coordinates do not match anymore
p +
scale_y_continuous(trans = foo_trans)
How do I get the to layer to match in x-y coordinates after the transformation? Thanks
ggproto object:
GeomFoo <- ggproto("GeomFoo", GeomPoint,
setup_data = function(data, params) {
cols_to_keep <- setdiff(names(data), c("x", "y"))
cbind(
foo(data$x, data$y),
data[, cols_to_keep]
)
}
)
geom constructor:
geom_foo <- function(mapping = NULL, data = NULL, ...,
na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE) {
layer(
data = data,
mapping = mapping,
stat = "identity",
geom = GeomFoo,
position = "identity",
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(
na.rm = na.rm,
...
)
)
}
Doing data transformations isn't really the task of a geom, but a task of a stat instead. That said, the larger issue is that scale transformations are applied before the GeomFoo$setup_data() method is called. There are two ways one could accomplish this task that I could see.
Apply foo() before scale transformation. I don't think geoms or stats ever have access to the data before scale transformation. A possible place for this is in the ggplot2:::Layer$setup_layer() method. However, this isn't exported, which probably means the devs would like to discourage this even before we make an attempt.
Inverse the scale transformation, apply foo(), and transform again. For this, you need a method with access to the scales. AFAIK, no geom method has this access. However Stat$compute_panel() does have access, so we can use this.
To give an example of (2), I think you could get away with the following:
StatFoo <- ggproto(
"StatFoo", Stat,
compute_panel = function(self, data, scales) {
cols_to_keep <- setdiff(names(data), c("x", "y"))
food <- foo(scales$x$trans$inverse(data$x),
scales$y$trans$inverse(data$y))
cbind(
data.frame(x = scales$x$trans$transform(food$x),
y = scales$y$trans$transform(food$y)),
data[, cols_to_keep]
)
}
)
geom_foo <- function(mapping = NULL, data = NULL, ...,
na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE) {
layer(
data = data,
mapping = mapping,
stat = StatFoo,
geom = GeomPoint,
position = "identity",
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(
na.rm = na.rm,
...
)
)
}
If someone else has brighter ideas to do this, I'd also like to know!

Creating geom / stat from scratch

I just started working with R not long ago, and I am currently trying to strengthen my visualization skills. What I want to do is to create boxplots with mean diamonds as a layer on top (see picture in the link below). I did not find any functions that does this already, so I guess I have to create it myself.
What I was hoping to do was to create a geom or a stat that would allow something like this to work:
ggplot(data, aes(...))) +
geom_boxplot(...) +
geom_meanDiamonds(...)
I have no idea where to start in order to build this new function. I know which values are needed for the mean diamonds (mean and confidence interval), but I do not know how to build the geom / stat that takes the data from ggplot(), calculates the mean and CI for each group, and plots a mean diamond on top of each boxplot.
I have searched for detailed descriptions on how to build these type of functions from scratch, however, I have not found anything that really starts from the bottom. I would really appreciate it, if anyone could point me towards some useful guides.
Thank you!
I'm currently learning to write geoms myself, so this is going to be a rather long & rambling post as I go through my thought processes, untangling the Geom aspects (creating polygons & line segments) from the Stats aspects (calculating where these polygons & segments should be) of a geom.
Disclaimer: I'm not familiar with this kind of plot, and Google didn't throw up many authoritative guides. My understanding of how the confidence interval is calculated / used here may be off.
Step 0. Understand the relationship between a geom / stat and a layer function.
geom_boxplot and stat_boxplot are examples of layer functions. If you enter them into the R console, you'll see that they are (relatively) short, and does not contain actual code for calculating the box / whiskers of the boxplot. Instead, geom_boxplot contains a line that says geom = GeomBoxplot, while stat_boxplot contains a line that says stat = StatBoxplot (reproduced below).
> stat_boxplot
function (mapping = NULL, data = NULL, geom = "boxplot", position = "dodge2",
..., coef = 1.5, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
{
layer(data = data, mapping = mapping, stat = StatBoxplot,
geom = geom, position = position, show.legend = show.legend,
inherit.aes = inherit.aes, params = list(na.rm = na.rm,
coef = coef, ...))
}
GeomBoxplot and StatBoxplot are ggproto objects. They are where the magic happens.
Step 1. Recognise that ggproto()'s _inherit parameter is your friend.
Don't reinvent the wheel. Since we want to create something that overlaps nicely with a boxplot, we can take reference from the Geom / Stat used for that, and only change what's necessary.
StatMeanDiamonds <- ggproto(
`_class` = "StatMeanDiamonds",
`_inherit` = StatBoxplot,
... # add functions here to override those defined in StatBoxplot
)
GeomMeanDiamonds <- ggproto(
`_class` = "GeomMeanDiamonds",
`_inherit` = GeomBoxplot,
... # as above
)
Step 2. Modify the Stat.
There are 3 functions defined within StatBoxplot: setup_data, setup_params, and compute_group. You can refer to the code on Github (link above) for the details, or view them by entering for example StatBoxplot$compute_group.
The compute_group function calculates the ymin / lower / middle / upper / ymax values for all the y values associated with each group (i.e. each unique x value), which are used to plot the box plot. We can override it with one that calculates the confidence interval & mean values instead:
# ci is added as a parameter, to allow the user to specify different confidence intervals
compute_group_new <- function(data, scales, width = NULL,
ci = 0.95, na.rm = FALSE){
a <- mean(data$y)
s <- sd(data$y)
n <- sum(!is.na(data$y))
error <- qt(ci + (1-ci)/2, df = n-1) * s / sqrt(n)
stats <- c("lower" = a - error, "mean" = a, "upper" = a + error)
if(length(unique(data$x)) > 1) width <- diff(range(data$x)) * 0.9
df <- as.data.frame(as.list(stats))
df$x <- if(is.factor(data$x)) data$x[1] else mean(range(data$x))
df$width <- width
df
}
(Optional) StatBoxplot has provision for the user to include weight as an aesthetic mapping. We can allow for that as well, by replacing:
a <- mean(data$y)
s <- sd(data$y)
n <- sum(!is.na(data$y))
with:
if(!is.null(data$weight)) {
a <- Hmisc::wtd.mean(data$y, weights = data$weight)
s <- sqrt(Hmisc::wtd.var(data$y, weights = data$weight))
n <- sum(data$weight[!is.na(data$y) & !is.na(data$weight)])
} else {
a <- mean(data$y)
s <- sd(data$y)
n <- sum(!is.na(data$y))
}
There's no need to change the other functions in StatBoxplot. So we can define StatMeanDiamonds as follows:
StatMeanDiamonds <- ggproto(
`_class` = "StatMeanDiamonds",
`_inherit` = StatBoxplot,
compute_group = compute_group_new
)
Step 3. Modify the Geom.
GeomBoxplot has 3 functions: setup_data, draw_group, and draw_key. It also includes definitions for default_aes() and required_aes().
Since we've changed the upstream data source (the data produced by StatMeanDiamonds contain the calculated columns "lower" / "mean" / "upper", while the data produced by StatBoxplot would have contained the calculated columns "ymin" / "lower" / "middle" / "upper" / "ymax"), do check whether the downstream setup_data function is affected as well. (In this case, GeomBoxplot$setup_data makes no reference to the affected columns, so no changes required here.)
The draw_group function takes the data produced by StatMeanDiamonds and set up by setup_data, and produces multiple data frames. "common" contains the aesthetic mappings common to all geoms. "diamond.df" for the mappings that contribute towards the diamond polygon, and "segment.df" for the mappings that contribute towards the horizontal line segment at the mean. The data frames are then passed to the draw_panel functions of GeomPolygon and GeomSegment respectively, to produce the actual polygons / line segments.
draw_group_new = function(data, panel_params, coord,
varwidth = FALSE){
common <- data.frame(colour = data$colour,
size = data$size,
linetype = data$linetype,
fill = alpha(data$fill, data$alpha),
group = data$group,
stringsAsFactors = FALSE)
diamond.df <- data.frame(x = c(data$x, data$xmax, data$x, data$xmin),
y = c(data$upper, data$mean, data$lower, data$mean),
alpha = data$alpha,
common,
stringsAsFactors = FALSE)
segment.df <- data.frame(x = data$xmin, xend = data$xmax,
y = data$mean, yend = data$mean,
alpha = NA,
common,
stringsAsFactors = FALSE)
ggplot2:::ggname("geom_meanDiamonds",
grid::grobTree(
GeomPolygon$draw_panel(diamond.df, panel_params, coord),
GeomSegment$draw_panel(segment.df, panel_params, coord)
))
}
The draw_key function is used to create the legend for this layer, should the need arise. Since GeomMeanDiamonds inherits from GeomBoxplot, the default is draw_key = draw_key_boxplot, and we don't have to change it. Leaving it unchanged will not break the code. However, I think a simpler legend such as draw_key_polygon offers a less cluttered look.
GeomBoxplot's default_aes specifications look fine. But we need to change the required_aes since the data we expect to get from StatMeanDiamonds is different ("lower" / "mean" / "upper" instead of "ymin" / "lower" / "middle" / "upper" / "ymax").
We are now ready to define GeomMeanDiamonds:
GeomMeanDiamonds <- ggproto(
"GeomMeanDiamonds",
GeomBoxplot,
draw_group = draw_group_new,
draw_key = draw_key_polygon,
required_aes = c("x", "lower", "upper", "mean")
)
Step 4. Define the layer functions.
This is the boring part. I copied from geom_boxplot / stat_boxplot directly, removing all references to outliers in geom_meanDiamonds, changing to geom = GeomMeanDiamonds / stat = StatMeanDiamonds, and adding ci = 0.95 to stat_meanDiamonds.
geom_meanDiamonds <- function(mapping = NULL, data = NULL,
stat = "meanDiamonds", position = "dodge2",
..., varwidth = FALSE, na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE){
if (is.character(position)) {
if (varwidth == TRUE) position <- position_dodge2(preserve = "single")
} else {
if (identical(position$preserve, "total") & varwidth == TRUE) {
warning("Can't preserve total widths when varwidth = TRUE.", call. = FALSE)
position$preserve <- "single"
}
}
layer(data = data, mapping = mapping, stat = stat,
geom = GeomMeanDiamonds, position = position,
show.legend = show.legend, inherit.aes = inherit.aes,
params = list(varwidth = varwidth, na.rm = na.rm, ...))
}
stat_meanDiamonds <- function(mapping = NULL, data = NULL,
geom = "meanDiamonds", position = "dodge2",
..., ci = 0.95,
na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) {
layer(data = data, mapping = mapping, stat = StatMeanDiamonds,
geom = geom, position = position, show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ci = ci, ...))
}
Step 5. Check output.
# basic
ggplot(iris,
aes(Species, Sepal.Length)) +
geom_boxplot() +
geom_meanDiamonds()
# with additional parameters, to see if they break anything
ggplot(iris,
aes(Species, Sepal.Length)) +
geom_boxplot(width = 0.8) +
geom_meanDiamonds(aes(fill = Species),
color = "red", alpha = 0.5, size = 1,
ci = 0.99, width = 0.3)

Constrict ggplot ellips to realistic/possible values

When plotting an ellips with ggplot is it possible to constrain the ellips to values that are actually possible?
For example, the following reproducible code and data plots Ele vs. Var for two species. Var is a positive variable and cannot be negative. Nonetheless, negative values are included in the resulting ellips. Is it possible to bound the ellips by 0 on the x-axis (using ggplot)?
More specifically, I am picturing a flat edge with the ellipsoids truncated at 0 on the x-axis.
library(ggplot2)
set.seed(123)
df <- data.frame(Species = rep(c("BHS", "MTG"), each = 100),
Ele = c(sample(1500:3000, 100), sample(2500:3500, 100)),
Var = abs(rnorm(200)))
ggplot(df, aes(Var, Ele, color = Species)) +
geom_point() +
stat_ellipse(aes(fill = Species), geom="polygon",level=0.95,alpha=0.2)
You could edit the default stat to clip points to a particular value. Here we change the basic stat to trim x values less than 0 to 0
StatClipEllipse <- ggproto("StatClipEllipse", Stat,
required_aes = c("x", "y"),
compute_group = function(data, scales, type = "t", level = 0.95,
segments = 51, na.rm = FALSE) {
xx <- ggplot2:::calculate_ellipse(data = data, vars = c("x", "y"), type = type,
level = level, segments = segments)
xx %>% mutate(x=pmax(x, 0))
}
)
Then we have to wrap it in a ggplot stat that is identical to stat_ellipe except that it uses our custom Stat object
stat_clip_ellipse <- function(mapping = NULL, data = NULL,
geom = "path", position = "identity",
...,
type = "t",
level = 0.95,
segments = 51,
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE) {
layer(
data = data,
mapping = mapping,
stat = StatClipEllipse,
geom = geom,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(
type = type,
level = level,
segments = segments,
na.rm = na.rm,
...
)
)
}
then you can use it to make your plot
ggplot(df, aes(Var, Ele, color = Species)) +
geom_point() +
stat_clip_ellipse(aes(fill = Species), geom="polygon",level=0.95,alpha=0.2)
This was inspired by the source code for stat_ellipse.
Based on my comment above, I created a less-misleading option for visualization. This is ignoring the problem with y being uniformly distributed, since that's a somewhat less egregious problem than the heavily skewed x variable.
Both these options use the ggforce package, which is an extension of ggplot2, but just in case, I've also included the source for the particular function I used.
library(ggforce)
library(scales)
# power_trans <- function (n)
# {
# scales::trans_new(name = paste0("power of ", fractions(n)), transform = function(x) {
# x^n
# }, inverse = function(x) {
# x^(1/n)
# }, breaks = scales::extended_breaks(), format = scales::format_format(),
# domain = c(0, Inf))
# }
Option 1:
ggplot(df, aes(Var, Ele, color = Species)) +
geom_point() +
stat_ellipse(aes(fill = Species), geom="polygon",level=0.95,alpha=0.2) +
scale_x_sqrt(limits = c(-0.1,3.5),
breaks = c(0.0001,1:4),
labels = 0:4,
expand = c(0.00,0))
This option stretches the x-axis along a square-root transform, spreading out the points clustered near zero. Then it computes an ellipse over this new space.
Advantage: looks like an ellipse still.
Disadvantage: in order to get it to play nice and label the Var=0 point on the x axis, you have to use expand = c(0,0), which clips the limits exactly, and so requires a bit more fiddling with manual limits/breaks/labels, including choosing a very small value (0.0001) to be represented as 0.
Disadvantage: the x values aren't linearly distributed along the axis, which requires a bit more cognitive load when reading the figure.
Option 2:
ggplot(df, aes(sqrt(Var), Ele, color = Species)) +
geom_point() +
stat_ellipse() +
coord_trans(x = ggforce::power_trans(2)) +
scale_x_continuous(breaks = sqrt(0:4), labels = 0:4,
name = "Var")
This option plots the pre-transformed sqrt(Var) (notice the aes(...)). It then calculates the ellipses based on this new approximately normal value. Then it stretches out the x-axis so that the values of Var are once again linearly spaced, which distorts the ellipse in the same transformation.
Advantage: looks cool.
Advantage: values of Var are easy to interpret on the x-axis.
Advantage: you can see the density near Var=0 with the points and the wide flat end of the "egg" easily.
Advantage: the pointy end shows you how low the density is at those values.
Disadvantage: looks unfamiliar and requires explanation and additional cognitive load to interpret.

Can we access all data columns in a custom ggplot2's stat?

I would like to implement diagnostics for Cox proportional hazards model with ggplot2, by creating new stats functions and ggproto objects. The idea is to benefit from grouping (by color, facet_grid, etc.) for conditional computation of desired statistics (e.g. martingale residuals).
In the example below, the null model is refitted and martingale's computed for each cofactor level. The issues are:
Fit is done based on data argument provided internally for the ggproto's compute_group method, not the actual dataset. This means, that dataframe is stripped of columns not explicitly defined in aes so it's up to the user to provide at least "time to event" and "event indicator" columns each time. (unless they're hardcoded in a wrappper for stat_... function). Can compute_group somehow access corresponding rows from the original dataset?
Rownames for compute_group's data are unique per PANEL, which means they do not reflect the actual rownames of original dataset supplited to ggplot, and we can no longer exclude e.g. incomplete cases omitted by full model, unless explicitly specifying/hardcoding some id variable. The question is same as above.
Can a geom layer access the y's computed by another custom stat, which is also plotted? Consider a scatterplot of residuals with a smoother, such as. ggplot(data, aes(x = covariate, time = time, event = event)) + stat_funcform() + geom_smooth(aes(y = ..martingales..)), where ..martingales.. are actually computed by stat_funcform.
.
# Load & Set data ---------------------------------------------------------
require(data.table)
require(dplyr)
require(ggplot2)
require(survival)
set.seed(1)
dt <- data.table(factor = sample(c(0, 2), 1e3, T),
factor2 = sample(c(0, 2, NA), 1e3, T),
covariate = runif(1e3, 0, 1))
dt[, `:=`(etime = rexp(1e3, .01 * (.5 + factor + covariate)),
ctime = runif(1e3, 0, 1e2))]
dt[, `:=`(event = ifelse(etime < ctime, 1, 0),
time = ifelse(etime < ctime, etime, ctime))]
survfit(Surv(time, event) ~ factor + I(covariate > .5), data = dt) %>%
autoplot(fun = 'cloglog')
# Fit coxph model ---------------------------------------------------------
fit <- coxph(Surv(time, event) ~ factor + covariate, data = dt)
# Define custom stat ------------------------------------------------------
StatFuncform <- ggproto(
"StatFuncform", Stat,
compute_group = function(data, scales, params) {
head(data) %>% print
data$y <- update(params$fit, ~ 1, data = data) %>%
resid(type = 'martingale')
return(data)
}
)
stat_funcform <- function(mapping = NULL, data = NULL, geom = "point",
position = "identity", na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE, ...) {
layer(
stat = StatFuncform, data = data, mapping = mapping, geom = geom,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ...)
)
}
# Plot --------------------------------------------------------------------
# computation unsuccessfull - no 'time' and 'event' variables
ggplot(data = dt,
mapping = aes(x = covariate,
color = factor(event))) +
stat_funcform(params = list(fit = fit)) +
facet_grid(~ factor, labeller = label_both, margins = TRUE)
# ok - 'time' and 'event' specified explicitly
ggplot(data = dt,
mapping = aes(x = covariate, time = time, event = event,
factor2 = factor2,
color = factor(event))) +
stat_funcform(params = list(fit = fit)) +
facet_grid(~ factor, labeller = label_both, margins = TRUE)
I'm aware there are already a few packages implementing diagnostics for survival analysis (e.g. survminer) or ggplot2's autoplot methods, but they're rather providing wrappers instead of taking advantage of default ggplot2() + stat_ semantics.

ggplot2 2.0 new stat_ function: setting default scale for given aesthetics

I try to use the new functionality of ggplot2 in R that allows creating our own stat_ functions. I'm creating a simple one to compute and plot an interpolated surface between points arranged on a 2d array.
I would like to create a stat_topo() requiring x, y, and val aesthetics, plotting a simple geom_raster of interpolated val mapped to fill.
library(ggplot2)
library(dplyr)
library(akima)
cpt_grp <- function(data, scales) {
#interpolate data in 2D
itrp <- akima::interp(data$x,data$y,data$val,linear=F,extrap=T)
out <- expand.grid(x=itrp$x, y=itrp$y,KEEP.OUT.ATTRS = F)%>%
mutate(fill=as.vector(itrp$z))
# str(out)
return(out)
}
StatTopo <- ggproto("StatTopo", Stat,
compute_group = cpt_grp,
required_aes = c("x","y","val")
)
stat_topo <- function(mapping = NULL, data = NULL, geom = "raster",
position = "identity", na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE, ...) {
layer(
stat = StatTopo, data = data, mapping = mapping, geom = geom,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ...)
)
}
set.seed(1)
nchan <- 30
d <- data.frame(val = rnorm(nchan), # some random values to be mapped to fill color
x = 1:nchan*cos(1:nchan), # the x and y position of the points to interpolate
y = 1:nchan*sin(1:nchan))
plot(d$x,d$y)
ggplot(d,aes(x=x,y=y,val=val)) +
stat_topo() +
geom_point()
When I run this, I get the following error:
Error: numerical color values must be >= 0, found -1
I understand that this is because somehow the scale of the fill aesthetic is set to discrete.
If I enter this:
ggplot(d,aes(x=x,y=y,val=val)) +
stat_topo() +
scale_fill_continuous() +
geom_point()
I get what I wanted: the expected raster with a continuous color scale, which I want the stat_ to do by default...
So I guess the question is:
How can I prevent ggplot from setting a discrete scale here, and ideally set a default scale within the call to my new stat_ function.
Apparently, when creating a new variable inside a stat_ function, one needs to explicitly associate it to the aesthetic it will be mapped to with the parameter default_aes = aes(fill = ..fill..) within the ggproto definition.
This is telling ggplot that it is a calculated aesthetic and it will pick a scale based on the data type.
So here we need to define the stat_ as follows:
cpt_grp <- function(data, scales) {
# interpolate data in 2D
itrp <- akima::interp(data$x,data$y,data$val,linear=F,extrap=T)
out <- expand.grid(x=itrp$x, y=itrp$y,KEEP.OUT.ATTRS = F)%>%
mutate(fill=as.vector(itrp$z))
# str(out)
return(out)
}
StatTopo <- ggproto("StatTopo", Stat,
compute_group = cpt_grp,
required_aes = c("x","y","val"),
default_aes = aes(fill = ..fill..)
)
stat_topo <- function(mapping = NULL, data = NULL, geom = "raster",
position = "identity", na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE, ...) {
layer(
stat = StatTopo, data = data, mapping = mapping, geom = geom,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ...)
)
}
Then the following code:
set.seed(1)
nchan <- 30
d <- data.frame(val = rnorm(nchan),
x = 1:nchan*cos(1:nchan),
y = 1:nchan*sin(1:nchan))
ggplot(d,aes(x=x,y=y,val=val)) +
stat_topo() +
geom_point()
Produces as expected:
Without the need to specify a scale_ manually, but leaving the possibility to adapt the scale easily as usual with e.g. scale_fill_gradient2(low = 'blue',mid='white',high='red')
I got this answer here: https://github.com/hadley/ggplot2/issues/1481
Okay, slept on it, and had an idea, and I think this might do what you want. In your stat_topo layer function instead of the ggproto I returned a list with it as the first element and then added to that list another ggproto with a call to scale_fill_continuous().
library(ggplot2)
library(dplyr)
library(akima)
cpt_grp <- function(data, scales) {
#interpolate data in 2D
itrp <- akima::interp(data$x,data$y,data$val,linear=F,extrap=T)
out <- expand.grid(x=itrp$x, y=itrp$y,KEEP.OUT.ATTRS = F)%>%
mutate(fill=as.vector(itrp$z))
return(out)
}
StatTopo <- ggproto("StatTopo", Stat,
compute_group = cpt_grp,
required_aes = c("x","y","val")
)
stat_topo <- function(mapping = NULL, data = NULL, geom = "raster",
position = "identity", na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE, ...) {
list(
layer(
stat = StatTopo, data = data, mapping = mapping, geom = geom,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(na.rm = na.rm )
),
scale_fill_continuous()
)
}
set.seed(1)
nchan <- 30
d <- data.frame(val = rnorm(nchan), # some random values to be mapped to fill color
x = 1:nchan*cos(1:nchan), # the x and y position of interp points
y = 1:nchan*sin(1:nchan))
ggplot(d,aes(x=x,y=y,val=val)) +
stat_topo() +
geom_point()
yielding the same picture as above.

Resources