I'm trying to use stat_ecdf() to plot cumulative successes as a function of a rank score created by a predictive model.
#libraries
require(ggplot2)
require(scales)
# fake data for reproducibility
set.seed(123)
n <- 200
df <- data.frame(model_score= rexp(n=n,rate=1:n),
obs_set= sample(c("training","validation"),n,replace=TRUE))
df$model_rank <- rank(df$model_score)/n
df$target_outcome <- rbinom(n,1,1-df$model_rank)
# Plot Gain Chart using stat_ecdf()
ggplot(subset(df,target_outcome==1),aes(x = model_rank)) +
stat_ecdf(aes(colour = obs_set), size=1) +
scale_x_continuous(limits=c(0,1), labels=percent,breaks=seq(0,1,.1)) +
xlab("Model Percentile") + ylab("Percent of Target Outcome") +
scale_y_continuous(limits=c(0,1), labels=percent) +
geom_segment(aes(x=0,y=0,xend=1,yend=1),
colour = "gray", linetype="longdash", size=1) +
ggtitle("Gain Chart")
All I want to do is force the ECDF to start at (0,0) and end at (1,1) so that there are no gaps at the beginning or end of the curve. If possible, I'd like to do it within the syntax of ggplot2, but I'd settle for a clever workaround.
#Henrik this is NOT a duplicate of this question, because I have already defined my limits with scale_x_ and _y_continuous(), and adding expand_limits() doesn't do anything. It is not the origin of the PLOT but the endpoints of the stat_ecdf() that need fixed.
Unfortunately, the definition of stat_ecdf gives no wiggle room here; it determines the endpoints internally.
There is a somewhat advanced solution. With the latest version of ggplot2 (devtools::install_github("hadley/ggplot2")), the extensibility is improved, to the point where it is possible to override this behavior, but not without some boilerplate.
stat_ecdf2 <- function(mapping = NULL, data = NULL, geom = "step",
position = "identity", n = NULL, show.legend = NA,
inherit.aes = TRUE, minval=NULL, maxval=NULL,...) {
layer(
data = data,
mapping = mapping,
stat = StatEcdf2,
geom = geom,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
stat_params = list(n = n, minval=minval,maxval=maxval),
params = list(...)
)
}
StatEcdf2 <- ggproto("StatEcdf2", StatEcdf,
calculate = function(data, scales, n = NULL, minval=NULL, maxval=NULL, ...) {
df <- StatEcdf$calculate(data, scales, n, ...)
if (!is.null(minval)) { df$x[1] <- minval }
if (!is.null(maxval)) { df$x[length(df$x)] <- maxval }
df
}
)
Now, stat_ecdf2 will behave the same as stat_ecdf, but with an optional minval and maxval parameter. So this will do the trick:
ggplot(subset(df,target_outcome==1),aes(x = model_rank)) +
stat_ecdf2(aes(colour = obs_set), size=1, minval=0, maxval=1) +
scale_x_continuous(limits=c(0,1), labels=percent,breaks=seq(0,1,.1)) +
xlab("Model Percentile") + ylab("Percent of Target Outcome") +
scale_y_continuous(limits=c(0,1), labels=percent) +
geom_segment(aes(x=0,y=0,xend=1,yend=1),
colour = "gray", linetype="longdash", size=1) +
ggtitle("Gain Chart")
The big caveat here is that I don't know if the current extensibility model will be supported in the future; it has changed several times in the past, and the change to use "ggproto" is recent -- like July 15th 2015 recent.
As a plus, this gave me a chance to really dig into ggplot's internals, which is something that I've been meaning to do for a while.
Related
I am trying to display data that includes non-detects. For the ND I want to have a circular outline at different sizes so that the lines do not overlap each other. I pretty much have what I want, but for the parameter cis-DCE the circular outline just makes the point look bigger instead of being a distinct outline. How do I attribute size to the parameter and also make the starting size larger?
I will include all of the code I am using for the graphing, but I am specifically working on this bit right now.
geom_point(aes(x= date, y = lrl, group = parm_nmShort, size = parm_nmShort), shape = 1) + #marking lower limit
I also know that I could use facet_wraps and I've done that previously, but historically this data has been shown in one graph, but without identifying the NDs and I do not want to drastically alter the display of the data and confuse anyone.
{
#graphing
# folder where you want the graphs to be saved:
results <- 'C:/Users/cbuckley/OneDrive - DOI/Documents/Projects/New Haven/Data/Graphs/'
{
VOC.graph <- function(df, na.rm = TRUE, ...){
df$parm_nmShort <- factor(df$parm_nm, levels = c("cis.1.2.Dichloroethene_77093",
"Trichloroethene_34485",
"Tetrachloroethene_34475"),
labels = c("cis-DCE", "TCE", "PCE"))
# create list of sites in data to loop over
site_list <- unique(df$site_nm)
# create for loop to produce ggplot2 graphs
for (i in seq_along(site_list)) {
# create plot for each county in df
plot <-
ggplot(subset(df, df$site_nm==site_list[i]),
aes(x = date, y = result,
group = parm_nmShort,
color = parm_nmShort)) +
geom_point() + #add data point plot
geom_line() + #add line plot
#geom_point(aes(y = lrl, group = parm_nmShort, shape = parm_nmShort)) +
geom_point(aes(x= date, y = lrl, group = parm_nmShort, size = parm_nmShort), shape = 1) + #marking lower limit
#scale_shape_manual(values = c("23","24","25")) + #create outlier shapes
#facet_wrap(~parm_nmShort) +
ggtitle(site_list[i]) + #name graphs well names
# theme(legend.position="none") + #removed legend
labs(x = "Year", y = expression(paste("Value, ug/L"))) + #add x and y label titles
theme_article() + #remove grey boxes, outline graph in black
theme(legend.title = element_blank()) + #removes legend title
scale_x_date(labels = date_format("%y"),
limits = as.Date(c("2000-01-01","2021-01-01"))) #+ # set x axis for all graphs
# geom_hline(yintercept = 5) #+ #add 5ug/L contaminant limit horizontal line
# theme(axis.text.x = element_text(angle = 45, size = 12, vjust = 1)) + #angles x axis titles 45 deg
# theme(aspect.ratio = 1) +
# scale_color_hue(labels = c("cic-DCE", "PCE", "TCE")) + #change label names
# scale_fill_discrete(breaks = c("PCE", "TCE", "cic-DCE"))
# Code below will let you block out below the resolution limit
# geom_ribbon(aes(ymin = 0, ymax = ###LRL###), fill ="white", color ="grey3") +
# geom_line(color ="black", lwd = 1)
#ggsave(plot,
# file=paste(results, "", site_list[i], ".png", sep=''),
# scale=1)
# print plots to screen
print(plot)
}
}
#run graphing function with long data set
VOC.graph(data)
}}
Well after a lot of playing around, I figured out the answer to my own question. I figured I'd leave the question up because none of the solutions I found online worked for me but this code did.
geom_point(aes(x= date, y = lrl, group = parm_nmShort, shape = parm_nmShort, size = parm_nmShort)) + #identify non detects
scale_shape_manual(values = c(1,1,1)) +
scale_size_manual(values = c(3,5,7)) +
I'm not very good at R, but for some reason when I didn't include the group and shape in the aes as parm_nmShort, I couldn't mannualy change the values. I don't know if it's because I have more than one geom_point in my whole script and so maybe it didn't know which one to change.
I am plotting a box-plot to see the distribution of the variable. I am also interested in seeing the number of observations in each quartile. Is there any way to add the number of observations in each quartile to the boxplot along with the values of quartiles?
I included some code below which can generate box-plot with the values of quartiles.
df <- datasets::iris
boxplot <- ggplot(df, aes(x = "", y = Sepal.Length)) +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label_repel", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.1), size = 3) +
ggtitle("") +
xlab("") +
ylab('Sepal.Length')
I expect the values of quartiles on the left-hand side of the plot and the number of observations on the right-hand side of the plot if possible.
this would be one possibility. I always prefer to have my additional data as an extra data frame, because this gives me more control on what is how calculated.
Counting made with some inspiration from https://stackoverflow.com/a/54451575
quantile_counts=function(x){
df= data.frame(label=table(cut(x, quantile(x))),
label_pos=diff(quantile(x))/2+quantile(x)[1:4])
return(df)
}
df_quantile_counts=quantile_counts(df$Sepal.Length)
boxplot <- ggplot(df, aes(x = "", y = Sepal.Length)) +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.1), size = 3) +
geom_text(data=df_quantile_counts,aes(x="",y=label_pos,label = label.Freq),
position = position_nudge(x = +0.1), size = 3) +
ggtitle("") +
xlab("") +
ylab('Sepal.Length')
HTH, Tobi
#TobiO 's answer is correct. But, my data was kind of skewed and some cut points were the same (such as the first and second cut points were the same). I needed to take the unique values to calculate the number of observations in each quartile. Another point is related to usage of cut function which does not include the starting point (low bound, high bound]. In order to include the starting point, I have used the cut2 function from the Hmisc package. I included a label_pos_extension line in order to prevent the overlap of label/text for the quartiles whose cut points are very close to each other. geom_text_repel did not work for preventing the overlaps.
quantile_counts2 <- function(x){
label_pos_extension <- c(0,3,4,0)
if(length(unique(quantile(x))) < 5){
df <- data.frame(label = table(cut2(x, g = 4)),
label_pos = c(0, diff(unique(quantile(x))) / 2 + quantile(x)[1:length(unique(quantile(x)))-1]) + label_pos_extension[1:length(unique(quantile(x)))])
} else {
df <- data.frame(label = table(cut2(x, g = 4)),
label_pos = diff(quantile(x)) / 2 + quantile(x)[1:4] + label_pos_extension)
} return(df)
}
PS. I tried to put my edited function in comment but, it did not work.
I want to add a table with some info that will be different in each panel within the facet.
I'm using ggplot2 and facet_grid.
say I want to add some kind of descriptive statistics to each panel, and they not necessarily the same.
these statistics are placed in a df I made for that purpose.
I found a few way to add these table to the graphs but:
as far as I concern Annotate will give me the same table for all the panels in the facet.
I would really like to use the facet_warp for the simplicity and not grid_extra...
library(datasets)
data(mtcars)
ggplot(data = mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
facet_wrap( ~ cyl,scales="free_y")
the place of the table is not that important to me, but I don't want it to overlap the graph.
My objective is kind of mixture between the two answers in that thread:
Adding table to ggplot with facets
The first answers (with annotate-) won't work for me since I want the table in each of the plot to be unique.)
The second answer is better, but I do not want it to overlap or hide some of the details in the graph, and in each panel the lines/scatters located in different place so I can't use it like that. I would like it to be attached just like in the annotate.
try this
library(ggplot2)
library(tibble)
library(gridExtra)
library(grid)
GeomCustom <- ggproto(
"GeomCustom",
Geom,
setup_data = function(self, data, params) {
data <- ggproto_parent(Geom, self)$setup_data(data, params)
data
},
draw_group = function(data, panel_scales, coord) {
vp <- grid::viewport(x=data$x, y=data$y)
g <- grid::editGrob(data$grob[[1]], vp=vp)
ggplot2:::ggname("geom_custom", g)
},
required_aes = c("grob","x","y")
)
geom_custom <- function(mapping = NULL,
data = NULL,
stat = "identity",
position = "identity",
na.rm = FALSE,
show.legend = NA,
inherit.aes = FALSE,
...) {
layer(
geom = GeomCustom,
mapping = mapping,
data = data,
stat = stat,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ...)
)
}
gl <- list(tableGrob(iris[1:2,1:3]),
tableGrob(iris[1:4,1:3]),
tableGrob(iris[1:3,1:3]),
tableGrob(iris[1:2,1:2]))
dummy <- tibble(f=letters[1:4], grob = gl )
d <- tibble(x=rep(1:3, 4), f=rep(letters[1:4], each=3))
ggplot(d, aes(x,x)) +
facet_wrap(~f) +
theme_bw() +
geom_custom(data=dummy, aes(grob=grob), x = 0.5, y = 0.5)
I am trying to use geom_ribbon to mimic the behavior of geom_area
but i am not successful. would you have any hint on why the following does not work ?
I used Hadley's statement from ggplot2 geom_area web pages :
"An area plot is a special case of geom_ribbon, where the minimum of the range is fixed to 0, and the position adjustment defaults to position_stacked."
test <- expand.grid(Param = LETTERS[1:3], x = 1:5)
test$y <- test$x
# Ok
p <- ggplot(test)
p <- p + geom_area(aes(x = x, y = y, group = Param, fill = Param), alpha = 0.3)
p
# not ok - initial idea
p <- ggplot(test)
p <- p + geom_ribbon(aes(x = x, ymin = 0, ymax = y, group = Param, fill = Param), alpha = 0.3, position = position_stack())
p
further, how can I look in the code of functions coded the way geom_XXX are?
my traditional way gives the following, which is not very usefull:
> geom_ribbon
function (mapping = NULL, data = NULL, stat = "identity", position = "identity",
na.rm = FALSE, ...)
GeomRibbon$new(mapping = mapping, data = data, stat = stat, position = position,
na.rm = na.rm, ...)
Thanks for your help
Regards
Pascal
You just didn't map a variable to y in your geom_ribbon call. Adding y = y causes it to work for me. In general, geom_ribbon doesn't require a y aesthetic, but I believe it does in the case of stacking. I presume there's a well-thought out reasoning for why that is, but you never know...
Also, all the source code for ggplot2 is on github.
I am new to R and am trying to plot 3 histograms onto the same graph.
Everything worked fine, but my problem is that you don't see where 2 histograms overlap - they look rather cut off.
When I make density plots, it looks perfect: each curve is surrounded by a black frame line, and colours look different where curves overlap.
Can someone tell me if something similar can be achieved with the histograms in the 1st picture? This is the code I'm using:
lowf0 <-read.csv (....)
mediumf0 <-read.csv (....)
highf0 <-read.csv(....)
lowf0$utt<-'low f0'
mediumf0$utt<-'medium f0'
highf0$utt<-'high f0'
histogram<-rbind(lowf0,mediumf0,highf0)
ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)
Using #joran's sample data,
ggplot(dat, aes(x=xx, fill=yy)) + geom_histogram(alpha=0.2, position="identity")
note that the default position of geom_histogram is "stack."
see "position adjustment" of this page:
geom_histogram documentation
Your current code:
ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)
is telling ggplot to construct one histogram using all the values in f0 and then color the bars of this single histogram according to the variable utt.
What you want instead is to create three separate histograms, with alpha blending so that they are visible through each other. So you probably want to use three separate calls to geom_histogram, where each one gets it's own data frame and fill:
ggplot(histogram, aes(f0)) +
geom_histogram(data = lowf0, fill = "red", alpha = 0.2) +
geom_histogram(data = mediumf0, fill = "blue", alpha = 0.2) +
geom_histogram(data = highf0, fill = "green", alpha = 0.2) +
Here's a concrete example with some output:
dat <- data.frame(xx = c(runif(100,20,50),runif(100,40,80),runif(100,0,30)),yy = rep(letters[1:3],each = 100))
ggplot(dat,aes(x=xx)) +
geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)
which produces something like this:
Edited to fix typos; you wanted fill, not colour.
While only a few lines are required to plot multiple/overlapping histograms in ggplot2, the results are't always satisfactory. There needs to be proper use of borders and coloring to ensure the eye can differentiate between histograms.
The following functions balance border colors, opacities, and superimposed density plots to enable the viewer to differentiate among distributions.
Single histogram:
plot_histogram <- function(df, feature) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)))) +
geom_histogram(aes(y = ..density..), alpha=0.7, fill="#33AADE", color="black") +
geom_density(alpha=0.3, fill="red") +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), color="black", linetype="dashed", size=1) +
labs(x=feature, y = "Density")
print(plt)
}
Multiple histogram:
plot_multi_histogram <- function(df, feature, label_column) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)), fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.7, position="identity", aes(y = ..density..), color="black") +
geom_density(alpha=0.7) +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), color="black", linetype="dashed", size=1) +
labs(x=feature, y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
Usage:
Simply pass your data frame into the above functions along with desired arguments:
plot_histogram(iris, 'Sepal.Width')
plot_multi_histogram(iris, 'Sepal.Width', 'Species')
The extra parameter in plot_multi_histogram is the name of the column containing the category labels.
We can see this more dramatically by creating a dataframe with many different distribution means:
a <-data.frame(n=rnorm(1000, mean = 1), category=rep('A', 1000))
b <-data.frame(n=rnorm(1000, mean = 2), category=rep('B', 1000))
c <-data.frame(n=rnorm(1000, mean = 3), category=rep('C', 1000))
d <-data.frame(n=rnorm(1000, mean = 4), category=rep('D', 1000))
e <-data.frame(n=rnorm(1000, mean = 5), category=rep('E', 1000))
f <-data.frame(n=rnorm(1000, mean = 6), category=rep('F', 1000))
many_distros <- do.call('rbind', list(a,b,c,d,e,f))
Passing data frame in as before (and widening chart using options):
options(repr.plot.width = 20, repr.plot.height = 8)
plot_multi_histogram(many_distros, 'n', 'category')
To add a separate vertical line for each distribution:
plot_multi_histogram <- function(df, feature, label_column, means) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)), fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.7, position="identity", aes(y = ..density..), color="black") +
geom_density(alpha=0.7) +
geom_vline(xintercept=means, color="black", linetype="dashed", size=1)
labs(x=feature, y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
The only change over the previous plot_multi_histogram function is the addition of means to the parameters, and changing the geom_vline line to accept multiple values.
Usage:
options(repr.plot.width = 20, repr.plot.height = 8)
plot_multi_histogram(many_distros, "n", 'category', c(1, 2, 3, 4, 5, 6))
Result:
Since I set the means explicitly in many_distros I can simply pass them in. Alternatively you can simply calculate these inside the function and use that way.