I would like to put perpendicular lines at the ends of the whiskers like the boxplot function automatically gives.
As hinted but not implemented by #Roland, you can use stat_boxplot to implement this. The trick calling _boxplot twice and is to set the geom to errorbar for one of the calls.
Note that as R uses a pen and paper approach it is advisable to implement the error bars first the draw the traditional boxplot over the top.
Using #Roland's dummy data df
ggplot(df, aes(x=cond, y = value)) +
stat_boxplot(geom ='errorbar') +
geom_boxplot() # shorthand for stat_boxplot(geom='boxplot')
The help for stat_boxplot (?stat_boxplot) detail the various values computed and saved in a data.frame
To resize the whiskers lines we can use the argument width = 0.5 inside the function: stat_boxplot
set.seed(42)
df <- data.frame(cond = factor(rep(c("A", "B"), each = 500)),
value = c(rnorm(500, mean = 1, sd = 0.2),
rnorm(500, mean = 1.5, sd = 0.1)))
library(ggplot2)
ggplot(df, aes(x = cond, y = value)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot()
It might be possible to use stat_boxplot to calculate the whisker ends, but I am not enough of a ggplot2 wizard, so I use the base function for that.
set.seed(42)
df <- data.frame(cond = factor( rep(c("A","B"), each=500) ),
value = c(rnorm(500,mean=1,sd=0.2),rnorm(500, mean=1.5,sd=0.1)))
whisk <- function(df,cond_col=1,val_col=2) {
require(reshape2)
condname <- names(df)[cond_col]
names(df)[cond_col] <- "cond"
names(df)[val_col] <- "value"
b <- boxplot(value~cond,data=df,plot=FALSE)
df2 <- cbind(as.data.frame(b$stats),c("min","lq","m","uq","max"))
names(df2) <- c(levels(df$cond),"pos")
df2 <- melt(df2,id="pos",variable.name="cond")
df2 <- dcast(df2,cond~pos)
names(df2)[1] <- condname
df2
}
library(ggplot2)
plot1 <- ggplot(df, aes(x=cond))
plot1 <- plot1 + geom_errorbar(aes(ymin=min,ymax=max),data=whisk(df),width = 0.5)
plot1 <- plot1 + geom_boxplot(aes(y=value))
plot1
Related
This should seem relatively straightforward but I can't find an argument which would allow me to do this and I've searched Google and Stack for an answer.
Sample code:
library(ggplot2)
library(plotly)
dat <- data.frame(cond = factor(rep(c("A","B"), each=200)), rating = c(rnorm(200),rnorm(200, mean=.8)))
p <- ggplot(dat, aes(x=cond, y=rating, fill=cond)) + geom_boxplot()
p <- ggplotly(p)
This outputs the first graph, I would want something like the second.
I tried including colour=cond but that gets rid of the median.
Two possible hacks for consideration, using the same dataset as Marco Sandri's answer.
Hack 1. If you don't really need it to work in plotly, just static ggplot image:
ggplot(dat, aes(x=cond, y=rating, fill=cond)) +
geom_boxplot() +
geom_boxplot(aes(color = cond),
fatten = NULL, fill = NA, coef = 0, outlier.alpha = 0,
show.legend = F)
This overlays the original boxplot with a version that's essentially an outline of the outer box, hiding the median (fatten = NULL), fill colour (fill = NA), whiskers (coef = 0) & outliers (outlier.alpha = 0).
However, it doesn't appear to work well with plotly. I've tested it with the dev version of ggplot2 (as recommended by plotly) to no avail. See output below:
Hack 2. If you need it to work in plotly:
ggplot(dat %>%
group_by(cond) %>%
mutate(rating.IQR = case_when(rating <= quantile(rating, 0.3) ~ quantile(rating, 0.25),
TRUE ~ quantile(rating, 0.75))),
aes(x=cond, y=rating, fill=cond)) +
geom_boxplot() +
geom_boxplot(aes(color = cond, y = rating.IQR),
fatten = NULL, fill = NA)
(ggplot output is same as above)
plotly doesn't seem to understand the coef = 0 & output.alpha = 0 commands, so this hack creates a modified version of the y variable, such that everything below P30 is set to P25, and everything above is set to P75. This creates a boxplot with no outliers, no whiskers, and the median sits together with the upper box limit at P75.
It's more cumbersome, but it works in plotly:
Here is an inelegant solution based on grobs:
set.seed(1)
dat <- data.frame(cond = factor(rep(c("A","B"), each=200)),
rating = c(rnorm(200),rnorm(200, mean=.8)))
library(ggplot2)
library(plotly)
p <- ggplot(dat, aes(x=cond, y=rating, fill=cond)) + geom_boxplot()
# Generate a ggplot2 plot grob
g <- ggplotGrob(p)
# The first box-and-whiskers grob
box_whisk1 <- g$grobs[[6]]$children[[3]]$children[[1]]
pos.box1 <- which(grepl("geom_crossbar",names(box_whisk1$children)))
g$grobs[[6]]$children[[3]]$children[[1]]$children[[pos.box1]]$children[[1]]$gp$col <-
g$grobs[[6]]$children[[3]]$children[[1]]$children[[pos.box1]]$children[[1]]$gp$fill
# The second box-and-whiskers grob
box_whisk2 <- g$grobs[[6]]$children[[3]]$children[[2]]
pos.box2 <- which(grepl("geom_crossbar",names(box_whisk2$children)))
g$grobs[[6]]$children[[3]]$children[[2]]$children[[pos.box2]]$children[[1]]$gp$col <-
g$grobs[[6]]$children[[3]]$children[[2]]$children[[pos.box2]]$children[[1]]$gp$fill
library(grid)
grid.draw(g)
P.S. To my knowledge, the above code cannot be used for generating plotly graphs.
I would like to make the y-axis of a bar chart symmetric, so that it's easier to see if positive or negative changes are bigger. Since otherwise this is a bit distorted. I do have working code although it's a bit clumsy and I thought it would be great if I could directly do this in the first ggplot() call. So as to say that ylim directly is symmetrical.
set.seed(123)
my.plot <- ggplot( data = data.table(x = 1:10,
y = rnorm(10,0, 2)), aes(x=x, y=y)) +
geom_bar(stat="identity")
rangepull <- layer_scales(my.plot)$y
newrange <- max(abs(rangepull$range$range))
my.plot +
ylim(newrange*-1, newrange)
What about this :
library(ggplot2)
library(data.table)
set.seed(123)
my.data = data.table(x = 1:10, y = rnorm(10,0, 2))
my.plot <- ggplot(data = my.data)+aes(x=x, y=y) +
geom_bar(stat="identity")+ylim((0-abs(max(my.data$y))),(0+max(abs(my.data$y))))
my.plot
You may want to consider using ceiling:
set.seed(123)
library(ggplot2)
library(data.table)
dT <- data.table(x = 1:10, y = rnorm(10,0, 2))
my.plot <- ggplot(dT, aes(x=x, y=y)) +
geom_bar(stat="identity") +
ylim(-ceiling(max(abs(dT$y))), ceiling(max(abs(dT$y))))
This will give you:
> my.plot
I'm trying to plot something like this, where alpha is intended to be a mean of dat$c per bin:
library(ggplot2)
set.seed(1)
dat <- data.frame(a = rnorm(1000), b = rnorm(1000), c = 1/rnorm(1000),
d = as.factor(sample(c(0, 1), 1000, replace=TRUE)))
# plot
p <- ggplot(dat, environment = environment()) +
geom_bin2d(aes(x=a, y=b, alpha=c, fill=d),
binwidth = c(1.0/10, 1.0/10))
but it doesn't look like alpha is correct. Please help
I'm not sure what you're expecting to see, but this will calculate mean(dat$c) in each bin and plot the result.
library(ggplot2)
brks <- seq(-5,5,0.1)
lbls <- brks[-1]-0.05
gg.df <- aggregate(c~cut(a, brks, lbls)+cut(b,brks, lbls)+d,dat,FUN=mean)
names(gg.df)[1:2] <- c("a","b")
gg.df$a <- as.numeric(as.character(gg.df$a))
gg.df$b <- as.numeric(as.character(gg.df$b))
ggplot(gg.df, aes(x=a, y=b, alpha=c, fill=d)) + geom_raster() + coord_fixed()
Edit: Response to OP's comment.
You could try:
dat$c <- with(dat,1/(a^2+b^2))
This makes dat$c inversely proportional to the radius (distance from (0,0) to the point). Now running the same code as above:
gg.df <- aggregate(c~cut(a, brks, lbls)+cut(b,brks, lbls)+d,dat,FUN=mean)
names(gg.df)[1:2] <- c("a","b")
gg.df$a <- as.numeric(as.character(gg.df$a))
gg.df$b <- as.numeric(as.character(gg.df$b))
ggplot(gg.df, aes(x=a, y=b, alpha=c, fill=d)) + geom_raster() + coord_fixed() +
scale_alpha_continuous(trans="log",breaks=10^(0:3))
Produces this, as expected: a plot having tiles with higher alpha (less transparent) near the center.
I needed to use a log scale for alpha because the values range over several orders of magnitude.
How would I ignore outliers in ggplot2 boxplot? I don't simply want them to disappear (i.e. outlier.size=0), but I want them to be ignored such that the y axis scales to show 1st/3rd percentile. My outliers are causing the "box" to shrink so small its practically a line. Are there some techniques to deal with this?
Edit
Here's an example:
y = c(.01, .02, .03, .04, .05, .06, .07, .08, .09, .5, -.6)
qplot(1, y, geom="boxplot")
Use geom_boxplot(outlier.shape = NA) to not display the outliers and scale_y_continuous(limits = c(lower, upper)) to change the axis limits.
An example.
n <- 1e4L
dfr <- data.frame(
y = exp(rlnorm(n)), #really right-skewed variable
f = gl(2, n / 2)
)
p <- ggplot(dfr, aes(f, y)) +
geom_boxplot()
p # big outlier causes quartiles to look too slim
p2 <- ggplot(dfr, aes(f, y)) +
geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = quantile(dfr$y, c(0.1, 0.9)))
p2 # no outliers plotted, range shifted
Actually, as Ramnath showed in his answer (and Andrie too in the comments), it makes more sense to crop the scales after you calculate the statistic, via coord_cartesian.
coord_cartesian(ylim = quantile(dfr$y, c(0.1, 0.9)))
(You'll probably still need to use scale_y_continuous to fix the axis breaks.)
Here is a solution using boxplot.stats
# create a dummy data frame with outliers
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# compute lower and upper whiskers
ylim1 = boxplot.stats(df$y)$stats[c(1, 5)]
# scale y limits based on ylim1
p1 = p0 + coord_cartesian(ylim = ylim1*1.05)
I had the same problem and precomputed the values for Q1, Q2, median, ymin, ymax using boxplot.stats:
# Load package and generate data
library(ggplot2)
data <- rnorm(100)
# Compute boxplot statistics
stats <- boxplot.stats(data)$stats
df <- data.frame(x="label1", ymin=stats[1], lower=stats[2], middle=stats[3],
upper=stats[4], ymax=stats[5])
# Create plot
p <- ggplot(df, aes(x=x, lower=lower, upper=upper, middle=middle, ymin=ymin,
ymax=ymax)) +
geom_boxplot(stat="identity")
p
The result is a boxplot without outliers.
One idea would be to winsorize the data in a two-pass procedure:
run a first pass, learn what the bounds are, e.g. cut of at given percentile, or N standard deviation above the mean, or ...
in a second pass, set the values beyond the given bound to the value of that bound
I should stress that this is an old-fashioned method which ought to be dominated by more modern robust techniques but you still come across it a lot.
gg.layers::geom_boxplot2 is just what you want.
# remotes::install_github('rpkgs/gg.layers')
library(gg.layers)
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot2(width = 0.8, width.errorbar = 0.5)
https://rpkgs.github.io/gg.layers/reference/geom_boxplot2.html
If you want to force the whiskers to extend to the max and min values, you can tweak the coef argument. Default value for coef is 1.5 (i.e. default length of the whiskers is 1.5 times the IQR).
# Load package and create a dummy data frame with outliers
#(using example from Ramnath's answer above)
library(ggplot2)
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# create boxplot where whiskers extend to max and min values
p1 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)), coef = 500)
Simple, dirty and effective.
geom_boxplot(outlier.alpha = 0)
The "coef" option of the geom_boxplot function allows to change the outlier cutoff in terms of interquartile ranges. This option is documented for the function stat_boxplot. To deactivate outliers (in other words they are treated as regular data), one can instead of using the default value of 1.5 specify a very high cutoff value:
library(ggplot2)
# generate data with outliers:
df = data.frame(x=1, y = c(-10, rnorm(100), 10))
# generate plot with increased cutoff for outliers:
ggplot(df, aes(x, y)) + geom_boxplot(coef=1e30)
I'm trying to use ggplot2 to create and label a scatterplot. The variables that I am plotting are both scaled such that the horizontal and the vertical axis are plotted in units of standard deviation (1,2,3,4,...ect from the mean). What I would like to be able to do is label ONLY those elements that are beyond a certain limit of standard deviations from the mean. Ideally, this labeling would be based off of another column of data.
Is there a way to do this?
I've looked through the online manual, but I haven't been able to find anything about defining labels for plotted data.
Help is appreciated!
Thanks!
BEB
Use subsetting:
library(ggplot2)
x <- data.frame(a=1:10, b=rnorm(10))
x$lab <- letters[1:10]
ggplot(data=x, aes(a, b, label=lab)) +
geom_point() +
geom_text(data = subset(x, abs(b) > 0.2), vjust=0)
The labeling can be done in the following way:
library("ggplot2")
x <- data.frame(a=1:10, b=rnorm(10))
x$lab <- rep("", 10) # create empty labels
x$lab[c(1,3,4,5)] <- LETTERS[1:4] # some labels
ggplot(data=x, aes(x=a, y=b, label=lab)) + geom_point() + geom_text(vjust=0)
Subsetting outside of the ggplot function:
library(ggplot2)
set.seed(1)
x <- data.frame(a = 1:10, b = rnorm(10))
x$lab <- letters[1:10]
x$lab[!(abs(x$b) > 0.5)] <- NA
ggplot(data = x, aes(a, b, label = lab)) +
geom_point() +
geom_text(vjust = 0)
Using qplot:
qplot(a, b, data = x, label = lab, geom = c('point','text'))