Illustrate standard deviation in histogram - r

Consider the following simple example:
# E. Musk in Grunheide
# generate random numbers
randomNumbers <- rnorm(n = 1000, mean = 10, sd = 10)
# empirical sd
#> [1] 10.34369
# histogram
hist(randomNumbers, probability = TRUE, main = "", breaks = 50)
# just for illusatration purpose
# empirical density
lines(density(randomNumbers), col = 'black', lwd = 2)
# theortical density
curve(dnorm(x, mean = 10, sd = 10), col = "blue", lwd = 2, add = TRUE)
Is there a nice way to illustrate the empirical standard deviation (sd) in the histogram by colour?
E.g. representing the inner bars by a different color, or indicating the range of the sd by an interval, i.e., [mean +/- sd], on the x-axis?
Note, if ggplot2 provides an easy solution, suggesting this would be also much appreciated.

This is similar ggplot solution to Benson's answer, except we precompute the histogram and use geom_col, so that we don't get any of the unwelcome stacking at the sd boundary:
# E. Musk in Grunheide
# generate random numbers
randomNumbers <- rnorm(n=1000, mean=10, sd=10)
h <- hist(randomNumbers, breaks = 50, plot = FALSE)
lower <- mean(randomNumbers) - sd(randomNumbers)
upper <- mean(randomNumbers) + sd(randomNumbers)
df <- data.frame(x = h$mids, y = h$density,
fill = h$mids > lower & h$mids < upper)
ggplot(df) +
geom_col(aes(x, y, fill = fill), width = 1, color = 'black') +
geom_density(data = data.frame(x = randomNumbers),
aes(x = x, color = 'Actual density'),
key_glyph = 'path') +
geom_function(fun = function(x) {
dnorm(x, mean = mean(randomNumbers), sd = sd(randomNumbers)) },
aes(color = 'theoretical density')) +
scale_fill_manual(values = c(`TRUE` = '#FF374A', 'FALSE' = 'gray'),
name = 'within 1 SD') +
scale_color_manual(values = c('black', 'blue'), name = 'Density lines') +
Here is a ggplot solution. First calculate mean and sd, and save the values in different vectors. Then use an ifelse statement to categorise the values into "Within range" and "Outside range", fill them with different colours.
Blue line represents the normal distribution stated in your question, and black line represents the density graph of the histogram we're plotting.
# generate random numbers
randomNumbers <- rnorm(n=1000, mean=10, sd=10)
randomNumbers_mean <- mean(randomNumbers)
randomNumbers_sd <- sd(randomNumbers)
ggplot(data.frame(randomNumbers = randomNumbers), aes(randomNumbers)) +
fill = ifelse(
randomNumbers > randomNumbers_mean + randomNumbers_sd |
randomNumbers < randomNumbers_mean - randomNumbers_sd,
"Outside range",
"Within range"
binwidth = 1, col = "gray") +
geom_density(aes(y = ..count..)) +
stat_function(fun = function(x) dnorm(x, mean = 10, sd = 10) * 1000,
color = "blue") +
labs(fill = "Data")
data.frame(rand = randomNumbers,
cut = {
sd <- sd(randomNumbers)
mn <- mean(randomNumbers)
cut(randomNumbers, c(-Inf, mn -sd, mn +sd, Inf))
}) |>
ggplot(aes(x = rand, fill = cut ) ) +


Given the probability find y for gamma distribution

Looking for some assistance using r. I know that there is invgamma but I am not sure if that will work/how to use it correctly. If X has a Gamma distribution with shape parameter r = 3 and scale parameter ρ = 6 is there a way to calculate y such that Prob(X < y) = .95? thanks!
In R you have 4 types of functions for distribution:
r[name of the distribution]- Random number generator
q[name of the distribution]- Quantile function
d[name of the distribution]- Density function
p[name of the distribution]- Distribution function
So since you have the probability, you need yo use qgamma.
P(X < x) = 0.95
shape <- 3
rate <- 6
x_95 <- qgamma(p = 0.95,shape = shape, rate = rate)
[1] 1.049299
Plot code
df <-
x = seq(0,3,l = 1000)
) %>%
mutate(y = dgamma(x = x,shape = shape,rate = rate))
ggplot(df,aes(x,y)) +
geom_function(fun = dgamma, args = list(shape = shape, rate = rate))+
geom_vline(xintercept = x_95, linetype = "dashed")+
scale_x_continuous(breaks = x_95)+
geom_area(data = df %>% filter(x <= x_95),
alpha = .7, fill = "chocolate2")+
scale_y_continuous(expand = c(0,0))+
Draw line through 2d density plot

I have a large dataset of gene expression from ~10,000 patient samples (TCGA), and I'm plotting a predicted expression value (x) and the actual observed value (y) of a certain gene signature. For my downstream analysis, I need to draw a precise line through the plot and calculate different parameters in samples above/below the line.
No matter how I draw a line through the data (geom_smooth(method = 'lm', 'glm', 'gam', or 'loess')), the line always seems imperfect - it doesn't cut through the data to my liking (red line is lm in figure).
After playing around for a while, I realized that the 2d kernel density lines (geom_density2d) actually do a good job of showing the slope/trends of my data, so I manually drew a line that kind of cuts through the density lines (black line in figure).
My question: how can I automatically draw a line that cuts through the kernel density lines, as for the black line in the figure? (Rather than manually playing with different intercepts and slopes till something looks good).
The best approach I can think of is to somehow calculate intercept and slope of the longest diameter for each of the kernel lines, take an average of all those intercepts and slopes and plot that line, but that's a bit out of my league. Maybe someone here has experience with this and can help?
A more hacky approach may be getting the x,y coords of each kernel density line from ggplot_build, and going from there, but it feels too hacky (and is also out of my league).
EDIT: Changed a few details to make the figure/analysis easier. (Density lines are smoother now).
samples <- 10000
r <- 0.9
data <- mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(2, r, r, 2), nrow=2))
x <- data[, 1] # standard normal (mu=0, sd=1)
y <- data[, 2] # standard normal (mu=0, sd=1)
test.df <- data.frame(x = x, y = y)
lm(y ~ x, test.df)
ggplot(test.df, aes(x, y)) +
geom_point(color = 'grey') +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = c(2,2)) + ### EDIT: h = c(2,2)
geom_smooth(method = "glm", se = F, lwd = 1, color = 'red') +
geom_abline(intercept = 0, slope = 0.7, lwd = 1, col = 'black') ## EDIT: slope to 0.7
I generally agree with #Hack-R.
However, it was kind of a fun problem and looking into ggplot_build is not such a big deal.
p <- ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = c(2,2))
#basic version of your plot
p_built <- ggplot_build(p)
p_data <- p_built$data[[1]]
p_maxring <- p_data[p_data[['level']] == min(p_data[['level']]),] %>%
select(x,y) # extracts the x/y coordinates of the points on the largest ellipse from your 2d-density contour
Now this answer helped me to find the points on this ellipse which are furthest apart.
coord_mean <- c(x = mean(p_maxring$x), y = mean(p_maxring$y))
p_maxring <- p_maxring %>%
mutate (mean_dev = sqrt((x - mean(x))^2 + (y - mean(y))^2)) #extra column specifying the distance of each point to the mean of those points
coord_farthest <- c('x' = p_maxring$x[which.max(p_maxring$mean_dev)], 'y' = p_maxring$y[which.max(p_maxring$mean_dev)])
# gives the coordinates of the point farthest away from the mean point
farthest_from_farthest <- sqrt((p_maxring$x - coord_farthest['x'])^2 + (p_maxring$y - coord_farthest['y'])^2)
#now this looks which of the points is the farthest from the point farthest from the mean point :D
coord_fff <- c('x' = p_maxring$x[which.max(farthest_from_farthest)], 'y' = p_maxring$y[which.max(farthest_from_farthest)])
ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = c(2,2)) +
# geom_segment using the coordinates of the points farthest apart
geom_segment((aes(x = coord_farthest['x'], y = coord_farthest['y'],
xend = coord_fff['x'], yend = coord_fff['y']))) +
geom_smooth(method = "glm", se = F, lwd = 1, color = 'red') +
# as per your request with your geom_smooth line
coord_equal is super important, because otherwise you will get super weird results - it messed up my brain too. Because if the coordinates are not set equal, the line will seemingly not pass through the point furthest apart from the mean...
I leave it to you to build this into a function in order to automate it. Also, I'll leave it to you to calculate the y-intercept and slope from the two points
Tjebo's approach was kind of good initially, but after a close look, I found that it found the longest distance between two points on an ellipse. While this is close to what I wanted, it failed with either an irregular shape of the ellipse, or the sparsity of points in the ellipse. This is because it measured the longest distance between two points; whereas what I really wanted is the longest diameter of an ellipse; i.e.: the semi-major axis. See image below for examples/details.
To find/draw density contours of specific density/percentage:
R - How to find points within specific Contour
To get the longest diameter ("semi-major axis") of an ellipse:
For function that returns intercept and slope (as in OP), see last piece of code.
The two pieces of code and images below compare two Tjebo's approach vs. my new approach based on the above posts.
#### Reprex from OP
samples <- 10000
r <- 0.9
data <- mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(2, r, r, 2), nrow=2))
x <- data[, 1] # standard normal (mu=0, sd=1)
y <- data[, 2] # standard normal (mu=0, sd=1)
test.df <- data.frame(x = x, y = y)
#### From Tjebo
p <- ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = 2)
p_built <- ggplot_build(p)
p_data <- p_built$data[[1]]
p_maxring <- p_data[p_data[['level']] == min(p_data[['level']]),][,2:3]
coord_mean <- c(x = mean(p_maxring$x), y = mean(p_maxring$y))
p_maxring <- p_maxring %>%
mutate (mean_dev = sqrt((x - mean(x))^2 + (y - mean(y))^2)) #extra column specifying the distance of each point to the mean of those points
p_maxring = p_maxring[round(seq(1, nrow(p_maxring), nrow(p_maxring)/23)),] #### Make a small ellipse to illustrate flaws of approach
coord_farthest <- c('x' = p_maxring$x[which.max(p_maxring$mean_dev)], 'y' = p_maxring$y[which.max(p_maxring$mean_dev)])
# gives the coordinates of the point farthest away from the mean point
farthest_from_farthest <- sqrt((p_maxring$x - coord_farthest['x'])^2 + (p_maxring$y - coord_farthest['y'])^2)
#now this looks which of the points is the farthest from the point farthest from the mean point :D
coord_fff <- c('x' = p_maxring$x[which.max(farthest_from_farthest)], 'y' = p_maxring$y[which.max(farthest_from_farthest)])
farthest_2_points = data.frame(t(cbind(coord_farthest, coord_fff)))
plot(p_maxring[,1:2], asp=1)
lines(farthest_2_points, col = 'blue', lwd = 2)
#### From answer in another post
d = cbind(p_maxring[,1], p_maxring[,2])
r = ellipsoidhull(d)
exy = predict(r) ## the ellipsoid boundary
me = colMeans((exy))
dist2center = sqrt(rowSums((t(t(exy)-me))^2))
max(dist2center) ## major axis
lines(exy[dist2center == max(dist2center),], col = 'red', lwd = 2)
#### The plot here is made from the data in the reprex in OP, but with h = 0.5
samples <- 10000
r <- 0.9
data <- mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(2, r, r, 2), nrow=2))
x <- data[, 1] # standard normal (mu=0, sd=1)
y <- data[, 2] # standard normal (mu=0, sd=1)
test.df <- data.frame(x = x, y = y)
p <- ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = 0.5) ## NOTE h = 0.5
p_built <- ggplot_build(p)
p_data <- p_built$data[[1]]
p_maxring <- p_data[p_data[['level']] == min(p_data[['level']]),][,2:3]
coord_mean <- c(x = mean(p_maxring$x), y = mean(p_maxring$y))
p_maxring <- p_maxring %>%
mutate (mean_dev = sqrt((x - mean(x))^2 + (y - mean(y))^2))
coord_farthest <- c('x' = p_maxring$x[which.max(p_maxring$mean_dev)], 'y' = p_maxring$y[which.max(p_maxring$mean_dev)])
farthest_from_farthest <- sqrt((p_maxring$x - coord_farthest['x'])^2 + (p_maxring$y - coord_farthest['y'])^2)
coord_fff <- c('x' = p_maxring$x[which.max(farthest_from_farthest)], 'y' = p_maxring$y[which.max(farthest_from_farthest)])
## h = 0.5
## Given the highly irregular shape of the contours, I will use only the largest contour line (0.95) for draing the line.
## Thus, average = 1. See function below for details.
ln = long.diam("x", "y", test.df, h = 0.5, average = 1) ## NOTE h = 0.5
ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = 0.5) + ## NOTE h = 0.5
geom_segment((aes(x = coord_farthest['x'], y = coord_farthest['y'],
xend = coord_fff['x'], yend = coord_fff['y'])), col = 'blue', lwd = 2) +
geom_abline(intercept = ln[1], slope = ln[2], color = 'red', lwd = 2) +
Finally, I came up with the following function to deal with all this. Sorry for the lack of comments/clarity
#### This will return the intercept and slope of the longest diameter (semi-major axis).
####If Average = TRUE, it will average the int and slope across different density contours.
long.diam = function(x, y, df, probs = c(0.95, 0.5, 0.1), average = T, h = 2) {
fun.df = data.frame(cbind(df[,x], df[,y]))
colnames(fun.df) = c("x", "y")
dens = kde2d(fun.df$x, fun.df$y, n = 200, h = h)
dx <- diff(dens$x[1:2])
dy <- diff(dens$y[1:2])
sz <- sort(dens$z)
c1 <- cumsum(sz) * dx * dy
levels <- sapply(probs, function(x) {
approx(c1, sz, xout = 1 - x)$y
names(levels) = paste0("L", str_sub(formatC(probs, 2, format = 'f'), -2))
#plot(fun.df$x,fun.df$y, asp = 1)
#contour(dens, levels = levels, labels=probs, add=T, col = c('red', 'blue', 'green'), lwd = 2)
#contour(dens, add = T, col = 'red', lwd = 2)
ls <- contourLines(dens, levels = levels)
names(ls) = names(levels) = list()
for (i in 1:length(ls)) {
d = cbind(ls[[i]]$x, ls[[i]]$y)
exy = predict(ellipsoidhull(d))## the ellipsoid boundary
colnames(exy) = c("x", "y")
me = colMeans((exy)) ## center of the ellipse
dist2center = sqrt(rowSums((t(t(exy)-me))^2))
max.dist = data.frame(exy[rev(order(dist2center))[1:2],]) = lm(max.dist$y ~ max.dist$x)[[i]] = c(as.numeric($coefficients[1]), as.numeric($coefficients[2]))
names( = names(ls)
#plot(fun.df$x,fun.df$y, asp = 1)
#contour(dens, levels = levels, labels=probs, add=T, col = c('red', 'blue', 'green'), lwd = 2)
#abline([[1]], col = 'red', lwd = 2)
#abline([[2]], col = 'blue', lwd = 2)
#abline([[3]], col = 'green', lwd = 2)
#abline(apply(simplify2array(, 1, mean), col = 'black', lwd = 4)
if (isTRUE(average)) {
apply(simplify2array(, 1, mean)
} else {[[average]]
Finally, here's the final implementation of the different answers:
samples = 10000
r = 0.9
data = mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(2, r, r, 2), nrow=2))
x = data[, 1] # standard normal (mu=0, sd=1)
y = data[, 2] # standard normal (mu=0, sd=1)
#plot(x, y)
test.df = data.frame(x = x, y = y)
#### Find furthest two points of contour
p <- ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 2, contour = T, h = 2)
p_built <- ggplot_build(p)
p_data <- p_built$data[[1]]
p_maxring <- p_data[p_data[['level']] == min(p_data[['level']]),][,2:3]
coord_mean <- c(x = mean(p_maxring$x), y = mean(p_maxring$y))
p_maxring <- p_maxring %>%
mutate (mean_dev = sqrt((x - mean(x))^2 + (y - mean(y))^2))
coord_farthest <- c('x' = p_maxring$x[which.max(p_maxring$mean_dev)], 'y' = p_maxring$y[which.max(p_maxring$mean_dev)])
farthest_from_farthest <- sqrt((p_maxring$x - coord_farthest['x'])^2 + (p_maxring$y - coord_farthest['y'])^2)
coord_fff <- c('x' = p_maxring$x[which.max(farthest_from_farthest)], 'y' = p_maxring$y[which.max(farthest_from_farthest)])
#### Find the average intercept and slope of 3 contour lines (0.95, 0.5, 0.1), as in my long.diam function above.
## RED
ln = long.diam("x", "y", test.df)
#### Plot everything. Black line is GLM
ggplot(test.df, aes(x, y)) +
geom_point(color = 'grey') +
geom_density2d(color = 'red', lwd = 1, contour = T, h = 2) +
geom_smooth(method = "glm", se = F, lwd = 1, color = 'black') +
geom_abline(intercept = ln[1], slope = ln[2], col = 'red', lwd = 1) +
geom_segment((aes(x = coord_farthest['x'], y = coord_farthest['y'],
Plotting the 95% confidence interval for means drawn from a normal distribution

I have drawn 100 samples of size 10 from a normal distribution with a mean of 10 and standard deviation of two. Code below:
n <- 10
for (i in 1:nreps) {
sample <- rnorm(n=n, mean = 10, sd = 2)
sample.mean[i] <- mean(sample)
a <- qnorm(0.95*2/sqrt(n))
ci <- a
plot(sample.mean, 1:100)
I want to create a graph that looks like this
This is what I currently have
I know I need to interpret the left hand and right hand bounds of each mean and then insert a horizontal line between them. Means that fall outside the 95% confidence interval are supposed to be colored differently than the rest. I am just beginning to learn R, so a helpful walk-through would be very appreciated.
try it this way:
n <- 10
sd <- 2
n.reps <- 100
my.mean <- 10
alpha <- 0.05
mydata <- matrix(rnorm(n = n.reps * n, mean = my.mean, sd =sd), ncol = n.reps)
sample.means <- apply(mydata, 2, mean)
error <- apply(mydata, 2, function(x) qt(p=1-alpha/2,df=length(x)-1)*sd (x)/sqrt(length(x)))
dfx <- data.frame(sample.means, error, lcl = sample.means-error, ucl = sample.means+error, trial = 1:n.reps)
dfx$miss <- dfx$ucl < my.mean | dfx$lcl > my.mean
ggplot(dfx, aes(x = sample.means, y = trial, xmin = lcl, xmax = ucl, color = miss)) + geom_errorbarh() + geom_point(pch = 1) +
geom_vline(aes(xintercept=my.mean), lty=2) +
xlab("True Mean in Blue and 95% Confidence Intervals") + ylab ("Trial") + ggtitle(paste("Successful CI's:", 100*mean(!dfx$miss), "%")) + scale_color_manual(values = c("green", "red")) +
or use base:
oldpar <- par(xpd=FALSE)
par(mar=c(8.1, 3.1, 3.1, 4.1))
with(subset(dfx, !miss), plot(sample.means, trial,
xlab = "Sample Mean",
ylab = "Trial",
col = "forestgreen",
xlim=c(min(dfx$lcl), max(dfx$ucl))))
with(subset(dfx, miss), points(sample.means, trial,
col = "red"))
with(subset(dfx, miss), segments(lcl, trial, ucl, trial, col = "red"))
with(subset(dfx, !miss), segments(lcl, trial, ucl, trial, col = "forestgreen"))
abline(v = my.mean, lty = 2, lwd = 2, col = "blue")
legend("bottomright", c("Successful CI", "Miss"), lty = c(1,1), col = c("forestgreen", "red"),
title(main = paste("Successful CI's:", 100*mean(!dfx$miss), "%"),
Scatter plot in R with large overlap and 3000+ points

I am making a scatter plot in R with ggplot2. I am comparing the fraction of votes Hillary and Bernie received in the primary and education level. There is a lot over overlap and way to many points. I tried to use transparency so I could see the overlap but it still looks bad.
demanalyze <- function(infocode, n = 1){
infoname <- filter(infolookup, column_name == infocode)$description
infocolumn <- as.vector(as.matrix(mydata[infocode]))
ggplot(mydata) +
aes(x = infocolumn) +
ggtitle(infoname) +
xlab(infoname) +
ylab("Fraction of votes each canidate recieved") +
xlab(infoname) +
geom_point(aes(y = sanders_vote_fraction, colour = "Bernie Sanders")) +#, color = alpha("blue",0.02), size=I(1)) +
stat_smooth(aes(y = sanders_vote_fraction), method = "lm", formula = y ~ poly(x, n), size = 1, color = "darkblue", se = F) +
geom_point(aes(y = clinton_vote_fraction, colour = "Hillary Clinton")) +#, color = alpha("red",0.02), size=I(1)) +
stat_smooth(aes(y = clinton_vote_fraction), method = "lm", formula = y ~ poly(x, n), size = 1, color = "darkred", se = F) +
values = c("Bernie Sanders" = alpha("blue",0.02), "Hillary Clinton" = alpha("red",0.02))
) +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
What could I change to make the overlap areas look less messy?
The standard way to plot a large number of points over 2 dimensions is to use 2D density plots:
With reproducible example:
x1 <- rnorm(1000, mean=10)
x2 <- rnorm(1000, mean=10)
y1 <- rnorm(1000, mean= 5)
y2 <- rnorm(1000, mean = 7)
mydat <- data.frame(xaxis=c(x1, x2), yaxis=c(y1, y2), lab=rep(c("H","B"),each=1000))
##Dots and density plots (kinda messy, but can play with alpha)
p1 <-ggplot(mydat) + geom_point(aes(x=xaxis, y = yaxis, color=lab),alpha=0.4) +
stat_density2d(aes(x=xaxis, y = yaxis, color=lab))
## just density
p2 <-ggplot(mydat) + stat_density2d(aes(x=xaxis, y = yaxis, color=lab))
dot plot different indicators, depending on the value, in R

I am visualising odds ratios.
You can find fake data and a plot below
Data <- data.frame(
odds = sample(0:9),
pvalue = c(0.1,0.04,0.02,0.03,0.2,0.5,0.03,
Y = sample(c("a", "b"), 5, replace = TRUE),
letters = letters[1:10]
dotplot(letters ~ odds| Y, data =Data,
aspect=0.5, layout = c(1,2), ylab=NULL)
I would like to show solid circles for p-values greater than 0.05, and empty circles if values are less than 0.05.
We could specify the pch with values 1/20 for empty/solid circles based on the 'pvalue' column.
dotplot(letters ~ odds| Y, data=Data, aspect= 0.5, layout= c(1,2),
ylab=NULL, pch= ifelse(Data$pvalue > 0.05, 20, 1))
The group argument together with pch should also do the job:
dotplot(letters ~ odds| Y, data =Data,
aspect=0.5, layout = c(1,2), ylab=NULL,
groups = pvalue <= 0.05,
pch = c(19, 21))
This is easy to create with ggplot2:
Data$significant <- Data$pvalue > 0.05
ggplot(Data, aes(x=odds, y=letters, shape=significant)) +
geom_point(size=4) +
scale_x_continuous(breaks = seq(from=0, to= 8, by=2)) +
scale_shape_manual(values=c(1, 16)) +
ylab("") +
