ggplot2: Density plot with mean / 95% confidence interval line - r

I know that there is a way to draw a density plot with the box plot as follows: So basically, in this plot, median & quartiles were used.
However, I was not able to find out how I can express the mean & confidence intervals of each density plot. I am wonder if there is a way that I can plot a "mean & confidence interval" line on x-axis (instead of the box plot with median & quartiles) based on ggplot2.
I tried to use geom_errorbarh, but failed to generate what I wanted to see.
Here is the R code with mean and 95% confidence interval calculation saved in sum_stat.
library(ggplot2)
library(ggridges)
library(grid)
library(reshape2)
library(ggstance)
library(dplyr)
# Generating the dataset
x <- data.frame(v1=rnorm(5000, mean = -0.02, sd = 0.022),
v2=rnorm(5000, mean = 0.02, sd = 0.022),
v3=rnorm(5000, mean = 0.04, sd = 0.022))
colnames(x) <- c("A", "B", "C")
# Summary statistics
mean_vec <- colMeans(x)
sd_vec <- apply(x, 2, sd)
n <- nrow(x)
error <- qnorm(0.975)*sd_vec/sqrt(n)
left <- mean_vec - error
right <- mean_vec + error
sum_stat <- cbind(left, mean_vec, right)
# Melting the data
data <- melt(x)
# head(data); str(data)
ggplot(data, aes(x = value, y = variable)) +
geom_density_ridges(aes(fill = variable), alpha=0.2, scale=0.8) +
geom_boxploth(aes(fill = variable), width = 0.06, outlier.shape = NA)
I look forward to hearing anything from you all!
Thank you.

To use geom_errorbarh, you will have to pass inherit.aes = FALSE in order to be able to plot mean and CI. (NB: I also transform your sum_stat in a dataframe and add a column variable to make the plot easier)
sum_stat <- data.frame(sum_stat)
sum_stat$variable = rownames(sum_stat)
ggplot(data, aes(x = value, y = variable)) +
geom_density_ridges(aes(fill = variable), alpha=0.2, scale=0.8) +
geom_point(inherit.aes = FALSE, data = sum_stat,
aes(x= mean_vec, y = variable, color = variable),show.legend = FALSE)+
geom_errorbarh(inherit.aes = FALSE, data = sum_stat,
aes(xmin = left, xmax = right, y = variable, color = variable),
height = 0.1, show.legend = FALSE)
Is it what you are looking for ?

Related

ggplot: transperancy of histogram as function of stat(count)

I'm trying to make a scaled histogram in a such a way, that transparency of each "column" (bin?) depends on the number of observations in a given range of x. Here is my code:
set.seed(1)
test = data.frame(x = rnorm(200, mean = 0, sd = 10),
y = as.factor(sample(c(0,1), replace=TRUE, size=100)))
threshold = 20
ggplot(test,
aes(x = x))+
geom_histogram(aes(fill = y, alpha = stat(count) > threshold),
position = "fill", bins = 10)
Basically I want to make plots that will looks like this:
however my code generate the plots there transparency are applied based on the count after grouping that ends up with hanging column like this:
For this example, in order to simulate a "proper" plot I just adjust the threshold, but I need alpha to consider sum of count from both groups in a given "column"(bin).
UPDATE:
I also want it to work with faceted plots in a such a way that highlighted area in each facet was independent from other facets. Approach that proposed #Stefan works perfect for the individual plot, but in faceted plot highlights the same area at all facets.
library(ggplot2)
set.seed(1)
test = data.frame(x = rnorm(1000, mean = 0, sd = 10),
y = as.factor(sample(c(0,1), replace=TRUE, size=1000)),
n = as.factor(sample(c(0,1,2), replace=TRUE, size=1000)),
m = as.factor(sample(c(0,1,3,4), replace=TRUE, size=1000)))
f = function(..count.., ..x..) tapply(..count.., factor(..x..), sum)[factor(..x..)]
threshold = 10
ggplot(test,
aes(x = x))+
geom_histogram(aes(fill = y, alpha = f(..count.., ..x..) > threshold),
position = "fill", bins = 10)+
facet_grid(rows = vars(n),
cols = vars(m))
This could be achieved like so:
As the count computed by stat_count is the number of obs after grouping we have to manually aggregate the count over groups to get the total count per bin.
To aggregate the counts per bin I use tapply, where I make use of the .. notation to get the variables computed by stat_count.
As the grouping variable I make use of the computed variable ..x.. which to the best of my knowledge is not documented. Basically ..x.. contains by default the midpoints of the bins and as such can be used as an identifier for the bins. However, as these are continuous values we have convert them to a factor.
Finally, to make the code more readable I use a auxilliary function to compute the aggregate counts. Additionally I double the threshold value to 20.
library(ggplot2)
set.seed(1)
test <- data.frame(
x = rnorm(200, mean = 0, sd = 10),
y = as.factor(sample(c(0, 1), replace = TRUE, size = 100))
)
threshold <- 20
f <- function(..count.., ..x..) tapply(..count.., factor(..x..), sum)[factor(..x..)]
p <- ggplot(
test,
aes(x = x)
) +
geom_histogram(aes(fill = y, alpha = f(..count.., ..x..) > threshold),
position = "fill", bins = 10
)
p
EDIT To allow for facetting we have to pass the function the ..PANEL.. identifier as an addtional argument. Instead of using tapply I now use dplyr::group_by and dplyr::add_count to compute the total count per bin and facet panel:
library(ggplot2)
library(dplyr)
set.seed(1)
test <- data.frame(
x = rnorm(200, mean = 0, sd = 10),
y = as.factor(sample(c(0, 1), replace = TRUE, size = 100)),
type = rep(c("A", "B"), each = 100)
)
threshold <- 20
f <- function(count, x, PANEL) {
data.frame(count, x, PANEL) %>%
add_count(x, PANEL, wt = count) %>%
pull(n)
}
p <- ggplot(
test,
aes(x = x)
) +
geom_histogram(aes(fill = y, alpha = f(..count.., ..x.., ..PANEL..) > threshold),
position = "fill", bins = 10
) +
facet_wrap(~type)
p
#> Warning: Using alpha for a discrete variable is not advised.
#> Warning: Removed 2 rows containing missing values (geom_bar).

Use stat_summary to label median line on boxplot

I have a function wherein I'm trying to use stat_summary() to plot the value of the median just above the median line on a geom_boxplot(). I've reduced my problem and created a toy example to simplify but retain context.
library(ggplot2)
set.seed(20191120)
dat <- data.frame(var = sample(c("a", "b"),
50,
replace = TRUE),
value = rpois(50, 5))
lims <- c(0, 10)
myplot <- function(DATA, YLIMS) {
ggplot(data = DATA,
aes(x = var)) +
geom_boxplot(aes(y = value),
outlier.shape = NA,
coef = 0) +
stat_summary(aes(y = ifelse(value > (YLIMS[2]*0.9), # if median in top10% of plot window
(value - (YLIMS[2]/10)), # put it below bar
(value + (YLIMS[2]/10))), # else put it above
label = round(..y.., 2)), #round(median(value), 2))
fun.y = median,
geom = "text") +
coord_cartesian(ylim = YLIMS)
}
myplot(dat, lims)
My actual plots have several facets, a variety of ranges, and some of the medians are at the top or bottom of the range. As you can see, I've excluded whiskers and outliers. This is where the YLIMS argument comes in to zoom and focus on the boxes and exclude unused plot space. I've used these YLIMS values to also position the label at +/- 10% of the range which works out perfectly.
I tried using the ..y.. value to get the value of the median for the label argument of stat_summary(aes()) but it is instead taking the new value. As you can see from the plot, we'd expect both labels to be "5" but they are instead "6" as that 10% of 10 has been added.
I also tried recalculating the median (as you can see commented out) but that takes a simple median of all the data and doesn't control for groupings/facets/etc.
I know of ways to refactor my code to calculate to create values for the y labels and positions in the data, or by aggregating and using identity with the boxplot, but I'm wondering if there is a way to calculate this in-line like my attempt is close to doing.
The key for solving this problem is not trying to adjust the value, but using the postion=position_nudge() option to move the location.
library(ggplot2)
set.seed(20191120)
dat <- data.frame(var = sample(c("a", "b"), 50, replace = TRUE),
value = rpois(50, 5))
lims <- c(0, 10)
myplot <- function(DATA, YLIMS) {
ggplot(data = DATA, aes(x = var)) +
geom_boxplot(aes(y = value), outlier.shape = NA, coef = 0) +
stat_summary(aes(y = value , label = round(..y.., 2)),
fun.y = median, geom = "text",
position=position_nudge(y = ifelse(value > (YLIMS[2]*0.9), #if median in top 10% of plot window
(-YLIMS[2]/10), #put it below bar
(YLIMS[2]/10)), x = 0)) +
coord_cartesian(ylim = YLIMS)
}
myplot(dat, lims)

Advice/ on how to plot side by side histograms with line graph going through in ggplot2

I'm currently finishing off my Masters project and need to include some graphics for the write-up. Without boring you too much, I have some data which is associated with AR(1) parameters ranging from 0.1 to 0.9 by 0.1 increments. As such I thought of doing a faceted histogram like the one below (worry not about the hideous fruit salad of colours, it will not be used).
I used this code.
ggplot(opt_lens_geom,aes(x=l_1024,fill=factor(rho))) + geom_histogram()+coord_flip()+facet_grid(.~rho,scales = "free_x")
I also would like to draw a trend line for the median values since the AR(1) parameter is continuous. In a later iteration I deleted the padding and made it "look" like it was one graph, but I have had issues with the endpoints matching up since each facet is a separate graphical device. Can anyone give me some advice on how to do this? I am not particularly partial to the faceting so if it is not needed I do away with it.
I will try and upload sample data, but all simulating 100 values for each of the 9 rhos would work just to get it started like:
opt_lens_geom <- data.frame(rho= rep(seq(0.1,0.9,by=0.1),each=100),l_1024=rnorm(900))
You might consider ggridges. I've assumed here that you want a median value for each value of rho.
library(ggplot2)
library(ggridges)
library(dplyr)
set.seed(1001)
opt_lens_geom <- data.frame(rho = rep(seq(0.1, 0.9, by = 0.1), each = 100),
l_1024 = rnorm(900))
opt_lens_geom %>%
mutate(rho_f = factor(rho)) %>%
ggplot(aes(l_1024, rho_f)) +
stat_density_ridges(quantiles = 2, quantile_lines = TRUE)
Result. You can add scale = 1 as a parameter to stat_density_ridges if you don't like the amount of overlap.
Try the following. It uses a pre-computed data frame of the medians.
library(ggplot2)
df <- iris[c(1, 5)]
names(df) <- c("val", "rho")
med <- plyr::ddply(df, "rho", summarise, m = median(val))
ggplot(data = df, aes(x = val, fill = factor(rho))) +
geom_histogram() +
coord_flip() +
geom_vline(data = med, aes(xintercept = m), colour = 'black') +
facet_wrap(~ factor(rho))
You could do a variant on this using geom_violin instead of using histograms, although you wouldn't get labelled counts, just an idea of the relative density. Example with made up data:
df = data.frame(
rho = rep(c(0.1, 0.2, 0.3), each = 50),
val = sample(1:10, 150, replace = TRUE)
)
df$val = df$val + (5 * (df$rho == 0.2)) + (8 * (df$rho == 0.3))
ggplot(df, aes(x = rho, y = val, fill = factor(rho))) +
geom_violin() +
stat_summary(aes(group = 1), colour = "black",
geom = "line", fun.y = "median")
This produces a violin for each value of rho, and joins the medians for each violin.

Comparing mean values by group between several variables

I am trying to reproduce a graph from Stata in R. I have several variables and want to display their mean in each treatment group of which there are two. The Stata graph is as follows:
This coefficient plot is not actually a plot of coefficients, but of the mean values by each treatment for each separate variable. The df basically looks something like.
workable data
It is difficult to answer your question without reproducible data.
However, this might get what you desire just with mean:
library(dplyr)
mpg %>%
select(manufacturer, cty, trans) %>%
group_by(manufacturer, trans) %>%
summarize(cty_mean = mean(cty)) %>%
ggplot(aes(x=cty_mean, y=reorder(manufacturer, cty_mean), color=trans)) +
geom_point()
If you also wish to include the coefficients or std errors, then you could achieve by including a function in summarize().
I figured out geom_pointrange() is probably what you are looking for:
library("ggplot2")
set.seed(111018)
interval1 <- -qnorm((1-0.9)/2)
means_treatment_1 <- rnorm(2)
se_treatment_1 <- rnorm(2)
df_treatment_1 <- data.frame("Mean" = means_treatment_1,
"lower" = means_treatment_1 - se_treatment_1*interval1,
"upper" = means_treatment_1 + se_treatment_1*interval1,
"Variable" = c("medicare_spending_dummy",
"job_training_dummy"),
"Treatment" = "a")
means_treatment_2 <- rnorm(2)
se_treatment_2 <- rnorm(2)
df_treatment_2 <- data.frame("Mean" = means_treatment_2,
"lower" = means_treatment_2 - se_treatment_2*interval1,
"upper" = means_treatment_2 + se_treatment_2*interval1,
"Variable" = c("medicare_spending_dummy",
"job_training_dummy"),
"Treatment" = "b")
df_tot<-rbind(df_treatment_1, df_treatment_2)
# Plot
ggplot(df_tot, aes(colour = Treatment)) +
geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) +
geom_pointrange(aes(x = Variable, y = Mean, ymin = lower, ymax = upper ),lwd = 1, position = position_dodge(width = 1/2)) +
coord_flip() +
theme_bw()

R ggplot2::geom_density with a constant variable

I have recently came across a problem with ggplot2::geom_density that I am not able to solve. I am trying to visualise a density of some variable and compare it to a constant. To plot the density, I am using the ggplot2::geom_density. The variable for which I am plotting the density, however, happens to be a constant (this time):
df <- data.frame(matrix(1,ncol = 1, nrow = 100))
colnames(df) <- "dummy"
dfV <- data.frame(matrix(5,ncol = 1, nrow = 1))
colnames(dfV) <- "latent"
ggplot() +
geom_density(data = df, aes(x = dummy, colour = 's'),
fill = '#FF6666', alpha = 0.2, position = "identity") +
geom_vline(data = dfV, aes(xintercept = latent, color = 'ls'), size = 2)
This is OK and something I would expect. But, when I shift this distribution to the far right, I get a plot like this:
df <- data.frame(matrix(71,ncol = 1, nrow = 100))
colnames(df) <- "dummy"
dfV <- data.frame(matrix(75,ncol = 1, nrow = 1))
colnames(dfV) <- "latent"
ggplot() +
geom_density(data = df, aes(x = dummy, colour = 's'),
fill = '#FF6666', alpha = 0.2, position = "identity") +
geom_vline(data = dfV, aes(xintercept = latent, color = 'ls'), size = 2)
which probably means that the kernel estimation is still taking 0 as the centre of the distribution (right?).
Is there any way to circumvent this? I would like to see a plot like the one above, only the centre of the kerner density would be in 71 and the vline in 75.
Thanks
Well I am not sure what the code does, but I suspect the geom_density primitive was not designed for a case where the values are all the same, and it is making some assumptions about the distribution that are not what you expect. Here is some code and a plot that sheds some light:
# Generate 10 data sets with 100 constant values from 0 to 90
# and then merge them into a single dataframe
dfs <- list()
for (i in 1:10){
v <- 10*(i-1)
dfs[[i]] <- data.frame(dummy=rep(v,100),facet=v)
}
df <- do.call(rbind,dfs)
# facet plot them
ggplot() +
geom_density(data = df, aes(x = dummy, colour = 's'),
fill = '#FF6666', alpha = 0.5, position = "identity") +
facet_wrap( ~ facet,ncol=5 )
Yielding:
So it is not doing what you thought it was, but it is also probably not doing what you want. You could of course make it "translation-invariant" (almost) by adding some noise like this for example:
set.seed(1234)
noise <- +rnorm(100,0,1e-3)
dfs <- list()
for (i in 1:10){
v <- 10*(i-1)
dfs[[i]] <- data.frame(dummy=rep(v,100)+noise,facet=v)
}
df <- do.call(rbind,dfs)
ggplot() +
geom_density(data = df, aes(x = dummy, colour = 's'),
fill = '#FF6666', alpha = 0.5, position = "identity") +
facet_wrap( ~ facet,ncol=5 )
Yielding:
Note that there is apparently a random component to the geom_density function, and I can't see how to set the seed before each instance, so the estimated density is a bit different each time.

Resources