While completing a project for understanding central limit theorem for exponential distribution, I ran into an annoying error message when plotting simulated vs theoretical distributions. When I run the code below, I get an error: 'mapping' is not used by stat_function().
By mapping I assume the error is referring to the aes parameter, which I later map to color red using scale_color_manual in order to show it in a legend.
My question is two-fold: why is this error happening? and is there a more efficient way to create a legend without using scale_color_manual?
Thank you!
lambda <- 0.2
n_sims <- 1000
set.seed(100100)
total_exp <- rexp(40 * n_sims, rate = lambda)
exp_data <- data.frame(
Mean = apply(matrix(total_exp, n_sims), 1, mean),
Vars = apply(matrix(total_exp, n_sims), 1, var)
)
g <- ggplot(data = exp_data, aes(x = Mean))
g +
geom_histogram(binwidth = .3, color = 'black', aes(y=..density..), fill = 'steelblue') +
geom_density(size=.5, aes(color = 'Simulation'))+
stat_function(fun = dnorm, mapping = aes(color='Theoretical'), args = list(mean = 1/lambda, sd = 1/lambda/sqrt(40)), size=.5, inherit.aes = F, show.legend = T)+
geom_text(x = 5.6, y = 0.1, label = "Theoretical and Sample Mean", size = 2, color = 'red') +
scale_color_manual("Legend", values = c('Theoretical' = 'red', 'Simulation' = 'blue')) +
geom_vline(aes(xintercept = 1/lambda), lwd = 1.5, color = 'grey') +
labs(x = 'Exponential Distribution Simulations Average Values') +
ggtitle('Sample Mean vs Theoretical Mean of the Averages of the Exponential Distribution')+
theme_classic(base_size = 10)
It's not an error, it's a warning:
library(ggplot2)
lambda <- 0.2
n_sims <- 1000
set.seed(100100)
total_exp <- rexp(40 * n_sims, rate = lambda)
exp_data <- data.frame(
Mean = apply(matrix(total_exp, n_sims), 1, mean),
Vars = apply(matrix(total_exp, n_sims), 1, var)
)
g <- ggplot(data = exp_data, aes(x = Mean))
g +
geom_histogram(binwidth = .3, color = 'black', aes(y=..density..), fill = 'steelblue') +
geom_density(size=.5, aes(color = 'Simulation'))+
stat_function(fun = dnorm, mapping = aes(color='Theoretical'), args = list(mean = 1/lambda, sd = 1/lambda/sqrt(40)), size=.5, inherit.aes = F, show.legend = T)+
geom_text(x = 5.6, y = 0.1, label = "Theoretical and Sample Mean", size = 2, color = 'red') +
scale_color_manual("Legend", values = c('Theoretical' = 'red', 'Simulation' = 'blue')) +
geom_vline(aes(xintercept = 1/lambda), lwd = 1.5, color = 'grey') +
labs(x = 'Exponential Distribution Simulations Average Values') +
ggtitle('Sample Mean vs Theoretical Mean of the Averages of the Exponential Distribution')+
theme_classic(base_size = 10)
#> Warning: `mapping` is not used by stat_function()
Created on 2020-05-01 by the reprex package (v0.3.0)
You can suppress the warning by calling geom_line(stat = "function") rather than stat_function():
library(ggplot2)
lambda <- 0.2
n_sims <- 1000
set.seed(100100)
total_exp <- rexp(40 * n_sims, rate = lambda)
exp_data <- data.frame(
Mean = apply(matrix(total_exp, n_sims), 1, mean),
Vars = apply(matrix(total_exp, n_sims), 1, var)
)
g <- ggplot(data = exp_data, aes(x = Mean))
g +
geom_histogram(binwidth = .3, color = 'black', aes(y=..density..), fill = 'steelblue') +
geom_density(size=.5, aes(color = 'Simulation'))+
geom_line(stat = "function", fun = dnorm, mapping = aes(color='Theoretical'), args = list(mean = 1/lambda, sd = 1/lambda/sqrt(40)), size=.5, inherit.aes = F, show.legend = T)+
geom_text(x = 5.6, y = 0.1, label = "Theoretical and Sample Mean", size = 2, color = 'red') +
scale_color_manual("Legend", values = c('Theoretical' = 'red', 'Simulation' = 'blue')) +
geom_vline(aes(xintercept = 1/lambda), lwd = 1.5, color = 'grey') +
labs(x = 'Exponential Distribution Simulations Average Values') +
ggtitle('Sample Mean vs Theoretical Mean of the Averages of the Exponential Distribution')+
theme_classic(base_size = 10)
Created on 2020-05-01 by the reprex package (v0.3.0)
In my opinion, the warning is erroneous, and an issue has been filed about this problem: https://github.com/tidyverse/ggplot2/issues/3611
However, it's not that easy to solve, and therefore as of now the warning is there.
I'm unable to recreate your issue -- when I run your code a plot is generated (below), which suggests the issue is likely to do you with your environment. A general 'solution' is to clear your workspace using the menu dropdown or similar: Session -> Clear workspace..., then re-run your code.
For refactoring the color issue, you can simplify scale_color_manual to
scale_color_manual("Legend", values = c('blue','red')), but how it is now, is a bit better in my view. Anything beyond that has more to do with changing the data structure and mapping.
Apologies, I don't have the rep to make a comment.
Related
First of all, some data similar to what I am working with.
rawdata <- data.frame(Score = rnorm(1000, seq(1, 0, length.out = 10), sd = 1),
Group = rep(LETTERS[1:3], 10000))
rawdata$Score <- ifelse(rawdata$Group == "A", rawdata$Score+2,rawdata$Score)
rawdata$Score <- ifelse(rawdata$Group == "C", rawdata$Score-2,rawdata$Score)
stdev <- c(10.78,10.51,9.42)
col <- c("#004d8d", "#cc2701", "#e5b400")
Now, the code of my geom_density_ridges with quantile lines, which in this case they will be white.
p <- ggplot(rawdata, aes(x = Score, y = Group)) +
scale_y_discrete() +
geom_rect(inherit.aes = FALSE, mapping = aes(ymin = 0, ymax = Inf, xmin = -0.1 * min(stdev), xmax = 0.1 * max(stdev)),
fill = "grey", alpha = 0.5) +
geom_density_ridges(scale = -0.5, size = 1, alpha=0.5, show.legend = FALSE,
quantile_lines = TRUE, quantiles = c(0.025, 0.975),
vline_color = "white", aes(fill = Group)) +
scale_color_manual(values = col) +
scale_fill_manual(values = col) +
labs(title="Toy Graph", y="Group", x="Value") +
coord_flip(xlim = c(-8, 8), ylim = NULL, expand = TRUE, clip = "on")
p
An we obtain the following plot, which is perfectly adjusted to expectation.
Now I was wondering if there was a way to make only this little white quantile line transparent to the background. I tried first to set the vline_color = "transparent" and leaving the aes(fill = Group) at the end of geom_density_ridges at the logic that options where drew in order but it gets transparent not to the different shades of grey background but to the density fill (so the quantile line disappears), which is not what I am trying to achieve.
Thanks in advance for your ideas!
Colors can be modified with scales::alpha. This can be passed to your color argument.
library(ggridges)
library(ggplot2)
rawdata <- data.frame(Score = rnorm(1000, seq(1, 0, length.out = 10), sd = 1),
Group = rep(LETTERS[1:3], 10000))
rawdata$Score <- ifelse(rawdata$Group == "A", rawdata$Score+2,rawdata$Score)
rawdata$Score <- ifelse(rawdata$Group == "C", rawdata$Score-2,rawdata$Score)
stdev <- c(10.78,10.51,9.42)
col <- c("#004d8d", "#cc2701", "#e5b400")
ggplot(rawdata, aes(x = Score, y = Group)) +
scale_y_discrete() +
geom_rect(inherit.aes = FALSE, mapping = aes(ymin = 0, ymax = Inf, xmin = -0.1 * min(stdev), xmax = 0.1 * max(stdev)),
fill = "grey", alpha = 0.5) +
geom_density_ridges(scale = -0.5, size = 1, alpha=0.5, show.legend = FALSE,
quantile_lines = TRUE, quantiles = c(0.025, 0.975),
### The only change is here
vline_color = alpha("white", .5), aes(fill = Group)) +
scale_color_manual(values = col) +
scale_fill_manual(values = col) +
labs(title="Toy Graph", y="Group", x="Value") +
coord_flip(xlim = c(-8, 8), ylim = NULL, expand = TRUE, clip = "on")
#> Picking joint bandwidth of 0.148
#> Warning: Using the `size` aesthietic with geom_segment was deprecated in ggplot2 3.4.0.
#> ℹ Please use the `linewidth` aesthetic instead.
Created on 2022-11-14 with reprex v2.0.2
No, if you make something transparent you will see what's underneath, which is the density plot.
However, you can replicate the visual effect of "seeing through to the background" by simply setting the line colour to the same as the background.
Your grey rectangle is currently plotted underneath the density plots, therefore the "background" doesn't have a single colour. This can be solved by plotting it on top instead. Instead of a 50% grey with 50% alpha, you can replicate the same effect with a 0% grey (aka black) with a 25% alpha. Move the geom_rect later than the density plots and it will be layered on top.
Finally, your geom_rect is being called once for each row of raw_data, since it inherits the same data as the main plot. You probably don't want that, so specify a (dummy) data source instead.
ggplot(rawdata, aes(x = Score, y = Group)) +
scale_y_discrete() +
geom_density_ridges(scale = -0.5, size = 1, alpha=0.5, show.legend = FALSE,
quantile_lines = TRUE, quantiles = c(0.025, 0.975),
vline_color = "grey90", aes(fill = Group)) +
scale_color_manual(values = col) +
scale_fill_manual(values = col) +
labs(title="Toy Graph", y="Group", x="Value") +
geom_rect(data=data.frame(), inherit.aes = FALSE, mapping = aes(
ymin = 0, ymax = Inf, xmin = -0.1 * min(stdev), xmax = 0.1 * max(stdev)
), fill = "black", alpha = 0.25) +
coord_flip(xlim = c(-8, 8), ylim = NULL, expand = TRUE, clip = "on")
Note: I'm not sure the background colour is really "grey90", I've eyeballed it. You may want to specify it explicitly with theme if you want to be exact.
If you want literal see-through portions of your density curves, you will need to make the gaps yourself:
library(tidyverse)
rawdata %>%
mutate(GroupNum = as.numeric(as.factor(Group))) %>%
group_by(GroupNum, Group) %>%
summarise(yval = first(GroupNum) - density(Score)$y,
xval = density(Score)$x,
q025 = quantile(Score, 0.025),
q975 = quantile(Score, 0.975)) %>%
mutate(Q = ifelse(xval < q025, 'low', ifelse(xval > q975, 'hi', 'mid'))) %>%
ggplot(aes(xval, yval, group = interaction(Group, Q))) +
geom_line(size = 1) +
geom_ribbon(aes(ymax = GroupNum, ymin = yval, fill = Group),
color = NA, alpha = 0.5, outline.type = 'full',
data = . %>% filter(abs(q025 - xval) > 0.03 &
abs(q975 - xval) > 0.03)) +
coord_flip() +
scale_fill_manual(values = col) +
scale_y_continuous(breaks = 1:3, labels = levels(factor(rawdata$Group)),
name = 'Group') +
labs(x = 'Score')
I am working through a class problem to test if the central limit theorem applies to medians as well. I've written the code, and as far as I can tell, it is working just fine. But my dnorm stat to plot the normal distribution is not showing up. It just creates a flat line when it should create a bell curve. Here is the code:
set.seed(14)
median_clt <- rnorm(1000, mean = 10, sd = 2)
many_sample_medians <- function(vec, n, reps) {
rep_vec <- replicate(reps, sample(vec, n), simplify = "vector")
median_vec <- apply(rep_vec, 2, median)
return(median_vec)
}
median_clt_test <- many_sample_medians(median_clt, 500, 1000)
median_clt_test_df <- data.frame(median_clt_test)
bw_clt <- 2 * IQR(median_clt_test_df$median_clt_test) / length(median_clt_test_df$median_clt_test)^(1/3)
ggplot(median_clt_test_df, aes(x = median_clt_test)) +
geom_histogram(binwidth = bw_clt, aes(y = ..density..), fill = "hotpink1", col = "white") +
stat_function(fun = ~dnorm(.x, mean = 10, sd = 2), col = "darkorchid1", lwd = 2) +
theme_classic()
As far as I can tell, the rest of the code is working properly - it just doesn't plot the dnorm stat function correctly. The exact same stat line worked for me before, so I'm not sure what's gone wrong.
The line isn't quite flat; it's just very stretched out compared to the histogram. We can see this more clearly if we zoom out on the x axis and zoom in on the y axis:
ggplot(median_clt_test_df, aes(x = median_clt_test)) +
geom_histogram(binwidth = bw_clt, aes(y = ..density..),
fill = "hotpink1", col = "white") +
stat_function(fun = ~dnorm(.x, mean = 10, sd = 2),
col = "darkorchid1",
lwd = 2) +
xlim(c(5, 15)) +
coord_cartesian(xlim = c(5, 15), ylim = c(0, 1)) +
theme_classic()
But why is this?
It's because you are using dnorm to plot the distribution of the random variable from which the medians were drawn, but your histogram is a sample of the medians themselves. So you are plotting the wrong dnorm curve. The sd should not be the standard deviation of the random variable, but the standard deviation of the sample medians:
ggplot(median_clt_test_df, aes(x = median_clt_test)) +
geom_histogram(binwidth = bw_clt, aes(y = ..density..),
fill = "hotpink1", col = "white") +
stat_function(fun = ~dnorm(.x,
mean = mean(median_clt_test),
sd = sd(median_clt_test)),
col = "darkorchid1",
lwd = 2)
theme_classic()
If you prefer you could use the theoretical standard error of the mean instead of the measured standard deviation of your medians - these will be very similar.
# Theoretical SEM
2/sqrt(500)
#> [1] 0.08944272
# SD of medians
sd(median_clt_test)
#> [1] 0.08850221
I'm trying to plot a histogram with ggplot2.
I wrote a simple code for this in R
dnorm.count <- function(x, mean = 0, sd = 1, log = FALSE, n = 1, binwidth = 1){
n * binwidth * dnorm(x = x, mean = mean, sd = sd, log = log)
}
mtcars %>%
ggplot(aes(x = mpg)) +
geom_histogram(bins =60,color = "white", fill = "#9FE367",boundary = 0.5) +
geom_vline(aes(xintercept = mean(mpg)),
linetype="dashed",
size = 1.6,
color = "#FF0000")+
geom_text(aes(label = ..count..), stat= "count",vjust = -0.6)+
stat_function(fun = dnorm.count, color = "#6D67E3",
args = list(mean= mean(mtcars$mpg),
sd = sd(mtcars$mpg),
n = nrow(mtcars)),
lwd = 1.2) +
scale_y_continuous(labels = comma, name = "Frequency") +
scale_x_continuous(breaks=seq(0,max(mtcars$mpg)))+
geom_text(aes(label = paste0("mean = ", round(mean(mtcars$mpg), 2)),
x = mean(mtcars$mpg)*1.2,
y = mean(mtcars$mpg)/5))+
geom_vline(aes(xintercept = sd(mpg)), linetype="dashed",size = 1.6, color = "#FF0000")
What I got is this!
The question is how do I Plot the histogram similar to this
using ggplot2 and is it possible to convert the code to R function?
Edit: For the better explanation of what I'm trying to do:
I wanna create a Histogram exactly the same as the one attached for reference using ggplot2 and then I wanna create a function for the same to reduce the coding. Use any package+ggplot2 you like. The histograms should have lines depicting the standard deviation & mean like the one in reference. If possible depict the standard deviation in the plot as the reference image, that's what I'm trying to achieve.
If your question how to plot histograms like the one you attached in your last figure, this 9 lines of code produce a very similar result.
library(magrittr) ; library(ggplot2)
set.seed(42)
data <- rnorm(1e5)
p <- data %>%
as.data.frame() %>%
ggplot(., aes(x = data)) +
geom_histogram(fill = "white", col = "black", bins = 30 ) +
geom_density(aes( y = 0.3 *..count..)) +
labs(x = "Statistics", y = "Probability/Density") +
theme_bw() + theme(axis.text = element_blank())
You could use annotate() to add symbols or text and geom_segment to show the intervals on the plot like this:
p + annotate(x = sd(data)/2 , y = 8000, geom = "text", label = "σ", size = 10) +
annotate(x = sd(data) , y = 6000, geom = "text", label = "2σ", size = 10) +
annotate(x = sd(data)*1.5 , y = 4000, geom = "text", label = "3σ", size = 10) +
geom_segment(x = 0, xend = sd(data), y = 7500, yend = 7500) +
geom_segment(x = 0, xend = sd(data)*2, y = 5500, yend = 5500) +
geom_segment(x = 0, xend = sd(data)*3, y = 3500, yend = 3500)
This chunk of code would give you something like this:
I was trying to plot some predicted vs. actual data, something that resembles the following:
# Some random data
x <- seq(1: 10)
y_pred <- runif(10, min = -10, max = 10)
y_obs <- y_pred + rnorm(10)
# Faking a CI
Lo.95 <- y_pred - 1.96
Hi.95 <- y_pred + 1.96
my_df <- data.frame(x, y_pred, y_obs, Lo.95, Hi.95)
ggplot(my_df, aes(x = x, y = y_pred)) +
geom_line(aes(colour = "Forecasted Data"), size = 1.2) +
geom_point(aes(x = x, y = y_obs, colour = "Actual Data"), size = 3) +
geom_ribbon(aes(ymin=Lo.95, ymax=Hi.95, x=x, linetype = NA, colour = "Confidence Interval"), alpha=0.2) +
theme_grey() +
scale_colour_manual(
values = c("gray30", "blue", "red"),
guide = guide_legend(override.aes = list(
border=c(NA, NA, NA),
fill=c("gray30", "white", "white"),
linetype = c("blank", "blank", "solid"),
shape = c(NA, 19, NA))))
The plot looks like this:
The only issue I have with this plot is the red border surrounding the legend item symbol for the line (i.e. the forecasted data). Is there any way I can remove it without breaking the rest of my plot?
I think geom_ribbon was the problem. If we take its color & fill out of aes, everything looks fine
library(ggplot2)
# Some random data
x <- seq(1: 10)
y_pred <- runif(10, min = -10, max = 10)
y_obs <- y_pred + rnorm(10)
# Faking a CI
Lo.95 <- y_pred - 1.96
Hi.95 <- y_pred + 1.96
my_df <- data.frame(x, y_pred, y_obs, Lo.95, Hi.95)
m1 <- ggplot(my_df, aes(x = x, y = y_pred)) +
geom_point(aes(x = x, y = y_obs, colour = "Actual"), size = 3) +
geom_line(aes(colour = "Forecasted"), size = 1.2) +
geom_ribbon(aes(x = x, ymin = Lo.95, ymax = Hi.95),
fill = "grey30", alpha = 0.2) +
scale_color_manual("Legend",
values = c("blue", "red"),
labels = c("Actual", "Forecasted")) +
guides( color = guide_legend(
order = 1,
override.aes = list(
color = c("blue", "red"),
fill = c("white", "white"),
linetype = c("blank", "solid"),
shape = c(19, NA)))) +
theme_bw() +
# remove legend key border color & background
theme(legend.key = element_rect(colour = NA, fill = NA),
legend.box.background = element_blank())
m1
As we leave Confidence Interval out of aes, we no longer have its legend. One workaround is to create an invisible point and take one unused geom to manually create a legend key. Here we can use size/shape (credit to this answer)
m2 <- m1 +
geom_point(aes(x = x, y = y_obs, size = "Confidence Interval", shape = NA)) +
guides(size = guide_legend(NULL,
order = 2,
override.aes = list(shape = 15,
color = "lightgrey",
size = 6))) +
# Move legends closer to each other
theme(legend.title = element_blank(),
legend.justification = "center",
legend.spacing.y = unit(0.05, "cm"),
legend.margin = margin(0, 0, 0, 0),
legend.box.margin = margin(0, 0, 0, 0))
m2
Created on 2018-03-19 by the reprex package (v0.2.0).
A better way to address this question would be to specify show.legend = F option in the geom_ribbon(). This will eliminate the need for the second step for adding and merging the legend key for the confidence interval. Here is the code with slight modifications.
ggplot(my_dff, aes(x = x, y = y_pred)) +
geom_line(aes(colour = "Forecasted Data"), size = 1) +
geom_point(aes(x = x, y = y_obs, colour = "Actual Data"), size = 1) +
geom_ribbon(aes(ymin=Lo.95, ymax=Hi.95, x=x, linetype = NA, colour = "Confidence Interval"), alpha=0.2, show.legend = F) +
theme_grey() +
scale_colour_manual(
values = c("blue", "gray30", "red"))+
guides(color = guide_legend(
override.aes = list(linetype = c(1, 1, 0)),
shape = c(1, NA, NA),
reverse = T))
My plot
Credit to https://stackoverflow.com/users/4282026/marblo
for their answer to similar question.
I'm trying to learn R from scratch and I just delivered a college assignment for hypothesis testing a binomial distribution (proportion test for one sample) that I used R to solve and plot. But I ran into some problems.
My sample size is 130, success cases are 68.
H0: π = 50%
H1: π > 50
The is the code I used (plenty of copy-paste and trial/error)
library(ggplot2)
library(ggthemes)
library(scales)
#data
n = 130
p = 1/2
stdev = sqrt(n*p*(1-p))
mean_binon = n*p
cases = 68
ztest = (cases-mean_binon)/stdev
pvalor = pnorm(-abs(ztest))
zcrit = qnorm(0.975)
#normal curve
xvalues <- data.frame(x = c(-4, 4))
#first plots and lines
p1 <- ggplot(xvalues, aes(x = xvalues))
p2 <- p1 + stat_function(fun = dnorm) + xlim(c(-4, 4)) +
geom_vline(xintercept = ztest, linetype="solid", color="blue",
size=1) +
geom_vline(xintercept = zcrit, linetype="solid", color="red",
size=1)
#z area function
area_z <- function(x){
norm_z <- dnorm(x)
norm_z[x < ztest] <- NA
return(norm_z)
}
#critical z area function
area_zc <- function(x){
norm_zc <- dnorm(x)
norm_zc[x < zcrit] <- NA
return(norm_zc)
}
#area value
valor_area_z <- round(pnorm(4) - pnorm(ztest), 3)
valor_area_zc <- round(pnorm(4) - pnorm(zcrit), 3)
#final plot
p3 <- p2 + stat_function(fun = dnorm) +
stat_function(fun = area_z, geom = "area", fill = "blue", alpha = 0.3) +
geom_text(x = 1.13, y = 0.1, size = 5, fontface = "bold",
label = paste0(valor_area_z * 100, "%")) +
stat_function(fun = area_zc, geom = "area", fill = "red", alpha = 0.5) +
geom_text(x = 2.27, y = 0.015, size = 3, fontface = "bold",
label = paste0(valor_area_zc * 100, "%")) +
scale_x_continuous(breaks = c(-3:3)) +
labs(x = "\n z", y = "f(z) \n", title = "Distribuição Normal \n") +
theme_fivethirtyeight()
p3
Here's the plot
There is a gap between my geom_vline's and the shaded area. I'm not sure if I'm doing the wrong steps with my statistics or this is an R related problem. Maybe both? Sorry if this is elementary. I'm not good at both but I'm trying to improve.
A solution is to use the option xlim inside stat_function which defines the range of the function. You can also replace area_z and area_zc with dnorm.
p3 <- p2 + stat_function(fun = dnorm) +
stat_function(fun = dnorm, geom = "area", fill = "blue", alpha = 0.3,
xlim = c(ztest,zcrit)) +
geom_text(x = 1.13, y = 0.1, size = 5, fontface = "bold",
label = paste0(valor_area_z * 100, "%")) +
stat_function(fun = dnorm, geom = "area", fill = "red", alpha = 0.5,
xlim = c(zcrit,xvalues$x[2])) +
geom_text(x = 2.27, y = 0.015, size = 3, fontface = "bold",
label = paste0(valor_area_zc * 100, "%")) +
scale_x_continuous(breaks = c(-3:3)) +
labs(x = "\n z", y = "f(z) \n", title = "Distribuição Normal \n") +
theme_fivethirtyeight()
p3