How to add name labels to a graph using ggplot2 in R? - r

I have the following code:
plot <- ggplot(data = df_sm)+
geom_histogram(aes(x=simul_means, y=..density..), binwidth = 0.20, fill="slategray3", col="black", show.legend = TRUE)
plot <- plot + labs(title="Density of 40 Means from Exponential Distribution", x="Mean of 40 Exponential Distributions", y="Density")
plot <- plot + geom_vline(xintercept=sampl_mean,size=1.0, color="black", show.legend = TRUE)
plot <- plot + stat_function(fun=dnorm,args=list(mean=sampl_mean, sd=sampl_sd),color = "dodgerblue4", size = 1.0)
plot <- plot+ geom_vline(xintercept=th_mean,size=1.0,color="indianred4",linetype = "longdash")
plot <- plot + stat_function(fun=dnorm,args=list(mean=th_mean, sd=th_mean_sd),color = "darkmagenta", size = 1.0)
plot
I want to show the legends of each layer, I tried show.legend = TRUE but it does nothing.
All my data frame is means from exponential distribution simulations, also I have some theoretical values from the distribution (mean and standard deviation) which I describe as th_mean and th_mean_sd.
The code for my simulation is the following:
lambda <- 0.2
th_mean <- 1/lambda
th_sd <- 1/lambda
th_var <- th_sd^2
n <- 40
th_mean_sd <- th_sd/sqrt(n)
th_mean_var <- th_var/sqrt(n)
simul <- 1000
simul_means <- NULL
for(i in 1:simul) {
simul_means <- c(simul_means, mean(rexp(n, lambda)))
}
sampl_mean <- mean(simul_means)
sampl_sd <- sd(simul_means)
df_sm<-data.frame(simul_means)

If you want to get a legend you have to map on aesthetics instead of setting the color, fill, ... as parameter, i.e. move color=... inside aes(...) and make use of scale_color/fill_manual to set the color values. Personally I find it helpful to make use of some meaningful labels, e.g. in case of your histogram I map the label "hist" on the fill but you could whatever label you like:
set.seed(123)
lambda <- 0.2
th_mean <- 1 / lambda
th_sd <- 1 / lambda
th_var <- th_sd^2
n <- 40
th_mean_sd <- th_sd / sqrt(n)
th_mean_var <- th_var / sqrt(n)
simul <- 1000
simul_means <- NULL
for (i in 1:simul) {
simul_means <- c(simul_means, mean(rexp(n, lambda)))
}
sampl_mean <- mean(simul_means)
sampl_sd <- sd(simul_means)
df_sm <- data.frame(simul_means)
library(ggplot2)
ggplot(data = df_sm) +
geom_histogram(aes(x = simul_means, y = ..density.., fill = "hist"), binwidth = 0.20, col = "black") +
labs(title = "Density of 40 Means from Exponential Distribution", x = "Mean of 40 Exponential Distributions", y = "Density") +
stat_function(fun = dnorm, args = list(mean = sampl_mean, sd = sampl_sd), aes(color = "sampl_mean"), size = 1.0) +
stat_function(fun = dnorm, args = list(mean = th_mean, sd = th_mean_sd), aes(color = "th_dens"), size = 1.0) +
geom_vline(size = 1.0, aes(xintercept = sampl_mean, color = "sampl_mean")) +
geom_vline(size = 1.0, aes(xintercept = th_mean, color = "th_mean"), linetype = "longdash") +
scale_fill_manual(values = c(hist = "slategray3")) +
scale_color_manual(values = c(sampl_dens = "dodgerblue4", th_dens = "darkmagenta", th_mean = "indianred4", sampl_mean = "black"))

Related

Stat function dnorm failure

I am working through a class problem to test if the central limit theorem applies to medians as well. I've written the code, and as far as I can tell, it is working just fine. But my dnorm stat to plot the normal distribution is not showing up. It just creates a flat line when it should create a bell curve. Here is the code:
set.seed(14)
median_clt <- rnorm(1000, mean = 10, sd = 2)
many_sample_medians <- function(vec, n, reps) {
rep_vec <- replicate(reps, sample(vec, n), simplify = "vector")
median_vec <- apply(rep_vec, 2, median)
return(median_vec)
}
median_clt_test <- many_sample_medians(median_clt, 500, 1000)
median_clt_test_df <- data.frame(median_clt_test)
bw_clt <- 2 * IQR(median_clt_test_df$median_clt_test) / length(median_clt_test_df$median_clt_test)^(1/3)
ggplot(median_clt_test_df, aes(x = median_clt_test)) +
geom_histogram(binwidth = bw_clt, aes(y = ..density..), fill = "hotpink1", col = "white") +
stat_function(fun = ~dnorm(.x, mean = 10, sd = 2), col = "darkorchid1", lwd = 2) +
theme_classic()
As far as I can tell, the rest of the code is working properly - it just doesn't plot the dnorm stat function correctly. The exact same stat line worked for me before, so I'm not sure what's gone wrong.
The line isn't quite flat; it's just very stretched out compared to the histogram. We can see this more clearly if we zoom out on the x axis and zoom in on the y axis:
ggplot(median_clt_test_df, aes(x = median_clt_test)) +
geom_histogram(binwidth = bw_clt, aes(y = ..density..),
fill = "hotpink1", col = "white") +
stat_function(fun = ~dnorm(.x, mean = 10, sd = 2),
col = "darkorchid1",
lwd = 2) +
xlim(c(5, 15)) +
coord_cartesian(xlim = c(5, 15), ylim = c(0, 1)) +
theme_classic()
But why is this?
It's because you are using dnorm to plot the distribution of the random variable from which the medians were drawn, but your histogram is a sample of the medians themselves. So you are plotting the wrong dnorm curve. The sd should not be the standard deviation of the random variable, but the standard deviation of the sample medians:
ggplot(median_clt_test_df, aes(x = median_clt_test)) +
geom_histogram(binwidth = bw_clt, aes(y = ..density..),
fill = "hotpink1", col = "white") +
stat_function(fun = ~dnorm(.x,
mean = mean(median_clt_test),
sd = sd(median_clt_test)),
col = "darkorchid1",
lwd = 2)
theme_classic()
If you prefer you could use the theoretical standard error of the mean instead of the measured standard deviation of your medians - these will be very similar.
# Theoretical SEM
2/sqrt(500)
#> [1] 0.08944272
# SD of medians
sd(median_clt_test)
#> [1] 0.08850221

how can i plot a line on bar chart in R

H <- c(1,2,4,1,0,0,3,1,3)
M <- c("one","two","three","four","five")
barplot(H,names.arg=M,xlab="number",ylab="random",col="blue",
main="bar chart",border="blue")
I want to add line on the bar chart , i don't know how to do it
like the one in blue
Maybe you want this?
hist(H, breaks=-1:4, freq=FALSE, xaxt="n")
axis(side=1, at=seq(-0.5, 3.5), labels=M)
lines(density(H))
The graph in the question can be made with code following the lines of:
1. Table the x vector.
tbl <- table(H)
df1 <- as.data.frame(tbl)
2. With package ggplot2, built-in ways of fitting a line can be used.
ggplot(df1, aes(as.integer(Var1), Freq)) +
geom_bar(stat = "identity", fill = "red", alpha = 0.5) +
geom_smooth(method = stats::loess,
formula = y ~ x,
color = "blue", fill = "blue",
alpha = 0.5)
Test data creation code.
set.seed(2020)
f <- function(x) sin(x)^2*exp(x)
p <- f(seq(0, 2.5, by = 0.05))
p <- p/sum(p)
H <- sample(51, size = 1e3, prob = p, replace = TRUE)
Edit
Here is a new graph, with the new data posted in comment. The data is at the end of this answer.
library(ggplot2)
library(scales)
Mdate <- as.Date(paste0(M, "/2020"), format = "%d/%m/%Y")
df1 <- data.frame(H, M = Mdate)
ggplot(df1, aes(M, H)) +
geom_bar(stat = "identity", fill = "red", alpha = 0.5) +
geom_smooth(method = stats::loess,
formula = y ~ x,
color = "blue", fill = "blue", alpha = 0.25,
level = 0.5, span = 0.1) +
scale_x_date(labels = date_format("%d/%m"))
New data
H <- c(1,2,4,1,0,0,3,1,3,3, 6,238,0,
58,17,64,38,3,10,8, 10,11,13,
7,25,11,12,13,28,44)
M <- c("29/02","01/03","02/03","03/03",
"04/03","05/03","06/03","07/03",
"08/03", "09/03","10/03","11/03",
"12/03","13/03","14/03","15/03",
"16/03", "17/03","18/03","19/03",
"20/03","21/03","22/03","23/03",
"24/03", "25/03","26/03","27/03",
"28/03","29/03")

Warning message 'mapping' is not used by stat_function() in R

While completing a project for understanding central limit theorem for exponential distribution, I ran into an annoying error message when plotting simulated vs theoretical distributions. When I run the code below, I get an error: 'mapping' is not used by stat_function().
By mapping I assume the error is referring to the aes parameter, which I later map to color red using scale_color_manual in order to show it in a legend.
My question is two-fold: why is this error happening? and is there a more efficient way to create a legend without using scale_color_manual?
Thank you!
lambda <- 0.2
n_sims <- 1000
set.seed(100100)
total_exp <- rexp(40 * n_sims, rate = lambda)
exp_data <- data.frame(
Mean = apply(matrix(total_exp, n_sims), 1, mean),
Vars = apply(matrix(total_exp, n_sims), 1, var)
)
g <- ggplot(data = exp_data, aes(x = Mean))
g +
geom_histogram(binwidth = .3, color = 'black', aes(y=..density..), fill = 'steelblue') +
geom_density(size=.5, aes(color = 'Simulation'))+
stat_function(fun = dnorm, mapping = aes(color='Theoretical'), args = list(mean = 1/lambda, sd = 1/lambda/sqrt(40)), size=.5, inherit.aes = F, show.legend = T)+
geom_text(x = 5.6, y = 0.1, label = "Theoretical and Sample Mean", size = 2, color = 'red') +
scale_color_manual("Legend", values = c('Theoretical' = 'red', 'Simulation' = 'blue')) +
geom_vline(aes(xintercept = 1/lambda), lwd = 1.5, color = 'grey') +
labs(x = 'Exponential Distribution Simulations Average Values') +
ggtitle('Sample Mean vs Theoretical Mean of the Averages of the Exponential Distribution')+
theme_classic(base_size = 10)
It's not an error, it's a warning:
library(ggplot2)
lambda <- 0.2
n_sims <- 1000
set.seed(100100)
total_exp <- rexp(40 * n_sims, rate = lambda)
exp_data <- data.frame(
Mean = apply(matrix(total_exp, n_sims), 1, mean),
Vars = apply(matrix(total_exp, n_sims), 1, var)
)
g <- ggplot(data = exp_data, aes(x = Mean))
g +
geom_histogram(binwidth = .3, color = 'black', aes(y=..density..), fill = 'steelblue') +
geom_density(size=.5, aes(color = 'Simulation'))+
stat_function(fun = dnorm, mapping = aes(color='Theoretical'), args = list(mean = 1/lambda, sd = 1/lambda/sqrt(40)), size=.5, inherit.aes = F, show.legend = T)+
geom_text(x = 5.6, y = 0.1, label = "Theoretical and Sample Mean", size = 2, color = 'red') +
scale_color_manual("Legend", values = c('Theoretical' = 'red', 'Simulation' = 'blue')) +
geom_vline(aes(xintercept = 1/lambda), lwd = 1.5, color = 'grey') +
labs(x = 'Exponential Distribution Simulations Average Values') +
ggtitle('Sample Mean vs Theoretical Mean of the Averages of the Exponential Distribution')+
theme_classic(base_size = 10)
#> Warning: `mapping` is not used by stat_function()
Created on 2020-05-01 by the reprex package (v0.3.0)
You can suppress the warning by calling geom_line(stat = "function") rather than stat_function():
library(ggplot2)
lambda <- 0.2
n_sims <- 1000
set.seed(100100)
total_exp <- rexp(40 * n_sims, rate = lambda)
exp_data <- data.frame(
Mean = apply(matrix(total_exp, n_sims), 1, mean),
Vars = apply(matrix(total_exp, n_sims), 1, var)
)
g <- ggplot(data = exp_data, aes(x = Mean))
g +
geom_histogram(binwidth = .3, color = 'black', aes(y=..density..), fill = 'steelblue') +
geom_density(size=.5, aes(color = 'Simulation'))+
geom_line(stat = "function", fun = dnorm, mapping = aes(color='Theoretical'), args = list(mean = 1/lambda, sd = 1/lambda/sqrt(40)), size=.5, inherit.aes = F, show.legend = T)+
geom_text(x = 5.6, y = 0.1, label = "Theoretical and Sample Mean", size = 2, color = 'red') +
scale_color_manual("Legend", values = c('Theoretical' = 'red', 'Simulation' = 'blue')) +
geom_vline(aes(xintercept = 1/lambda), lwd = 1.5, color = 'grey') +
labs(x = 'Exponential Distribution Simulations Average Values') +
ggtitle('Sample Mean vs Theoretical Mean of the Averages of the Exponential Distribution')+
theme_classic(base_size = 10)
Created on 2020-05-01 by the reprex package (v0.3.0)
In my opinion, the warning is erroneous, and an issue has been filed about this problem: https://github.com/tidyverse/ggplot2/issues/3611
However, it's not that easy to solve, and therefore as of now the warning is there.
I'm unable to recreate your issue -- when I run your code a plot is generated (below), which suggests the issue is likely to do you with your environment. A general 'solution' is to clear your workspace using the menu dropdown or similar: Session -> Clear workspace..., then re-run your code.
For refactoring the color issue, you can simplify scale_color_manual to
scale_color_manual("Legend", values = c('blue','red')), but how it is now, is a bit better in my view. Anything beyond that has more to do with changing the data structure and mapping.
Apologies, I don't have the rep to make a comment.

How to align layers (density plot and vertical line) in a single ggplot2

I am trying to adjust the layers of a plot that uses both stat_function and geom_vline. My problem is that the vertical line is not perfectly aligned with the green area:
Density plot with a vertical line (not aligned)
In this post I saw a solution to align two separate plots, however, in my case I want to align then in the same plot.
all_mean <- mean(mtcars$wt,na.rm = T)%>% round(2)
all_sd <- sd(mtcars$wt,na.rm = T)%>% round(2)
my_score <- mtcars[1,"wt"]
dd <- function(x) { dnorm(x, mean=all_mean, sd=all_sd) }
z <- (my_score - all_mean)/all_sd
pc <- round(100*(pnorm(z)), digits=0)
t1 <- paste0(as.character(pc),"th percentile")
p33 <- all_mean + (qnorm(0.3333) * all_sd)
p67 <- all_mean + (qnorm(0.6667) * all_sd)
funcShaded <- function(x, lower_bound) {
y = dnorm(x, mean = all_mean, sd = all_sd)
y[x < lower_bound] <- NA
return(y)
}
greenShaded <- function(x, lower_bound) {
y = dnorm(x, mean = all_mean, sd = all_sd)
y[x > (all_mean*2)] <- NA
return(y)
}
ggplot(data.frame(x=c(min(mtcars$wt-2), max(mtcars$wt+2))), aes(x=x)) +
stat_function(fun=dd, colour="black") +
stat_function(fun = greenShaded, args = list(lower_bound = pc),
geom = "area", fill = "green", alpha = 1)+
stat_function(fun = funcShaded, args = list(lower_bound = my_score),
geom = "area", fill = "white", alpha = .9)+
geom_vline(aes(xintercept=my_score), colour="black")
stat_function chooses n points along your range, by default 101. This means you only have limited resolution for your curve. Simply increase n for the funcShaded layer.
ggplot(data.frame(x=c(min(mtcars$wt-2), max(mtcars$wt+2))), aes(x=x)) +
stat_function(fun=dd, colour="black") +
stat_function(fun = greenShaded, args = list(lower_bound = pc),
geom = "area", fill = "green", alpha = 1)+
stat_function(fun = funcShaded, args = list(lower_bound = my_score),
geom = "area", fill = "white", alpha = .9, n = 1e3)+
geom_vline(aes(xintercept=my_score), colour="black")

#What causes different behaviour between stats and ggplot2 when writing histograms, normal curves and qqplots to .pdf?

I need to produce plots for statistical analyses and I am stumped by a difference in behaviour between stats and ggplot. Who can help out?
I am trying to produce a pdf with histograms, including normal curves, side-by-side with qqplots, with the next plot continuing on the same page. Preferably using ggplot (because prettier plots). I have a large number of variables in my real dataset, so I am using a 'for' loop.
library(ggplot2)
library(stats)
library(datasets)
This piece of ggplot code does what I want it to do.
ggplot(airquality, aes(Wind)) +
geom_histogram(aes(y = ..density..),colour = "black", fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(airquality$Wind), sd = sd(airquality$Wind)), colour = "red", size = 1) +
xlab("Wind")
qplot(sample = airquality$Wind, stat = "qq")
I am fine with the binwidth warning, I want that picked automatically, and I will build in a suppression for that message later on. I am not sure wat to do though with: '"stat" is deprecated' Anyone?
If I try to work this into a 'for' loop, I cannot get it to work. It keeps putting every plot on a new page and it leaves out the normal curves:
Variablesairquality<-c("Wind", "Temp", "Month", "Day")
pdf(file = "Normality.pdf", 4, 5)
par(mfrow = c(2,2))
for(i in Variablesairquality){
plot(ggplot(airquality, aes(airquality[,i])) +
geom_histogram(aes(y = ..density..),colour = "black", fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(airquality[,i]), sd = sd(airquality[,i])), colour = "red", size = 1) +
xlab(i)
)
plot(qplot(sample = airquality[,i], stat = "qq" )
)
}
dev.off()
Which I don’t get, because if I try it using stats, it does exactly what I want:
pdf(file = "Normality2.pdf", 4, 5)
par(mfrow = c(2,2))
for(i in Variablesairquality){
h <- hist(airquality[,i], col = "white", cex.axis=0.50, xlab = i, cex.lab=0.75, main = paste("Distribution"), cex.main= 0.75)
xfit<-seq(min(airquality[,i]),max(airquality[,i]),length=length(airquality[,i]))
yfit<-dnorm(xfit,mean=mean(airquality[,i]),sd=sd(airquality[,i]))
yfit <- yfit*diff(h$mids[1:2])*length(airquality[,i])
lines(xfit, yfit, col="red", lwd=1)
qqnorm(airquality[,i], cex = 0.5, cex.axis=0.50, cex.lab=0.75, main = expression("Q-Q plot for"~paste(i)), cex.main= 0.75)
qqline(airquality[,i], col = "red")
}
dev.off()
(Accept for the thing with the main label which I still need to figure out. Anyone any tips?)
I would be most grateful if someone could point out the mistake in my ggplot code or otherwise explain this behaviour. Thanks!
I use R-programming V3.2.3 and R-studio v0.99.891. (And yes, I read every similar item here, scowered the internet and I read the help files; that did not get me where I need to go.)
On `stat` is deprecated, see Deprecated features in the ggplot2 2.0.0 release notes. Use instead:
ggplot(airquality, aes(sample = Wind)) +
stat_qq()
If you don't wish to use gridExtra::grid.arrange, here's an approach that uses facets. Begin by wrangling the data into a new dataframe with the values we want for x, y, plot type, and geom variables:
d <- as.data.frame(qqnorm(airquality$Wind, plot.it = F))
d$plot <- "QQ plot"
d$geom <- "point"
d <- rbind(d, data.frame(x = airquality$Wind, y = NA,
plot = "Histogram", geom = "bar"))
d <- rbind(d, with(airquality, data.frame(
x = seq(min(Wind), max(Wind), l = 100),
y = dnorm(seq(min(Wind), max(Wind), l = 100),
mean = mean(Wind), sd = sd(Wind)),
plot = "Histogram", geom = "line")))
Then call ggplot, subsetting the data as appropriate for each geom:
ggplot(d, aes(x = x, y = y)) + facet_wrap(~plot, scales = "free") +
geom_histogram(data = subset(d, plot == "Histogram" & geom == "bar"),
aes(y = ..density..),
colour = "black", fill = "white") +
geom_line(data = subset(d, plot == "Histogram" & geom == "line"),
colour = "red", size = 1) +
geom_point(data = subset(d, plot == "QQ plot")) +
labs(x = "Wind")
Output:
To do multiple plots, you can wrap the code above into a for loop, making sure to wrap ggplot inside print:
pdf("path/to/pdf/out.pdf")
Variablesairquality <- c("Wind", "Temp", "Month", "Day")
for (i in rev(Variablesairquality)) {
x <- airquality[[i]]
d <- as.data.frame(qqnorm(x, plot.it = F))
d$plot <- "QQ plot"
d$geom <- "point"
d <- rbind(d, data.frame(x = x, y = NA, plot = "Histogram", geom = "bar"))
d <- rbind(d, data.frame(x = seq(min(x), max(x), l = 100),
y = dnorm(seq(min(x), max(x), l = 100),
mean = mean(x), sd = sd(x)),
plot = "Histogram", geom = "line"))
print(
ggplot(d, aes(x = x, y = y)) + facet_wrap(~plot, scales = "free") +
geom_histogram(data = subset(d, plot == "Histogram" & geom == "bar"),
aes(y = ..density..),
colour = "black", fill = "white") +
geom_line(data = subset(d, plot == "Histogram" & geom == "line"),
colour = "red", size = 1) +
geom_point(data = subset(d, plot == "QQ plot")) +
labs(x = i)
)
}
dev.off()

Resources