I'm trying to show data of two groups. I am using the ggplot2 package to graph the data and using stat_summary() to obtain a point estimate (mean) and 90% CI within the plot of the data. What I'd like is for the mean and confidence interval be structured off to the right of the points representing the distribution of the data. Currently, stat_summary() will simply impose the mean and CI over top of the distribution.
Here is an example of data that I am working with:
set.seed(9909)
Subjects <- 1:100
values <- c(rnorm(n = 50, mean = 30, sd = 5), rnorm(n = 50, mean = 35, sd = 8))
data <- cbind(Subjects, values)
group1 <- rep("group1", 50)
group2 <- rep("group2", 50)
group <- c(group1, group2)
data <- data.frame(data, group)
data
And this is what my current ggplot2 code looks like (distribution as points with the mean and 90% CI overlaid on top for each group):
ggplot(data, aes(x = group, y = values, group = 1)) +
geom_point() +
stat_summary(fun.y = "mean", color = "red", size = 5, geom = "point") +
stat_summary(fun.data = "mean_cl_normal", color = "red", size = 2, geom = "errorbar", width = 0, fun.args = list(conf.int = 0.9)) + theme_bw()
Is it possible to get the mean and confidence intervals to position_dodge to the right of their respective groups?
You can use position_nudge:
ggplot(data, aes(x = group, y = values, group = 1)) +
geom_point() +
stat_summary(fun.y = "mean", color = "red", size = 5, geom = "point",
position=position_nudge(x = 0.1, y = 0)) +
stat_summary(fun.data = "mean_cl_normal", color = "red", size = 2,
geom = "errorbar", width = 0, fun.args = list(conf.int = 0.9),
position=position_nudge(x = 0.1, y = 0)) +
theme_bw()
Related
I am working through a class problem to test if the central limit theorem applies to medians as well. I've written the code, and as far as I can tell, it is working just fine. But my dnorm stat to plot the normal distribution is not showing up. It just creates a flat line when it should create a bell curve. Here is the code:
set.seed(14)
median_clt <- rnorm(1000, mean = 10, sd = 2)
many_sample_medians <- function(vec, n, reps) {
rep_vec <- replicate(reps, sample(vec, n), simplify = "vector")
median_vec <- apply(rep_vec, 2, median)
return(median_vec)
}
median_clt_test <- many_sample_medians(median_clt, 500, 1000)
median_clt_test_df <- data.frame(median_clt_test)
bw_clt <- 2 * IQR(median_clt_test_df$median_clt_test) / length(median_clt_test_df$median_clt_test)^(1/3)
ggplot(median_clt_test_df, aes(x = median_clt_test)) +
geom_histogram(binwidth = bw_clt, aes(y = ..density..), fill = "hotpink1", col = "white") +
stat_function(fun = ~dnorm(.x, mean = 10, sd = 2), col = "darkorchid1", lwd = 2) +
theme_classic()
As far as I can tell, the rest of the code is working properly - it just doesn't plot the dnorm stat function correctly. The exact same stat line worked for me before, so I'm not sure what's gone wrong.
The line isn't quite flat; it's just very stretched out compared to the histogram. We can see this more clearly if we zoom out on the x axis and zoom in on the y axis:
ggplot(median_clt_test_df, aes(x = median_clt_test)) +
geom_histogram(binwidth = bw_clt, aes(y = ..density..),
fill = "hotpink1", col = "white") +
stat_function(fun = ~dnorm(.x, mean = 10, sd = 2),
col = "darkorchid1",
lwd = 2) +
xlim(c(5, 15)) +
coord_cartesian(xlim = c(5, 15), ylim = c(0, 1)) +
theme_classic()
But why is this?
It's because you are using dnorm to plot the distribution of the random variable from which the medians were drawn, but your histogram is a sample of the medians themselves. So you are plotting the wrong dnorm curve. The sd should not be the standard deviation of the random variable, but the standard deviation of the sample medians:
ggplot(median_clt_test_df, aes(x = median_clt_test)) +
geom_histogram(binwidth = bw_clt, aes(y = ..density..),
fill = "hotpink1", col = "white") +
stat_function(fun = ~dnorm(.x,
mean = mean(median_clt_test),
sd = sd(median_clt_test)),
col = "darkorchid1",
lwd = 2)
theme_classic()
If you prefer you could use the theoretical standard error of the mean instead of the measured standard deviation of your medians - these will be very similar.
# Theoretical SEM
2/sqrt(500)
#> [1] 0.08944272
# SD of medians
sd(median_clt_test)
#> [1] 0.08850221
I have two dataframes, one which I want to make a stat_density_2d plot using a 'raster' geom and one in which I want to use a 'point' geom. For the point geom I want to remove any point where there is no data though, as measured by a point size of 0.
The following is my code:
library(tidyverse)
set.seed(1)
#tibble for raster density plot
df <- tibble(x = runif(1000000, min = -7, max = 5),
y = runif(1000000, min = 0, max = 1000))
#tibble for point density plot
df2 <- tibble(x = runif(20000, min = -2, max = 2),
y = runif(20000, min = 0, max = 500))
#create the density plot
p1 <- ggplot(NULL, aes(x=x, y=y) ) +
stat_density_2d(data = df, aes(fill = stat(density)), geom = "raster", contour = FALSE) +
scale_fill_gradient(low="transparent", high="red") +
stat_density_2d(data = df2, geom = "point", aes(size = ..density..), n = 40, contour = FALSE) +
theme_bw() +
theme(text=element_text(size=18)) +
ylim(0, 1000) + xlim(-7, 5)
p1
which returns:
But where the points are smallest (outside the bounds specified in the df2 tibble) I don't want any density points to be shown. Is there anyway to remove these?
Here's a hack, though I don't know how robust it is to differences in data.
BLUF: add scale_radius(range=c(-1,6)).
I reduced your data a lot so that it doesn't take 5 minutes to render.
set.seed(1)
df <- tibble(x = runif(1000, min = -7, max = 5),
y = runif(1000, min = 0, max = 1000))
df2 <- tibble(x = runif(20, min = -2, max = 2),
y = runif(20, min = 0, max = 500))
Four plots:
Your code (my data), no other change;
scale_radius();
scale_radius(range = c(-0.332088004, 6)); and
scale_radius(range = c(-1, 6)).
This is surely a hack, and I don't know how to find a more precise way of filtering out specific levels.
The modified code:
p1 <- ggplot(NULL, aes(x=x, y=y) ) +
stat_density_2d(data = df, aes(fill = stat(density)), geom = "raster", contour = FALSE) +
scale_fill_gradient(low="transparent", high="red") +
stat_density_2d(data = df2, geom = "point", aes(size = ..density..), n = 40, contour = FALSE) +
theme_bw() +
# scale_radius() +
# scale_radius(range = c(-0.332088004, 6)) +
scale_radius(range = c(-1, 6)) +
theme(text=element_text(size=18)) +
ylim(0, 1000) + xlim(-7, 5)
I have created a graph that overlays a normally distributed density plot on top of a previous density plot using the dnorm() function. However, I am having a difficult time adding a legend. Below is the code to create the plot with one of my attempts at adding a legend.
library(tidyverse)
my.data = rnorm(1000, 3, 10)
ggplot(enframe(my.data), aes(value)) +
geom_density(fill = "mediumseagreen", alpha = 0.1) +
geom_area(stat = "function", fun = function(x) dnorm(x, mean = 0, sd = 5), fill = "red", alpha = .5)+
theme(legend.position="right")+
scale_color_manual("Line.Color", values=c(red="red",green="green"),
labels=paste0("Plot",1:2))
To summarize I am trying to add a legend to this plot that has labels "Plot1" and "Plot2"
There might be better answers. This is what I have achieved with several attemps:
library(tidyverse)
my.data = rnorm(1000, 3, 10)
ggplot(enframe(my.data), aes(value)) +
geom_density(aes(color = "Plot1", fill = "Plot1"), alpha = 0.1) +
geom_area(aes(color = "Plot2", fill = "Plot2"), stat = "function",
fun = function(x) dnorm(x, mean = 0, sd = 5), alpha = .5)+
theme(legend.position="right") +
scale_color_manual(" ", values=c(Plot1="green", Plot2="red")) +
scale_fill_manual(" ", values=c(Plot1 ="green", Plot2="red"))
I'm plotting the relationships between speed and time for four different species (each in a different facet). For each species, I have a range of speeds I'm interested in, and would like to shade the area between the min and max values. However, these ranges are different for the 4th species compared to the first three.
#data to plot as points
species <- sample(letters[1:4], 40, replace = TRUE)
time <- runif(40, min = 1, max = 100)
speed <- runif(40, min = 1, max = 20)
df <- data.frame(species, time, speed)
#ranges of key speeds
sp <- letters[1:4]
minspeed <- c(5, 5, 5, 8)
maxspeed <- c(10, 10, 10, 13)
df.range <- data.frame(sp, minspeed, maxspeed)
ggplot() +
geom_hline(data = df.range, aes(yintercept = minspeed),
colour = "red") +
geom_hline(data = df.range, aes(yintercept = maxspeed),
colour = "red") +
geom_point(data=df, aes(time, speed),
shape = 1) +
facet_wrap(~species) +
theme_bw()
How do I:
get geom_hline to only plot the max and min ranges for the correct species, and
shade the area between the two lines?
For the later part, I've tried adding geom_ribbon to my plot, but I keep getting an error message that I'm unsure how to address.
geom_ribbon(data = df,
aes(ymin = minspeed, ymax = maxspeed,
x = c(0.0001, 100)),
fill = "grey",
alpha = 0.5) +
Error: Aesthetics must be either length 1 or the same as the data
(40): x, ymin, ymax
As per my comment, the following should work. Perhaps there are other unobserved differences between your actual use case & the example in your question?
colnames(df.range)[which(colnames(df.range) == "sp")] <- "species"
ggplot() +
geom_hline(data = df.range, aes(yintercept = minspeed),
colour = "red") +
geom_hline(data = df.range, aes(yintercept = maxspeed),
colour = "red") +
geom_point(data = df, aes(time, speed),
shape = 1) +
geom_rect(data = df.range,
aes(xmin = -Inf, xmax = Inf, ymin = minspeed, ymax = maxspeed),
fill = "grey", alpha = 0.5) +
facet_wrap(~species) +
theme_bw()
Data used:
df <- data.frame(species = sample(letters[1:4], 40, replace = TRUE),
time = runif(40, min = 1, max = 100),
speed = runif(40, min = 1, max = 20))
df.range <- data.frame(sp = letters[1:4],
minspeed = c(5, 5, 5, 8),
maxspeed = c(10, 10, 10, 13))
I'm trying to plot a histogram with ggplot2.
I wrote a simple code for this in R
dnorm.count <- function(x, mean = 0, sd = 1, log = FALSE, n = 1, binwidth = 1){
n * binwidth * dnorm(x = x, mean = mean, sd = sd, log = log)
}
mtcars %>%
ggplot(aes(x = mpg)) +
geom_histogram(bins =60,color = "white", fill = "#9FE367",boundary = 0.5) +
geom_vline(aes(xintercept = mean(mpg)),
linetype="dashed",
size = 1.6,
color = "#FF0000")+
geom_text(aes(label = ..count..), stat= "count",vjust = -0.6)+
stat_function(fun = dnorm.count, color = "#6D67E3",
args = list(mean= mean(mtcars$mpg),
sd = sd(mtcars$mpg),
n = nrow(mtcars)),
lwd = 1.2) +
scale_y_continuous(labels = comma, name = "Frequency") +
scale_x_continuous(breaks=seq(0,max(mtcars$mpg)))+
geom_text(aes(label = paste0("mean = ", round(mean(mtcars$mpg), 2)),
x = mean(mtcars$mpg)*1.2,
y = mean(mtcars$mpg)/5))+
geom_vline(aes(xintercept = sd(mpg)), linetype="dashed",size = 1.6, color = "#FF0000")
What I got is this!
The question is how do I Plot the histogram similar to this
using ggplot2 and is it possible to convert the code to R function?
Edit: For the better explanation of what I'm trying to do:
I wanna create a Histogram exactly the same as the one attached for reference using ggplot2 and then I wanna create a function for the same to reduce the coding. Use any package+ggplot2 you like. The histograms should have lines depicting the standard deviation & mean like the one in reference. If possible depict the standard deviation in the plot as the reference image, that's what I'm trying to achieve.
If your question how to plot histograms like the one you attached in your last figure, this 9 lines of code produce a very similar result.
library(magrittr) ; library(ggplot2)
set.seed(42)
data <- rnorm(1e5)
p <- data %>%
as.data.frame() %>%
ggplot(., aes(x = data)) +
geom_histogram(fill = "white", col = "black", bins = 30 ) +
geom_density(aes( y = 0.3 *..count..)) +
labs(x = "Statistics", y = "Probability/Density") +
theme_bw() + theme(axis.text = element_blank())
You could use annotate() to add symbols or text and geom_segment to show the intervals on the plot like this:
p + annotate(x = sd(data)/2 , y = 8000, geom = "text", label = "σ", size = 10) +
annotate(x = sd(data) , y = 6000, geom = "text", label = "2σ", size = 10) +
annotate(x = sd(data)*1.5 , y = 4000, geom = "text", label = "3σ", size = 10) +
geom_segment(x = 0, xend = sd(data), y = 7500, yend = 7500) +
geom_segment(x = 0, xend = sd(data)*2, y = 5500, yend = 5500) +
geom_segment(x = 0, xend = sd(data)*3, y = 3500, yend = 3500)
This chunk of code would give you something like this: