How to use geom_errorbar with facet_wrap in ggplot2 - r

I am facing a problem adding error bars to my plots. I have a data frame like this:
> str(bank1)
'data.frame': 24 obs. of 4 variables:
$ site : Factor w/ 12 levels "BED","BEU","EB",..: 8 9 10 3 11 1 6 7 5 4 ...
$ canopy : Factor w/ 3 levels "M_Closed","M_Open",..: 3 3 3 3 2 2 2 2 1 1 ...
$ variable: Factor w/ 2 levels "depth5","depth10": 1 1 1 1 1 1 1 1 1 1 ...
$ value : int 200 319 103 437 33 51 165 38 26 29 ...
I plot it like this:
gs1 <- ggplot(bank1, aes(x = canopy, y= value , fill = variable)) +
geom_bar(stat='identity', position = 'dodge', fill = 'darkgray')+
xlab("Canopy cover")+ylab("Seed Bank")+
facet_wrap(~variable,nrow=1)
gs1
This gives a plot like this:
My problem is when I want to add the error bars (standard deviation), the code does not run. I use this code:
bank2 <- bank1
bank2.mean = ddply(bank2, .(canopy, variable), summarize,
plant.mean = mean(value), plant.sd = sd(value))
gs1 <- ggplot(bank1, aes(x = canopy, y= value , fill = variable)) +
geom_bar(stat='identity', position = 'dodge', fill = 'darkgray')+
geom_errorbar(aes(ymin=plant.mean-plant.sd, ymax = plant.mean +
plant.sd), width = 0.5)+
xlab("Canopy cover")+ylab("Seed Bank")+
facet_wrap(~variable,nrow=1)
gs1
I searched for help here, here, here and here but I did not understand how to proceed.
Kindly help!
Here I reproduce an example:
> set.seed(1)
> Data1 <- data.frame(
+ site= c("KOA","KOB","KOO","EB","PNS","BED","KB","KER","KAU","KAD","RO","BEU"),
+ variable = sample(c("depth5", "depth10"), 12, replace = TRUE),
+ canopy=sample(c("open", "M_open", "M_closed"), 12, replace = TRUE),
+ value=sample(c(100,500,50,20,112,200,230,250,300,150,160,400))
+ )
> Data1
site variable canopy value
1 KOA depth5 M_closed 20
2 KOB depth5 M_open 112
3 KOO depth10 M_closed 100
4 EB depth10 M_open 400
5 PNS depth5 M_closed 230
6 BED depth10 M_closed 50
7 KB depth10 M_open 250
8 KER depth10 M_closed 200
9 KAU depth10 M_closed 500
10 KAD depth5 open 150
11 RO depth5 M_open 300
12 BEU depth5 open 160
> gs1 <- ggplot(Data1, aes(x = canopy, y= value , fill = variable)) +
+ geom_bar(stat='identity', position = 'dodge', fill = 'darkgray')+
+ xlab("Canopy cover")+ylab("Seed Bank")+
+ facet_wrap(~variable,nrow=1)
> gs1
> Data2 <- Data1
> data2.mean = ddply(Data2, .(canopy, variable), summarize,
+ plant.mean = mean(value), plant.sd = sd(value))
> gs1 <- ggplot(Data2, aes(x = canopy, y= value , fill = variable)) +
+ geom_bar(stat='identity', position = 'dodge', fill = 'darkgray')+
+ geom_errorbar(aes(ymin=plant.mean-plant.sd, ymax = plant.mean +
+ plant.sd), width = 0.5)+
+ xlab("Canopy cover")+ylab("Seed Bank")+
+ facet_wrap(~variable,nrow=1)
> gs1
Error in FUN(X[[i]], ...) : object 'plant.mean' not found
I get the same error with my original data

The solution to my problem is here. The way I wanted. You need these packages
library(ggplot2)
library(dplyr)
My data frame bank1 was piped into a new data frame cleandata to calculate the mean, sd and se and summarize the results
cleandata <- bank1 %>%
group_by(canopy, variable) %>%
summarise(mean.value = mean(value),
sd.value = sd(value), count = n(),
se.mean = sd.value/sqrt(count))
The summarized results look like this:
> head(cleandata)
# A tibble: 6 x 6
# Groups: canopy [3]
canopy variable mean.value sd.value count se.mean
<fct> <fct> <dbl> <dbl> <int> <dbl>
1 Open depth5 265. 145. 4 72.4
2 Open depth10 20.5 12.8 4 6.41
3 M_Open depth5 71.8 62.6 4 31.3
4 M_Open depth10 6.5 4.20 4 2.10
5 M_Closed depth5 20 8.98 4 4.49
6 M_Closed depth10 0.5 1 4 0.5
Finally, the plotting was done with this piece of code:
gs1 <- ggplot(cleandata, aes(x=canopy, y=mean.value)) +
geom_bar(stat = "identity", color = "black", position = position_dodge())+
geom_errorbar(aes(ymin = mean.value - sd.value, ymax = mean.value + sd.value),
width=0.2)+
xlab("Canopy cover")+ylab("Seed Bank")+
facet_wrap(~variable,nrow=1)
gs1
This gives a graph with error bars (standard deviation) as given below
Problem solved! Cheers!

Related

Geom_spoke not plotting properly

I am trying to plot the length and angle of a bursting event for a marine animal. However, I dont think it is plotting properly. The event occurring at 75m has an ascent of 31 however this is not translating to the graph.
Further, I was wanting to cut the x-axis between 75 & 95.
ggplot(depth, aes(y=hour, x= Startdepth,)) +
geom_point(size = 3) +
geom_spoke(aes(angle = Pitchavg, radius = ascent), size = 1) +
theme_classic() +
scale_color_manual(values = c("#E7B800","#FC4E07", "#00AFBB", '#004E54')) +
annotate("rect", ymin = 5, ymax = 8.5, xmin = -1, xmax = 100,
alpha = .1, fill = "blue") +
annotate("rect", ymin = 16.5, ymax = 21.5, xmin = -1, xmax =100,
alpha = .1, fill = "blue") +
ylim(0, 24)
Below is a sample of my data:
Shark Startdepth Pitchavg ascent Hour
3 5.84887 15.170846 3.34844 19
4 0.424778 -1.737293333 0.014004 22
5 4.1233 7.2578775 0.02645 19
5 4.95108 17.91744 1.07518 19
5 4.77525 23.388734 0.95225 20
6 5.57968 -20.51986667 0 22
6 10.3285 37.778 4.48275 0
6 6.73732 17.5535036 5.52834 1
6 7.01584 27.8316 2.28104 1
6 0.977145 -24.41713333 0 4
6 55.3021 31.6784 9.9084 15
7 8.04433 24.8095 0 10
7 18.712 45.95876667 0 10
7 17.3225 26.36621125 16.374918 10
7 14.0488 37.84178571 0 11
7 20.24 35.13114667 0 11
7 25.5022 9.78095025 2.4459 13
The problem seems to be that geom_spoke accepts angle in radians and you have angles in degrees. Convert radians to degrees in aes and the plot seems right.
I have commented out the scale_color_* since the variable mapping to color was removed from the data, as said in the OP's comment.
x <- 'Shark Startdepth Pitchavg ascent Hour
3 5.84887 15.170846 3.34844 19
4 0.424778 -1.737293333 0.014004 22
5 4.1233 7.2578775 0.02645 19
5 4.95108 17.91744 1.07518 19
5 4.77525 23.388734 0.95225 20
6 5.57968 -20.51986667 0 22
6 10.3285 37.778 4.48275 0
6 6.73732 17.5535036 5.52834 1
6 7.01584 27.8316 2.28104 1
6 0.977145 -24.41713333 0 4
6 55.3021 31.6784 9.9084 15
7 8.04433 24.8095 0 10
7 18.712 45.95876667 0 10
7 17.3225 26.36621125 16.374918 10
7 14.0488 37.84178571 0 11
7 20.24 35.13114667 0 11
7 25.5022 9.78095025 2.4459 13'
depth <- read.table(textConnection(x), header = TRUE)
library(ggplot2)
ggplot(depth, aes(x = Startdepth, y = Hour)) +
geom_point(size = 3) +
geom_spoke(aes(angle = Pitchavg * pi/180, radius = ascent), size = 1) +
#scale_color_manual(values = c("#E7B800","#FC4E07", "#00AFBB", '#004E54')) +
annotate("rect", ymin = c(5, 16.5), ymax = c(8.5, 21.5), xmin = c(-1, -1), xmax = c(100, 100),
alpha = 0.1, fill = "blue") +
ylim(0, 24) +
theme_classic()
Created on 2022-05-12 by the reprex package (v2.0.1)

geom_text & geom_jitter are removing all the rows of my df, even when limits are modified

doing simple word-analysis from this df
# A tibble: 22,959 x 4
word n proportion.A proportion.B
<chr> <int> <dbl> <dbl>
1 globe 1100 0.00792 NA
2 people 954 0.00687 NA
3 world 900 0.00648 NA
4 flag 719 0.00518 NA
5 american 646 0.00465 NA
6 program 634 0.00456 NA
7 travel 609 0.00438 NA
8 time 561 0.00404 NA
9 economic 556 0.00400 NA
10 sociology 529 0.00381 NA
# ... with 22,949 more rows
I'm trying to create a plot with geom_abline, & jitter to show a similar frequency in both text A & B through this code
ggplot(frequency, aes(x = proportion.A, y = proportion.B, color = abs(proportion.B - proportion.A))) +
geom_abline(color = "yellow", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
xlim(0,0.1) +
ylim(0,0.1) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_color_gradient(limits = c(0, 0.01), low = "blue", high = "red")
When I plot this I only get the plot with the abline and this warning message
Warning messages:
1: Removed 22959 rows containing missing values (geom_point).
2: Removed 22959 rows containing missing values (geom_text).
I know that this warning can occur when there's a limit issue, I've tried using scale_""_continuous & scale_log10 but to no avail. Any idea where I need to look?
Thank you

Why the two mean don't match when computed manually and using stat_summary?

I'm doing a project and analysis on the SO data and wanted to check whether the badges (gold, silver and bronze) would depict significant differences between the top users within those groups i.e. that the CI don't overlap. To this end I incorporate t-student CI's and error bars but the mean computed in that summary doesn't match the mean shown using the stat_summary see here (apologies for not having a reproducible example, the data set is huge):
str(comp)
'data.frame': 4500 obs. of 10 variables:
$ userId : num 51 58 61 79 101 122 136 142 233 238 ...
$ reputation : num 35198 39731 41299 38596 38689 ...
$ creationDate : POSIXct, format: "2008-08-01 13:31:13" "2008-08-01 13:56:33" "2008-08-01 14:21:00" "2008-08-01 16:05:09" ...
$ lastAccessDate: POSIXct, format: "2019-11-30 16:40:08" "2019-10-31 15:55:12" "2019-12-01 01:41:04" "2018-04-06 01:48:22" ...
$ location : chr "Yad Binyamin, Israel" "Indianapolis, IN" "Auckland, New Zealand" "New York, NY" ...
$ views : int 3086 1825 1771 1404 1845 2936 2199 874 1655 780 ...
$ upvotes : int 2753 1049 1322 411 550 517 553 106 1734 216 ...
$ downvotes : int 44 55 219 38 64 51 98 3 211 18 ...
$ class : Factor w/ 3 levels "bronze","gold",..: 1 1 1 1 1 1 1 1 1 1 ...
$ badge : Factor w/ 91 levels "Altruist","Analytical",..: 52 52 52 52 52 52 52 52 52 52 ..
summaryRep <- comp %>%
group_by(class) %>%
summarise(n=n(), mean=mean(reputation), sd=sd(reputation), se=sd/sqrt(n), ci=qt(.975,n-1)*se)
> summaryRep
# A tibble: 3 x 6
class n mean sd se ci
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 bronze 1500 37494. 5513. 142. 279.
2 gold 1500 145712. 117260. 3028. 5939.
3 silver 1500 54451. 13118. 339. 664.
colorSpec <- c("#f9a602", "#c0c0c0", "#cd7f32")
names(colorSpec) <- c("gold", "silver", "bronze")
comp %>%
left_join(summaryRep, by="class") %>%
ggplot(aes(badge, reputation, colour=class, group=class)) +
geom_boxplot(notch=T) +
stat_summary(fun.y=mean, geom="point", shape=20, size=10) +
geom_errorbar(aes(ymin=mean-ci, ymax=mean+ci), width=.3) +
scale_y_log10() +
scale_colour_manual(values = colorSpec) +
geom_jitter(alpha=0.3)
See that the mean inside the error bar doesn't match the mean per class produced by stat_summary.
PS: the data is very far from normally distributed so I'd need to use a different CI like a bootstrapped CI (BCI) but still I'm very curious why the mean don't match.
UPDATE this demonstrates that no matter which column ggplot uses to group by either badge or class should end in the same mean:
identical(comp %>%
group_by(class) %>%
summarise(avgReputation=mean(reputation)) %>%
select(avgReputation) %>%
arrange(avgReputation),
comp %>%
group_by(badge) %>%
summarise(avgReputation=mean(reputation)) %>%
select(avgReputation) %>%
arrange(avgReputation))
[1] TRUE
It is the log transformation that's causing you the problem. When you apply scale_y_log10(), stat_summary is by taking the mean and sd of the log10 values, which is different from the log10(mean) or log10(sd). Ideally you should transform the data before making these calculations.
Simulate some data:
comp <- data.frame(
class = sample(c("bronze","silver","gold"),1000,replace=TRUE),
reputation = rnbinom(1000,mu=100,size=1)+1
)
rep2badge = c("silver"="Good Answer","gold"="Great Answer","brzone"="Nice Answer")
comp$badge = rep2badge[comp$class]
We make a function for your plot:
boxplot_func = function(DF,LOG,TITLE){
if(LOG){DF <- DF %>% mutate(reputation=log10(reputation))}
colorSpec <- c("#f9a602", "#c0c0c0", "#cd7f32")
names(colorSpec) <- c("gold", "silver", "bronze")
summaryRep <- DF %>%
group_by(class) %>%
summarise(n=n(), mean=mean(reputation),
median=median(reputation),sd=sd(reputation),
se=sd/sqrt(n), ci=qt(.975,n-1)*se)
DF %>%
left_join(summaryRep, by="class") %>%
ggplot(aes(badge, reputation, colour=class, group=class)) +
geom_boxplot(notch=T) +
stat_summary(fun.y=mean, geom="point", shape=20, size=3) +
geom_errorbar(aes(ymin=mean-ci, ymax=mean+ci), width=.3) +
scale_colour_manual(values = colorSpec) +
geom_jitter(alpha=0.3) +
ggtitle(TITLE)
}
Then we plot with and without log transformation on reputation
library(ggplot2)
library(dplyr)
library(gridExtra)
p1= boxplot_func(comp,TRUE,"log10scale")
p2= boxplot_func(comp,FALSE,"normal scale")
grid.arrange(p1,p2,ncol=2)

How to use mse function - babynames example

So I am using the 'babynames' package in rstudio and am trying to get the 35 most common unisex names. I am trying to rank the names based on the mean squared error from the 50-50 line (however, I am not sure how to do this). Any help would be greatly appreciated! (Also below my code I will put the 'reference code' we were given that includes what the top 35 unisex names are)
Reference Code:
actual_names <- c("Jessie", "Marion", "Jackie", "Alva", "Ollie",
"Jody", "Cleo", "Kerry", "Frankie", "Guadalupe",
"Carey", "Tommie", "Angel", "Hollis", "Sammie",
"Jamie", "Kris", "Robbie", "Tracy", "Merrill",
"Noel", "Rene", "Johnnie", "Ariel", "Jan",
"Devon", "Cruz", "Michel", "Gale", "Robin",
"Dorian", "Casey", "Dana", "Kim", "Shannon")
I think there are a few ways to answer the question as posed, since there's a tradeoff between "most popular" and "most unisex."
Here's a way to prep the data to collect some stats for each name.
library(babynames)
library(tidyverse)
babynames_share <-
babynames %>%
filter(year >= 1930, year <= 2012) %>%
count(name, sex, wt = n) %>%
spread(sex, n, fill = 0) %>%
mutate(Total = F + M,
F_share = F / Total,
MS_50 = ((F_share-0.5)^2 +
(0.5-F_share)^2) / 2)
It looks like around 100 names have perfect gender parity -- but they're all quite uncommon:
babynames_share %>%
filter(F == M) %>%
arrange(-Total)
# A tibble: 100 x 6
name F M Total F_share RMS_50
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Tyjae 157 157 314 0.5 0
2 Callaway 128 128 256 0.5 0
3 Avyn 100 100 200 0.5 0
4 Zarin 92 92 184 0.5 0
5 Tkai 72 72 144 0.5 0
6 Rayen 57 57 114 0.5 0
7 Meco 43 43 86 0.5 0
8 Pele 40 40 80 0.5 0
9 Nijay 35 35 70 0.5 0
10 Mako 27 27 54 0.5 0
# … with 90 more rows
Or we might pick some arbitrary threshold for what counts as unisex. In the example above, I've calculated the mean squared error for the female and male percent shares. We can plot that to see very gendered names on the top (MS_50 tops out at 0.25 by this measure), and unisex names toward the bottom. But it isn't obvious to me how far down we should go to count a name as unisex. Is Casey, which is 58.9% male, with therefore an 8.9%^2 = 0.79% squared error, unisex? Or do we need to further to Jessie, which is 50.8% male?
babynames_share %>%
ggplot(data = .,
aes(Total, MS_50, label = name)) +
geom_point(size = 0.2, alpha = 0.1, color = "gray30") +
geom_text(data = . %>% filter(Total > 10000),
check_overlap = TRUE, size = 3) +
scale_x_log10(breaks = c(10^(1:7)),
labels = scales::comma)
At the "Casey" level of gender parity, here are the top 35:
unisex_names <- babynames_share %>%
filter(MS_50 <= 0.00796) %>%
arrange(-Total) %>%
top_n(35, wt = Total)
It's also interesting to see the whole spectrum of names, with most male on the bottom, female on the top, and unisex in the middle:
babynames_share %>%
ggplot(data = .,
aes(Total, F_share, label = name)) +
geom_point(size = 0.2, alpha = 0.1, color = "gray30") +
geom_text(data = . %>% filter(Total > 10000),
check_overlap = TRUE, size = 2) +
scale_x_log10(breaks = c(10^(1:7)),
labels = scales::comma)

ggplot2 geom_area overlay area plots in front of each other

I am trying to make an area plot with the different areas are overlaid on one another rather than stacked.
I have a dataframe that looks like this:
r variable value
1 45.0 Cat 1 4.057250e+03
2 52.5 Cat 1 3.537323e+03
3 56.1 Cat 1 3.429861e+03
4 57.3 Cat 1 3.395330e+03
5 57.6 Cat 1 3.389983e+03
6 45.0 Cat 2 4.545455e-03
7 52.5 Cat 2 4.509400e+01
8 56.1 Cat 2 3.525753e+02
9 57.3 Cat 2 4.185094e+02
10 57.6 Cat 2 4.336622e+02
11 45.0 Cat 3 4.074432e+03
12 52.5 Cat 3 3.630504e+03
13 56.1 Cat 3 3.919076e+03
14 57.3 Cat 3 3.957039e+03
15 57.6 Cat 3 3.970083e+03
16 45.0 Cat 4 1.718182e+01
17 52.5 Cat 4 9.318133e+01
18 56.1 Cat 4 4.892154e+02
19 57.3 Cat 4 5.617087e+02
20 57.6 Cat 4 5.801001e+02
I am trying to get area plots for each category. My code for that is:
p <- ggplot(reshaped_data, aes(r, value))
p <- p + labs(x = "X Axis", y = "Y Axis") + ggtitle(title)
p <- p + geom_area(aes(colour = variable, fill= variable), position = 'stack')
p
And the result I am getting looks like this:
How can I make it so that the area graphs aren't stacked on each other, but the smallest are overlaid in front of the bigger ones?
Thanks
Using tidyverse:
library(forcats)
p + geom_area(aes(colour = variable,
fill= fct_reorder(variable, value, .desc = TRUE)), position = 'identity')
Remove .desc = TRUE if it does the opposite of what you want.
As Nathan wrote you have to use geom_area(position = "identity", ...)
But before this you should reorder the levels of variable:
df$variable <- factor(df$variable, unique(df[order(df$value, decreasing = T),"variable"]) )
or
df$variable <- reorder(df$variable, df$value, function(x) -max(x) )

Resources