Adding group-specific text/data to faceted plot in R/ggplot2 - r

I am comparing the intra-group correlation between duplicate samples within a large gene expression experiment, where I have multiple separate biological groups - the idea being to see if any of the groups is much less well-correlated than the others, indicating a potential sample mixup or other error.
I am using ggplot to plot the expression values of each duplicate pair against each other. I would like to also be able to add the correlation coefficient and p-value to each panel of the plot, which I obtain through summarize and cor.test. You can use this code to get the general idea: in exp1, the duplicates are correlated, but not in exp2.
library(tidyverse)
df <- data.frame(exp=c(rep('exp1', 100), rep('exp2', 100)), a=rnorm(200, 1000, 200))
df <- mutate(df, b=ifelse(exp=='exp1', a*rnorm(100,1,0.05), rnorm(100, 1000, 200)))
head(df)
tail(df)
df %>% ggplot(aes(x=a, y=b))+
geom_point() +
facet_wrap(~exp)
group_by(df, exp) %>%
summarize(corr=cor.test(a,b)$estimate, pval=cor.test(a,b)$p.value)
This is the plot I generated via ggplot, and I've manually added the R and p-values that I got at the end. But of course, if I have a lot of sample pairs to analyze, it would be nice to be able to add these automatically from within the ggplot call. I'm just not sure how to do it.

If, for whatever reason, you want to build this yourself instead of using the ggpubr functions, you can create your summary data, format labels, and place the labels with geom_text.
I'm formatting the stats so that R has a fixed 3 significant digits and p has 3 digits, falling back on scientific notation. I changed the names of those columns in summarise to R and p to make the labels below. Reshaping to long data and creating a new column with unite gets this:
library(tidyverse)
...
group_by(df, exp) %>%
summarize(R = cor.test(a, b)$estimate, p = cor.test(a, b)$p.value) %>%
mutate(R = formatC(R, format = "fg", digits = 3),
p = formatC(p, format = "g", digits = 3)) %>%
gather(key = measure, value = value, -exp) %>%
unite("stat", measure, value, sep = " = ")
#> # A tibble: 4 x 2
#> exp stat
#> <chr> <chr>
#> 1 exp1 R = 0.965
#> 2 exp2 R = 0.0438
#> 3 exp1 p = 1.14e-58
#> 4 exp2 p = 0.665
Then for each of the groups, I want to collapse both labels, separated by a newline \n. This is a place that will scale well—you might have more summary stats to display, but this should still work.
summ <- group_by(df, exp) %>%
summarize(R = cor.test(a, b)$estimate, p = cor.test(a, b)$p.value) %>%
mutate(R = formatC(R, format = "fg", digits = 3),
p = formatC(p, format = "g", digits = 3)) %>%
gather(key = measure, value = value, -exp) %>%
unite("stat", measure, value, sep = " = ") %>%
group_by(exp) %>%
summarise(both_stats = paste(stat, collapse = "\n"))
summ
#> # A tibble: 2 x 2
#> exp both_stats
#> <chr> <chr>
#> 1 exp1 "R = 0.965\np = 1.14e-58"
#> 2 exp2 "R = 0.0438\np = 0.665"
In geom_text, I'm setting the x coordinate to -Inf, which gets the minimum of all x values, and the y coordinate as Inf for the maximum of all y values. That puts the label in the top-left corner, regardless of the values in the data.
The one thing I don't like here is then hacking the hjust and vjust outside their intended ranges of 0 to 1. But nudge_x/nudge_y won't do anything because of the values being set to infinity.
df %>%
ggplot(aes(x = a, y = b)) +
geom_point() +
geom_text(aes(x = -Inf, y = Inf, label = both_stats), data = summ,
hjust = -0.1, vjust = 1.1, lineheight = 1) +
facet_wrap(~ exp)
Created on 2018-11-14 by the reprex package (v0.2.1)

We can use the stat_cor function from the ggpubr package.
set.seed(123)
library(dplyr)
library(ggplot2)
library(ggpubr)
df <- data.frame(exp=c(rep('exp1', 100), rep('exp2', 100)), a=rnorm(200, 1000, 200))
df <- mutate(df, b=ifelse(exp=='exp1', a*rnorm(100,1,0.05), rnorm(100, 1000, 200)))
ggplot(df, aes(x=a, y=b))+
geom_point() +
facet_wrap(~exp) +
stat_cor(method = "pearson")

Similar to the answer of camille, but you can do all in one run
library(tidyverse)
set.seed(123)
df %>%
group_by(exp) %>%
mutate(p = cor.test(a, b)$p.value,
rho = cor.test(a, b)$estimate) %>%
mutate_at(vars(p, rho), signif, 2) %>%
ggplot(aes(x=a, y=b)) +
geom_point() +
geom_text(data = . %>% distinct(p, rho, exp),
aes(x = -Inf, y = Inf,label = paste("p=",p,"\nrho=",rho)),
hjust = -0.1, vjust = 1.1, lineheight = 1) +
facet_wrap(~exp)

Related

ggplot: add mean value to a stacked barplot (secondary axis)

I am dealing with survey data and now I am trying to add text-labels to a stacked bar plot. What am I doing wrong?
# Sample Data
n <- 100
df <- data.frame(item = sample(paste("Item", 1:4), size=n, replace=TRUE),
value = sample(1:5, size=n, replace=TRUE))
# Create stacked barplot
df %>% group_by(item) %>%
count(value) %>%
mutate(percent = 1 / sum(n) * n,
answer = factor(value, ordered=TRUE)) %>%
ggplot(aes(x = item, y = percent, fill = fct_rev(answer))) +
geom_col() +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = round(percent, 1))) +
labs(fill = "Answer")
I am supposed to add additional mean values for every item. Is there a way to add a secondary axis ranging from 1 to 5 and add the mean values for each item as points to the plot? (even though I know, that statistically this is somewhat questionable as 100% does not really correspond to the maximum value of 5)
You need to specify the position of the labels; at the moment your code places them at their respective positions (i.e. 0.2 is placed at 0.2 on the y axis, and 0.3 is placed at 0.3 on the y axis), but if you add position = position_stack() this should solve your first problem, e.g.
library(tidyverse)
n <- 100
df <- data.frame(item = sample(paste("Item", 1:4), size=n, replace=TRUE),
value = sample(1:5, size=n, replace=TRUE))
# Create stacked barplot
df %>% group_by(item) %>%
count(value) %>%
mutate(percent = 1 / sum(n) * n,
answer = factor(value, ordered=TRUE)) %>%
ggplot(aes(x = item, y = percent, fill = fct_rev(answer))) +
geom_col() +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = round(percent, 1)),
position = position_stack(vjust = 0.5)) +
labs(fill = "Answer")
Created on 2022-11-30 with reprex v2.0.2

Visualising diagonal in asymmetric matrix plot

I have a number of symmetric matrices of the same dimensionality, and I wish to visualise the mean and variance of the values in each cell across these matrices in an elegant way (which I will make more precise below) that makes use of the symmetric character.
Let me start by making some data to illustrate. The following creates 10 9x9 matrices, aggregates the mean and variance, and transforms to long format in preparation for plotting:
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
make_matrix <- function(n) {
m <- matrix(NA, nrow = n, ncol = n)
m[lower.tri(m)] <- runif((n^2 - n) / 2)
m <- pmax(m, t(m), na.rm = TRUE)
diag(m) <- runif(n)
rownames(m) <- colnames(m) <- letters[1:n]
m
}
matrices <- replicate(10, make_matrix(9))
means <- apply(matrices, 1:2, mean) %>%
as_tibble(rownames = "row") %>%
pivot_longer(-1, names_to = "col", values_to = "mean")
vars <- apply(matrices, 1:2, var) %>%
as_tibble(rownames = "row") %>%
pivot_longer(-1, names_to = "col", values_to = "var")
df <- full_join(means, vars, by = c("row", "col"))
head(df)
#> # A tibble: 6 x 4
#> row col mean var
#> <chr> <chr> <dbl> <dbl>
#> 1 a a 0.548 0.111
#> 2 a b 0.507 0.0914
#> 3 a c 0.374 0.105
#> 4 a d 0.350 0.0976
#> 5 a e 0.525 0.0752
#> 6 a f 0.452 0.0887
Now, I could simply use geom_tile to make one plot of the means, and one plot of the variances. However, considering that both of these are symmetric, this wastes quite a lot of space, and also fails to communicate the symmetric character to the audience.
To address this problem, I have been playing around with the ggasym package to create an asymmetric matrix plot. The following is a slight modification from the ggasym vignette:
library(ggasym)
library(ggplot2)
ggplot(df, aes(x = col, y = row)) +
geom_asymmat(aes(fill_diag = mean, fill_tl = mean, fill_br = var)) +
scale_fill_diag_gradient(limits = c(0, 1), low = "lightpink", high = "tomato") +
scale_fill_tl_gradient(limits = c(0, 1), low = "lightpink", high = "tomato") +
scale_fill_br_gradient(low = "lightblue1", high = "dodgerblue") +
geom_text(data = filter(df, row == col), aes(label = signif(var, 2)))
Created on 2020-06-27 by the reprex package (v0.3.0)
What bothers me about this is the diagonal. In the above, I have mapped the fill of the diagonal to the means, and overlaid the variance by text, which works, but doesn't seem great. Specifically, I would like to map all the information here to fill, so as to get rid of the text. I see a couple of options for how to do this, but I am not sure how to implement any of them:
Split the fill of the diagonal cells, so that (in the example above) the lower right of each cell on the diagonal is an appropriate shade of blue, while the upper left is some shade of red.
Plot the upper and lower matrices separately (each with the diagonal), and then somehow "overlay" these plots so that they end up next to each other in an appropriate way. In other words, this would plot the diagonal twice.
I am open to other suggestions for how to accomplish this in a clean way. Let me emphasise that I do not require a solution building on ggasym, this was simply the closest I have been able to get so far. However, I would like some kind of ggplot-based solution.
So here is my take on the 'split-the-fill' strategy. You can plot most of the things you would want in ggplot if you don't mind parameterising your stuff as polygons. We let the ggnewscale package handle the double fill mapping for us.
First off, we no longer autoname the matrices, as we will not use the dimnames.
suppressPackageStartupMessages({
library(ggplot2)
library(tidyr)
library(dplyr)
library(ggnewscale)
})
make_matrix <- function(n) {
m <- matrix(NA, nrow = n, ncol = n)
m[lower.tri(m)] <- runif((n^2 - n) / 2)
m <- pmax(m, t(m), na.rm = TRUE)
diag(m) <- runif(n)
# rownames(m) <- colnames(m) <- letters[1:n]
m
}
Below is a function that takes a matrix, parameterises it as a polygon and cuts off one half.
halfmat <- function(mat, side) {
side <- match.arg(side, c("upper", "lower", "both"))
# Convert to long format
dat <- data.frame(
x = as.vector(row(mat)),
y = as.vector(col(mat)),
id = seq_along(mat),
value = as.vector(mat)
)
# Parameterise as polygon
poly <- with(dat, data.frame(
x = c(x - 0.5, x + 0.5, x + 0.5, x - 0.5),
y = c(y - 0.5, y - 0.5, y + 0.5, y + 0.5),
id = rep(id, 4),
value = rep(value, 4)
))
# Slice off one of the triangles
if (side == "upper") {
poly <- filter(poly, y >= x)
} else if (side == "lower") {
poly <- filter(poly, x >= y)
}
poly
}
Then we generate the data, compute the means and variances and reparameterise them.
matrices <- replicate(10, make_matrix(9))
means <- apply(matrices, 1:2, mean) %>% halfmat("upper")
vars <- apply(matrices, 1:2, var) %>% halfmat("lower")
Then we put in the means and variances as two seperate polygon layers, since we need to seperate the fill mappings with new_scale_fill(). There is a bit of extra fiddling with the scales, as these are now continuous instead of discrete, but it is not that bad.
ggplot(means, aes(x, y, fill = value, group = id)) +
geom_polygon() +
scale_fill_distiller(palette = "Reds", name = "Mean") +
# Be sure to call new_scale_fill() only after you've set up a fill scale
# for the upper part
new_scale_fill() +
geom_polygon(data = vars, aes(fill = value)) +
scale_fill_distiller(palette = "Blues", name = "Variance") +
scale_x_continuous(breaks = function(x){seq(x[1] + 0.5, x[2] - 0.5, by = 1)},
labels = function(x){letters[x]},
expand = c(0,0), name = "col") +
scale_y_continuous(breaks = function(x){seq(x[1] + 0.5, x[2] - 0.5, by = 1)},
labels = function(x){letters[x]},
expand = c(0,0), name = "row")
Created on 2020-06-27 by the reprex package (v0.3.0)

How is the binning done in stat_summary_bin in ggplot2?

I'm trying to add some custom features to a bin-scatter plot using ggplot2. The original way that I was doing the bin-scatter was with stat_summary_bin(fun.y="mean"). This seems to produce a reasonable binning, but when I try to reproduce it by binning manually, I keep getting slightly different results -- especially at the right tail.
Can anyone help me figure out how the binning in stat_summary_bin is done? I need to figure out if this is a reliable form of bin-scattering that I can use...
library(tidyverse)
library(mltools)
#>
#> Attaching package: 'mltools'
#> The following object is masked from 'package:tidyr':
#>
#> replace_na
x = runif(1000, 0, 10)
y = x + rnorm(1000, 0.5, 2)
plot(x,y)
df <- data.frame(x = x, y = y)
p <- df %>%
ggplot(aes(x = x, y = y)) +
stat_summary_bin(aes(color ="stat summary"),fun.y = "mean", size = 2.5, geom="point", bins=20)
p
## Attempt 1 at binning
df$x_bin <- mltools::bin_data(df$x, bins=20, binType = "explicit")
df_binned <- df %>%
group_by(x_bin) %>%
mutate(
x_binned = mean(x),
y_binned = mean(y)
) %>%
ungroup()
p <- p + geom_point(aes(x = df_binned$x_binned, y = df_binned$y_binned, color = "manual bin"), size = 2.5)
p
## Attempt 2 at binning
xbreaks = quantile(df$x, probs = seq(0,1,0.05))
df_binned$x_bin_2 <- cut(df$x, xbreaks, include.lowest = T)
df_binned <- df_binned %>%
group_by(x_bin_2) %>%
mutate(
x_binned2 = mean(x),
y_binned2 = mean(y)
) %>%
ungroup()
p <- p + geom_point(aes(x = df_binned$x_binned2, y = df_binned$y_binned2, color = "2nd manual bin"), size = 2.5)
p
Created on 2018-09-09 by the reprex
package (v0.2.0).

overlay/superimpose grouped bar plots in ggplot2

I'd like to make a bar plot featuring an overlay of data from two time points, 'before' and 'after'.
At each time point, participants were asked two questions ('pain' and 'fear'), which they would answer by stating a score of 1, 2, or 3.
My existing code plots the counts for the data from the 'before' time point nicely, but I can't seem to add the counts for the 'after' data.
This is a sketch of what I'd like the plot to look like with the 'after' data added, with the black bars representing the 'after' data:
I'd like to make the plot in ggplot2() and I've tried to adapt code from How to superimpose bar plots in R? but I can't get it to work for grouped data.
Many thanks!
#DATA PREP
library(dplyr)
library(ggplot2)
library(tidyr)
df <- data.frame(before_fear=c(1,1,1,2,3),before_pain=c(2,2,1,3,1),after_fear=c(1,3,3,2,3),after_pain=c(1,1,2,3,1))
df <- df %>% gather("question", "answer_option") # Get the counts for each answer of each question
df2 <- df %>%
group_by(question,answer_option) %>%
summarise (n = n())
df2 <- as.data.frame(df2)
df3 <- df2 %>% mutate(time = factor(ifelse(grepl("before", question), "before", "after"),
c("before", "after"))) # change classes and split data into two data frames
df3$n <- as.numeric(df3$n)
df3$answer_option <- as.factor(df3$answer_option)
df3after <- df3[ which(df3$time=='after'), ]
df3before <- df3[ which(df3$time=='before'), ]
# CODE FOR 'BEFORE' DATA ONLY PLOT - WORKS
ggplot(df3before, aes(fill=answer_option, y=n, x=question)) + geom_bar(position="dodge", stat="identity")
# CODE FOR 'BEFORE' AND 'AFTER' DATA PLOT - DOESN'T WORK
ggplot(mapping = aes(x, y,fill)) +
geom_bar(data = data.frame(x = df3before$question, y = df3before$n, fill= df3before$index_value), width = 0.8, stat = 'identity') +
geom_bar(data = data.frame(x = df3after$question, y = df3after$n, fill=df3after$index_value), width = 0.4, stat = 'identity', fill = 'black') +
theme_classic() + scale_y_continuous(expand = c(0, 0))
I think the clue is to set the width of the "after" bars, but to dodge them as if their width are 0.9 (i.e. the same (default) width as the "before" bars). In addition, because we don't map fill of the "after" bars, we need to use the group aesthetic instead to achieve the dodging.
I prefer to have only one data set and just subset it in each call to geom_col.
ggplot(mapping = aes(x = question, y = n, fill = factor(ans))) +
geom_col(data = d[d$t == "before", ], position = "dodge") +
geom_col(data = d[d$t == "after", ], aes(group = ans),
fill = "black", width = 0.5, position = position_dodge(width = 0.9))
Data:
set.seed(2)
d <- data.frame(t = rep(c("before", "after"), each = 6),
question = rep(c("pain", "fear"), each = 3),
ans = 1:3, n = sample(12))
Alternative data preparation using data.table, starting with your original 'df':
library(data.table)
d <- melt(setDT(df), measure.vars = names(df), value.name = "ans")
d[ , c("t", "question") := tstrsplit(variable, "_")]
Either pre-calculate the counts and proceed as above with geom_col
# d2 <- d[ , .N, by = .(question, ans)]
Or let geom_bar do the counting:
ggplot(mapping = aes(x = question, fill = factor(ans))) +
geom_bar(data = d[d$t == "before", ], position = "dodge") +
geom_bar(data = d[d$t == "after", ], aes(group = ans),
fill = "black", width = 0.5, position = position_dodge(width = 0.9))
Data:
df <- data.frame(before_fear = c(1,1,1,2,3), before_pain = c(2,2,1,3,1),
after_fear = c(1,3,3,2,3),after_pain = c(1,1,2,3,1))
My solution is very similar to #Henrik's, but I wanted to point out a few things.
First, you're building your data frames inside your geom_cols, which is probably messier than you need it to be. If you've already created df3after, etc., you might as well use it inside your ggplot.
Second, I had a hard time following your tidying. I think there are a couple tidyr functions that might make this task easier on you, so I went a different route, such as using separate to create the columns of time and measure, rather than essentially searching for them manually, making it more scalable. This also lets you put "pain" and "fear" on your x-axis, rather than still having "before_pain" and "before_fear", which are no longer accurate representations once you have "after" values on the plot as well. But feel free to disregard this and stick with your own method.
library(tidyverse)
df <- data.frame(before_fear = c(1,1,1,2,3),
before_pain = c(2,2,1,3,1),
after_fear = c(1,3,3,2,3),
after_pain = c(1,1,2,3,1))
df_long <- df %>%
gather(key = question, value = answer_option) %>%
mutate(answer_option = as.factor(answer_option)) %>%
count(question, answer_option) %>%
separate(question, into = c("time", "measure"), sep = "_", remove = F)
df_long
#> # A tibble: 12 x 5
#> question time measure answer_option n
#> <chr> <chr> <chr> <fct> <int>
#> 1 after_fear after fear 1 1
#> 2 after_fear after fear 2 1
#> 3 after_fear after fear 3 3
#> 4 after_pain after pain 1 3
#> 5 after_pain after pain 2 1
#> 6 after_pain after pain 3 1
#> 7 before_fear before fear 1 3
#> 8 before_fear before fear 2 1
#> 9 before_fear before fear 3 1
#> 10 before_pain before pain 1 2
#> 11 before_pain before pain 2 2
#> 12 before_pain before pain 3 1
I split this into before & after datasets, as you did, then plotted them with 2 geom_cols. I still put df_long into ggplot, treating it almost as a dummy to get uniform x and y aesthetics. Like #Henrik said, you can use different width in the geom_col and in its position_dodge to dodge the bars at a width of 90% but give the bars themselves a width of only 40%.
df_before <- df_long %>% filter(time == "before")
df_after <- df_long %>% filter(time == "after")
ggplot(df_long, aes(x = measure, y = n)) +
geom_col(aes(fill = answer_option),
data = df_before, width = 0.9,
position = position_dodge(width = 0.9)) +
geom_col(aes(group = answer_option),
data = df_after, fill = "black", width = 0.4,
position = position_dodge(width = 0.9))
What you could instead of making the two separate data frames is to filter inside each geom_col. This is generally my preference unless the filtering is more complex. This code will get the same plot as above.
ggplot(df_long, aes(x = measure, y = n)) +
geom_col(aes(fill = answer_option),
data = . %>% filter(time == "before"), width = 0.9,
position = position_dodge(width = 0.9)) +
geom_col(aes(group = answer_option),
data = . %>% filter(time == "after"), fill = "black", width = 0.4,
position = position_dodge(width = 0.9))

Apply MASS::fitdistr to multiple data by a factor

My question is at the end in bold.
I know how to fit the beta distribution to some data. For instance:
library(Lahman)
library(dplyr)
# clean up the data and calculate batting averages by playerID
batting_by_decade <- Batting %>%
filter(AB > 0) %>%
group_by(playerID, Decade = round(yearID - 5, -1)) %>%
summarize(H = sum(H), AB = sum(AB)) %>%
ungroup() %>%
filter(AB > 500) %>%
mutate(average = H / AB)
# fit the beta distribution
library(MASS)
m <- MASS::fitdistr(batting_by_decade$average, dbeta,
start = list(shape1 = 1, shape2 = 10))
alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]
# plot the histogram of data and the beta distribution
ggplot(career_filtered) +
geom_histogram(aes(average, y = ..density..), binwidth = .005) +
stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red",
size = 1) +
xlab("Batting average")
Which yields:
Now I want to calculate different beta parameters alpha0 and beta0 for each batting_by_decade$Decade column of the data so I end up with 15 parameter sets, and 15 beta distributions that I can fit to this ggplot of batting averages faceted by Decade:
batting_by_decade %>%
ggplot() +
geom_histogram(aes(x=average)) +
facet_wrap(~ Decade)
I can hard code this by filtering for each decade, and passing that decade's worth of data into the fidistr function, repeating this for all decades, but is there a way of calculating all beta parameters per decade quickly and reproducibly, perhaps with one of the apply functions?
You can leverage summarise together with two custom functions for this:
getAlphaEstimate = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate[1]}
getBetaEstimate = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate[2]}
batting_by_decade %>%
group_by(Decade) %>%
summarise(alpha = getAlphaEstimate(average),
beta = getBetaEstimate(average)) -> decadeParameters
However, you will not be able to plot it with stat_summary according to Hadley's post here: https://stackoverflow.com/a/1379074/3124909
Here's an example of how you'd go from generating dummy data all the way through to plotting.
temp.df <- data_frame(yr = 10*187:190,
al = rnorm(length(yr), mean = 4, sd = 2),
be = rnorm(length(yr), mean = 10, sd = 2)) %>%
group_by(yr, al, be) %>%
do(data_frame(dats = rbeta(100, .$al, .$be)))
First I made up some scale parameters for four years, grouped by each combination, and then used do to create a dataframe with 100 samples from each distribution. Aside from knowing the "true" parameters, this dataframe should look a lot like your original data: a vector of samples with an associated year.
temp.ests <- temp.df %>%
group_by(yr, al, be) %>%
summarise(ests = list(MASS::fitdistr(dats, dbeta, start = list(shape1 = 1, shape2 = 1))$estimate)) %>%
unnest %>%
mutate(param = rep(letters[1:2], length(ests)/2)) %>%
spread(key = param, value = ests)
This is the bulk of your question here, very much solved the way you solved it. If you step through this snippet line by line, you'll see you have a dataframe with a column of type list, containing <dbl [2]> in each row. When you unnest() it splits those two numbers into separate rows, so then we identify them by adding a column that goes "a, b, a, b, ..." and spread them back apart to get two columns with one row for each year. Here you can also see how closely fitdistr matched the true population we sampled from, looking at a vs al and b vs be.
temp.curves <- temp.ests %>%
group_by(yr, al, be, a, b) %>%
do(data_frame(prop = 1:99/100,
trueden = dbeta(prop, .$al, .$be),
estden = dbeta(prop, .$a, .$b)))
Now we turn that process inside out to generate the data to plot the curves. For each row, we use do to make a dataframe with a sequence of values prop, and calculate the beta density at each value for both the true population parameters and our estimated sample parameters.
ggplot() +
geom_histogram(data = temp.df, aes(dats, y = ..density..), colour = "black", fill = "white") +
geom_line(data = temp.curves, aes(prop, trueden, color = "population"), size = 1) +
geom_line(data = temp.curves, aes(prop, estden, color = "sample"), size = 1) +
geom_text(data = temp.ests,
aes(1, 2, label = paste("hat(alpha)==", round(a, 2))),
parse = T, hjust = 1) +
geom_text(data = temp.ests,
aes(1, 1, label = paste("hat(beta)==", round(b, 2))),
parse = T, hjust = 1) +
facet_wrap(~yr)
Finally we put it together, plotting a histogram of our sample data. Then a line from our curve data for the true density. Then a line from our curve data for our estimated density. Then some labels from our parameter estimate data to show the sample parameters, and facets by year.
This is an apply solution, but I prefer #CMichael's dplyr solution.
calc_beta <- function(decade){
dummy <- batting_by_decade %>%
dplyr::filter(Decade == decade) %>%
dplyr::select(average)
m <- fitdistr(dummy$average, dbeta, start = list(shape1 = 1, shape2 = 10))
alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]
return(c(alpha0,beta0))
}
decade <- seq(1870, 2010, by =10)
params <- sapply(decade, calc_beta)
colnames(params) <- decade
Re: #CMichael's comment about avoiding a double fitdistr, we could rewrite the function to getAlphaBeta.
getAlphaBeta = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate}
batting_by_decade %>%
group_by(Decade) %>%
summarise(params = list(getAlphaBeta(average))) -> decadeParameters
decadeParameters$params[1] # it works!
Now we just need to unlist the second column in a nice way....

Resources