Plot beta distribution in R - r

Using the dataset Lahman::Batting I've estimated parameters for the beta distribution. Now I want to plot this empirically derived beta distribution onto the histogram that I estimated it from.
library(dplyr)
library(tidyr)
library(Lahman)
career <- Batting %>%
filter(AB > 0) %>%
anti_join(Pitching, by = "playerID") %>%
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB)) %>%
mutate(average = H / AB)
I can plot the distribution of RBI as:
career %>%
filter(AB > 500) %>%
ggplot(aes(x = average)) +
geom_histogram() +
geom_freqpoly(color = "red")
And obtain:
I know I can use + geom_freqpoly to obtain:
But I want the smooth beta distribution. I can estimate beta parameters by:
career_filtered <- career %>%
filter(AB >= 500)
m <- MASS::fitdistr(career_filtered$average, dbeta,
start = list(shape1 = 1, shape2 = 10))
alpha0 <- m$estimate[1] # parameter 1
beta0 <- m$estimate[2] # parameter 2
Now that I have parameters alpha0 and beta0, how do I plot the beta distribution so that I obtain something like this:
This question is based on a post I'm reading here.

All code, including the code for the plots, can be found here. The following code is used to get the requested plot:
ggplot(career_filtered) +
geom_histogram(aes(average, y = ..density..), binwidth = .005) +
stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red",
size = 1) +
xlab("Batting average")
Hope this helps.

Related

Caterpillar plot of posterior brms samples: Order factors in a ggdist plot (stat_slab)

I ran a bayesian linear mixed model with brms and can plot the estimates nicely but I can't figure out how to order the single-subject estimates based on the mean of the posterior samples (so as to get a caterpillar plot). This is what I've done.
Toy data:
library(brms)
library(tidybayes)
library(tidyverse)
n = 20
n_condition = 6
ABC =
tibble(
condition = rep(c("A","B","C","D","E","F"), n),
response = rnorm(n * 6, c(0,1,2,1,-1,-2), 0.5),
treatment = rnorm(n * 6, c(0,1,2,1,-1,-2), 0.5),
subject = c(rep("X",(n_condition*n)/3),rep("Y",(n_condition*n)/3),rep("Z",(n_condition*n)/3))
)
Add a shift for some subjects
ABC$response[ABC$subject == "X"] = 20 + ABC$response[ABC$subject == "X"]
ABC$response[ABC$subject == "Y"] = -20 + ABC$response[ABC$subject == "Z"]
Run the model
m = brm(
response ~ treatment + (1|condition) + (1|subject),
data = ABC,
cores = 4, chains = 1,
iter = 500, warmup = 50
)
Plot
m %>%
spread_draws(b_treatment, r_subject[subject,]) %>%
mutate(subject_estimate = b_treatment + r_subject) %>%
mutate(subject = reorder(subject, sort(subject_estimate))) %>%
ggplot(aes(y = subject, x = subject_estimate)) +
stat_slab()
Gives me this:
The line mutate(subject = reorder(subject, sort(subject_estimate))) doesn't do anything, which might be fine as I probably need to reorder based on the mean of the posteriors, but when I try mutate(subject_order = reorder(subject, sort(mean(subject_estimate)))) I get the error message:
Error: Problem with mutate() input subject_order.
x arguments must have same length
ℹ Input subject_order is reorder(subject, sort(mean(subject_estimate))).
ℹ The error occurred in group 1: subject = "X".
Any pointers welcome
Two points for consideration:
Ungroup the result from spread_draws, otherwise you won't be able to reorder the levels of subject;
Use fct_reorder from the forcats package in tidyverse. It's designed for this exact purpose.
m %>%
spread_draws(b_treatment, r_subject[subject,]) %>%
ungroup() %>%
mutate(subject_estimate = b_treatment + r_subject) %>%
mutate(subject = fct_reorder(subject, subject_estimate, mean)) %>%
ggplot(aes(y = subject, x = subject_estimate)) +
stat_slab()
Result (data generated with set.seed(123)):

Trying to use tidy for a power analysis and using clmm2

I'm trying to do a power analysis on a clmm2 analysis that I'm doing.
This is the code for the particular statistical model:
test <- clmm2(risk_sensitivity ~ treat + sex + dispersal +
sex*dispersal + treat*dispersal + treat*sex,random = id, data = datasocial, Hess=TRUE)
Now, I have the following function:
sim_experiment_power <- function(rep) {
s <- sim_experiment(n_sample = 1000,
prop_disp = 0.10,
prop_fem = 0.35,
disp_probability = 0.75,
nondisp_probability = 0.90,
fem_probability = 0.75,
mal_probability = 0.90)
broom.mixed::tidy(s) %>%
mutate(rep = rep)
}
my_power <- map_df(1:10, sim_experiment_power)
The details of the function sim_experiment are not relevant because they are working as expected. The important thing to know is that it spits up a statistical clmm2 result. My objective with the function above is to do a power analysis. However, I get the following error:
Error: No tidy method for objects of class clmm2
I'm a bit new to R, but I guess it means that tidy doesn't work with clmm2. Does anyone know a work-around for this issue?
EDIT: This is what follows the code that I posted above, which is ultimately what I'm trying to get.
You can then plot the distribution of estimates across your simulations.
ggplot(my_power, aes(estimate, color = term)) +
geom_density() +
facet_wrap(~term, scales = "free")
You can also just calculate power as the proportion of p-values less than your alpha.
my_power %>%
group_by(term) %>%
summarise(power <- mean(p.value < 0.05))
For what you need, you can write a function to return the coefficients with the same column name:
library(ordinal)
library(dplyr)
library(purrr)
tidy_output_clmm = function(fit){
results = as.data.frame(coefficients(summary(fit)))
colnames(results) = c("estimate","std.error","statistic","p.value")
results %>% tibble::rownames_to_column("term")
}
Then we apply it using an example where I sample the wine dataset in ordinal:
sim_experiment_power <- function(rep) {
idx = sample(nrow(wine),replace=TRUE)
s <- clmm2(rating ~ temp, random=judge, data=wine[idx,], nAGQ=10,Hess=TRUE)
tidy_output_clmm(s) %>% mutate(rep=rep)
}
my_power <- map_df(1:10, sim_experiment_power)
Plotting works:
ggplot(my_power, aes(estimate, color = term)) +
geom_density() +
facet_wrap(~term, scales = "free")
And so does power:
my_power %>% group_by(term) %>% summarise(power = mean(p.value < 0.05))
# A tibble: 5 x 2
term power
<chr> <dbl>
1 1|2 0.9
2 2|3 0.1
3 3|4 1
4 4|5 1
5 tempwarm 1

Having several fits in one plot (in R)

I was wondering how I can modify the following code to have a plot something like
data(airquality)
library(quantreg)
library(ggplot2)
library(data.table)
library(devtools)
# source Quantile LOESS
source("https://www.r-statistics.com/wp-content/uploads/2010/04/Quantile.loess_.r.txt")
airquality2 <- na.omit(airquality[ , c(1, 4)])
#'' quantreg::rq
rq_fit <- rq(Ozone ~ Temp, 0.95, airquality2)
rq_fit_df <- data.table(t(coef(rq_fit)))
names(rq_fit_df) <- c("intercept", "slope")
#'' quantreg::lprq
lprq_fit <- lapply(1:3, function(bw){
fit <- lprq(airquality2$Temp, airquality2$Ozone, h = bw, tau = 0.95)
return(data.table(x = fit$xx, y = fit$fv, bw = paste0("bw=", bw), fit = "quantreg::lprq"))
})
#'' Quantile LOESS
ql_fit <- Quantile.loess(airquality2$Ozone, jitter(airquality2$Temp), window.size = 10,
the.quant = .95, window.alignment = c("center"))
ql_fit_df <- data.table(x = ql_fit$x, y = ql_fit$y.loess, bw = "bw=1", fit = "Quantile LOESS")
I want to have all these fits in a plot.
geom_quantile can calculate quantiles using the rq method internally, so we don't need to create the rq_fit_df separately. However, the lprq and Quantile LOESS methods aren't available within geom_quantile, so I've used the data frames you provided and plotted them using geom_line.
In addition, to include the rq line in the color and linetype mappings and in the legend we add aes(colour="rq", linetype="rq") as a sort of "artificial" mapping inside geom_quantile.
library(dplyr) # For bind_rows()
ggplot(airquality2, aes(Temp, Ozone)) +
geom_point() +
geom_quantile(quantiles=0.95, formula=y ~ x, aes(colour="rq", linetype="rq")) +
geom_line(data=bind_rows(lprq_fit, ql_fit_df),
aes(x, y, colour=paste0(gsub("q.*:","",fit),": ", bw),
linetype=paste0(gsub("q.*:","",fit),": ", bw))) +
theme_bw() +
scale_linetype_manual(values=c(2,4,5,1,1)) +
labs(colour="Method", linetype="Method",
title="Different methods of estimating the 95th percentile by quantile regression")

Visualizing multiple curves in ggplot from bootstrapping, curve fitting

I have time series data that is well modeled using a sinusoidal curve. I'd like to visualize the uncertainty in the fitted model using bootstrapping.
I adapted the approach from here. I am also interested in this approach too, using nlsBoot. I can get the first approach to run, but the resulting plot contains curves that are not continuous, but jagged.
library(dplyr)
library(broom)
library(ggplot2)
xdata <- c(-35.98, -34.74, -33.46, -32.04, -30.86, -29.64, -28.50, -27.29, -26.00,
-24.77, -23.57, -22.21, -21.19, -20.16, -18.77, -17.57, -16.47, -15.35,
-14.40, -13.09, -11.90, -10.47, -9.95,-8.90,-7.77,-6.80, -5.99,
-5.17, -4.21, -3.06, -2.29, -1.04)
ydata <- c(-4.425, -4.134, -5.145, -5.411, -6.711, -7.725, -8.087, -9.059, -10.657,
-11.734, NA, -12.803, -12.906, -12.460, -12.128, -11.667, -10.947, -10.294,
-9.185, -8.620, -8.025, -7.493, -6.713, -6.503, -6.316, -5.662, -5.734, -4.984,
-4.723, -4.753, -4.503, -4.200)
data <- data.frame(xdata,ydata)
bootnls_aug <- data %>% bootstrap(100) %>%
do(augment(nls(ydata ~ A*cos(2*pi*((xdata-x_0)/z))+M, ., start=list(A=4,M=-7,x_0=-10,z=30),.)))
ggplot(bootnls_aug, aes(xdata, ydata)) +
geom_line(aes(y=.fitted, group=replicate), alpha=.1, color="blue") +
geom_point(size=3) +
theme_bw()
ggplot output
Can anyone offer help? Why are the displayed curves not smooth? Is there a better way to implement?
broom::augment is merely returning fitted values for each of the available data points. Therefore, the resolution of x is limited to the resolution of the data. You can predict values from the model with a much higher resolution:
x_range <- seq(min(xdata), max(xdata), length.out = 1000)
fitted_boot <- data %>%
bootstrap(100) %>%
do({
m <- nls(ydata ~ A*cos(2*pi*((xdata-x_0)/z))+M, ., start=list(A=4,M=-7,x_0=-10,z=30))
f <- predict(m, newdata = list(xdata = x_range))
data.frame(xdata = x_range, .fitted = f)
} )
ggplot(data, aes(xdata, ydata)) +
geom_line(aes(y=.fitted, group=replicate), fitted_boot, alpha=.1, color="blue") +
geom_point(size=3) +
theme_bw()
Some more work is needed to add the mean and 95% confidence interval:
quants <- fitted_boot %>%
group_by(xdata) %>%
summarise(mean = mean(.fitted),
lower = quantile(.fitted, 0.025),
upper = quantile(.fitted, 0.975)) %>%
tidyr::gather(stat, value, -xdata)
ggplot(mapping = aes(xdata)) +
geom_line(aes(y = .fitted, group = replicate), fitted_boot, alpha=.05) +
geom_line(aes(y = value, lty = stat), col = 'red', quants, size = 1) +
geom_point(aes(y = ydata), data, size=3) +
scale_linetype_manual(values = c(lower = 2, mean = 1, upper = 2)) +
theme_bw()

R Language - Sorting data into ranges; averaging; ignore outliers

I am analyzing data from a wind turbine, normally this is the sort of thing I would do in excel but the quantity of data requires something heavy-duty. I have never used R before and so I am just looking for some pointers.
The data consists of 2 columns WindSpeed and Power, so far I have arrived at importing the data from a CSV file and scatter-plotted the two against each other.
What I would like to do next is to sort the data into ranges; for example all data where WindSpeed is between x and y and then find the average of power generated for each range and graph the curve formed.
From this average I want recalculate the average based on data which falls within one of two standard deviations of the average (basically ignoring outliers).
Any pointers are appreciated.
For those who are interested I am trying to create a graph similar to this. Its a pretty standard type of graph but like I said the shear quantity of data requires something heavier than excel.
Since you're no longer in Excel, why not use a modern statistical methodology that doesn't require crude binning of the data and ad hoc methods to remove outliers: locally smooth regression, as implemented by loess.
Using a slight modification of csgillespie's sample data:
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
plot(w_sp, power)
x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)
Throw this version, similar in motivation as #hadley's, into the mix using an additive model with an adaptive smoother using package mgcv:
Dummy data first, as used by #hadley
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)
Fit the additive model using gam(), using an adaptive smoother and smoothness selection via REML
require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)
Predict from our model and get standard errors of fit, use latter to generate an approximate 95% confidence interval
x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)
Plot everything and the Loess fit for comparison
plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from #hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
lwd = 3)
First we will create some example data to make the problem concrete:
w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))
Suppose we want to bin the power values between [0,5), [5,10), etc. Then
bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))
We have now created the mean values between the ranges of interest. Note, if you wanted the median values, just change mean to median. All that's left to do, is to plot them:
plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)
To get the average based on data that falls within two standard deviations of the average, we need to create a slightly more complicated function:
noOutliers = function(x, power, w_sp, bin_incr) {
d = power[w_sp >= x & w_sp < (x + bin_incr)]
m_d = mean(d)
d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
return(mean(d_trim))
}
y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)
Here are some examples of fitted curves (weibull analysis) for commercial turbines:
http://www.inl.gov/wind/software/
http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf
http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf
I'd recommend also playing around with Hadley's own ggplot2. His website is a great resource: http://had.co.nz/ggplot2/ .
# If you haven't already installed ggplot2:
install.pacakges("ggplot2", dependencies = T)
# Load the ggplot2 package
require(ggplot2)
# csgillespie's example data
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
# Bind the two variables into a data frame, which ggplot prefers
wind <- data.frame(w_sp = w_sp, power = power)
# Take a look at how the first few rows look, just for fun
head(wind)
# Create a simple plot
ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()
# Create a slightly more complicated plot as an example of how to fine tune
# plots in ggplot
p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot")
p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
p3 + scale_x_continuous(name = "mph") +
scale_y_continuous(name = "power") +
opts(title = "Wind speed and power")

Resources