ggplot to show confidence intervals from bootstrapping curve fitting - r

Thanks to the help I got here, I was able to get a spaghetti plot of curve fits using bootstrapping. I am trying to derive confidence bands from these fitted models. I've had no luck getting something like
quants <- apply(fitted_boot, 1, quantile, c(0.025, 0.5, 0.975))
to work with the following:
library(dplyr)
library(broom)
library(ggplot2)
xdata <- c(-35.98, -34.74, -33.46, -32.04, -30.86, -29.64, -28.50, -27.29, -26.00,
-24.77, -23.57, -22.21, -21.19, -20.16, -18.77, -17.57, -16.47, -15.35,
-14.40, -13.09, -11.90, -10.47, -9.95,-8.90,-7.77,-6.80, -5.99,
-5.17, -4.21, -3.06, -2.29, -1.04)
ydata <- c(-4.425, -4.134, -5.145, -5.411, -6.711, -7.725, -8.087, -9.059, -10.657,
-11.734, NA, -12.803, -12.906, -12.460, -12.128, -11.667, -10.947, -10.294,
-9.185, -8.620, -8.025, -7.493, -6.713, -6.503, -6.316, -5.662, -5.734, -4.984,
-4.723, -4.753, -4.503, -4.200)
data <- data.frame(xdata,ydata)
x_range <- seq(min(xdata), max(xdata), length.out = 1000)
fitted_boot <- data %>%
bootstrap(100) %>%
do({
m <- nls(ydata ~ A*cos(2*pi*((xdata-x_0)/z))+M, ., start=list(A=4,M=-7,x_0=-10,z=30))
f <- predict(m, newdata = list(xdata = x_range))
data.frame(xdata = x_range, .fitted = f)
} )
ggplot(data, aes(xdata, ydata)) +
geom_line(aes(y=.fitted, group=replicate), fitted_boot, alpha=.1, color="blue") +
geom_point(size=3) +
theme_bw()
I thought perhaps geom_ribbon() would be a nice way to go, but I just don't know where to go from here.
Thank you to Axeman for helping on the other post!

One approach would be to calculate the confidence interval at each x-value and then just plot that. Here, I am using the first value outside of the 2.5th percentile and the 97.5th percentiles, though you could adjust the code as needed.
First, I change to group_by the xdata locations (instead of replicates). Then, I arrange by the .fitted values so that I can slice out the values I want (the first outside the percentile cutoffs). Finally, I tag them with which bound I am getting (they always go lower then upper because we sorted).
forConfInt <-
fitted_boot %>%
ungroup() %>%
group_by(xdata) %>%
arrange(.fitted) %>%
slice(c(floor(0.025 * n() )
, ceiling(0.975 * n() ) ) ) %>%
mutate(range = c("lower", "upper"))
This gives:
replicate xdata .fitted range
<int> <dbl> <dbl> <chr>
1 9 -35.98000 -4.927462 lower
2 94 -35.98000 -4.249348 upper
3 9 -35.94503 -4.927248 lower
4 94 -35.94503 -4.257776 upper
5 9 -35.91005 -4.927228 lower
6 94 -35.91005 -4.266334 upper
7 9 -35.87508 -4.927401 lower
8 94 -35.87508 -4.275020 upper
9 9 -35.84010 -4.927766 lower
10 94 -35.84010 -4.283836 upper
# ... with 1,990 more rows
And we can then just add an additional line to the ggplot call:
ggplot(data, aes(xdata, ydata)) +
geom_line(aes(y=.fitted, group=replicate), fitted_boot, alpha=.1, color="blue") +
# Added confidence interval:
geom_line(aes(y=.fitted, group=range), forConfInt, color="red") +
geom_point(size=3) +
theme_bw()
Gives this plot:

Related

Calculating odds ratios between deciles of data in R

Similar to: How to calculate and plot odds-ratios and their standard errors from a logistic regression in R?
But I would like to plot the Phenotypes separately in the plot.
Data (subset of 20,000 similar lines):
ID PHENO SCORE
1 1 -0.001
2 1 0.132
3 1 0.023
4 0 -0.20032
5 1 -0.002
6 0 0.012
7 1 -0.23
8 0 0.321
9 0 -0.21
10 0 -0.497
I have then run a glm logistic model on this data
I would like to put the scores into deciles or some meaningful division and then work out the Odds ratio of having the phenotype (1 is having the disease, 0 is controls) per division of score , ideally between cases and control, using R.
To decile I do:
library(dplyr)
#place each value into a decile
data$decile <- ntile(data, 10)
I then follow the question above but wanted the plot to show the cases and controls separately.
I would like to end up with an image like below (with case(1) vs control(0) from the PHENO column:
Any help would be appreciated.
First of all, I generated some random data to make it more reproducible. First, you could make your target and deciles a factor. To extract the odds ratios and confidence intervals, you could use coef and confint with exp. After you can take the mean of each ID and PHENO of your results. To create the graph you can use geom_pointrange like this:
# Generate random data
set.seed(7)
data <- data.frame(ID = rep(c(1:10), 2000),
PHENO = sample(c(0,1), replace=TRUE, size=20000),
SCORE = rnorm(20000, 0, 1))
library(dplyr)
library(ggplot2)
#place each value into a decile
data <- data %>% mutate(decile = ntile(SCORE, 10))
# convert PHENO and decile to factor
data$PHENO <- as.factor(data$PHENO)
data$decile <- as.factor(data$decile)
# model
fit <- glm(PHENO ~ decile, data=data, family='binomial')
# Extract odds ratio with intervals
results <- as.data.frame(exp(cbind(coef(fit), confint(fit))))
#> Waiting for profiling to be done...
# Change columnames results dataframe
colnames(results) <- c('odds_ratio', '2.5', '97.5')
# Add id column
results$ID <- c(1:10)
# Join data and results dataframe
data <- left_join(data, results, by = 'ID')
# Take mean
data_sum <- data %>%
group_by(decile, PHENO) %>%
summarise(odds_ratio = mean(odds_ratio),
`2.5` = mean(`2.5`),
`97.5` = mean(`97.5`))
#> `summarise()` has grouped output by 'decile'. You can override using the
#> `.groups` argument.
# plot
ggplot(data_sum, aes(x = decile, y = odds_ratio, ymin = `2.5`, ymax = `97.5`, color = PHENO, shape = PHENO)) +
geom_pointrange(position = position_dodge(width = 0.4)) +
scale_color_manual(values = c('blue', 'green')) +
scale_shape_manual(values = c(18, 16)) +
guides(shape = 'none') +
theme_classic() +
labs(x = 'Decile', y = 'Odds ratio', color = '')
Created on 2022-10-29 with reprex v2.0.2

Is it possible to recreate the functionality of bayesplot's "mcmc_areas" plot in ggplot in R

There is a package supported by Stan called bayesplot that can produce nice density area plots with the area under the density curves partitioned based on credibility intervals on the posterior parameter samples drawn through MCMC, this results in a plot that looks like the following:
I am looking to make a similar style of plot given 1D lists of sampled values using ggplot, that I can pass any generic list of values to without it having to be a Stan fit etc. Does anyone know how to do this? The density part is clear via geom_density, but I am struggling with the fill partitioning.
Here's a function that generates a plot similar to bayesplot::mcmc_areas. It plots credible intervals (equal-tailed by default, or highest density) with optional setting of the probability width of the interval:
library(tidyverse)
library(ggridges)
library(bayestestR)
theme_set(theme_classic(base_size=15))
# Create ridgeplots with credible intervals
# ARGUMENTS
# data A data frame
# FUN A function that calculates credible intervals
# ci The width of the credible interval
# ... For passing optional arguments to geom_ridgeline.
# For example, change the scale parameter to control overlap of ridge lines.
# geom_ridgeline's default is scale=1.
plot_density_ridge = function(data, FUN=c("eti", "hdi"), ci=0.89, ...) {
# Determine whether to use eti or hdi function
FUN = match.arg(FUN)
FUN = match.fun(FUN)
# Get kernel density estimate as a data frame
dens = map_df(data, ~ {
d = density(.x, na.rm=TRUE)
tibble(x=d$x, y=d$y)
}, .id="name")
# Set relative width of median line
e = diff(range(dens$x)) * 0.006
# Get credible interval width and median
cred.int = data %>%
pivot_longer(cols=everything()) %>%
group_by(name) %>%
summarise(CI=list(FUN(value, ci=ci)),
m=median(value, na.rm=TRUE)) %>%
unnest_wider(CI)
dens %>%
left_join(cred.int) %>%
ggplot(aes(y=name, x=x, height=y)) +
geom_vline(xintercept=0, colour="grey70") +
geom_ridgeline(data= . %>% group_by(name) %>%
filter(between(x, CI_low, CI_high)),
fill=hcl(230,25,85), ...) +
geom_ridgeline(data=. %>% group_by(name) %>%
filter(between(x, m - e, m + e)),
fill=hcl(240,30,60), ...) +
geom_ridgeline(fill=NA, ...) +
geom_ridgeline(fill=NA, aes(height=0), ...) +
labs(y=NULL, x=NULL)
}
Now let's try out the function
# Fake data
set.seed(2)
d = data.frame(a = rnorm(1000, 0.6, 1),
b = rnorm(1000, 1.3, 0.5),
c = rnorm(1000, -1.2, 0.7))
plot_density_ridge(d)
plot_density_ridge(d, ci=0.5, scale=1.5)
plot_density_ridge(iris %>% select(-Species))
plot_density_ridge(iris %>% select(-Species), FUN="hdi")
Use the ggridges package:
library(tidyverse)
library(ggridges)
tibble(data_1, data_2, data_3) %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, y = name, group = name)) +
geom_density_ridges()
Data:
set.seed(123)
n <- 15
data_1 <- rnorm(n)
data_2 <- data_1 - 1
data_3 <- data_1 + 2

Trying to use tidy for a power analysis and using clmm2

I'm trying to do a power analysis on a clmm2 analysis that I'm doing.
This is the code for the particular statistical model:
test <- clmm2(risk_sensitivity ~ treat + sex + dispersal +
sex*dispersal + treat*dispersal + treat*sex,random = id, data = datasocial, Hess=TRUE)
Now, I have the following function:
sim_experiment_power <- function(rep) {
s <- sim_experiment(n_sample = 1000,
prop_disp = 0.10,
prop_fem = 0.35,
disp_probability = 0.75,
nondisp_probability = 0.90,
fem_probability = 0.75,
mal_probability = 0.90)
broom.mixed::tidy(s) %>%
mutate(rep = rep)
}
my_power <- map_df(1:10, sim_experiment_power)
The details of the function sim_experiment are not relevant because they are working as expected. The important thing to know is that it spits up a statistical clmm2 result. My objective with the function above is to do a power analysis. However, I get the following error:
Error: No tidy method for objects of class clmm2
I'm a bit new to R, but I guess it means that tidy doesn't work with clmm2. Does anyone know a work-around for this issue?
EDIT: This is what follows the code that I posted above, which is ultimately what I'm trying to get.
You can then plot the distribution of estimates across your simulations.
ggplot(my_power, aes(estimate, color = term)) +
geom_density() +
facet_wrap(~term, scales = "free")
You can also just calculate power as the proportion of p-values less than your alpha.
my_power %>%
group_by(term) %>%
summarise(power <- mean(p.value < 0.05))
For what you need, you can write a function to return the coefficients with the same column name:
library(ordinal)
library(dplyr)
library(purrr)
tidy_output_clmm = function(fit){
results = as.data.frame(coefficients(summary(fit)))
colnames(results) = c("estimate","std.error","statistic","p.value")
results %>% tibble::rownames_to_column("term")
}
Then we apply it using an example where I sample the wine dataset in ordinal:
sim_experiment_power <- function(rep) {
idx = sample(nrow(wine),replace=TRUE)
s <- clmm2(rating ~ temp, random=judge, data=wine[idx,], nAGQ=10,Hess=TRUE)
tidy_output_clmm(s) %>% mutate(rep=rep)
}
my_power <- map_df(1:10, sim_experiment_power)
Plotting works:
ggplot(my_power, aes(estimate, color = term)) +
geom_density() +
facet_wrap(~term, scales = "free")
And so does power:
my_power %>% group_by(term) %>% summarise(power = mean(p.value < 0.05))
# A tibble: 5 x 2
term power
<chr> <dbl>
1 1|2 0.9
2 2|3 0.1
3 3|4 1
4 4|5 1
5 tempwarm 1

Plot regression coefficient with confidence intervals

Suppose I have 2 data frames, one for 2015 and one for 2016. I want to run a regression for each data frame and plot one of the coefficient for each regression with their respective confidence interval. For example:
set.seed(1020022316)
library(dplyr)
library(stargazer)
df16 <- data.frame(
x1 = rnorm(1000, 0, 2),
t = sample(c(0, 1), 1000, T),
e = rnorm(1000, 0, 10)
) %>% mutate(y = 0.5 * x1 + 2 * t + e) %>%
select(-e)
df15 <- data.frame(
x1 = rnorm(1000, 0, 2),
t = sample(c(0, 1), 1000, T),
e = rnorm(1000, 0, 10)
) %>% mutate(y = 0.75 * x1 + 2.5 * t + e) %>%
select(-e)
lm16 <- lm(y ~ x1 + t, data = df16)
lm15 <- lm(y ~ x1 + t, data = df15)
stargazer(lm15, lm16, type="text", style = "aer", ci = TRUE, ci.level = 0.95)
I want to plot t=1.558, x=2015, and t=2.797, x=2016 with their respective .95 CI. What is the best way of doing this?
I could do it 'by hand', but I hope there is a better way.
library(ggplot2)
df.plot <-
data.frame(
y = c(lm15$coefficients[['t']], lm16$coefficients[['t']]),
x = c(2015, 2016),
lb = c(
confint(lm15, 't', level = 0.95)[1],
confint(lm16, 't', level = 0.95)[1]
),
ub = c(
confint(lm15, 't', level = 0.95)[2],
confint(lm16, 't', level = 0.95)[2]
)
)
df.plot %>% ggplot(aes(x, y)) + geom_point() +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1) +
geom_hline(aes(yintercept=0), linetype="dashed")
Best: The figure quality (looks nice), code elegance, easy to expand (more than 2 regressions)
This is a bit too long for a comment, so I post it as a partial answer.
It is unclear from your post if your main problem is to get the data into the right shape, or if it is the plotting itself. But just to follow up on one of the comments, let me show you how to do run several models using dplyr and broom that makes plotting easy. Consider the mtcars-dataset:
library(dplyr)
library(broom)
models <- mtcars %>% group_by(cyl) %>%
do(data.frame(tidy(lm(mpg ~ disp, data = .),conf.int=T )))
head(models) # I have abbreviated the following output a bit
cyl term estimate std.error statistic p.value conf.low conf.high
(dbl) (chr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
4 (Intercept) 40.8720 3.5896 11.39 0.0000012 32.752 48.99221
4 disp -0.1351 0.0332 -4.07 0.0027828 -0.210 -0.06010
6 (Intercept) 19.0820 2.9140 6.55 0.0012440 11.591 26.57264
6 disp 0.0036 0.0156 0.23 0.8259297 -0.036 0.04360
You see that this gives you all coefficients and confidence intervals in one nice dataframe, which makes plotting with ggplot easier. For instance, if your datasets have identical content, you could add a year identifier to them (e.g. df1$year <- 2000; df2$year <- 2001 etc), and bind them together afterwards (e.g. using bind_rows, of you can use bind_rows's .id option). Then you can use the year identifer instead of cyl in the above example.
The plotting then is simple. To use the mtcars data again, let's plot the coefficients for disp only (though you could also use faceting, grouping, etc):
ggplot(filter(models, term=="disp"), aes(x=cyl, y=estimate)) +
geom_point() + geom_errorbar(aes(ymin=conf.low, ymax=conf.high))
To use your data:
df <- bind_rows(df16, df15, .id = "years")
models <- df %>% group_by(years) %>%
do(data.frame(tidy(lm(y ~ x1+t, data = .),conf.int=T ))) %>%
filter(term == "t") %>%
ggplot(aes(x=years, y=estimate)) + geom_point() +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high))
Note that you can easily add more and more models just by binding more and more data to the main dataframe. You can also easily use faceting, grouping or position-dodgeing to adjust the look of the corresponding plot if you want to plot more than one coefficient.
This is the solution I have right now:
gen_df_plot <- function(reg, coef_name){
df <- data.frame(y = reg$coefficients[[coef_name]],
lb = confint(reg, coef_name, level = 0.95)[1],
ub = confint(reg, coef_name, level = 0.95)[2])
return(df)
}
df.plot <- lapply(list(lm15,lm16), gen_df_plot, coef_name = 't')
df.plot <- data.table::rbindlist(df.plot)
df.plot$x <- as.factor(c(2015, 2016))
df.plot %>% ggplot(aes(x, y)) + geom_point(size=4) +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1, linetype="dotted") +
geom_hline(aes(yintercept=0), linetype="dashed") + theme_bw()
I don't love it, but it works.
Here is what might be generalized code. I have made a change to how "x" is defined so that you don't have to worry about alphabetic reordering of the factor.
#
# Paul Gronke and Paul Manson
# Early Voting Information Center at Reed College
#
# August 27, 2019
#
#
# Code to plot a single coefficient from multiple models, provided
# as an easier alternative to "coefplot" and "dotwhisker". Some users
# may find those packages more capable
#
# Code adapted from https://stackoverflow.com/questions/35582052/plot-regression-coefficient-with-confidence-intervals
# gen_df_plot function will create a tidy data frame for your plot
# Currently set up to display 95% confidence intervals
gen_df_plot <- function(reg, coef_name){
df <- data.frame(y = reg$coefficients[[coef_name]],
lb = confint(reg, coef_name, level = 0.95)[1],
ub = confint(reg, coef_name, level = 0.95)[2])
return(df)
}
# Populate the data frame with a list of your model results.
df.plot <- lapply(list(model1, # List your models here
model2),
gen_df_plot,
coef_name = 'x1') # Coefficient name
# Convert the list to a tidy data frame
df.plot <- data.table::rbindlist(df.plot)
# Provide the coefficient or regression labels below, in the
# order that you want them to appear. The "levels=unique(.)" parameter
# overrides R's desire to order the factor alphabetically
df.plot$x <- c("Group 1",
"Group 2") %>%
factor(., levels = unique(.),
ordered = TRUE)
# Create your plot
df.plot %>% ggplot(aes(x, y)) +
geom_point(size=4) +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1, linetype="dotted") +
geom_hline(aes(yintercept=0), linetype="dashed") +
theme_bw() +
ggtitle("Comparing Coefficients") +
ylab("Coefficient Value")```

How to mimic geom_boxplot() with outliers using geom_boxplot(stat = "identity")

I would like to pre-compute by-variable summaries of data (with plyr and passing a quantile function) and then plot with geom_boxplot(stat = "identity"). This works great except it (a) does not plot outliers as points and (b) extends the "whiskers" to the max and min of the data being plotted.
Example:
library(plyr)
library(ggplot2)
set.seed(4)
df <- data.frame(fact = sample(letters[1:2], 12, replace = TRUE),
val = c(1:10, 100, 101))
df
# fact val
# 1 b 1
# 2 a 2
# 3 a 3
# 4 a 4
# 5 b 5
# 6 a 6
# 7 b 7
# 8 b 8
# 9 b 9
# 10 a 10
# 11 b 100
# 12 a 101
by.fact.df <- ddply(df, c("fact"), function(x) quantile(x$val))
by.fact.df
# fact 0% 25% 50% 75% 100%
# 1 a 2 3.25 5.0 9.00 101
# 2 b 1 5.50 7.5 8.75 100
# What I can do...with faults (a) and (b) above
ggplot(by.fact.df,
aes(x = fact, ymin = `0%`, lower = `25%`, middle = `50%`,
upper = `75%`, ymax = `100%`)) +
geom_boxplot(stat = "identity")
# What I want...
ggplot(df, aes(x = fact, y = val)) +
geom_boxplot()
What I can do...with faults (a) and (b) mentioned above:
What I would like to obtain, but still leverage pre-computation via plyr (or other method):
Initial Thoughts: Perhaps there is some way to pre-compute the true end-points of the whiskers without the outliers? Then, subset the data for outliers and pass them as geom_point()?
Motivation: When working with larger datasets, I have found it faster and more practical to leverage plyr, dplyr, and/or data.table to pre-compute the stats and then plot them rather than having ggplot2 to the calculations.
UPDATE
I am able to extract what I need with the following mix of dplyr and plyr code, but I'm not sure if this is the most efficient way:
df %>%
group_by(fact) %>%
do(ldply(boxplot.stats(.$val), data.frame))
Source: local data frame [6 x 3]
Groups: fact
fact .id X..i..
1 a stats 2
2 a stats 4
3 a stats 10
4 a stats 13
5 a stats 16
6 a n 9
Here's my answer, using built-in functions quantile and boxplot.stats.
geom_boxplot does the calcualtions for boxplot slightly differently than boxplot.stats. Read ?geom_boxplot and ?boxplot.stats to understand my implementation below
#Function to calculate boxplot stats to match ggplot's implemention as in geom_boxplot.
my_boxplot.stats <-function(x){
quantiles <-quantile(x, c(0, 0.25, 0.5, 0.75, 1))
labels <-names(quantile(x))
#replacing the upper whisker to geom_boxplot
quantiles[5] <-boxplot.stats(x)$stats[5]
res <-data.frame(rbind(quantiles))
names(res) <-labels
res$out <-boxplot.stats(x)$out
return(res)
}
Code to calculate the stats and plot it
library(dplyr)
df %>% group_by(fact) %>% do(my_boxplot.stats(.$val)) %>%
ggplot(aes(x=fact, y=out, ymin = `0%`, lower = `25%`, middle = `50%`,
upper = `75%`, ymax = `100%`)) +
geom_boxplot(stat = "identity") + geom_point()
To get the correct statistics, you have to do some more calculations than just finding the quantiles. The geom_boxplot function with stat = "identity" does not draw the outliers. So you have to calculate the statistics without the outliers and then use geom_point to draw the outliers seperately. The following function (basically a simplified version of stat_boxplot) is probably not the most efficient, but it gives the desired result:
box.df <- df %>% group_by(fact) %>% do({
stats <- as.numeric(quantile(.$val, c(0, 0.25, 0.5, 0.75, 1)))
iqr <- diff(stats[c(2, 4)])
coef <- 1.5
outliers <- .$val < (stats[2] - coef * iqr) | .$val > (stats[4] + coef * iqr)
if (any(outliers)) {
stats[c(1, 5)] <- range(c(stats[2:4], .$val[!outliers]), na.rm=TRUE)
}
outlier_values = .$val[outliers]
if (length(outlier_values) == 0) outlier_values <- NA_real_
res <- as.list(t(stats))
names(res) <- c("lower.whisker", "lower.hinge", "median", "upper.hinge", "upper.whisker")
res$out <- outlier_values
as.data.frame(res)
})
box.df
## Source: local data frame [2 x 7]
## Groups: fact
##
## fact lower.whisker lower.hinge median upper.hinge upper.whisker out
## 1 a 2 3.25 5.0 9.00 10 101
## 2 b 1 5.50 7.5 8.75 9 100
ggplot(box.df, aes(x = fact, y = out, middle = median,
ymin = lower.whisker, ymax = upper.whisker,
lower = lower.hinge, upper = upper.hinge)) +
geom_boxplot(stat = "identity") +
geom_point()

Resources