Calculating odds ratios between deciles of data in R - r

Similar to: How to calculate and plot odds-ratios and their standard errors from a logistic regression in R?
But I would like to plot the Phenotypes separately in the plot.
Data (subset of 20,000 similar lines):
ID PHENO SCORE
1 1 -0.001
2 1 0.132
3 1 0.023
4 0 -0.20032
5 1 -0.002
6 0 0.012
7 1 -0.23
8 0 0.321
9 0 -0.21
10 0 -0.497
I have then run a glm logistic model on this data
I would like to put the scores into deciles or some meaningful division and then work out the Odds ratio of having the phenotype (1 is having the disease, 0 is controls) per division of score , ideally between cases and control, using R.
To decile I do:
library(dplyr)
#place each value into a decile
data$decile <- ntile(data, 10)
I then follow the question above but wanted the plot to show the cases and controls separately.
I would like to end up with an image like below (with case(1) vs control(0) from the PHENO column:
Any help would be appreciated.

First of all, I generated some random data to make it more reproducible. First, you could make your target and deciles a factor. To extract the odds ratios and confidence intervals, you could use coef and confint with exp. After you can take the mean of each ID and PHENO of your results. To create the graph you can use geom_pointrange like this:
# Generate random data
set.seed(7)
data <- data.frame(ID = rep(c(1:10), 2000),
PHENO = sample(c(0,1), replace=TRUE, size=20000),
SCORE = rnorm(20000, 0, 1))
library(dplyr)
library(ggplot2)
#place each value into a decile
data <- data %>% mutate(decile = ntile(SCORE, 10))
# convert PHENO and decile to factor
data$PHENO <- as.factor(data$PHENO)
data$decile <- as.factor(data$decile)
# model
fit <- glm(PHENO ~ decile, data=data, family='binomial')
# Extract odds ratio with intervals
results <- as.data.frame(exp(cbind(coef(fit), confint(fit))))
#> Waiting for profiling to be done...
# Change columnames results dataframe
colnames(results) <- c('odds_ratio', '2.5', '97.5')
# Add id column
results$ID <- c(1:10)
# Join data and results dataframe
data <- left_join(data, results, by = 'ID')
# Take mean
data_sum <- data %>%
group_by(decile, PHENO) %>%
summarise(odds_ratio = mean(odds_ratio),
`2.5` = mean(`2.5`),
`97.5` = mean(`97.5`))
#> `summarise()` has grouped output by 'decile'. You can override using the
#> `.groups` argument.
# plot
ggplot(data_sum, aes(x = decile, y = odds_ratio, ymin = `2.5`, ymax = `97.5`, color = PHENO, shape = PHENO)) +
geom_pointrange(position = position_dodge(width = 0.4)) +
scale_color_manual(values = c('blue', 'green')) +
scale_shape_manual(values = c(18, 16)) +
guides(shape = 'none') +
theme_classic() +
labs(x = 'Decile', y = 'Odds ratio', color = '')
Created on 2022-10-29 with reprex v2.0.2

Related

Explain the code underlying a linear model in R visualised with ggplot

I am trying to understand how linear modelling can be used to as an alternative to the t-test when analysing gene expression data. For a single gene, I have a dataframe of 20 gene expression values altogether in group 1 (n=10) and group 2 (n=10).
gexp = data.frame(expression = c(2.7,0.4,1.8,0.8,1.9,5.4,5.7,2.8,2.0,4.0,3.9,2.8,3.1,2.1,1.9,6.4,7.5,3.6,6.6,5.4),
group = c(rep(1, 10), rep(2, 10)))
The data can be (box)plotted using ggplot as shown below:
plot <- gexp %>%
ggplot(aes(x = group, y = expression)) +
geom_boxplot() +
geom_point()
plot
I wish to model the expression in groups 1 and 2 using the regression formula:
Y = Beta0 + (Beta1 x X) + e where Y is the expression I want to model and X represents the two groups that are encoded as 0 and 1 respectively. Therefore, the expression in group 1 (when x = 0) is equal to Beta0; and the expression in group 2 (when x = 1) is equal to Beta0 + Beta1.
If this is modelled with:
mod1 <- lm(expression ~ group, data = gexp)
mod1
The above code outputs an intercept of 2.75 and a slope of 1.58. It is the visualisation of the linear model that I don't understand. I would be grateful for a clear explanation of the below code:
plot +
geom_point(data = data.frame(x = c(1, 2), y = c(2.75, 4.33)),
aes(x = x, y = y),
colour = "red", size = 5) +
geom_abline(intercept = coefficients(mod1)[1] - coefficients(mod1)[2],
slope = coefficients(mod1)[2])
I get why the data.frame values are the ones chosen (the value of 4.33 is the sum of the intercept, Beta0 and the slope, Beta1) , but it is the geom_abline arguments I do not understand. Why is the intercept calculation as shown? In the text I am using it states, '...we need to subtract the slope from the intercept when plotting the linear model because groups 1 and 2 are encoded as 0 and 1 in the model, but plotted as 1 and 2 on the figure.' I don't follow this point and would be grateful for an explanation, without getting too technical.
I believe your code is correct if the group variable was encoded as a factor.
library(ggplot2)
gexp = data.frame(expression = c(2.7,0.4,1.8,0.8,1.9,5.4,5.7,2.8,2.0,4.0,3.9,2.8,3.1,2.1,1.9,6.4,7.5,3.6,6.6,5.4),
group = factor(c(rep(1, 10), rep(2, 10))))
plot <-
ggplot(gexp, aes(x = group, y = expression)) +
geom_boxplot() +
geom_point()
mod1 <- lm(expression ~ group, data = gexp)
plot +
geom_point(data = data.frame(x = c(1, 2), y = c(2.75, 4.33)),
aes(x = x, y = y),
colour = "red", size = 5) +
geom_abline(intercept = coefficients(mod1)[1] - coefficients(mod1)[2],
slope = coefficients(mod1)[2])
Created on 2022-03-30 by the reprex package (v2.0.1)
To understand the difference between factors and integers in specifying linear models, you can have a look at the model matrix.
model.matrix(y ~ f, data = data.frame(f = 1:3, y = 1))
#> (Intercept) f
#> 1 1 1
#> 2 1 2
#> 3 1 3
#> attr(,"assign")
#> [1] 0 1
model.matrix(y ~ f, data = data.frame(f = factor(1:3), y = 1))
#> (Intercept) f2 f3
#> 1 1 0 0
#> 2 1 1 0
#> 3 1 0 1
#> attr(,"assign")
#> [1] 0 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$f
#> [1] "contr.treatment"
Created on 2022-03-30 by the reprex package (v2.0.1)
In the first model matrix, what you specify is what you get: you're modelling something as a function of the intercept and the f variable. In this model, you account for that f = 2 is twice as much as f = 1.
This works a little bit differently when f is a factor. A k-level factor gets split up in k-1 dummy variables, where each dummy variable encodes with 1 or 0 whether it deviates from the reference level (the first factor level). By modelling it in this way, you don't consider that the 2nd factor level might be twice the 1st factor level.
Because in ggplot2, the first factor level is displayed at position = 1 and not at position = 0 (how it is modelled), your calculated intercept is off. You need to subtract 1 * slope from the calculated intercept to get it to display right in ggplot2.

Plotting a graph with sample sizes and power estimates

I have simulated a linear model 1000 times using a randomly generated height and weight values, and randomly assigned each participant to a treatment or non-treatment (factor of 1 and 0). Let's say the model was:
lm(bmi~height + weight + treatment, data = df)
I am now struggling for the following:
The model now needs to cycle through the sample sizes between 300 and 500 in steps of 10 for each of the 1000 replications and store the proportion of simulated experiments with p values less than 0.05 for the purpose of estimating the power that can detect a change of 0.5 in bmi between two treatment groups at 5% significance level.
After doing the above, I then need to create a figure that best depicts the sample sizes on x-axis, and the estimated power on the y-axis, and also reflect the smallest sample size to achieve a 80% power estimate by a distinct color.
Any ideas how and where to go from here?
Thanks,
Chris
I would do it something like this:
library(dplyr)
library(ggplot2)
# first, encapsulate the steps required to generate one sample of data
# at a given sample size, run the model, and extract the treatment p-value
do_simulate <- function(n) {
# use assumed data generating process to simulate data and add error
data <- tibble(height = rnorm(n, 69, 0.1),
weight = rnorm(n, 197.8, 1.9),
treatment = sample(c(0, 1), n, replace = TRUE),
error = rnorm(n, sd = 1.75),
bmi = 703 * weight / height^2 + 0.5 * treatment + error)
# model the data
mdl <- lm(bmi ~ height + weight + treatment, data = data)
# extract p-value for treatment effect
summary(mdl)[["coefficients"]]["treatment", "Pr(>|t|)"]
}
# second, wrap that single simulation in a replicate so that you can perform
# many simulations at a given sample size and estimate power as the proportion
# of simulations that achieve a significant p-value
simulate_power <- function(n, alpha = 0.05, r = 1000) {
p_values <- replicate(r, do_simulate(n))
power <- mean(p_values < alpha)
return(c(n, power))
}
# third, estimate power at each of your desired
# sample sizes and restructure that data for ggplot
mx <- vapply(seq(300, 500, 10), simulate_power, numeric(2))
plot_data <- tibble(n = mx[1, ],
power = mx[2, ])
# fourth, make a note of the minimum sample size to achieve your desired power
plot_data %>%
filter(power > 0.80) %>%
top_n(-1, n) %>%
pull(n) -> min_n
# finally, construct the plot
ggplot(plot_data, aes(x = n, y = power)) +
geom_smooth(method = "loess", se = FALSE) +
geom_vline(xintercept = min_n)

ggplot to show confidence intervals from bootstrapping curve fitting

Thanks to the help I got here, I was able to get a spaghetti plot of curve fits using bootstrapping. I am trying to derive confidence bands from these fitted models. I've had no luck getting something like
quants <- apply(fitted_boot, 1, quantile, c(0.025, 0.5, 0.975))
to work with the following:
library(dplyr)
library(broom)
library(ggplot2)
xdata <- c(-35.98, -34.74, -33.46, -32.04, -30.86, -29.64, -28.50, -27.29, -26.00,
-24.77, -23.57, -22.21, -21.19, -20.16, -18.77, -17.57, -16.47, -15.35,
-14.40, -13.09, -11.90, -10.47, -9.95,-8.90,-7.77,-6.80, -5.99,
-5.17, -4.21, -3.06, -2.29, -1.04)
ydata <- c(-4.425, -4.134, -5.145, -5.411, -6.711, -7.725, -8.087, -9.059, -10.657,
-11.734, NA, -12.803, -12.906, -12.460, -12.128, -11.667, -10.947, -10.294,
-9.185, -8.620, -8.025, -7.493, -6.713, -6.503, -6.316, -5.662, -5.734, -4.984,
-4.723, -4.753, -4.503, -4.200)
data <- data.frame(xdata,ydata)
x_range <- seq(min(xdata), max(xdata), length.out = 1000)
fitted_boot <- data %>%
bootstrap(100) %>%
do({
m <- nls(ydata ~ A*cos(2*pi*((xdata-x_0)/z))+M, ., start=list(A=4,M=-7,x_0=-10,z=30))
f <- predict(m, newdata = list(xdata = x_range))
data.frame(xdata = x_range, .fitted = f)
} )
ggplot(data, aes(xdata, ydata)) +
geom_line(aes(y=.fitted, group=replicate), fitted_boot, alpha=.1, color="blue") +
geom_point(size=3) +
theme_bw()
I thought perhaps geom_ribbon() would be a nice way to go, but I just don't know where to go from here.
Thank you to Axeman for helping on the other post!
One approach would be to calculate the confidence interval at each x-value and then just plot that. Here, I am using the first value outside of the 2.5th percentile and the 97.5th percentiles, though you could adjust the code as needed.
First, I change to group_by the xdata locations (instead of replicates). Then, I arrange by the .fitted values so that I can slice out the values I want (the first outside the percentile cutoffs). Finally, I tag them with which bound I am getting (they always go lower then upper because we sorted).
forConfInt <-
fitted_boot %>%
ungroup() %>%
group_by(xdata) %>%
arrange(.fitted) %>%
slice(c(floor(0.025 * n() )
, ceiling(0.975 * n() ) ) ) %>%
mutate(range = c("lower", "upper"))
This gives:
replicate xdata .fitted range
<int> <dbl> <dbl> <chr>
1 9 -35.98000 -4.927462 lower
2 94 -35.98000 -4.249348 upper
3 9 -35.94503 -4.927248 lower
4 94 -35.94503 -4.257776 upper
5 9 -35.91005 -4.927228 lower
6 94 -35.91005 -4.266334 upper
7 9 -35.87508 -4.927401 lower
8 94 -35.87508 -4.275020 upper
9 9 -35.84010 -4.927766 lower
10 94 -35.84010 -4.283836 upper
# ... with 1,990 more rows
And we can then just add an additional line to the ggplot call:
ggplot(data, aes(xdata, ydata)) +
geom_line(aes(y=.fitted, group=replicate), fitted_boot, alpha=.1, color="blue") +
# Added confidence interval:
geom_line(aes(y=.fitted, group=range), forConfInt, color="red") +
geom_point(size=3) +
theme_bw()
Gives this plot:

Plot regression coefficient with confidence intervals

Suppose I have 2 data frames, one for 2015 and one for 2016. I want to run a regression for each data frame and plot one of the coefficient for each regression with their respective confidence interval. For example:
set.seed(1020022316)
library(dplyr)
library(stargazer)
df16 <- data.frame(
x1 = rnorm(1000, 0, 2),
t = sample(c(0, 1), 1000, T),
e = rnorm(1000, 0, 10)
) %>% mutate(y = 0.5 * x1 + 2 * t + e) %>%
select(-e)
df15 <- data.frame(
x1 = rnorm(1000, 0, 2),
t = sample(c(0, 1), 1000, T),
e = rnorm(1000, 0, 10)
) %>% mutate(y = 0.75 * x1 + 2.5 * t + e) %>%
select(-e)
lm16 <- lm(y ~ x1 + t, data = df16)
lm15 <- lm(y ~ x1 + t, data = df15)
stargazer(lm15, lm16, type="text", style = "aer", ci = TRUE, ci.level = 0.95)
I want to plot t=1.558, x=2015, and t=2.797, x=2016 with their respective .95 CI. What is the best way of doing this?
I could do it 'by hand', but I hope there is a better way.
library(ggplot2)
df.plot <-
data.frame(
y = c(lm15$coefficients[['t']], lm16$coefficients[['t']]),
x = c(2015, 2016),
lb = c(
confint(lm15, 't', level = 0.95)[1],
confint(lm16, 't', level = 0.95)[1]
),
ub = c(
confint(lm15, 't', level = 0.95)[2],
confint(lm16, 't', level = 0.95)[2]
)
)
df.plot %>% ggplot(aes(x, y)) + geom_point() +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1) +
geom_hline(aes(yintercept=0), linetype="dashed")
Best: The figure quality (looks nice), code elegance, easy to expand (more than 2 regressions)
This is a bit too long for a comment, so I post it as a partial answer.
It is unclear from your post if your main problem is to get the data into the right shape, or if it is the plotting itself. But just to follow up on one of the comments, let me show you how to do run several models using dplyr and broom that makes plotting easy. Consider the mtcars-dataset:
library(dplyr)
library(broom)
models <- mtcars %>% group_by(cyl) %>%
do(data.frame(tidy(lm(mpg ~ disp, data = .),conf.int=T )))
head(models) # I have abbreviated the following output a bit
cyl term estimate std.error statistic p.value conf.low conf.high
(dbl) (chr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
4 (Intercept) 40.8720 3.5896 11.39 0.0000012 32.752 48.99221
4 disp -0.1351 0.0332 -4.07 0.0027828 -0.210 -0.06010
6 (Intercept) 19.0820 2.9140 6.55 0.0012440 11.591 26.57264
6 disp 0.0036 0.0156 0.23 0.8259297 -0.036 0.04360
You see that this gives you all coefficients and confidence intervals in one nice dataframe, which makes plotting with ggplot easier. For instance, if your datasets have identical content, you could add a year identifier to them (e.g. df1$year <- 2000; df2$year <- 2001 etc), and bind them together afterwards (e.g. using bind_rows, of you can use bind_rows's .id option). Then you can use the year identifer instead of cyl in the above example.
The plotting then is simple. To use the mtcars data again, let's plot the coefficients for disp only (though you could also use faceting, grouping, etc):
ggplot(filter(models, term=="disp"), aes(x=cyl, y=estimate)) +
geom_point() + geom_errorbar(aes(ymin=conf.low, ymax=conf.high))
To use your data:
df <- bind_rows(df16, df15, .id = "years")
models <- df %>% group_by(years) %>%
do(data.frame(tidy(lm(y ~ x1+t, data = .),conf.int=T ))) %>%
filter(term == "t") %>%
ggplot(aes(x=years, y=estimate)) + geom_point() +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high))
Note that you can easily add more and more models just by binding more and more data to the main dataframe. You can also easily use faceting, grouping or position-dodgeing to adjust the look of the corresponding plot if you want to plot more than one coefficient.
This is the solution I have right now:
gen_df_plot <- function(reg, coef_name){
df <- data.frame(y = reg$coefficients[[coef_name]],
lb = confint(reg, coef_name, level = 0.95)[1],
ub = confint(reg, coef_name, level = 0.95)[2])
return(df)
}
df.plot <- lapply(list(lm15,lm16), gen_df_plot, coef_name = 't')
df.plot <- data.table::rbindlist(df.plot)
df.plot$x <- as.factor(c(2015, 2016))
df.plot %>% ggplot(aes(x, y)) + geom_point(size=4) +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1, linetype="dotted") +
geom_hline(aes(yintercept=0), linetype="dashed") + theme_bw()
I don't love it, but it works.
Here is what might be generalized code. I have made a change to how "x" is defined so that you don't have to worry about alphabetic reordering of the factor.
#
# Paul Gronke and Paul Manson
# Early Voting Information Center at Reed College
#
# August 27, 2019
#
#
# Code to plot a single coefficient from multiple models, provided
# as an easier alternative to "coefplot" and "dotwhisker". Some users
# may find those packages more capable
#
# Code adapted from https://stackoverflow.com/questions/35582052/plot-regression-coefficient-with-confidence-intervals
# gen_df_plot function will create a tidy data frame for your plot
# Currently set up to display 95% confidence intervals
gen_df_plot <- function(reg, coef_name){
df <- data.frame(y = reg$coefficients[[coef_name]],
lb = confint(reg, coef_name, level = 0.95)[1],
ub = confint(reg, coef_name, level = 0.95)[2])
return(df)
}
# Populate the data frame with a list of your model results.
df.plot <- lapply(list(model1, # List your models here
model2),
gen_df_plot,
coef_name = 'x1') # Coefficient name
# Convert the list to a tidy data frame
df.plot <- data.table::rbindlist(df.plot)
# Provide the coefficient or regression labels below, in the
# order that you want them to appear. The "levels=unique(.)" parameter
# overrides R's desire to order the factor alphabetically
df.plot$x <- c("Group 1",
"Group 2") %>%
factor(., levels = unique(.),
ordered = TRUE)
# Create your plot
df.plot %>% ggplot(aes(x, y)) +
geom_point(size=4) +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1, linetype="dotted") +
geom_hline(aes(yintercept=0), linetype="dashed") +
theme_bw() +
ggtitle("Comparing Coefficients") +
ylab("Coefficient Value")```

How to mimic geom_boxplot() with outliers using geom_boxplot(stat = "identity")

I would like to pre-compute by-variable summaries of data (with plyr and passing a quantile function) and then plot with geom_boxplot(stat = "identity"). This works great except it (a) does not plot outliers as points and (b) extends the "whiskers" to the max and min of the data being plotted.
Example:
library(plyr)
library(ggplot2)
set.seed(4)
df <- data.frame(fact = sample(letters[1:2], 12, replace = TRUE),
val = c(1:10, 100, 101))
df
# fact val
# 1 b 1
# 2 a 2
# 3 a 3
# 4 a 4
# 5 b 5
# 6 a 6
# 7 b 7
# 8 b 8
# 9 b 9
# 10 a 10
# 11 b 100
# 12 a 101
by.fact.df <- ddply(df, c("fact"), function(x) quantile(x$val))
by.fact.df
# fact 0% 25% 50% 75% 100%
# 1 a 2 3.25 5.0 9.00 101
# 2 b 1 5.50 7.5 8.75 100
# What I can do...with faults (a) and (b) above
ggplot(by.fact.df,
aes(x = fact, ymin = `0%`, lower = `25%`, middle = `50%`,
upper = `75%`, ymax = `100%`)) +
geom_boxplot(stat = "identity")
# What I want...
ggplot(df, aes(x = fact, y = val)) +
geom_boxplot()
What I can do...with faults (a) and (b) mentioned above:
What I would like to obtain, but still leverage pre-computation via plyr (or other method):
Initial Thoughts: Perhaps there is some way to pre-compute the true end-points of the whiskers without the outliers? Then, subset the data for outliers and pass them as geom_point()?
Motivation: When working with larger datasets, I have found it faster and more practical to leverage plyr, dplyr, and/or data.table to pre-compute the stats and then plot them rather than having ggplot2 to the calculations.
UPDATE
I am able to extract what I need with the following mix of dplyr and plyr code, but I'm not sure if this is the most efficient way:
df %>%
group_by(fact) %>%
do(ldply(boxplot.stats(.$val), data.frame))
Source: local data frame [6 x 3]
Groups: fact
fact .id X..i..
1 a stats 2
2 a stats 4
3 a stats 10
4 a stats 13
5 a stats 16
6 a n 9
Here's my answer, using built-in functions quantile and boxplot.stats.
geom_boxplot does the calcualtions for boxplot slightly differently than boxplot.stats. Read ?geom_boxplot and ?boxplot.stats to understand my implementation below
#Function to calculate boxplot stats to match ggplot's implemention as in geom_boxplot.
my_boxplot.stats <-function(x){
quantiles <-quantile(x, c(0, 0.25, 0.5, 0.75, 1))
labels <-names(quantile(x))
#replacing the upper whisker to geom_boxplot
quantiles[5] <-boxplot.stats(x)$stats[5]
res <-data.frame(rbind(quantiles))
names(res) <-labels
res$out <-boxplot.stats(x)$out
return(res)
}
Code to calculate the stats and plot it
library(dplyr)
df %>% group_by(fact) %>% do(my_boxplot.stats(.$val)) %>%
ggplot(aes(x=fact, y=out, ymin = `0%`, lower = `25%`, middle = `50%`,
upper = `75%`, ymax = `100%`)) +
geom_boxplot(stat = "identity") + geom_point()
To get the correct statistics, you have to do some more calculations than just finding the quantiles. The geom_boxplot function with stat = "identity" does not draw the outliers. So you have to calculate the statistics without the outliers and then use geom_point to draw the outliers seperately. The following function (basically a simplified version of stat_boxplot) is probably not the most efficient, but it gives the desired result:
box.df <- df %>% group_by(fact) %>% do({
stats <- as.numeric(quantile(.$val, c(0, 0.25, 0.5, 0.75, 1)))
iqr <- diff(stats[c(2, 4)])
coef <- 1.5
outliers <- .$val < (stats[2] - coef * iqr) | .$val > (stats[4] + coef * iqr)
if (any(outliers)) {
stats[c(1, 5)] <- range(c(stats[2:4], .$val[!outliers]), na.rm=TRUE)
}
outlier_values = .$val[outliers]
if (length(outlier_values) == 0) outlier_values <- NA_real_
res <- as.list(t(stats))
names(res) <- c("lower.whisker", "lower.hinge", "median", "upper.hinge", "upper.whisker")
res$out <- outlier_values
as.data.frame(res)
})
box.df
## Source: local data frame [2 x 7]
## Groups: fact
##
## fact lower.whisker lower.hinge median upper.hinge upper.whisker out
## 1 a 2 3.25 5.0 9.00 10 101
## 2 b 1 5.50 7.5 8.75 9 100
ggplot(box.df, aes(x = fact, y = out, middle = median,
ymin = lower.whisker, ymax = upper.whisker,
lower = lower.hinge, upper = upper.hinge)) +
geom_boxplot(stat = "identity") +
geom_point()

Resources