Maximum likelihood with order statistics in R - r

I'm testing the Maximum Likelihood method, when only the maximum value of a sample is provided. I'm assuming the sample is from a Gaussian Distribution.
First I generate 10.000 random number with mean = 2.45 & sd = 1
library(tidyverse)
set.seed(91)
n <- 10000
mean <- 2.45
sd <- 1
random_numbers <- rnorm(n, mean, sd)
Then I extract the max value, and I assume that's the only value I know.
maximum <- random_numbers[which.max(random_numbers)]
Then, I estimate the density value for that maximum value using different mean values. I use the formula:
mean_space <- seq(0, 10, by = 0.01)
densities <- n * (pnorm(maximum, mean_space, sd)^(n-1)) * dnorm(maximum,
mean_space,1)
df <- data.frame(x = mean_space, y = densities)
g <- ggplot(df, aes(x = x, y = y)) +
geom_line() +
geom_vline(xintercept = mean)
print(g)
df %>% filter(y == max(y))
However, I'm getting density values higher than 1 which I think are not correct.

Related

How to estimate the density (empirical pdf) from quantiles (the empirical CDF) in R

Question
Say I have an unknown density a.
All I know is a probability grid (probs) of quantiles (quants).
How can I generate random samples from the unknown density?
This is what I have so far.
I am giving rejection sampling a try, but I am not tied to this method. Here I fit a polynomial (6 degress) to the quantiles. The purpose of this is to convert discrete quantiles to a smooth continuous function. This gives me an empirical CDF. I then use rejection sampling to get actual samples from the CDF. Is there a convenient way in R to convert samples from the CDF to density samples, or did I go about this in a convoluted way when there is a better alternative?
# unknown and probably not normal, but I use rnorm here because it is easy
a <- c(exp(rnorm(200, 5, .8)))
probs <- seq(0.05, 0.95, 0.05)
quants <- quantile(a, probs)
df_quants <- tibble::tibble(cum_probs, quants)
df_quants <- df_quants
fit <- lm(quants ~ poly(cum_probs, 6), df_quants)
df_quants$fit <- predict(fit, df_quants)
p <- df_quants %>%
ggplot(aes(x = cum_probs, y = quants))+
geom_line(aes(y = quants), color = "black", size = 1) +
geom_line(aes(y = fit), color = "red", size = 1)
CDF
count = 1
accept = c()
X <- runif(50000, 0, 1)
U <- runif(50000, 0, 1)
estimate <- function(x){
new_x <- predict(fit, data.frame(cum_probs = c(x)))
return(new_x)
while(count <= 50000 & length(accept) < 40000){
test_u = U[count]
test_x = estimate(X[count])/(1000*dunif(X[count], 0, 1))
if(test_u <= test_x){
accept = rbind(accept, X[count])
count = count + 1
}
count = count + 1
}
p2 <- as_tibble(accept, name = V1) %>%
ggplot(aes(x = V1)) +
geom_histogram(bins = 45)
}
CDF Samples
I don't think rejection sampling is needed, with a Bspline fit I was able to generate sensible samples via Inverse Transform, but I also needed a higher resolution grid. The tails are a little off.
The assumption I am making here is that a Bspline fit to a tight grid of quantiles approximate the inverse CDF function. Once this curve is fut I can just use random uniforms U[0,1]
library(splines2)
a <- c(exp(rnorm(200, 5, .8)))
cum_probs <- seq(0.01, 0.99, 0.01)
quants <- quantile(a, cum_probs)
df_quants <- tibble::tibble(cum_probs, quants)
fit_spline <- lm(quants ~ bSpline(cum_probs, df = 9), df_quants)
df_quants$fit_spline <- predict(fit_spline, df_quants)
estimate <- function(x){
new_x <- predict(fit_spline, data.frame(cum_probs = c(x)))
return(new_x)
}
e <- runif(10000, 0, 1)
y <-(estimate(e))
df_density <- tibble(y)
df_densitya <- tibble(a)
py <- df_density %>%
ggplot(aes(x = y)) +
geom_histogram()
pa <- df_densitya %>%
ggplot(aes(x = a)) +
geom_histogram(bins = 45)
original density
Inverse Transformation samples
summary stats
original dist a
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.36 80.84 145.25 195.72 241.22 1285.24
generated from quantiles y
Min. 1st Qu. Median Mean 3rd Qu. Max.
28.09 81.78 149.53 189.07 239.62 667.27

Plotting a graph with sample sizes and power estimates

I have simulated a linear model 1000 times using a randomly generated height and weight values, and randomly assigned each participant to a treatment or non-treatment (factor of 1 and 0). Let's say the model was:
lm(bmi~height + weight + treatment, data = df)
I am now struggling for the following:
The model now needs to cycle through the sample sizes between 300 and 500 in steps of 10 for each of the 1000 replications and store the proportion of simulated experiments with p values less than 0.05 for the purpose of estimating the power that can detect a change of 0.5 in bmi between two treatment groups at 5% significance level.
After doing the above, I then need to create a figure that best depicts the sample sizes on x-axis, and the estimated power on the y-axis, and also reflect the smallest sample size to achieve a 80% power estimate by a distinct color.
Any ideas how and where to go from here?
Thanks,
Chris
I would do it something like this:
library(dplyr)
library(ggplot2)
# first, encapsulate the steps required to generate one sample of data
# at a given sample size, run the model, and extract the treatment p-value
do_simulate <- function(n) {
# use assumed data generating process to simulate data and add error
data <- tibble(height = rnorm(n, 69, 0.1),
weight = rnorm(n, 197.8, 1.9),
treatment = sample(c(0, 1), n, replace = TRUE),
error = rnorm(n, sd = 1.75),
bmi = 703 * weight / height^2 + 0.5 * treatment + error)
# model the data
mdl <- lm(bmi ~ height + weight + treatment, data = data)
# extract p-value for treatment effect
summary(mdl)[["coefficients"]]["treatment", "Pr(>|t|)"]
}
# second, wrap that single simulation in a replicate so that you can perform
# many simulations at a given sample size and estimate power as the proportion
# of simulations that achieve a significant p-value
simulate_power <- function(n, alpha = 0.05, r = 1000) {
p_values <- replicate(r, do_simulate(n))
power <- mean(p_values < alpha)
return(c(n, power))
}
# third, estimate power at each of your desired
# sample sizes and restructure that data for ggplot
mx <- vapply(seq(300, 500, 10), simulate_power, numeric(2))
plot_data <- tibble(n = mx[1, ],
power = mx[2, ])
# fourth, make a note of the minimum sample size to achieve your desired power
plot_data %>%
filter(power > 0.80) %>%
top_n(-1, n) %>%
pull(n) -> min_n
# finally, construct the plot
ggplot(plot_data, aes(x = n, y = power)) +
geom_smooth(method = "loess", se = FALSE) +
geom_vline(xintercept = min_n)

Difference between two geom_smooth() lines

I made a plot for my data and am now I would like to have the difference in y for every x that was estimated by geom_smooth(). There is a similiar question which unfortunately has no answer. For example, how to get the differences for the following plot (data below):
EDIT
Two suggestions were made but I still don't know how to calculate the differences.
First suggestion was to access the data from the ggplot object. I did so with
pb <- ggplot_build(p)
pb[["data"]][[1]]
That approach kind of works, but the data doesn't use the same x values for the groups. For example, the first x value of the first group is -3.21318853, but there is no x of -3.21318853 for the second group, hence, I can not calculate the difference in y for -3.21318853 between both groups
Second suggestion was to see what formula is used in geom_smooth(). The package description says that "loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs")". My N is more than 60,000, hence, gam is used by default. I am not familiar with gam; can anyone provide a short answer how to calculate the difference between the two lines considering the things just described?
R Code
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
Hi and welcome on Stack Overflow,
The first suggestion is good. To make the x-sequences match, you can interpolate the values in between using the approx function (in stats).
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
p <- ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
pb <- ggplot_build(p) # Get computed data
data.of.g1 <- pb[['data']][[1]][pb[['data']][[1]]$group == 1, ] # Extract info for group 1
data.of.g2 <- pb[['data']][[1]][pb[['data']][[1]]$group == 2, ] # Extract info for group 2
xlimit.inf <- max(min(data.of.g1$x), min(data.of.g2$x)) # Get the minimum X the two smoothed data have in common
xlimit.sup <- min(max(data.of.g1$x), max(data.of.g2$x)) # Get the maximum X
xseq <- seq(xlimit.inf, xlimit.sup, 0.01) # Sequence of X value (you can use bigger/smaller step size)
# Based on data from group 1 and group 2, interpolates linearly for all the values in `xseq`
y.g1 <- approx(x = data.of.g1$x, y = data.of.g1$y, xout = xseq)
y.g2 <- approx(x = data.of.g2$x, y = data.of.g2$y, xout = xseq)
difference <- data.frame(x = xseq, dy = abs(y.g1$y - y.g2$y)) # Compute the difference
ggplot(difference, aes(x = x, y = dy)) + geom_line() # Make the plot
Output:
As I mentioned in the comments above, you really are better off doing this outside of ggplot and instead do it with a full model of the two smooths from which you can compute uncertainties on the difference, etc.
This is basically a short version of a blog post that I wrote a year or so back.
OP's exmaple data
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
Start by fitting the model for the example data:
library("mgcv")
m <- gam(y ~ g + s(x, by = g), data = df, method = "REML")
Here I'm fitting a GAM with a factor-smooth interaction (the by bit) and for this model we need to also include g as a parametric effect as the group-specific smooths are both centred about 0 so we need to include the group means in the parametric part of the model.
Next we need a grid of data along the x variable at which we will estimate the difference between the two estimated smooths:
pdat <- with(df, expand.grid(x = seq(min(x), max(x), length = 200),
g = c(0,1)))
pdat <- transform(pdat, g = factor(g))
then we use this prediction data to generate the Xp matrix, which is a matrix that maps values of the covariates to values of the basis expansion for the smooths; we can manipulate this matrix to get the difference smooth that we want:
xp <- predict(m, newdata = pdat, type = "lpmatrix")
Next some code to identify which rows and columns in xp belong to the smooths for the respective levels of g; as there are only two levels and only a single smooth term in the model, this is entirely trivial but for more complex models this is needed and it is important to get the smooth component names right for the grep() bits to work.
## which cols of xp relate to splines of interest?
c1 <- grepl('g0', colnames(xp))
c2 <- grepl('g1', colnames(xp))
## which rows of xp relate to sites of interest?
r1 <- with(pdat, g == 0)
r2 <- with(pdat, g == 1)
Now we can difference the rows of xp for the pair of levels we are comparing
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
As we focus on the difference, we need to zero out all the column not associated with the selected pair of smooths, which includes any parametric terms.
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
(In this example, these two lines do exactly the same thing, but in more complex examples both are needed.)
Now we have a matrix X which contains the difference between the two basis expansions for the pair of smooths we're interested in, but to get this in terms of fitted values of the response y we need to multiply this matrix by the vector of coefficients:
## difference between smooths
dif <- X %*% coef(m)
Now dif contains the difference between the two smooths.
We can use X again and covariance matrix of the model coefficients to compute the standard error of this difference and thence a 95% (in this case) confidence interval for the estimate difference.
## se of difference
se <- sqrt(rowSums((X %*% vcov(m)) * X))
## confidence interval on difference
crit <- qt(.975, df.residual(m))
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
Note that here with the vcov() call we're using the empirical Bayesian covariance matrix but not the one corrected for having chosen the smoothness parameters. The function I show shortly allows you to account for this additional uncertainty via argument unconditional = TRUE.
Finally we gather the results and plot:
res <- data.frame(x = with(df, seq(min(x), max(x), length = 200)),
dif = dif, upr = upr, lwr = lwr)
ggplot(res, aes(x = x, y = dif)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = x), alpha = 0.2) +
geom_line()
This produces
Which is consistent with an assessment that shows the model with the group-level smooths doesn't provide substantially better fit than a model with different group means but only single common smoother in x:
r$> m0 <- gam(y ~ g + s(x), data = df, method = "REML")
r$> AIC(m0, m)
df AIC
m0 9.68355 30277.93
m 14.70675 30285.02
r$> anova(m0, m, test = 'F')
Analysis of Deviance Table
Model 1: y ~ g + s(x)
Model 2: y ~ g + s(x, by = g)
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 4990.1 124372
2 4983.9 124298 6.1762 73.591 0.4781 0.8301
Wrapping up
The blog post I mentioned has a function which wraps the steps above into a simple function, smooth_diff():
smooth_diff <- function(model, newdata, f1, f2, var, alpha = 0.05,
unconditional = FALSE) {
xp <- predict(model, newdata = newdata, type = 'lpmatrix')
c1 <- grepl(f1, colnames(xp))
c2 <- grepl(f2, colnames(xp))
r1 <- newdata[[var]] == f1
r2 <- newdata[[var]] == f2
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
dif <- X %*% coef(model)
se <- sqrt(rowSums((X %*% vcov(model, unconditional = unconditional)) * X))
crit <- qt(alpha/2, df.residual(model), lower.tail = FALSE)
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
data.frame(pair = paste(f1, f2, sep = '-'),
diff = dif,
se = se,
upper = upr,
lower = lwr)
}
Using this function we can repeat the entire analysis and plot the difference with:
out <- smooth_diff(m, pdat, '0', '1', 'g')
out <- cbind(x = with(df, seq(min(x), max(x), length = 200)),
out)
ggplot(out, aes(x = x, y = diff)) +
geom_ribbon(aes(ymin = lower, ymax = upper, x = x), alpha = 0.2) +
geom_line()
I won't show the plot here as it is identical to that shown above except for the axis labels.

Using regression parameter as mean in rnorm

I want to test a model where the distribution of a random variable, assumed normal, is conditional on the regime of another random variable, that switches state according to a Markov chain. The first step would be:
Assuming the simple linear model:
lm(y~x, data=data)
I want to find the parameters of the distribution assuming that x switches regime.
For example:
mkt.bull <- rnorm(150, 2, 1.5)
mkt.bear <- rnorm(150, -1, 2.5)
x <- c(mkt.bear,mkt.bull)
portfolio.bull <- rnorm(150, 1.75, 1.6)
portfolio.bear <- rnorm(150, -0.5, 2.3)
y <- c(portfolio.bear,portfolio.bull)
In the example above, x can be modelled as a Markov switching model (msmFit) with two states, one bull and one bear. Instead of approaching the problem with a lm,
lm(y~x)
since the two series are clearly non-linear, I want to run a regression where the parameters are conditional on the regime. This can be done with maximum likelihood, but the first step is to define the distribution of y as:
y_i | x, S_t ~ N(alpha + beta_{i,s_t}); sigma^2)
How can I code the above formula? I guess this cannot be done using rnorm. Is there another way?
Thanks
Data
Here I prepared and visualized the data.
# Load packages
library(tidyverse)
library(rjags)
# Set seed for reproduciblility
set.seed(199)
mkt.bull <- rnorm(150, 2, 1.5)
mkt.bear <- rnorm(150, -1, 2.5)
x <- c(mkt.bear,mkt.bull)
portfolio.bull <- rnorm(150, 1.75, 1.6)
portfolio.bear <- rnorm(150, -0.5, 2.3)
y <- c(portfolio.bear,portfolio.bull)
# Create example data frame
dat <- data.frame(x = x, y = y, regime = c(rep("bear", 150), rep("bull", 150)),
stringsAsFactors = FALSE)
# Plot the sample distribution
dat$regime <- factor(dat$regime, levels = c("bear", "bull"))
# Create a plot
ggplot(dat, aes(x = y, color = regime)) +
geom_density()
There are two regimes, bear and bull. The y for these regimes are both normally distributed. It seems like the OP wants to estimate the mean and standard deviation of y conditioned on these states.
Maximum Likelihood
Here is one way to use maximum likelihood to estimate the parameters using the stats4 package.
# Load the infer package
library(stats4)
# Split the data
y_bull <- dat %>% filter(regime %in% "bull") %>% pull("y")
y_bear <- dat %>% filter(regime %in% "bear") %>% pull("y")
# Define the log-likelihood function
LogLike_bull <- function(Mean, Sigma){
R <- suppressWarnings(dnorm(y_bull, Mean, Sigma))
return(-sum(log(R)))
}
LogLike_bear <- function(Mean, Sigma){
R <- suppressWarnings(dnorm(y_bear, Mean, Sigma))
return(-sum(log(R)))
}
mle(minuslogl = LogLike_bull, start = list(Mean = 1, Sigma = 1))
# Call:
# mle(minuslogl = LogLike_bull, start = list(Mean = 1, Sigma = 1))
#
# Coefficients:
# Mean Sigma
# 1.703099 1.482619
mle(minuslogl = LogLike_bear, start = list(Mean = 1, Sigma = 1))
# Call:
# mle(minuslogl = LogLike_bear, start = list(Mean = 1, Sigma = 1))
#
# Coefficients:
# Mean Sigma
# -0.616106 2.340852
The parameters for bull are mean = 1.703 and standard deviation = 1.483. The parameters for bear are mean = -0.616 and standard deviation = 2.341. They are close to the true values.
Bayseian Analysis
Here is an attempt to use Bayesian analysis to solve this question with jags and the rjags package.
I ran a Bayesian model to estimate the alpha (mean of y on bear), beta (The difference of y on bear and bull), and sigma (standard deviation of y on bear and bull) using 10000 iterations.
# Define the Bayesian model
model <- "model{
for(i in 1:length(Y)) {
Y[i] ~ dnorm(Mean[i], s[X[i]]^(-2))
Mean[i] <- alpha + beta[X[i]]
}
alpha ~ dnorm(0, 5^(-2))
beta[1] <- 0
beta[2] ~ dnorm(0, 5^(-2))
s[1] ~ dunif(0, 10)
s[2] ~ dunif(0, 10)
}"
# Compile the model
jags_model <- jags.model(
textConnection(model),
data = list(Y = dat$y, X = dat$regime),
n.chains = 3,
inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 10)
)
# Simulate the posterior
jags_sim <- coda.samples(model = jags_model,
variable.names = c("alpha", "beta", "s"),
n.iter = 10000)
# Plot the posterior
plot(jags_sim)
The plot shows that estimates are well mixed.
# See the summary
summary(jags_sim)
Iterations = 1001:11000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
alpha -0.614 0.19436 0.0011222 0.0027201
beta[1] 0.000 0.00000 0.0000000 0.0000000
beta[2] 2.315 0.23099 0.0013336 0.0032666
s[1] 2.369 0.13768 0.0007949 0.0010393
s[2] 1.500 0.08836 0.0005102 0.0006727
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
alpha -0.9838 -0.7471 -0.6147 -0.4857 -0.2225
beta[1] 0.0000 0.0000 0.0000 0.0000 0.0000
beta[2] 1.8582 2.1622 2.3174 2.4760 2.7586
s[1] 2.1225 2.2722 2.3632 2.4564 2.6611
s[2] 1.3368 1.4390 1.4959 1.5579 1.6813
The mean of alpha on bear is -0.614, which is similar to the actual value -0.5. The mean of beta[2] is 2.315. If we add alpha and beta[2], we got 1.701, while the actual value is 1.7. We also got s[1] and s[2] as 2.369 and 1.5, which are similar to 2.3 and 1.6, respectively.
Bootstrapping
Here is another approch to use bootstrap to estimate alpha, beta, and standard deviation, which is based on the infer package.
# Load the infer package
library(infer)
set.seed(199)
# Split the data
dat_bull <- dat %>% filter(regime %in% "bull")
dat_bear <- dat %>% filter(regime %in% "bear")
# Calcualte the values in bull
dat_bull2 <- dat_bull %>%
# Specify the response variable
specify(response = y) %>%
# Generate 10000 bootstrap samples
generate(reps = 10000, type = "bootstrap")
summary_bull <- dat_bull2 %>%
summarise(mean_y = mean(y), sd_y = sd(y))
# Calcualte the values in bear
dat_ear2 <- dat_bear %>%
# Specify the response variable
specify(response = y) %>%
# Generate 10000 bootstrap samples
generate(reps = 10000, type = "bootstrap")
summary_bear <- dat_ear2 %>%
summarise(mean_y = mean(y), sd_y = sd(y))
Now we can print the results. The are all similar to the true values.
# The mean of bull
mean(summary_bull$mean_y)
# [1] 1.702693
# The standard deviation of bear
mean(summary_bull$sd_y)
# [1] 1.480158
# The mean of bear
mean(summary_bear$mean_y)
# [1] -0.6165585
# The standard deviation of bear
mean(summary_bear$sd_y)
# [1] 2.337042

R: Determine the threshold that maximally separates two groups based on a continuous variable?

Say I have 200 subjects, 100 in group A and 100 in group B, and for each I measure some continuous parameter.
require(ggplot2)
set.seed(100)
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep('A', 100), rep('B', 100))
data <- data.frame(value, group)
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
I would like to determine the value (Threshold? Breakpoint?) that maximizes separation and minimizes misclassification between the groups. Does such a function exist in R?
I've tried searching along the lines of "r breakpoint maximal separation between groups," and "r threshold minimize misclassification," but my google-foo seems to be off today.
EDIT:
Responding to #Thomas's comment, I have tried to fit the data using logistic regression and then solve for the threshold, but I haven't gotten very far.
lr <- glm(group~value)
coef(lr)
# (Intercept) value
# 1.1857435 -0.0911762
So Bo = 1.1857435 and B1 = -0.0911762
From Wikipedia, I see that F(x) = 1/(1+e^-(Bo + B1x)), and solving for x:
x = (ln(F(x) / (1 - F(x))) - Bo)/B1
But trying this in R, I get an obviously incorrect answer:
(log(0.5/(1 - 0.5)) - 1.1857435)/-0.0911762 # 13.00497
A simple approach is to write a function that calculates the accuracy given a threshold:
accuracy = Vectorize(function(th) mean(c("A", "B")[(value > th) + 1] == group))
Then find the maximum using optimize:
optimize(accuracy, c(min(value), max(value)), maximum=TRUE)
# $maximum
# [1] 8.050888
#
# $objective
# [1] 0.86
I've gotten the answer I need thanks to help from #Thomas and #BenBolker.
Summary
The problem with my attempt at solving it through logistic regression was that I hadn't specified family = binomial
The dose.p() function in MASS will do the work for me given a glm fit
Code
# Include libraries
require(ggplot2)
require(MASS)
# Set seed
set.seed(100)
# Put together some dummy data
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep(0, 100), rep(1, 100))
data <- data.frame(value, group)
# Plot the distribution -- visually
# The answer appears to be b/t 7 and 8
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
# Fit a glm model, specifying the binomial distribution
my.glm <- glm(group~value, data = data, family = binomial)
b0 <- coef(my.glm)[[1]]
b1 <- coef(my.glm)[[2]]
# See what the probability function looks like
lr <- function(x, b0, b1) {
prob <- 1 / (1 + exp(-1*(b0 + b1*x)))
return(prob)
}
# The line appears to cross 0.5 just above 7.5
x <- -0:12
y <- lr(x, b0, b1)
lr.val <- data.frame(x, y)
ggplot(lr.val, aes(x = x, y = y)) +
geom_line()
# The inverse of this function computes the threshold for a given probability
inv.lr <- function(p, b0, b1) {
x <- (log(p / (1 - p)) - b0)/b1
return(x)
}
# With the betas from this function, we get 7.686814
inv.lr(0.5, b0, b1)
# Or, feeding the glm model into dose.p from MASS, we get the same answer
dose.p(my.glm, p = 0.5)
Thanks, everyone, for your help!

Resources