How to automatically fit data with several normal cumulative distribution functions in R - r

I have several data sets (hundreds of them actually), that I know can be fitted with the sum of several normal cumulative distributions (see here).
Here is one example of such data set, here with two cumulative distribution functions:
library(pracma)
library(minpack.lm)
x <- seq(1, 1000, length.out = 50)
k1 <- 0.5
mu1 <- 500
sigma1 <- 100
y1 <- k1 * (1 + erf((x - mu1) / (sqrt(2) * sigma1)))
k2 <- 0.5
mu2 <- 300
sigma2 <- 50
y2 <- k2 * (1 + erf((x - mu2) / (sqrt(2) * sigma2)))
my.df <- data.frame(x, y = y1 + y2, type = "data")
ggplot(my.df, aes(x, y)) + geom_line()
Now I want to fit those curves, so I use nls to do so:
model <- nlsLM(y ~ k1 * (1 + erf((x - mu1) / (sqrt(2) * sigma1)))
+ k2 * (1 + erf((x - mu2) / (sqrt(2) * sigma2))),
start= c(mu1 = 500 , sigma1 = 50, k1 = 0.5,
mu2 = 300 , sigma2 = 50, k2 = 0.5),
data = my.df,
control = nls.lm.control(maxiter = 500))
tmp <- data.frame(x, y = predict(model), type = "fit")
combined <- rbind(my.df, tmp)
ggplot(combined, aes(x, y, colour = type, shape = type)) + geom_line() + geom_point()
Here is what I get:
The fit is great. However, I helped nls a lot:
I gave it a perfect fitting curve as input, not raw data
I told it my curve was the sum of two functions (not one or three)
And I almost gave the solution by providing very close parameter values
To fix the first point, I compute 3 models for one, two and three functions and choose the one with the minimum deviance.
For the second point, with my hundreds of data sets unfortunately, the parameters change quite a bit and I have disappointing results when I give the same starting parameters for all sets.
Is there a better way to select those starting values?
I heard of the mixtools library, but I'm not sure it works for CDF (cumulative distribution functions).

Related

How to add an equation to my plot using text function in R? [duplicate]

This question already has answers here:
Add regression line equation and R^2 on graph
(10 answers)
Closed 2 years ago.
So I am trying to add a simple equation to my graph using the text function in R but for some reason its not working.
This is how I generated my randon simulated data:
a <- 5
b <- 7
sigma <- 3
x <- runif(100, min = 0, max = 50)
n <- length(x)
y <- a + b * x + rnorm(n, 0, sigma)
fakeData <- data.frame(x, y)
Then I plugged in the data frame and added a regression line like so.
ggplot(data = fakeData, aes(y, x)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
labs(x = "random uniform distribution from range 0 - 50.",
y = "Linear model points based on the Uniform dist",
title = "Fake data simulation")
Which results in a graph like this
Now all I want to do is add the equation of the line to the graph using the text function but for some reason, R just seems to be throwing errors at me. I tried googling the issue but I wasnt able to resolve it. I tried this method also but the lack of documentation is appalling I have no idea what the guy was doing in the function.
Try this with annotate() but you previously have to obtain the coefs model using lm():
library(ggplot2)
#Code
a <- 5
b <- 7
sigma <- 3
x <- runif(100, min = 0, max = 50)
n <- length(x)
y <- a + b * x + rnorm(n, 0, sigma)
fakeData <- data.frame(x, y)
#lm
mod <- lm(y~x,data=fakeData)
lab <- paste0(round(coef(mod)[1],3),'+',paste0(round(coef(mod)[2],3),'x'))
#plot
ggplot(data = fakeData, aes(y, x)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
labs(x = "random uniform distribution from range 0 - 50.",
y = "Linear model points based on the Uniform dist",
title = "Fake data simulation")+
annotate(geom='text',x=50,y=30,label=lab,fontface='bold')
Output:

Difference between two geom_smooth() lines

I made a plot for my data and am now I would like to have the difference in y for every x that was estimated by geom_smooth(). There is a similiar question which unfortunately has no answer. For example, how to get the differences for the following plot (data below):
EDIT
Two suggestions were made but I still don't know how to calculate the differences.
First suggestion was to access the data from the ggplot object. I did so with
pb <- ggplot_build(p)
pb[["data"]][[1]]
That approach kind of works, but the data doesn't use the same x values for the groups. For example, the first x value of the first group is -3.21318853, but there is no x of -3.21318853 for the second group, hence, I can not calculate the difference in y for -3.21318853 between both groups
Second suggestion was to see what formula is used in geom_smooth(). The package description says that "loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs")". My N is more than 60,000, hence, gam is used by default. I am not familiar with gam; can anyone provide a short answer how to calculate the difference between the two lines considering the things just described?
R Code
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
Hi and welcome on Stack Overflow,
The first suggestion is good. To make the x-sequences match, you can interpolate the values in between using the approx function (in stats).
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
p <- ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
pb <- ggplot_build(p) # Get computed data
data.of.g1 <- pb[['data']][[1]][pb[['data']][[1]]$group == 1, ] # Extract info for group 1
data.of.g2 <- pb[['data']][[1]][pb[['data']][[1]]$group == 2, ] # Extract info for group 2
xlimit.inf <- max(min(data.of.g1$x), min(data.of.g2$x)) # Get the minimum X the two smoothed data have in common
xlimit.sup <- min(max(data.of.g1$x), max(data.of.g2$x)) # Get the maximum X
xseq <- seq(xlimit.inf, xlimit.sup, 0.01) # Sequence of X value (you can use bigger/smaller step size)
# Based on data from group 1 and group 2, interpolates linearly for all the values in `xseq`
y.g1 <- approx(x = data.of.g1$x, y = data.of.g1$y, xout = xseq)
y.g2 <- approx(x = data.of.g2$x, y = data.of.g2$y, xout = xseq)
difference <- data.frame(x = xseq, dy = abs(y.g1$y - y.g2$y)) # Compute the difference
ggplot(difference, aes(x = x, y = dy)) + geom_line() # Make the plot
Output:
As I mentioned in the comments above, you really are better off doing this outside of ggplot and instead do it with a full model of the two smooths from which you can compute uncertainties on the difference, etc.
This is basically a short version of a blog post that I wrote a year or so back.
OP's exmaple data
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
Start by fitting the model for the example data:
library("mgcv")
m <- gam(y ~ g + s(x, by = g), data = df, method = "REML")
Here I'm fitting a GAM with a factor-smooth interaction (the by bit) and for this model we need to also include g as a parametric effect as the group-specific smooths are both centred about 0 so we need to include the group means in the parametric part of the model.
Next we need a grid of data along the x variable at which we will estimate the difference between the two estimated smooths:
pdat <- with(df, expand.grid(x = seq(min(x), max(x), length = 200),
g = c(0,1)))
pdat <- transform(pdat, g = factor(g))
then we use this prediction data to generate the Xp matrix, which is a matrix that maps values of the covariates to values of the basis expansion for the smooths; we can manipulate this matrix to get the difference smooth that we want:
xp <- predict(m, newdata = pdat, type = "lpmatrix")
Next some code to identify which rows and columns in xp belong to the smooths for the respective levels of g; as there are only two levels and only a single smooth term in the model, this is entirely trivial but for more complex models this is needed and it is important to get the smooth component names right for the grep() bits to work.
## which cols of xp relate to splines of interest?
c1 <- grepl('g0', colnames(xp))
c2 <- grepl('g1', colnames(xp))
## which rows of xp relate to sites of interest?
r1 <- with(pdat, g == 0)
r2 <- with(pdat, g == 1)
Now we can difference the rows of xp for the pair of levels we are comparing
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
As we focus on the difference, we need to zero out all the column not associated with the selected pair of smooths, which includes any parametric terms.
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
(In this example, these two lines do exactly the same thing, but in more complex examples both are needed.)
Now we have a matrix X which contains the difference between the two basis expansions for the pair of smooths we're interested in, but to get this in terms of fitted values of the response y we need to multiply this matrix by the vector of coefficients:
## difference between smooths
dif <- X %*% coef(m)
Now dif contains the difference between the two smooths.
We can use X again and covariance matrix of the model coefficients to compute the standard error of this difference and thence a 95% (in this case) confidence interval for the estimate difference.
## se of difference
se <- sqrt(rowSums((X %*% vcov(m)) * X))
## confidence interval on difference
crit <- qt(.975, df.residual(m))
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
Note that here with the vcov() call we're using the empirical Bayesian covariance matrix but not the one corrected for having chosen the smoothness parameters. The function I show shortly allows you to account for this additional uncertainty via argument unconditional = TRUE.
Finally we gather the results and plot:
res <- data.frame(x = with(df, seq(min(x), max(x), length = 200)),
dif = dif, upr = upr, lwr = lwr)
ggplot(res, aes(x = x, y = dif)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = x), alpha = 0.2) +
geom_line()
This produces
Which is consistent with an assessment that shows the model with the group-level smooths doesn't provide substantially better fit than a model with different group means but only single common smoother in x:
r$> m0 <- gam(y ~ g + s(x), data = df, method = "REML")
r$> AIC(m0, m)
df AIC
m0 9.68355 30277.93
m 14.70675 30285.02
r$> anova(m0, m, test = 'F')
Analysis of Deviance Table
Model 1: y ~ g + s(x)
Model 2: y ~ g + s(x, by = g)
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 4990.1 124372
2 4983.9 124298 6.1762 73.591 0.4781 0.8301
Wrapping up
The blog post I mentioned has a function which wraps the steps above into a simple function, smooth_diff():
smooth_diff <- function(model, newdata, f1, f2, var, alpha = 0.05,
unconditional = FALSE) {
xp <- predict(model, newdata = newdata, type = 'lpmatrix')
c1 <- grepl(f1, colnames(xp))
c2 <- grepl(f2, colnames(xp))
r1 <- newdata[[var]] == f1
r2 <- newdata[[var]] == f2
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
dif <- X %*% coef(model)
se <- sqrt(rowSums((X %*% vcov(model, unconditional = unconditional)) * X))
crit <- qt(alpha/2, df.residual(model), lower.tail = FALSE)
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
data.frame(pair = paste(f1, f2, sep = '-'),
diff = dif,
se = se,
upper = upr,
lower = lwr)
}
Using this function we can repeat the entire analysis and plot the difference with:
out <- smooth_diff(m, pdat, '0', '1', 'g')
out <- cbind(x = with(df, seq(min(x), max(x), length = 200)),
out)
ggplot(out, aes(x = x, y = diff)) +
geom_ribbon(aes(ymin = lower, ymax = upper, x = x), alpha = 0.2) +
geom_line()
I won't show the plot here as it is identical to that shown above except for the axis labels.

How to calculate x-values of the convolution of two distributions?

(This question may be suited for https://stats.stackexchange.com/, but I'm thinking it's just how you calculate what I want in R that is my question).
I'm trying to add multiple distributions together, and then look at the resulting distribution. I'll illustrate my problem with a simple example using normally distributed random variables, p1 and p2.
set.seed(21)
N <- 1000
p1 <- rnorm(N, mean = 0, sd = 1)
p2 <- rnorm(N, mean = 10, sd = 1)
Which we can plot:
data.frame(p1, p2) %>%
gather(key="dist", value="value") %>%
ggplot(aes(value, color=dist)) + geom_density()
I can add these distributions together using convolve. Okay so that's fine. But what I can't figure out, is how to plot the summation of the distributions with the appropriate x-values. In the examples I've seen, it looks like the x-values are manually added in a way that doesn't seem "accurate" for lack of better work. See this Example.
I can "add" them together and plot:
pdf.c <- convolve(pdf1.y, pdf2.y, type = "open")
plot(pdf.c, type="l")
My question is how to get the corresponding x-values of the new distribution. I'm sure I'm missing something from a foundational statistics point of view.
Appendix for pdf1 and pdf2:
set.seed(21)
N <- 1000
p1 <- rnorm(N, mean = 0, sd = 1)
p2 <- rnorm(N, mean = 10, sd = 1)
pdf1.x <- density(p1)$x
pdf2.x <- density(p2)$x
pdf1.y <- density(p1)$y / sum(density(p1)$y)
pdf2.y <- density(p2)$y / sum(density(p2)$y)
df1 <- data.frame(pdf.x = pdf1.x, pdf.y = pdf1.y, dist = "1", stringsAsFactors = FALSE)
df2 <- data.frame(pdf.x = pdf2.x, pdf.y = pdf2.y, dist = "2", stringsAsFactors = FALSE)
df <- bind_rows(df1, df2)
Assuming that p1 and p2 are discretized uniformly, with the same interval dx between successive x values. (I see that you have discretized p1 and p2 at random points -- that's not the same, and, without thinking about it some more, I don't have an answer for that.) Let x1 = x1_0 + (k - 1) times dx, k = 1, 2, 3, ..., n1 be the points at which p1 is discretized, and x2 = x2_0 + (k - 1) times dx, k = 1, 2, 3, ..., n2 be the points at which p2 is discretized.
Each point xi_k = xi_0 + (k - 1) times dx represents the center point of a bar which has width dx and height pi(xi_k), i = 1, 2. Thus the mass of the bar is dx times pi(xi_k), and the total mass for all bars approaches 1 as dx approaches 0. These masses are the values which are convolved. If the discretized masses are normalized to 1, then their convolution will also be normalized to 1.
To be very careful, the range over which the distributions are discretized are xi_0 - dx/2 to xi_0 + (ni - 1) times dx + dx/2. After computing the convolution, the range for the result is likewise -dx/2 and +dx/2 the first and last points, respectively.
The convolution has n = n1 + n2 - 1 points, namely x1_0 + x2_0 + (k - 1) times dx, k = 1, 2, 3, ..., n1 + n2 - 1. The first point is x1_0 + x2_0 (i.e. first point for p1 plus first point for p2) and the last point is x1_0 + x2_0 + (n1 + n2 - 2) times dx = (x1_0 + (n1 - 1) times dx) + (x2_0 + (n2 - 1) times dx) (i.e. last point for p1 plus last point for p2). From this you can construct x values corresponding to the convolution via the seq function or something like that.

Log-likelihood calculation given estimated parameters

In general: I want to calculate the (log) likelihood of data N given the estimated model parameters from data O.
More specifically, I want to know if my ll_given_modPars function below exists in one of the may R packages dealing with data modeling (lme4, glmm, etc.) as shown in this abstract example (not run):
library(lme4)
o_model <- lmer(observed ~ fixed.id + (1|random.id), data = O, REML = F)
n_logLik <- ll_given_modPars(model.estimates = o_model, data = N)
The fictional example above is on a linear mixed model for simplicity but I would like to eventually do this in a generalized linear mixed model which deals with the Poisson family or directly the negative binomial (for lme4: glmer(..., family="poisson") or glmer.nb ).
From what I could see most packages deal with parameter estimation (great, I need that) but then compare models on the same data with different combinations of fixed and random effects using anova or something to that extent which is not what I want to do.
I want the log likelihood for the same parameters on different data.
The main attempts made:
After not finding a function which seems to be doing that I thought of 'simply' tweaking the lme4 code to my purposes: it calculates the log likelihood for parameters given the data so I thought I could use the same framework but not have it optimize over different parameters but isolate the likelihood calculation function and just give it the parameters and the data. Unfortunately the code is a bit above my current skills https://github.com/lme4/lme4/blob/master/R/nbinom.R (I get a bit lost in how they use the objects over which they optimize).
I thought of doing the likelihood calculation myself, starting with a linear mixed model and then working my way up to more involved ones. But already with this example I'm having a hard time following the math and even when using the formula as specified the obtained log-likelihood is still different (I don't know why, see code in appendix) and I fear it will take me too long before I'll be able to do it for the more involved models (such as Poisson or negative binomial)
At this point I'm not sure what avenue is best to pursue and would appreciate any input you might have.
Appendix: Trying to calculate the log-likelihood (or finding a closed form approximation) based on How does lmer (from the R package lme4) compute log likelihood?. lmer (from lme4) gives a log-likelihood of -17.8 and I get -45.56
library(lme4)
set.seed(7)
n <- 2 # number of groups
m <- 4 # number of instances per group
fixed.effect <- c(0, -2, -1, 1)
tau <- 5 # standard deviation of random effects
sigma <- 2 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau)
sim.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
sim.data$EXPECT.Y <- sim.data$GROUP.EFFECT + sim.data$INSTANCE.EFFECT
# now observe Y value, assuming normally distributed with fixed std. deviation
sim.data$OBS.Y <- rnorm(nrow(sim.data), mean=sim.data$EXPECT.Y, sigma)
model <- lmer(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = sim.data, REML=F)
summary(model)
toy.model.var <- VarCorr(model)
toy.model.sigma <- attr(toy.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
toy.model.tau.squared <- toy.model.var[[1]][1] # corresponds to variance of random effects
toy.model.betas <- model#beta
# left product, spread within gropus
toy.data <- rbind(sim.data$OBS.Y[1:4], sim.data$OBS.Y[5:8])
toy.mean.adj <- rbind(toy.data[1,] - mean(unlist(toy.data[1,])), toy.data[2,] - mean(unlist(toy.data[2,])))
toy.mean.adj.prod1 <- prod(dnorm(unlist(toy.mean.adj[1,]), mean = 0, sd = toy.model.sigma))
toy.mean.adj.prod2 <- prod(dnorm(unlist(toy.mean.adj[2,]), mean = 0, sd = toy.model.sigma))
toy.mean.adj.final.prod <- toy.mean.adj.prod1 * toy.mean.adj.prod2
# right product, spread between gropus
toy.mean.beta.adj <- rbind(mean(unlist(toy.data[1,])) - toy.model.betas, mean(unlist(toy.data[2,])) - toy.model.betas)
toy.mean.beta.adj[1,] <- toy.mean.beta.adj[1,] - c(0, toy.model.betas[1], toy.model.betas[1], toy.model.betas[1])
toy.mean.beta.adj[2,] <- toy.mean.beta.adj[2,] - c(0, toy.model.betas[1], toy.model.betas[1], toy.model.betas[1])
toy.mean.beta.adj.prod1 <- prod(dnorm(unlist(toy.mean.beta.adj[1,]), mean = 0, sd = sqrt(toy.model.sigma^2/4 + toy.model.tau.squared)) * sqrt(2/4*pi*toy.model.sigma^2))
toy.mean.beta.adj.prod2 <- prod(dnorm(unlist(toy.mean.beta.adj[2,]), mean = 0, sd = sqrt(toy.model.sigma^2/4 + toy.model.tau.squared)) * sqrt(2/4*pi*toy.model.sigma^2))
toy.mean.beta.adj.final.prod <- toy.mean.beta.adj.prod1 * toy.mean.beta.adj.prod2
toy.total.prod <- toy.mean.adj.final.prod * toy.mean.beta.adj.final.prod
log(toy.total.prod)
EDIT: A helpful link was provided in the comments (https://stats.stackexchange.com/questions/271903/understand-marginal-likelihood-of-mixed-effects-models). Converting my example from above I can replicate the log-likelihood
library(mvtnorm)
z = getME(model, "Z")
zt = getME(model, "Zt")
psi = bdiag(replicate(2, toy.model.tau.squared, simplify=FALSE))
betw = z%*%psi%*%zt
err = Diagonal(8, sigma(model)^2)
v = betw + err
dmvnorm(sim.data$OBS.Y, predict(model, re.form=NA), as.matrix(v), log=TRUE)
While I did not manage to come up with a closed form solution for all of them, I did manage to reproduce the log-likelihoods using numerical integration. I have posted below small examples for how it works in the LMM setting (assuming normal residuals random effects) as well as the GLMM with Poisson and Negative-Binomial. Note that especially the latter one tends so differ ever so slightly when you increase the sample size. My guess is that there is some rounding happening somewhere but for my purposes the precision achieved here is good enough. I will for now accept my own answer but if someone posts a closed form for the Poisson or the Negative-Binomial I will happily accept your answer :)
library(lme4)
library(mvtnorm)
################################################################################
# LMM numerical integration
set.seed(7)
n <- 2 # number of groups
m <- 4 # number of instances per group
fixed.effect <- c(0, -2, -1, 1)
tau <- 5 # standard deviation of random effects
sigma <- 2 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau)
normal.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
normal.data$EXPECT.Y <- normal.data$GROUP.EFFECT + normal.data$INSTANCE.EFFECT
# now observe Y value, assuming normally distributed with fixed std. deviation
normal.data$OBS.Y <- rnorm(nrow(normal.data), mean=normal.data$EXPECT.Y, sigma)
normal.model <- lmer(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = normal.data, REML=F)
summary(normal.model)
normal.model.var <- VarCorr(normal.model)
normal.model.sigma <- attr(normal.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
normal.model.tau.squared <- normal.model.var[[1]][1] # corresponds to variance of random effects
normal.model.betas <- normal.model#beta
normal.group.tau <- sqrt(normal.model.tau.squared)
normal.group.sigma <- sigma(normal.model)
normal.group.beta <- predict(normal.model, re.form=NA)[1:4]
integrate_group1 <- function(x){
p1 <- dnorm(normal.data$OBS.Y[1] - normal.group.beta[1] - x, mean = 0, sd = normal.group.sigma) * dnorm(x, mean = 0, sd = normal.group.tau)
p2 <- dnorm(normal.data$OBS.Y[2] - normal.group.beta[2] - x, mean = 0, sd = normal.group.sigma)
p3 <- dnorm(normal.data$OBS.Y[3] - normal.group.beta[3] - x, mean = 0, sd = normal.group.sigma)
p4 <- dnorm(normal.data$OBS.Y[4] - normal.group.beta[4] - x, mean = 0, sd = normal.group.sigma)
p_out <- p1 * p2 * p3 * p4
p_out
}
normal.group1.integration <- integrate(integrate_group1, lower = -10*normal.group.tau, upper = 10*normal.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
integrate_group2 <- function(x){
p1 <- dnorm(normal.data$OBS.Y[5] - normal.group.beta[1] - x, mean = 0, sd = normal.group.sigma) * dnorm(x, mean = 0, sd = normal.group.tau)
p2 <- dnorm(normal.data$OBS.Y[6] - normal.group.beta[2] - x, mean = 0, sd = normal.group.sigma)
p3 <- dnorm(normal.data$OBS.Y[7] - normal.group.beta[3] - x, mean = 0, sd = normal.group.sigma)
p4 <- dnorm(normal.data$OBS.Y[8] - normal.group.beta[4] - x, mean = 0, sd = normal.group.sigma)
p_out <- p1 * p2 * p3 * p4
p_out
}
normal.group2.integration <- integrate(integrate_group2, lower = -10*normal.group.tau, upper = 10*normal.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
log(normal.group1.integration) + log(normal.group2.integration)
#################################
# Poisson numerical integration
set.seed(13) #13
n <- 2 # number of groups
m <- 4 # number of instances per group
# effect sizes are much smaller since they are exponentiated
fixed.effect <- c(0, -0.2, -0.1, 0.2)
tau <- 1.5 # standard deviation of random effects
# sigma <- 1.5 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau) # guide effect
poisson.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
poisson.data$EXPECT.Y <- exp(poisson.data$GROUP.EFFECT + poisson.data$INSTANCE.EFFECT)
# now observe Y value, assuming normally distributed with fixed std. deviation
poisson.data$OBS.Y <- rpois(nrow(poisson.data), poisson.data$EXPECT.Y)
poisson.model <- glmer(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = poisson.data, family="poisson")
summary(poisson.model)
poisson.model.var <- VarCorr(poisson.model)
poisson.model.sigma <- attr(poisson.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
poisson.model.tau.squared <- poisson.model.var[[1]][1] # corresponds to variance of random effects
poisson.model.betas <- poisson.model#beta
poisson.group.tau <- sqrt(poisson.model.tau.squared)
poisson.group.sigma <- sigma(poisson.model)
poisson.group.beta <- predict(poisson.model, re.form=NA)[1:4]
integrate_group1 <- function(x){
p1 <- dpois(poisson.data$OBS.Y[1], lambda = exp(poisson.group.beta[1] + x)) * dnorm(x, mean = 0, sd = poisson.group.tau)
p2 <- dpois(poisson.data$OBS.Y[2], lambda = exp(poisson.group.beta[2] + x))
p3 <- dpois(poisson.data$OBS.Y[3], lambda = exp(poisson.group.beta[3] + x))
p4 <- dpois(poisson.data$OBS.Y[4], lambda = exp(poisson.group.beta[4] + x))
p_out <- p1 * p2 * p3 * p4
p_out
}
poisson.group1.integration <- integrate(integrate_group1, lower = -10*poisson.group.tau, upper = 10*poisson.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
integrate_group2 <- function(x){
p1 <- dpois(poisson.data$OBS.Y[5], lambda = exp(poisson.group.beta[1] + x)) * dnorm(x, mean = 0, sd = poisson.group.tau)
p2 <- dpois(poisson.data$OBS.Y[6], lambda = exp(poisson.group.beta[2] + x))
p3 <- dpois(poisson.data$OBS.Y[7], lambda = exp(poisson.group.beta[3] + x))
p4 <- dpois(poisson.data$OBS.Y[8], lambda = exp(poisson.group.beta[4] + x))
p_out <- p1 * p2 * p3 * p4
p_out
}
poisson.group2.integration <- integrate(integrate_group2, lower = -10*poisson.group.tau, upper = 10*poisson.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
log(poisson.group1.integration) + log(poisson.group2.integration)
#############
# Negative-Binomial numerical integration
set.seed(13) #13
n <- 100 # number of groups
m <- 4 # number of instances per group
# effect sizes are much smaller since they are exponentiated
fixed.effect <- c(0, -0.2, -0.1, 0.2)
tau <- 1.5 # standard deviation of random effects
theta <- 0.5
# sigma <- 1.5 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau) # guide effect
nb.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
nb.data$EXPECT.Y <- exp(nb.data$GROUP.EFFECT + nb.data$INSTANCE.EFFECT)
# now observe Y value, assuming normally distributed with fixed std. deviation
nb.data$OBS.Y <- rnbinom(nrow(nb.data), mu = nb.data$EXPECT.Y, size = theta)
nb.model <- glmer.nb(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = nb.data)
summary(nb.model)
nb.model.var <- VarCorr(nb.model)
nb.model.sigma <- attr(nb.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
nb.model.tau.squared <- nb.model.var[[1]][1] # corresponds to variance of random effects
nb.model.betas <- nb.model#beta
nb.group.tau <- sqrt(nb.model.tau.squared)
nb.group.beta <- predict(nb.model, re.form=NA)[1:4]
nb.group.dispersion <- getME(nb.model, "glmer.nb.theta")
integration_function_generator <- function(input.obs, input.beta, input.dispersion, input.tau){
function(x){
p1 <- dnbinom(input.obs[1], mu = exp(input.beta[1] + x), size = input.dispersion) * dnorm(x, mean = 0, sd = input.tau)
p2 <- dnbinom(input.obs[2], mu = exp(input.beta[2] + x), size = input.dispersion)
p3 <- dnbinom(input.obs[3], mu = exp(input.beta[3] + x), size = input.dispersion)
p4 <- dnbinom(input.obs[4], mu = exp(input.beta[4] + x), size = input.dispersion)
p_out <- p1 * p2 * p3 * p4
p_out
}
}
nb.all.group.integrations <- c()
for(i in 1:n){
temp.obs <- nb.data$OBS.Y[(1:4)+(i-1)*4]
temp_integrate_function <- integration_function_generator(temp.obs, nb.group.beta, nb.group.dispersion, nb.group.tau)
temp.integration <- integrate(temp_integrate_function, lower = -10*nb.group.tau, upper = 10*nb.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
nb.all.group.integrations <- c(nb.all.group.integrations, temp.integration)
}
sum(log(nb.all.group.integrations))

R: Determine the threshold that maximally separates two groups based on a continuous variable?

Say I have 200 subjects, 100 in group A and 100 in group B, and for each I measure some continuous parameter.
require(ggplot2)
set.seed(100)
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep('A', 100), rep('B', 100))
data <- data.frame(value, group)
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
I would like to determine the value (Threshold? Breakpoint?) that maximizes separation and minimizes misclassification between the groups. Does such a function exist in R?
I've tried searching along the lines of "r breakpoint maximal separation between groups," and "r threshold minimize misclassification," but my google-foo seems to be off today.
EDIT:
Responding to #Thomas's comment, I have tried to fit the data using logistic regression and then solve for the threshold, but I haven't gotten very far.
lr <- glm(group~value)
coef(lr)
# (Intercept) value
# 1.1857435 -0.0911762
So Bo = 1.1857435 and B1 = -0.0911762
From Wikipedia, I see that F(x) = 1/(1+e^-(Bo + B1x)), and solving for x:
x = (ln(F(x) / (1 - F(x))) - Bo)/B1
But trying this in R, I get an obviously incorrect answer:
(log(0.5/(1 - 0.5)) - 1.1857435)/-0.0911762 # 13.00497
A simple approach is to write a function that calculates the accuracy given a threshold:
accuracy = Vectorize(function(th) mean(c("A", "B")[(value > th) + 1] == group))
Then find the maximum using optimize:
optimize(accuracy, c(min(value), max(value)), maximum=TRUE)
# $maximum
# [1] 8.050888
#
# $objective
# [1] 0.86
I've gotten the answer I need thanks to help from #Thomas and #BenBolker.
Summary
The problem with my attempt at solving it through logistic regression was that I hadn't specified family = binomial
The dose.p() function in MASS will do the work for me given a glm fit
Code
# Include libraries
require(ggplot2)
require(MASS)
# Set seed
set.seed(100)
# Put together some dummy data
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep(0, 100), rep(1, 100))
data <- data.frame(value, group)
# Plot the distribution -- visually
# The answer appears to be b/t 7 and 8
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
# Fit a glm model, specifying the binomial distribution
my.glm <- glm(group~value, data = data, family = binomial)
b0 <- coef(my.glm)[[1]]
b1 <- coef(my.glm)[[2]]
# See what the probability function looks like
lr <- function(x, b0, b1) {
prob <- 1 / (1 + exp(-1*(b0 + b1*x)))
return(prob)
}
# The line appears to cross 0.5 just above 7.5
x <- -0:12
y <- lr(x, b0, b1)
lr.val <- data.frame(x, y)
ggplot(lr.val, aes(x = x, y = y)) +
geom_line()
# The inverse of this function computes the threshold for a given probability
inv.lr <- function(p, b0, b1) {
x <- (log(p / (1 - p)) - b0)/b1
return(x)
}
# With the betas from this function, we get 7.686814
inv.lr(0.5, b0, b1)
# Or, feeding the glm model into dose.p from MASS, we get the same answer
dose.p(my.glm, p = 0.5)
Thanks, everyone, for your help!

Resources