Simulating datasets in R for model selection - r

I made a code to simulate a dataset in R to see how backward selection works in machine learning. And I generated poly() function to write polynomial function and then wanted to choose the suitable polynomial using Cp, BIC, adjusted R^2.
The code is:
###Generating dataset
set.seed(1)
X = rnorm(100)
eps = rnorm(100)
beta0 = 3
beta1 = 2
beta2 = -3
beta3 = 0.3
Y = beta0 + beta1 * X + beta2 * X^2 + beta3 * X^3 + eps
library(leaps)
data.full = data.frame(y = Y, x = X)
mod.full = regsubsets(y ~ poly(x, 10, raw = T), data = data.full, nvmax = 10)
mod.summary = summary(mod.full)
### Find the model size for best cp, BIC and adjr2
which.min(mod.summary$cp)
For cp, BIC and adjusted R^2 I get model with polynomial 3 as it should be
However, now I want to simulate 100 datasets and see in how many datasets do I get the right model. I simulated 100 datasets but now I am not getting polynomial 3 for each of the measures. And I don't quite understand what I'm doing wrong. My code for simulation is:
###Generating 100 datasets
data <- replicate(100, rnorm(n=100))
epsilon <- replicate(100,rnorm(n=100))
###Formula (same as before)
Y = beta0 + beta1 * data + beta2 * data^2 + beta3 * data^3 + epsilon
data.full = data.frame(y = Y, x = data)
###Using polynomial terms
mod.bwd = regsubsets(data.full$y.1 ~ poly(data.full$x.1, 10, raw = T), data = data.full, nvmax = 10,
method = "backward")
bwd.summary = summary(mod.bwd)
which.min(bwd.summary$cp)
which.min(bwd.summary$bic)
which.max(bwd.summary$adjr2)
For a given subset cp, Bic, adjr2 are giving me different results. For example, using y.1 and x.1 (first dataset in simulation) gives following results:
which.min(bwd.summary$cp): 7
which.min(bwd.summary$bic): 4
which.max(bwd.summary$adjr2): 9
Can someone help me what I'm doing wrong in simulating these 100 datasets.

If I've read your code correctly you run the model on the same simulated dataset 100 times instead of all 100 simulated datasets, this should do the trick:
set.seed(42)
###Generating 100 datasets
data <- replicate(100, rnorm(n=100))
epsilon <- replicate(100,rnorm(n=100))
###Formula (same as before)
Y = beta0 + beta1 * data + beta2 * data^2 + beta3 * data^3 + epsilon
data.full = data.frame(y = Y, x = data)
res <- lapply(1:100, function(i){
###Using polynomial terms
mod.bwd = regsubsets(data.full[[i]] ~ poly(data.full[[100+i]], 10, raw = T), data = data.full, nvmax = 10,
method = "backward")
bwd.summary = summary(mod.bwd)
c(
which.min(bwd.summary$cp),
which.min(bwd.summary$bic),
which.max(bwd.summary$adjr2)
)
})
res <- do.call(rbind, res)
With this rng-seed this gives some lines where all cirteria select the correct model.

Related

How to estimate the Kalman Filter with 'KFAS' R package, with an AR(1) transition equation and covariates?

I am using 'KFAS' package from R to estimate a state-space model with the Kalman filter. My measurement and transition equations are:
y_t = b_0 + b_1xx_t + Z_t * x_t + \eps_t (measurement)
x_t = T_t * x_{t-1} + R_t * \eta_t (transition),
with \eps_t ~ N(0,H_t) and \eta_t ~ N(0,Q_t),
where xx_t are covariates. I have read this question and wrote the following code
library(KFAS)
set.seed(100)
xx <- rnorm(200)
beta0 <- 0.1
beta1 <- 0.1
eps <- rt(200, 4, 1)
y <- as.matrix(beta0 + beta1*xx + (arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(y ~ xx + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
updatefn <- function(pars, model) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
model
}
fit <- fitSSM(ss_model, c(1, 0.5, 1), updatefn, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100))
I get the error
Error in is.SSModel(do.call(updatefn, args = c(list(inits, model), update_args)), :
System matrices (excluding Z) contain NA or infinite values, covariance matrices contain values larger than 1e+07
I have tried to change the initial vector to c(1, 0.5, 1, 1, 1) but it returns the same message. Does anyone know how can I do this?
Thanks!

Data is too long Error in R FlexmixNL package

I tried to search this online, but couldn't exactly figure out what my issue was. Here is my code:
n = 10000
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y1 <- 10*sin(x1/10) + 10 + rnorm(n, sd = 1)
y2 <- x2 * cos(x2) - 2 * rnorm(n, sd = 2)
x <- c(x1, x2)
y <- c(x1, x2)
start1 = list(a = 10, b = 5)
start2 = list(a = 30, b = 5)
library(flexmix)
library(flexmixNL)
modelNL <- flexmix(y~x, k =2,
model = FLXMRnlm(formula = y ~ a*x/(b+x),
family = "gaussian",
start = list(start1, start2)))
plot(x, y, col = clusters(modelNL))
and before the plot, it gives me this error:
Error in matrix(1, nrow = sum(groups$groupfirst)) : data is too long
I checked google for similar errors, but I don't quite understand what is wrong with my own code that results in this error.
As you can already tell, I am very new to R, so please explain it in the most layman terms possible. Thank you in advance.
Ironically (in the context of an error message saying data is "too long") I think the proximate cause of that error is no data argument. If you give it the data in the form of a dataframe, you still get an error but its not the same one as you are experiencing. When you plot the data, you get a rather bizarre set of values at least from a statistical distribution standpoint and it's not clear why you are trying to model this with this formula. Nonetheless, with those starting values and a dataframe argument to data, one sees results.
> modelNL <- flexmix(y~x, k =2, data=data.frame(x=x,y=y),
+ model = FLXMRnlm(formula = y ~ a*x/(b+x),
+ family = "gaussian",
+ start = list(start1, start2)))
> modelNL
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x/(b + x), family = "gaussian", start = list(start1, start2)))
Cluster sizes:
1 2
6664 13336
convergence after 20 iterations
> summary(modelNL)
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x/(b + x), family = "gaussian", start = list(start1, start2)))
prior size post>0 ratio
Comp.1 0.436 6664 20000 0.333
Comp.2 0.564 13336 16306 0.818
'log Lik.' -91417.03 (df=7)
AIC: 182848.1 BIC: 182903.4
Most R regression functions first check for the matchng names in formulae within the data= argument. Apparently this function fails when it needs to go out to the global environment to match formula tokens.
I tried a formula suggested by the plot of the data and get convergent results:
> modelNL <- flexmix(y~x, k =2, data=data.frame(x=x,y=y),
+ model = FLXMRnlm(formula = y ~ a*x*cos(x+b),
+ family = "gaussian",
+ start = list(start1, start2)))
> modelNL
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
Cluster sizes:
1 2
9395 10605
convergence after 17 iterations
> summary(modelNL)
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
prior size post>0 ratio
Comp.1 0.521 9395 18009 0.522
Comp.2 0.479 10605 13378 0.793
'log Lik.' -78659.85 (df=7)
AIC: 157333.7 BIC: 157389
The reduction in AIC seems huge compare to the first formula.

Calculating RSS manually with given pairs of beta0 and beta1

I am trying to manually calculate the RSS for a dataset with given pairs of beta0 and beta1. For each (beta_0,beta_1) pair of values, I need to calculate the residual sum of squares. Store it as a vector in data called RSS. Here's the code provided.
x = pinotnoir$Aroma
y = pinotnoir$Quality
fit = lm(y ~ x)
summary(fit)
b0s <- seq(0, 10, .1)
b1s <- seq(0, 4, .01)
data <- expand.grid(beta0=b0s, beta1=b1s)
Here's what I have so far. I think the residual calculation is wrong but I'm not sure how to fix it.
rows = length(b1s)
rsd <- rep(NA,rows)
for (i in 1:rows){
residual = (y - (b0s[i] + b1s[i] * x))^2
rsd[i] <- residual
}
data <- expand.grid(beta0=b0s, beta1=b1s, RSS=rsd)
Any help would be appreciated. Thanks in advance!
I am not sure this is exactly what you aim but adapting your code slightly you can get the sum of squared residuals and which betas minimizes them. (using mtcars data for the example)
mtcars
x = mtcars$drat
y = mtcars$wt
(fit = lm(y ~ x))
summary(fit)
grid_len <- 20
b0s <- seq(5, 10, length.out = grid_len)
b1s <- seq(-3, -1, length.out = grid_len)
(data <- expand.grid(beta0=b0s, beta1=b1s))
rows = nrow(data)
resids <- rep(NA,rows)
for (i in 1:rows) {
fitted <- (data$beta0[i] + (data$beta1[i] * x))
squared_resid <- (y - fitted)^2
SSR <- sum(squared_resid)
resids[i] <- SSR
cat(i, ": ", SSR, "\n")
}
data[which.min(resids), ]
fit
results:
> data[which.min(resids), ]
beta0 beta1
332 7.894737 -1.315789
> fit
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
7.906 -1.304

Log-likelihood calculation given estimated parameters

In general: I want to calculate the (log) likelihood of data N given the estimated model parameters from data O.
More specifically, I want to know if my ll_given_modPars function below exists in one of the may R packages dealing with data modeling (lme4, glmm, etc.) as shown in this abstract example (not run):
library(lme4)
o_model <- lmer(observed ~ fixed.id + (1|random.id), data = O, REML = F)
n_logLik <- ll_given_modPars(model.estimates = o_model, data = N)
The fictional example above is on a linear mixed model for simplicity but I would like to eventually do this in a generalized linear mixed model which deals with the Poisson family or directly the negative binomial (for lme4: glmer(..., family="poisson") or glmer.nb ).
From what I could see most packages deal with parameter estimation (great, I need that) but then compare models on the same data with different combinations of fixed and random effects using anova or something to that extent which is not what I want to do.
I want the log likelihood for the same parameters on different data.
The main attempts made:
After not finding a function which seems to be doing that I thought of 'simply' tweaking the lme4 code to my purposes: it calculates the log likelihood for parameters given the data so I thought I could use the same framework but not have it optimize over different parameters but isolate the likelihood calculation function and just give it the parameters and the data. Unfortunately the code is a bit above my current skills https://github.com/lme4/lme4/blob/master/R/nbinom.R (I get a bit lost in how they use the objects over which they optimize).
I thought of doing the likelihood calculation myself, starting with a linear mixed model and then working my way up to more involved ones. But already with this example I'm having a hard time following the math and even when using the formula as specified the obtained log-likelihood is still different (I don't know why, see code in appendix) and I fear it will take me too long before I'll be able to do it for the more involved models (such as Poisson or negative binomial)
At this point I'm not sure what avenue is best to pursue and would appreciate any input you might have.
Appendix: Trying to calculate the log-likelihood (or finding a closed form approximation) based on How does lmer (from the R package lme4) compute log likelihood?. lmer (from lme4) gives a log-likelihood of -17.8 and I get -45.56
library(lme4)
set.seed(7)
n <- 2 # number of groups
m <- 4 # number of instances per group
fixed.effect <- c(0, -2, -1, 1)
tau <- 5 # standard deviation of random effects
sigma <- 2 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau)
sim.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
sim.data$EXPECT.Y <- sim.data$GROUP.EFFECT + sim.data$INSTANCE.EFFECT
# now observe Y value, assuming normally distributed with fixed std. deviation
sim.data$OBS.Y <- rnorm(nrow(sim.data), mean=sim.data$EXPECT.Y, sigma)
model <- lmer(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = sim.data, REML=F)
summary(model)
toy.model.var <- VarCorr(model)
toy.model.sigma <- attr(toy.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
toy.model.tau.squared <- toy.model.var[[1]][1] # corresponds to variance of random effects
toy.model.betas <- model#beta
# left product, spread within gropus
toy.data <- rbind(sim.data$OBS.Y[1:4], sim.data$OBS.Y[5:8])
toy.mean.adj <- rbind(toy.data[1,] - mean(unlist(toy.data[1,])), toy.data[2,] - mean(unlist(toy.data[2,])))
toy.mean.adj.prod1 <- prod(dnorm(unlist(toy.mean.adj[1,]), mean = 0, sd = toy.model.sigma))
toy.mean.adj.prod2 <- prod(dnorm(unlist(toy.mean.adj[2,]), mean = 0, sd = toy.model.sigma))
toy.mean.adj.final.prod <- toy.mean.adj.prod1 * toy.mean.adj.prod2
# right product, spread between gropus
toy.mean.beta.adj <- rbind(mean(unlist(toy.data[1,])) - toy.model.betas, mean(unlist(toy.data[2,])) - toy.model.betas)
toy.mean.beta.adj[1,] <- toy.mean.beta.adj[1,] - c(0, toy.model.betas[1], toy.model.betas[1], toy.model.betas[1])
toy.mean.beta.adj[2,] <- toy.mean.beta.adj[2,] - c(0, toy.model.betas[1], toy.model.betas[1], toy.model.betas[1])
toy.mean.beta.adj.prod1 <- prod(dnorm(unlist(toy.mean.beta.adj[1,]), mean = 0, sd = sqrt(toy.model.sigma^2/4 + toy.model.tau.squared)) * sqrt(2/4*pi*toy.model.sigma^2))
toy.mean.beta.adj.prod2 <- prod(dnorm(unlist(toy.mean.beta.adj[2,]), mean = 0, sd = sqrt(toy.model.sigma^2/4 + toy.model.tau.squared)) * sqrt(2/4*pi*toy.model.sigma^2))
toy.mean.beta.adj.final.prod <- toy.mean.beta.adj.prod1 * toy.mean.beta.adj.prod2
toy.total.prod <- toy.mean.adj.final.prod * toy.mean.beta.adj.final.prod
log(toy.total.prod)
EDIT: A helpful link was provided in the comments (https://stats.stackexchange.com/questions/271903/understand-marginal-likelihood-of-mixed-effects-models). Converting my example from above I can replicate the log-likelihood
library(mvtnorm)
z = getME(model, "Z")
zt = getME(model, "Zt")
psi = bdiag(replicate(2, toy.model.tau.squared, simplify=FALSE))
betw = z%*%psi%*%zt
err = Diagonal(8, sigma(model)^2)
v = betw + err
dmvnorm(sim.data$OBS.Y, predict(model, re.form=NA), as.matrix(v), log=TRUE)
While I did not manage to come up with a closed form solution for all of them, I did manage to reproduce the log-likelihoods using numerical integration. I have posted below small examples for how it works in the LMM setting (assuming normal residuals random effects) as well as the GLMM with Poisson and Negative-Binomial. Note that especially the latter one tends so differ ever so slightly when you increase the sample size. My guess is that there is some rounding happening somewhere but for my purposes the precision achieved here is good enough. I will for now accept my own answer but if someone posts a closed form for the Poisson or the Negative-Binomial I will happily accept your answer :)
library(lme4)
library(mvtnorm)
################################################################################
# LMM numerical integration
set.seed(7)
n <- 2 # number of groups
m <- 4 # number of instances per group
fixed.effect <- c(0, -2, -1, 1)
tau <- 5 # standard deviation of random effects
sigma <- 2 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau)
normal.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
normal.data$EXPECT.Y <- normal.data$GROUP.EFFECT + normal.data$INSTANCE.EFFECT
# now observe Y value, assuming normally distributed with fixed std. deviation
normal.data$OBS.Y <- rnorm(nrow(normal.data), mean=normal.data$EXPECT.Y, sigma)
normal.model <- lmer(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = normal.data, REML=F)
summary(normal.model)
normal.model.var <- VarCorr(normal.model)
normal.model.sigma <- attr(normal.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
normal.model.tau.squared <- normal.model.var[[1]][1] # corresponds to variance of random effects
normal.model.betas <- normal.model#beta
normal.group.tau <- sqrt(normal.model.tau.squared)
normal.group.sigma <- sigma(normal.model)
normal.group.beta <- predict(normal.model, re.form=NA)[1:4]
integrate_group1 <- function(x){
p1 <- dnorm(normal.data$OBS.Y[1] - normal.group.beta[1] - x, mean = 0, sd = normal.group.sigma) * dnorm(x, mean = 0, sd = normal.group.tau)
p2 <- dnorm(normal.data$OBS.Y[2] - normal.group.beta[2] - x, mean = 0, sd = normal.group.sigma)
p3 <- dnorm(normal.data$OBS.Y[3] - normal.group.beta[3] - x, mean = 0, sd = normal.group.sigma)
p4 <- dnorm(normal.data$OBS.Y[4] - normal.group.beta[4] - x, mean = 0, sd = normal.group.sigma)
p_out <- p1 * p2 * p3 * p4
p_out
}
normal.group1.integration <- integrate(integrate_group1, lower = -10*normal.group.tau, upper = 10*normal.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
integrate_group2 <- function(x){
p1 <- dnorm(normal.data$OBS.Y[5] - normal.group.beta[1] - x, mean = 0, sd = normal.group.sigma) * dnorm(x, mean = 0, sd = normal.group.tau)
p2 <- dnorm(normal.data$OBS.Y[6] - normal.group.beta[2] - x, mean = 0, sd = normal.group.sigma)
p3 <- dnorm(normal.data$OBS.Y[7] - normal.group.beta[3] - x, mean = 0, sd = normal.group.sigma)
p4 <- dnorm(normal.data$OBS.Y[8] - normal.group.beta[4] - x, mean = 0, sd = normal.group.sigma)
p_out <- p1 * p2 * p3 * p4
p_out
}
normal.group2.integration <- integrate(integrate_group2, lower = -10*normal.group.tau, upper = 10*normal.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
log(normal.group1.integration) + log(normal.group2.integration)
#################################
# Poisson numerical integration
set.seed(13) #13
n <- 2 # number of groups
m <- 4 # number of instances per group
# effect sizes are much smaller since they are exponentiated
fixed.effect <- c(0, -0.2, -0.1, 0.2)
tau <- 1.5 # standard deviation of random effects
# sigma <- 1.5 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau) # guide effect
poisson.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
poisson.data$EXPECT.Y <- exp(poisson.data$GROUP.EFFECT + poisson.data$INSTANCE.EFFECT)
# now observe Y value, assuming normally distributed with fixed std. deviation
poisson.data$OBS.Y <- rpois(nrow(poisson.data), poisson.data$EXPECT.Y)
poisson.model <- glmer(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = poisson.data, family="poisson")
summary(poisson.model)
poisson.model.var <- VarCorr(poisson.model)
poisson.model.sigma <- attr(poisson.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
poisson.model.tau.squared <- poisson.model.var[[1]][1] # corresponds to variance of random effects
poisson.model.betas <- poisson.model#beta
poisson.group.tau <- sqrt(poisson.model.tau.squared)
poisson.group.sigma <- sigma(poisson.model)
poisson.group.beta <- predict(poisson.model, re.form=NA)[1:4]
integrate_group1 <- function(x){
p1 <- dpois(poisson.data$OBS.Y[1], lambda = exp(poisson.group.beta[1] + x)) * dnorm(x, mean = 0, sd = poisson.group.tau)
p2 <- dpois(poisson.data$OBS.Y[2], lambda = exp(poisson.group.beta[2] + x))
p3 <- dpois(poisson.data$OBS.Y[3], lambda = exp(poisson.group.beta[3] + x))
p4 <- dpois(poisson.data$OBS.Y[4], lambda = exp(poisson.group.beta[4] + x))
p_out <- p1 * p2 * p3 * p4
p_out
}
poisson.group1.integration <- integrate(integrate_group1, lower = -10*poisson.group.tau, upper = 10*poisson.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
integrate_group2 <- function(x){
p1 <- dpois(poisson.data$OBS.Y[5], lambda = exp(poisson.group.beta[1] + x)) * dnorm(x, mean = 0, sd = poisson.group.tau)
p2 <- dpois(poisson.data$OBS.Y[6], lambda = exp(poisson.group.beta[2] + x))
p3 <- dpois(poisson.data$OBS.Y[7], lambda = exp(poisson.group.beta[3] + x))
p4 <- dpois(poisson.data$OBS.Y[8], lambda = exp(poisson.group.beta[4] + x))
p_out <- p1 * p2 * p3 * p4
p_out
}
poisson.group2.integration <- integrate(integrate_group2, lower = -10*poisson.group.tau, upper = 10*poisson.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
log(poisson.group1.integration) + log(poisson.group2.integration)
#############
# Negative-Binomial numerical integration
set.seed(13) #13
n <- 100 # number of groups
m <- 4 # number of instances per group
# effect sizes are much smaller since they are exponentiated
fixed.effect <- c(0, -0.2, -0.1, 0.2)
tau <- 1.5 # standard deviation of random effects
theta <- 0.5
# sigma <- 1.5 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau) # guide effect
nb.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
nb.data$EXPECT.Y <- exp(nb.data$GROUP.EFFECT + nb.data$INSTANCE.EFFECT)
# now observe Y value, assuming normally distributed with fixed std. deviation
nb.data$OBS.Y <- rnbinom(nrow(nb.data), mu = nb.data$EXPECT.Y, size = theta)
nb.model <- glmer.nb(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = nb.data)
summary(nb.model)
nb.model.var <- VarCorr(nb.model)
nb.model.sigma <- attr(nb.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
nb.model.tau.squared <- nb.model.var[[1]][1] # corresponds to variance of random effects
nb.model.betas <- nb.model#beta
nb.group.tau <- sqrt(nb.model.tau.squared)
nb.group.beta <- predict(nb.model, re.form=NA)[1:4]
nb.group.dispersion <- getME(nb.model, "glmer.nb.theta")
integration_function_generator <- function(input.obs, input.beta, input.dispersion, input.tau){
function(x){
p1 <- dnbinom(input.obs[1], mu = exp(input.beta[1] + x), size = input.dispersion) * dnorm(x, mean = 0, sd = input.tau)
p2 <- dnbinom(input.obs[2], mu = exp(input.beta[2] + x), size = input.dispersion)
p3 <- dnbinom(input.obs[3], mu = exp(input.beta[3] + x), size = input.dispersion)
p4 <- dnbinom(input.obs[4], mu = exp(input.beta[4] + x), size = input.dispersion)
p_out <- p1 * p2 * p3 * p4
p_out
}
}
nb.all.group.integrations <- c()
for(i in 1:n){
temp.obs <- nb.data$OBS.Y[(1:4)+(i-1)*4]
temp_integrate_function <- integration_function_generator(temp.obs, nb.group.beta, nb.group.dispersion, nb.group.tau)
temp.integration <- integrate(temp_integrate_function, lower = -10*nb.group.tau, upper = 10*nb.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
nb.all.group.integrations <- c(nb.all.group.integrations, temp.integration)
}
sum(log(nb.all.group.integrations))

R: Determine the threshold that maximally separates two groups based on a continuous variable?

Say I have 200 subjects, 100 in group A and 100 in group B, and for each I measure some continuous parameter.
require(ggplot2)
set.seed(100)
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep('A', 100), rep('B', 100))
data <- data.frame(value, group)
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
I would like to determine the value (Threshold? Breakpoint?) that maximizes separation and minimizes misclassification between the groups. Does such a function exist in R?
I've tried searching along the lines of "r breakpoint maximal separation between groups," and "r threshold minimize misclassification," but my google-foo seems to be off today.
EDIT:
Responding to #Thomas's comment, I have tried to fit the data using logistic regression and then solve for the threshold, but I haven't gotten very far.
lr <- glm(group~value)
coef(lr)
# (Intercept) value
# 1.1857435 -0.0911762
So Bo = 1.1857435 and B1 = -0.0911762
From Wikipedia, I see that F(x) = 1/(1+e^-(Bo + B1x)), and solving for x:
x = (ln(F(x) / (1 - F(x))) - Bo)/B1
But trying this in R, I get an obviously incorrect answer:
(log(0.5/(1 - 0.5)) - 1.1857435)/-0.0911762 # 13.00497
A simple approach is to write a function that calculates the accuracy given a threshold:
accuracy = Vectorize(function(th) mean(c("A", "B")[(value > th) + 1] == group))
Then find the maximum using optimize:
optimize(accuracy, c(min(value), max(value)), maximum=TRUE)
# $maximum
# [1] 8.050888
#
# $objective
# [1] 0.86
I've gotten the answer I need thanks to help from #Thomas and #BenBolker.
Summary
The problem with my attempt at solving it through logistic regression was that I hadn't specified family = binomial
The dose.p() function in MASS will do the work for me given a glm fit
Code
# Include libraries
require(ggplot2)
require(MASS)
# Set seed
set.seed(100)
# Put together some dummy data
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep(0, 100), rep(1, 100))
data <- data.frame(value, group)
# Plot the distribution -- visually
# The answer appears to be b/t 7 and 8
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
# Fit a glm model, specifying the binomial distribution
my.glm <- glm(group~value, data = data, family = binomial)
b0 <- coef(my.glm)[[1]]
b1 <- coef(my.glm)[[2]]
# See what the probability function looks like
lr <- function(x, b0, b1) {
prob <- 1 / (1 + exp(-1*(b0 + b1*x)))
return(prob)
}
# The line appears to cross 0.5 just above 7.5
x <- -0:12
y <- lr(x, b0, b1)
lr.val <- data.frame(x, y)
ggplot(lr.val, aes(x = x, y = y)) +
geom_line()
# The inverse of this function computes the threshold for a given probability
inv.lr <- function(p, b0, b1) {
x <- (log(p / (1 - p)) - b0)/b1
return(x)
}
# With the betas from this function, we get 7.686814
inv.lr(0.5, b0, b1)
# Or, feeding the glm model into dose.p from MASS, we get the same answer
dose.p(my.glm, p = 0.5)
Thanks, everyone, for your help!

Resources