Simulate negbin data from a fitted glmmTMB model - family negbin1 - r

I fitted a glmmTMB model using family = nbinom1. Now I would like to perform a simulation of data based on predicted values and the dispersion. However, from the help files, it looks like the go-to rnbinom function uses the family=nbinom2 parameterization where variance is equal to mu + mu^2/size.
1) Can anyone help me figure out how to simulate family=nbinom1 data (where variance is equal to mu + mu*size)?
2) Also, is my extraction / use of the dispersion value as size correct?
Thanks so much!
Current code (data not provided, because doesn't matter), using the stats:::rnbinom function despite the mismatch of variance definition:
library(glmmTMB)
mod <- glmmTMB(y ~ x + (1 | ID), data = df, family = nbinom1)
preds <- predict(mod, type = "response")
size <- sigma(mod)
sim <- rnbinom(nrow(df), mu = preds, size = size)

We can try to simulate nbinom1, so if the variance is mu + mu*k:
set.seed(111)
k = 2
x = runif(100,min=1,max=3)
y = rnbinom(100,mu=exp(2*x),size=exp(2*x)/k)
ID = sample(1:2,100,replace=TRUE)
df = data.frame(x,y,ID)
mod <- glmmTMB(y ~ x + (1 | ID), data = df, family = nbinom1)
sigma(mod)
[1] 1.750076
In the above, for every mean, mu, I specified a size that is mu / k so that it will give an expected variance of mu*k. This shows that as long as you parameterize the rnbinom correctly, you get back rnbinom1.
Now with this model, if we need to simulate data, it's just using the same parameterization as above:
preds <- predict(mod, type = "response")
size <- sigma(mod)
sim <- rnbinom(nrow(df), mu = preds, size = preds/size)
plot(sim,df$y)

There are a variety of issues here, including:
sigma(mod) gives the estimated standard deviation of the residuals; it is not a variance but the square-root of a variance, so you might want to square it.
there are many parametrisations of a negative binomial distribution beyond R's version, but in R's version, if the mean is mean(dat) and the variance var(dat) then you can estimate size with mean(dat)^2/(var(dat)-mean(dat)) and the probability prob with mean(dat)/var(dat)
rnbinom() will tolerate size being non-integer or infinite despite this being a theoretical nonsense; it will not tolerate size being negative which can happen if var(dat) is less than mean(dat). It will also have problems the mean is negative or if size is zero.
So perhaps you could consider adapting your simulation lines to something like
sizes <- ifelse(sigma(mod) ^ 2 > preds, preds ^ 2 / (sigma(mod) ^ 2 - preds), Inf)
sim <- ifelse(preds > 0, rnbinom(nrow(df), mu = preds, size = sizes), 0)
then you might still get errors when sigma(mod) is less than or equal to preds

Related

how to obtain p-value (or CIs) for correlation of random effects in a GLMM (lme4)

I want to test for a correlation between the random effects of a GLMM model calculated in lme4. I have already been suggested to conduct a likelihood ratio comparison of a model with and without the random correlation. That is indeed significant but I wanted to ask if there is any way to get the confidence intervals or p-values of this correlation from the model.
(Specifically, I have compared a model with the random effects structure (1 + X1 + X2 || group) against (1 + X1 + X2 | group) but the problem is that in the second model also the correlation with the intercept is included and I want to specifically test for the significance of the correlation between X1 and X2. Unfortunately, a model with (1 + X1 | group) + (1 + X2 | group) does not converge)
Any help would be appreciated
You can use confint() to get likelihood profile confidence intervals. P-values would be harder; you could do parametric bootstrapping but it would be slow.
set.seed(101)
dd <- data.frame(x = rnorm(1000), y = rnorm(1000),
g = factor(sample(1:20, size = 1000, replace = TRUE)))
library(lme4)
dd$z <- simulate(~ x + y + (1 + x + y | g),
newdata = dd,
newparams = list(beta = rep(1, 3),
sigma = 1,
theta = rep(1, 6)))[[1]]
m <- lmer(z ~ x + y + (1 + x + y | g),
data = dd)
In the confint() call below, parm = "theta_" means "all covariance parameters". You could use parm = c(2, 3, 5) to select only the correlation parameters, but you'd have to read ?profile.merMod and think carefully to figure out the correct indices ...
cc <- confint(m, parm = "theta_", oldNames = FALSE)
Results give you 95% (by default) CIs for all of the covariance parameters. In this example, the x/y slope correlation is significant but the correlations between (intercept and x) and (intercept and y) aren't. (Note that the correlations aren't necessarily invariant to reparameterizing the model, in particular centering or otherwise shifting the predictors will change the answers ...)
cc
2.5 % 97.5 %
sd_(Intercept)|g 0.38142602 0.7451417
cor_x.(Intercept)|g -0.15990774 0.6492967
cor_y.(Intercept)|g -0.01148283 0.7294138
sd_x|g 0.67205037 1.2800681
cor_y.x|g 0.53404483 0.9116571
sd_y|g 0.83378353 1.5742580
sigma 0.94201110 1.0311559

GLMM with beta distribution and lots of zeros in y variable

I am trying to run a glmm with a beta distribution using the glmmTMB function (package glmmTMB). My response variable has a lot of 0 observations so I get this error when running the model
Error in eval(family$initialize) : y values must be 0 < y < 1
I have attached what my response variable looks like regular and also normalized (see image)
Zero values cannot occur in data that are truly Beta-distributed (the probability density of y==0 is either zero or infinite unless the first shape parameter is exactly 1.0). You can fit a zero-inflated Beta response by specifying ziformula. For example:
simulate data
set.seed(101)
y <- rbeta(1000, shape1 = 1, shape2 = 5)
y[sample(1000, replace= FALSE, size = 100)] <- 0
dd <- data.frame(y)
fit
library(glmmTMB)
glmmTMB(y ~ 1, ziformula = ~1, data = dd, family = beta_family)
This example doesn't have a random-effects component, but that doesn't change anything important.

my location-scale estimator function not working with polynomial mean

I'm building my own maximum likelihood estimator that estimates the parameters associated with the mean and standard deviation. On simulated data my function works when the true mean is a linear function and the standard deviation is constant. However, if the mean structure is polynomial my function cannot recover the true parameters. Can anybody point me to a solution?
I'm aware there are plenty of existing functions for estimating means and SDs. I'm not interested in them, I'm interested in why my function is not working.
Below is a reproducible example where my model does not recover the true standard deviation (true sd = 1.648, mysd = 4.184123)
*Edit: added library()
library(tidyverse)
my_poly_loglik <- function(pars, #parameters
outcome, #outcome variable
poly_mean){ #data frame of polynomials
#modelling the mean - adding intercept column
mean_mdl = cbind(1, poly_mean) %*% pars[1:(ncol(poly_mean) + 1)]
#modelling the standard deviation on exponential scale
sd_mdl = exp(pars[length(pars)])
#computing log likelihood
sum_log_likelihood <- sum(dnorm(outcome,
mean = mean_mdl,
sd = sd_mdl,
log = TRUE),
na.rm = TRUE)
#since optim() is minimizing we want the -log likelihood
return(-sum_log_likelihood)
}
#Generate data
set.seed(103)
n <- 100000 #100k obs
z <- runif(n, min = 0.1, max = 40) #independent variable sampled uniformly
mean <- 10 + 0.2 * z + 0.4 * z^2 #mean structure
sd = exp(0.5) #constant SD
y <- rnorm(n,mean, sd)
#Visualizing simulated data
#plot(z,mean)
#plot(z,sd)
#plot(z,y)
mydf = data.frame(z,y)
#Defining polynomials
polymean = cbind(z, z^2)
#Initial values. 2 extra for mean_intercept and SD
pars = rep(0, ncol(polymean) + 2)
#Optimising my likelihood function
optim_res <- optim(pars,
fn = my_poly_loglik,
outcome = mydf$y,
poly_mean = polymean)
if (optim_res$convergence != 0) stop("optim_res value is not 0!")
#comparing my function to the real parameter
plot_df = data.frame("mymean" = optim_res$par[1] + (polymean %*% optim_res$par[2:3]),
"truemean" = mean,
"z" = z)
#my mean (black) and true mean (red)
plot_df %>%
ggplot(aes(x = z, y = mymean)) +
geom_line() +
geom_line(aes(y = truemean), color = "red")
#Works!
#my SD and true SD - PROBLEM!
sd #true sd
exp(optim_res$par[length(optim_res$par)]) #my sd
this is not a complete solution but it might help others find the correct answer.
The code looks good overall and the issue emerges only with a high range of the z values. In fact, scaling them or generating data from a considerably lower range leads to the correct solution. Furthermore, checking the hessian shows that the covariance matrix of the estimates is not positive semidefinite and slightly reducing the range results in correlations of the mean parameters close to 1. (This is a bit puzzling since a normal linear model with the same parametrization does not suffer from the same issue -- I know it does not optimize the likelihood directly, but still a bit unintuitive to me).
So, a temporal solution might be rescaling the predictors / using an orthogonal parametrization? But that does not really explain core of the issue.

Other than burn-in increase and priors, how can I help my multiple change point (mcp package in R) models converge?

I would like to identify changepoints in my data and an associated error term for their estimate. The mcp package seems to do a good job (visually) of identifying changepoints in my data, but the model parameters generally have rhat values >1.1. From my understanding, I cannot trust any Bayesian parameter estimates unless all rhat values in the model are =< 1.1. Aside from increasing my burn-in period with the adapt argument or using priors (see note below), how else can I improve these models?
Alternatively, can I force mcp to fit a 'best' two and three segment model and return those parameter estimates with error? Ideally I would be able to provide changepoint estimates with an error term associated with each estimate, but packages like segmented and struccchange generally fail to identify changepoints in my data.
The code looks like this:
set.seed(42)
x <- c(227,227,228,228,228,228,228,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,232,233,233,233,233,233,233,233,233,236,236,236,236,237,237,237,238,238,238,238,238,238,238,238,239,239,239,239,239,239,243,244,244,244,244,244,244,244,244,244,245,245,245,246,246,246,246,247,250,250,250,250,251,251,251,251,251,251,251,251,253,253,253,257,257,260,260,260,260,260,260,260,264,264,264,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,267,267,267,267,267,267,267,267,267,267,267,267,267,271,271,271,271,271,271,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,274,274,274,274,274,274,274)
y <- c(8.43,6.9,8.93,7.33,7.28,7.24,6.62,8.36,8.17,8.07,8.07,7.63,7.6,7.54,7.37,7.31,7.21,7.03,6.93,6.88,6.82,6.78,6.7,6.5,8.35,10.97,7.48,7.46,7.28,7.17,6.72,6.68,6.08,7.42,7.14,6.92,6.68,7.49,7.28,6.67,9.4,7.54,7.04,6.89,6.88,6.52,6.45,6.39,8.48,8.04,7.52,7.35,6.9,6.57,6.86,7.46,7.39,7.16,7.08,6.83,6.83,6.7,6.54,6.47,9.75,7.38,5.96,10.49,8.32,7.22,7.05,8.55,10.34,8.23,7.9,7.31,8.18,7.8,7.31,7.18,7.17,7.13,7.02,6.84,10.62,10.09,9.26,10.8,10.37,10.9,10.52,10.23,9.28,9.18,8.85,8.81,11.03,8.84,6.29,11.36,10.91,10.87,10.4,10.17,9.61,9.5,9.36,9.17,9.13,8.88,8.73,8.55,8.37,8.33,8.25,7.82,6.9,9.77,9.53,9.39,9.1,8.93,8.68,8.64,8.47,8.41,8.38,8.28,8.18,7.74,10.67,10.64,10.54,10.36,10.35,7.03,9.51,9.37,9.24,9.22,9.18,8.96,8.95,8.94,8.89,8.82,8.79,8.72,8.35,8.22,8.13,8.07,7.91,7.85,7.79,8.82,8.59,8.44,8.42,8.37,8.06,7.34)
df <- data.frame(x, y)
#Writing the formula for a three-segment line
three_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 0 + x, #segment 2, specifying a changepoint and joined slope
y ~ 1 ~ 0 + x #segment 3, specifying a changepoint and joined slope
)
#Writing the formula for a two-segment line
two_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 0 + x #segment 2, specifying a changepoint and joined slope
)
#Disjointing the slopes of two segments
test_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 1 + x #segment 2, specifying a changepoint and disjoined slope
)
#Disjointing the slopes of the three expected segments
test_three_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 1 + x, #segment 2, specifying a changepoint and disjoined slope
y ~ 1 ~ 1 + x #segment 3, specifying a changepoint and disjoined slope
)
These are the models I've tried, but none achieve consistently tolerable rhat values. The code I use to test them is as follows:
#An example run, although each of these models fails to provide suitable rhat values
fit1 <- mcp(three_segment_model, df, chains = 4, iter = 10000, cores = 3)
plot(fit1)
fit1_summary <- data.frame(summary(fit1))
Given the literature, I have also tried priors (though admittedly I cannot make heads or tails of the truncate and distribution code with mcp), which do not seem to reflect the patterns poorly identified with this modelling approach.
If your issue is primarily a practical one, upping the number of iterations and chains reveals that the posterior is quite reproducible across chains for the "problematic" parameters:
fit1 <- mcp(three_segment_model, df, chains = 6, iter = 50000, cores = 6)
plot_pars(fit1, c("cp_1", "cp_2", "x_1"))
And you have good rhat values for the non-three-segment models; all pointing to the three-segment model being unidentifiable with this data. Without knowing the process, I did try some fairly informative priors:
prior = list(
x_1 = 0, # fixed horizontal!
x_2 = "dnorm(0, 1) T(0, )" # Positive slope
)
but it did not improve rhat values for the change point parameters.

R nls singular gradient starting value

I am having a problem while fitting a fucntion via nls
This is the Data:
size<-c(0.0020,0.0063,0.0200,0.0630,0.1250,0.2000,0.6300,2.0000)
cum<-c(6.4,7.1,7.6,37.5,83.0,94.5,99.9,100.0)
I want to fit Gompertz model to it. Therefor i tried:
start<-c(alpha =100, beta = 10, k = 0.03)
fit<-nls(cum~ alpha*exp(-beta*exp(-k*size)),start=start)
The Error says: Singulat gradient.
Some post suggest to choose better starting values.
Can you help me with this problem?
The starting values are too far away from the optimal ones. First take logs of both sides in which case there is only one non-linear parameter, k. Only that needs a starting value if we use the plinear algorithm. Using k from that fit as the k starting value refit using original formula.
fit.log <- nls(log(cum) ~ cbind(1, exp(-k*size)), alg = "plinear", start = c(k = 0.03))
start <- list(alpha = 100, beta = 10, k = coef(fit.log)[["k"]])
fit <- nls(cum ~ alpha*exp(-beta*exp(-k*size)), start = start)
fit
giving:
Nonlinear regression model
model: cum ~ alpha * exp(-beta * exp(-k * size))
data: parent.frame()
alpha beta k
100.116 3.734 22.340
residual sum-of-squares: 45.87
Number of iterations to convergence: 11
Achieved convergence tolerance: 3.351e-06
We can show the fit on a graph
plot(cum ~ size, pch = 20)
lines(fitted(fit) ~ size, col = "red")
giving:

Resources