I would like to identify changepoints in my data and an associated error term for their estimate. The mcp package seems to do a good job (visually) of identifying changepoints in my data, but the model parameters generally have rhat values >1.1. From my understanding, I cannot trust any Bayesian parameter estimates unless all rhat values in the model are =< 1.1. Aside from increasing my burn-in period with the adapt argument or using priors (see note below), how else can I improve these models?
Alternatively, can I force mcp to fit a 'best' two and three segment model and return those parameter estimates with error? Ideally I would be able to provide changepoint estimates with an error term associated with each estimate, but packages like segmented and struccchange generally fail to identify changepoints in my data.
The code looks like this:
set.seed(42)
x <- c(227,227,228,228,228,228,228,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,232,233,233,233,233,233,233,233,233,236,236,236,236,237,237,237,238,238,238,238,238,238,238,238,239,239,239,239,239,239,243,244,244,244,244,244,244,244,244,244,245,245,245,246,246,246,246,247,250,250,250,250,251,251,251,251,251,251,251,251,253,253,253,257,257,260,260,260,260,260,260,260,264,264,264,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,267,267,267,267,267,267,267,267,267,267,267,267,267,271,271,271,271,271,271,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,274,274,274,274,274,274,274)
y <- c(8.43,6.9,8.93,7.33,7.28,7.24,6.62,8.36,8.17,8.07,8.07,7.63,7.6,7.54,7.37,7.31,7.21,7.03,6.93,6.88,6.82,6.78,6.7,6.5,8.35,10.97,7.48,7.46,7.28,7.17,6.72,6.68,6.08,7.42,7.14,6.92,6.68,7.49,7.28,6.67,9.4,7.54,7.04,6.89,6.88,6.52,6.45,6.39,8.48,8.04,7.52,7.35,6.9,6.57,6.86,7.46,7.39,7.16,7.08,6.83,6.83,6.7,6.54,6.47,9.75,7.38,5.96,10.49,8.32,7.22,7.05,8.55,10.34,8.23,7.9,7.31,8.18,7.8,7.31,7.18,7.17,7.13,7.02,6.84,10.62,10.09,9.26,10.8,10.37,10.9,10.52,10.23,9.28,9.18,8.85,8.81,11.03,8.84,6.29,11.36,10.91,10.87,10.4,10.17,9.61,9.5,9.36,9.17,9.13,8.88,8.73,8.55,8.37,8.33,8.25,7.82,6.9,9.77,9.53,9.39,9.1,8.93,8.68,8.64,8.47,8.41,8.38,8.28,8.18,7.74,10.67,10.64,10.54,10.36,10.35,7.03,9.51,9.37,9.24,9.22,9.18,8.96,8.95,8.94,8.89,8.82,8.79,8.72,8.35,8.22,8.13,8.07,7.91,7.85,7.79,8.82,8.59,8.44,8.42,8.37,8.06,7.34)
df <- data.frame(x, y)
#Writing the formula for a three-segment line
three_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 0 + x, #segment 2, specifying a changepoint and joined slope
y ~ 1 ~ 0 + x #segment 3, specifying a changepoint and joined slope
)
#Writing the formula for a two-segment line
two_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 0 + x #segment 2, specifying a changepoint and joined slope
)
#Disjointing the slopes of two segments
test_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 1 + x #segment 2, specifying a changepoint and disjoined slope
)
#Disjointing the slopes of the three expected segments
test_three_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 1 + x, #segment 2, specifying a changepoint and disjoined slope
y ~ 1 ~ 1 + x #segment 3, specifying a changepoint and disjoined slope
)
These are the models I've tried, but none achieve consistently tolerable rhat values. The code I use to test them is as follows:
#An example run, although each of these models fails to provide suitable rhat values
fit1 <- mcp(three_segment_model, df, chains = 4, iter = 10000, cores = 3)
plot(fit1)
fit1_summary <- data.frame(summary(fit1))
Given the literature, I have also tried priors (though admittedly I cannot make heads or tails of the truncate and distribution code with mcp), which do not seem to reflect the patterns poorly identified with this modelling approach.
If your issue is primarily a practical one, upping the number of iterations and chains reveals that the posterior is quite reproducible across chains for the "problematic" parameters:
fit1 <- mcp(three_segment_model, df, chains = 6, iter = 50000, cores = 6)
plot_pars(fit1, c("cp_1", "cp_2", "x_1"))
And you have good rhat values for the non-three-segment models; all pointing to the three-segment model being unidentifiable with this data. Without knowing the process, I did try some fairly informative priors:
prior = list(
x_1 = 0, # fixed horizontal!
x_2 = "dnorm(0, 1) T(0, )" # Positive slope
)
but it did not improve rhat values for the change point parameters.
I have semicontinuous data (many exact zeros and continuous positive outcomes) that I am trying to model. I have largely learned about modeling data with substantial zero mass from Zuur and Ieno's Beginner's Guide to Zero-Inflated Models in R, which makes a distinction between zero-inflated gamma models and what they call "zero-altered" gamma models, which they describe as hurdle models that combine a binomial component for the zeros and a gamma component for the positive continuous outcome. I have been exploring the use of the ziGamma option in the glmmTMB package and comparing the resulting coefficients to a hurdle model that I built following the instructions in Zuur's book (pages 128-129), and they do not coincide. I'm having trouble understanding why not, as I know that the gamma distribution cannot take on the value of zero, so I suppose every zero-inflated gamma model is technically a hurdle model. Can anyone illuminate this for me? See more comments about the models below the code.
library(tidyverse)
library(boot)
library(glmmTMB)
library(parameters)
### DATA
id <- rep(1:75000)
age <- sample(18:88, 75000, replace = TRUE)
gender <- sample(0:1, 75000, replace = TRUE)
cost <- c(rep(0, 30000), rgamma(n = 37500, shape = 5000, rate = 1),
sample(1:1000000, 7500, replace = TRUE))
disease <- sample(0:1, 75000, replace = TRUE)
time <- sample(30:3287, 75000, replace = TRUE)
df <- data.frame(cbind(id, disease, age, gender, cost, time))
# create binary variable for non-zero costs
df <- df %>% mutate(cost_binary = ifelse(cost > 0, 1, 0))
### HURDLE MODEL (MY VERSION)
# gamma component
hurdle_gamma <- glm(cost ~ disease + gender + age + offset(log(time)),
data = subset(df, cost > 0),
family = Gamma(link = "log"))
model_parameters(hurdle_gamma, exponentiate = T)
# binomial component
hurdle_binomial <- glm(cost_binary ~ disease + gender + age + time,
data = df, family = "binomial")
model_parameters(hurdle_binomial, exponentiate = T)
# predicted probability of use
df$prob_use <- predict(hurdle_binomial, type = "response")
# predicted mean cost for people with any cost
df_bin <- subset(df, cost_binary == 1)
df_bin$cost_gamma <- predict(hurdle_gamma, type = "response")
# combine data frames
df2 <- left_join(df, select(df_bin, c(id, cost_gamma)), by = "id")
# replace NA with 0
df2$cost_gamma <- ifelse(is.na(df2$cost_gamma), 0, df2$cost_gamma)
# calculate predicted cost for everyone
df2 <- df2 %>% mutate(cost_pred = prob_use * cost_gamma)
# mean predicted cost
mean(df2$cost_pred)
### glmmTMB with ziGamma
zigamma_model <- glmmTMB(cost ~ disease + gender + age + offset(log(time)),
family = ziGamma(link = "log"),
ziformula = ~ disease + gender + age + time,
data = df)
model_parameters(zigamma_model, exponentiate = T)
df <- df %>% predict(zigamma_model, new data = df, type = "response") # doesn't work
# "no applicable method for "predict" applied to an object of class "data.frame"
The coefficients from the gamma component of my hurdle model and the fixed effects components of the zigamma model are the same, but the SEs are different, which in my actual data has substantial implications for the significance of my predictor of interest. The coefficients on the zero-inflated model are different, and I also noticed that the z values in the binomial component are the negative inverse of those in my binomial model. I assume this has to do with my binomial model modeling the probability of presence (1 is a success) and glmmTMB presumably modeling the probability of absence (0 is a success)?
In sum, can anyone point out what I am doing wrong with the glmmTMB ziGamma model?
The glmmTMB package can do this:
glmmTMB(formula, family=ziGamma(link="log"), ziformula=~1, data= ...)
ought to do it. Maybe something in VGAM as well?
To answer the questions about coefficients and standard errors:
the change in sign of the binomial coefficients is exactly what you suspected (the difference between estimating the probability of 0 [glmmTMB] vs the probability of not-zero [your/Zuur's code])
The standard errors on the binomial part of the model are close but not identical: using broom.mixed::tidy,
round(1-abs(tidy(hurdle_g,component="zi")$statistic)/
abs(tidy(hurdle_binomial)$statistic),3)
## [1] 0.057 0.001 0.000 0.000 0.295
6% for the intercept, up to 30% for the effect of age ...
the nearly twofold difference in the standard errors of the conditional (cost>0) component is definitely puzzling me; it holds up if we simply implement the Gamma/log-link in glmmTMB vs glm. It's hard to know how to check which is right/what the gold standard should be for this case. I might distrust Wald p-values in this case and try to get p-values with the likelihood ratio test instead (via drop1).
In this case the model is badly misspecified (i.e. the cost is uniformly distributed, nothing like Gamma); I wonder if that could be making things harder/worse?
I fitted a glmmTMB model using family = nbinom1. Now I would like to perform a simulation of data based on predicted values and the dispersion. However, from the help files, it looks like the go-to rnbinom function uses the family=nbinom2 parameterization where variance is equal to mu + mu^2/size.
1) Can anyone help me figure out how to simulate family=nbinom1 data (where variance is equal to mu + mu*size)?
2) Also, is my extraction / use of the dispersion value as size correct?
Thanks so much!
Current code (data not provided, because doesn't matter), using the stats:::rnbinom function despite the mismatch of variance definition:
library(glmmTMB)
mod <- glmmTMB(y ~ x + (1 | ID), data = df, family = nbinom1)
preds <- predict(mod, type = "response")
size <- sigma(mod)
sim <- rnbinom(nrow(df), mu = preds, size = size)
We can try to simulate nbinom1, so if the variance is mu + mu*k:
set.seed(111)
k = 2
x = runif(100,min=1,max=3)
y = rnbinom(100,mu=exp(2*x),size=exp(2*x)/k)
ID = sample(1:2,100,replace=TRUE)
df = data.frame(x,y,ID)
mod <- glmmTMB(y ~ x + (1 | ID), data = df, family = nbinom1)
sigma(mod)
[1] 1.750076
In the above, for every mean, mu, I specified a size that is mu / k so that it will give an expected variance of mu*k. This shows that as long as you parameterize the rnbinom correctly, you get back rnbinom1.
Now with this model, if we need to simulate data, it's just using the same parameterization as above:
preds <- predict(mod, type = "response")
size <- sigma(mod)
sim <- rnbinom(nrow(df), mu = preds, size = preds/size)
plot(sim,df$y)
There are a variety of issues here, including:
sigma(mod) gives the estimated standard deviation of the residuals; it is not a variance but the square-root of a variance, so you might want to square it.
there are many parametrisations of a negative binomial distribution beyond R's version, but in R's version, if the mean is mean(dat) and the variance var(dat) then you can estimate size with mean(dat)^2/(var(dat)-mean(dat)) and the probability prob with mean(dat)/var(dat)
rnbinom() will tolerate size being non-integer or infinite despite this being a theoretical nonsense; it will not tolerate size being negative which can happen if var(dat) is less than mean(dat). It will also have problems the mean is negative or if size is zero.
So perhaps you could consider adapting your simulation lines to something like
sizes <- ifelse(sigma(mod) ^ 2 > preds, preds ^ 2 / (sigma(mod) ^ 2 - preds), Inf)
sim <- ifelse(preds > 0, rnbinom(nrow(df), mu = preds, size = sizes), 0)
then you might still get errors when sigma(mod) is less than or equal to preds
There is a great post about the interpretation of the predict.coxph() output. However, I keep getting different results comparing the output from predict.coxph, simPH and the formula for relative risk. Since my hypothesis includes a quadratic effect, I am going to include a polynomial with power 2 in my example.
I use the example from this post.
data("lung")
Predicting relative risk with predict()
# Defining the quadratic predictor
lung$meal.cal_q <- lung$meal.cal^2
# conduct a cox regression with the predictor meal.cal, its quadratic version and some covariates.
cox_mod <- coxph(Surv(time, status) ~
ph.karno + pat.karno + meal.cal + meal.cal_q,
data = lung)
# a vector of fitted values to predict for
meal.cal_new <- seq(min(lung$meal.cal, na.rm= TRUE), max(lung$meal.cal,
na.rm= TRUE), by= 1)
# a vector of fitted values to predict for, the quadratic effect
meal.cal_q_new <- meal.cal_new^2
# the length of the vector with the values to predict for
n <- length(meal.cal_new)
# a dataframe with all the values to predict for
lung_new <- data.frame(ph.karno= rep(mean(lung$ph.karno, na.rm= TRUE), n),
pat.karno= rep(mean(lung$pat.karno, na.rm= TRUE), n),
meal.cal= meal.cal_new,
meal.cal_q = meal.cal_q_new)
# predict the relative risk
lung_new$rel_risk <- predict(cox_mod, lung_new, type= "risk")
Predicting the relative risk with the formula (see the post mentioned above)
# Defining the quadratic predictor
lung$meal.cal_q <- lung$meal.cal^2
# run a cox regression with the predictor meal.cal, its quadratic version and some covariates.
cox_mod <- coxph(Surv(time, status) ~
ph.karno + pat.karno + meal.cal + meal.cal_q,
data = lung)
# a vector of fitted values to predict for
meal.cal_new <- seq(min(lung$meal.cal, na.rm= TRUE), max(lung$meal.cal,
na.rm= TRUE), by= 1)
# a vector of fitted values to predict for, the quadratic effect
meal.cal_q_new <- meal.cal_new^2
# length of the vector to predict for
n <- length(meal.cal_new)
# A dataframe with the values to make the prediction for
lung_new2 <- data.frame(
ph.karno= rep(mean(lung$ph.karno, na.rm= TRUE), n),
pat.karno= rep(mean(lung$pat.karno, na.rm= TRUE), n),
meal.cal= meal.cal_new,
meal.cal_q = meal.cal_q_new)
# A dataframe with the values to compare the prediction with
lung_new_mean <- data.frame(
ph.karno= rep(mean(lung$ph.karno, na.rm= TRUE), n),
pat.karno= rep(mean(lung$pat.karno, na.rm= TRUE), n),
meal.cal= rep(mean(lung$meal.cal, na.rm= TRUE), n),
meal.cal_q = rep(mean(lung$meal.cal_q, na.rm= TRUE), n))
# extract the coefficients
coefCPH <- coef(cox_mod)
# make the prediction for the values of interest
cox_risk <-
exp(coefCPH["ph.karno"] * lung_new2[ , "ph.karno"] +
coefCPH["pat.karno"] * lung_new2[ , "pat.karno"] +
coefCPH["meal.cal"] * lung_new2[ , "meal.cal"] +
coefCPH["meal.cal_q"] * lung_new2[ , "meal.cal_q"])
# make the predictions for the values to compare with
cox_risk_mean <-
exp(coefCPH["ph.karno"] * lung_new_mean[ , "ph.karno"] +
coefCPH["pat.karno"] * lung_new_mean[ , "pat.karno"] +
coefCPH["meal.cal"] * lung_new_mean[ , "meal.cal"] +
coefCPH["meal.cal_q"] * lung_new_mean[ , "meal.cal_q"])
# calculate the relative risk
lung_new2$rel_risk <- unlist(cox_risk)/ unlist(cox_risk_mean)
Now the plot with the predicted relative risk using predict() and using the formula:
ggplot(lung_new, aes(meal.cal, rel_risk)) +
geom_smooth() +
geom_smooth(data= lung_new2, col= "red")
The plot shows that the predictions are different. I do not understand why this is the case, although the mentioned post shows that the predict function and the formula should give the same result.
Because of this confusion I tried to solve the problem with the simPH package. Here is what I did:
# Defining the quadratic predictor
lung$meal.cal_q <- lung$meal.cal^2
# run a cox regression with the predictor, its quadratic version and some covariates.
cox_mod <- coxph(Surv(time, status) ~
ph.karno + pat.karno + meal.cal + meal.cal_q,
data = lung)
# a vector of fitted values to predict for
meal.cal_new <- seq(min(lung$meal.cal, na.rm= TRUE),
max(lung$meal.cal, na.rm= TRUE), by= 1)
# length of the vector to predict for
n <- length(meal.cal_new)
# A vector with the values to compare the prediction with
meal.cal_new_mean <- rep(mean(lung$meal.cal, na.rm= TRUE), n)
# running 100 simulations per predictor value with coxsimPoly
Sim <- coxsimPoly(obj= cox_mod, b = "meal.cal", pow = 2,
qi = "Relative Hazard",
Xj = meal.cal_new,
Xl = meal.cal_new_mean,
ci = .95,
nsim = 100,
extremesDrop = FALSE)
# plot the result
simGG(Sim)
This gives an empty plot with the warning
Warning messages:
1: In min(obj$sims[, x]) : no non-missing arguments to min; returning Inf
2: In max(obj$sims[, x]) : no non-missing arguments to max; returning -Inf
And the Sim$simsobject appears indeed to be empty.
My questions are:
Why do the results from predict() and the use of the formula differ?
Why does the simPH package not calculate the relative risk?
Which method should I choose? My hypothesis is a quadratic effect in a cox regression and I need a plot for this predictor with its relative risk (compared to the predictor being at its mean value), just like in the example.
Quick answer to the simPH issue: the polynomial terms need to be specified in the coxph call using the I function, e.g.:
cox_mod <- coxph(Surv(time, status) ~
ph.karno + pat.karno + meal.cal + I(meal.cal^2),
data = lung)
(The error handling in your use case is pretty poor.)
Using this modification (and 1000 simulations) with your code above should return something like:
Differences between simPH and predict
My guess to the differences is that simPH doesn't create confidence intervals around the transformed point estimates like predict. It draws simulations from the multivariate normal distribution specified by the fitted model, then shows the central 50% and 95% of this simulated distribution. The central line is just the median of the sims. It is explicitly a different logic from predict. For very non-monotonic quantities of interest, like this one, predict point estimates give highly substantively misleading results compared to simPH. There is little evidence for such a form based on 4 observations.