I would like to identify changepoints in my data and an associated error term for their estimate. The mcp package seems to do a good job (visually) of identifying changepoints in my data, but the model parameters generally have rhat values >1.1. From my understanding, I cannot trust any Bayesian parameter estimates unless all rhat values in the model are =< 1.1. Aside from increasing my burn-in period with the adapt argument or using priors (see note below), how else can I improve these models?
Alternatively, can I force mcp to fit a 'best' two and three segment model and return those parameter estimates with error? Ideally I would be able to provide changepoint estimates with an error term associated with each estimate, but packages like segmented and struccchange generally fail to identify changepoints in my data.
The code looks like this:
set.seed(42)
x <- c(227,227,228,228,228,228,228,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,232,233,233,233,233,233,233,233,233,236,236,236,236,237,237,237,238,238,238,238,238,238,238,238,239,239,239,239,239,239,243,244,244,244,244,244,244,244,244,244,245,245,245,246,246,246,246,247,250,250,250,250,251,251,251,251,251,251,251,251,253,253,253,257,257,260,260,260,260,260,260,260,264,264,264,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,267,267,267,267,267,267,267,267,267,267,267,267,267,271,271,271,271,271,271,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,274,274,274,274,274,274,274)
y <- c(8.43,6.9,8.93,7.33,7.28,7.24,6.62,8.36,8.17,8.07,8.07,7.63,7.6,7.54,7.37,7.31,7.21,7.03,6.93,6.88,6.82,6.78,6.7,6.5,8.35,10.97,7.48,7.46,7.28,7.17,6.72,6.68,6.08,7.42,7.14,6.92,6.68,7.49,7.28,6.67,9.4,7.54,7.04,6.89,6.88,6.52,6.45,6.39,8.48,8.04,7.52,7.35,6.9,6.57,6.86,7.46,7.39,7.16,7.08,6.83,6.83,6.7,6.54,6.47,9.75,7.38,5.96,10.49,8.32,7.22,7.05,8.55,10.34,8.23,7.9,7.31,8.18,7.8,7.31,7.18,7.17,7.13,7.02,6.84,10.62,10.09,9.26,10.8,10.37,10.9,10.52,10.23,9.28,9.18,8.85,8.81,11.03,8.84,6.29,11.36,10.91,10.87,10.4,10.17,9.61,9.5,9.36,9.17,9.13,8.88,8.73,8.55,8.37,8.33,8.25,7.82,6.9,9.77,9.53,9.39,9.1,8.93,8.68,8.64,8.47,8.41,8.38,8.28,8.18,7.74,10.67,10.64,10.54,10.36,10.35,7.03,9.51,9.37,9.24,9.22,9.18,8.96,8.95,8.94,8.89,8.82,8.79,8.72,8.35,8.22,8.13,8.07,7.91,7.85,7.79,8.82,8.59,8.44,8.42,8.37,8.06,7.34)
df <- data.frame(x, y)
#Writing the formula for a three-segment line
three_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 0 + x, #segment 2, specifying a changepoint and joined slope
y ~ 1 ~ 0 + x #segment 3, specifying a changepoint and joined slope
)
#Writing the formula for a two-segment line
two_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 0 + x #segment 2, specifying a changepoint and joined slope
)
#Disjointing the slopes of two segments
test_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 1 + x #segment 2, specifying a changepoint and disjoined slope
)
#Disjointing the slopes of the three expected segments
test_three_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 1 + x, #segment 2, specifying a changepoint and disjoined slope
y ~ 1 ~ 1 + x #segment 3, specifying a changepoint and disjoined slope
)
These are the models I've tried, but none achieve consistently tolerable rhat values. The code I use to test them is as follows:
#An example run, although each of these models fails to provide suitable rhat values
fit1 <- mcp(three_segment_model, df, chains = 4, iter = 10000, cores = 3)
plot(fit1)
fit1_summary <- data.frame(summary(fit1))
Given the literature, I have also tried priors (though admittedly I cannot make heads or tails of the truncate and distribution code with mcp), which do not seem to reflect the patterns poorly identified with this modelling approach.
If your issue is primarily a practical one, upping the number of iterations and chains reveals that the posterior is quite reproducible across chains for the "problematic" parameters:
fit1 <- mcp(three_segment_model, df, chains = 6, iter = 50000, cores = 6)
plot_pars(fit1, c("cp_1", "cp_2", "x_1"))
And you have good rhat values for the non-three-segment models; all pointing to the three-segment model being unidentifiable with this data. Without knowing the process, I did try some fairly informative priors:
prior = list(
x_1 = 0, # fixed horizontal!
x_2 = "dnorm(0, 1) T(0, )" # Positive slope
)
but it did not improve rhat values for the change point parameters.
I am working on predicting gam model with random effect to produce 3D surface plot by plot_ly.
Here is my code;
x <- runif(100)
y <- runif(100)
z <- x^2 + y + rnorm(100)
r <- rep(1,times=100) # random effect
r[51:100] <- 2 # replace 1 into 2, making two groups
df <- data.frame(x, y, z, r)
gam_fit <- gam(z ~ s(x) + s(y) + s(r,bs="re"), data = df) # fit
#create matrix data for `add_surface` function in `plot_ly`
newx <- seq(0, 1, len=20)
newy <- seq(0, 1, len=30)
newxy <- expand.grid(x = newx, y = newy)
z <- matrix(predict(gam_fit, newdata = newxy), 20, 30) # predict data as matrix
However, the last line results in error;
Error in model.frame.default(ff, data = newdata, na.action = na.act) :
variable lengths differ (found for 'r')
In addition: Warning message:
In predict.gam(gam_fit, newdata = newxy) :
not all required variables have been supplied in newdata!
Thanks to the previous answer, I am sure that above codes work without random effect, as in here.
How can I predict gam models with random effect?
Assuming you want the surface conditional upon the random effects (but not for a specific level of the random effect), there are two ways.
The first is to provide a level for the random effect but exclude that term from the predicted values using the exclude argument to predict.gam(). The second is to again use exclude but this time to not provide any data for the random effect and instead stop predict.gam() from checking the newdata using the argument newdata.guaranteed = TRUE.
Option 1:
newxy1 <- with(df, expand.grid(x = newx, y = newy, r = 2))
z1 <- predict(gam_fit, newdata = newxy1, exclude = 's(r)')
z1 <- matrix(z1, 20, 30)
Option 2:
z2 <- predict(gam_fit, newdata = newxy, exclude = 's(r)',
newdata.guaranteed=TRUE)
z2 <- matrix(z2, 20, 30)
These produce the same result:
> all.equal(z1, z2)
[1] TRUE
A couple of notes:
Which you use will depend on how complex the rest of you model is. I would generally use the first option as it provides an extra check against me doing something stupid when creating the data. But in this instance, with a simple model and set of covariates it seems safe enough to trust that newdata is OK.
Your example uses a random slope (was that intended?), not a random intercept as r is not a factor. If your real example uses a factor random effect then you'll need to be a little more careful when creating the newdata as you need to get the levels of the factor right. For example:
expand.grid(x = newx, y = newy,
r = with(df, factor(2, levels = levels(r))))
should get the right set-up for a factor r
I fitted a glmmTMB model using family = nbinom1. Now I would like to perform a simulation of data based on predicted values and the dispersion. However, from the help files, it looks like the go-to rnbinom function uses the family=nbinom2 parameterization where variance is equal to mu + mu^2/size.
1) Can anyone help me figure out how to simulate family=nbinom1 data (where variance is equal to mu + mu*size)?
2) Also, is my extraction / use of the dispersion value as size correct?
Thanks so much!
Current code (data not provided, because doesn't matter), using the stats:::rnbinom function despite the mismatch of variance definition:
library(glmmTMB)
mod <- glmmTMB(y ~ x + (1 | ID), data = df, family = nbinom1)
preds <- predict(mod, type = "response")
size <- sigma(mod)
sim <- rnbinom(nrow(df), mu = preds, size = size)
We can try to simulate nbinom1, so if the variance is mu + mu*k:
set.seed(111)
k = 2
x = runif(100,min=1,max=3)
y = rnbinom(100,mu=exp(2*x),size=exp(2*x)/k)
ID = sample(1:2,100,replace=TRUE)
df = data.frame(x,y,ID)
mod <- glmmTMB(y ~ x + (1 | ID), data = df, family = nbinom1)
sigma(mod)
[1] 1.750076
In the above, for every mean, mu, I specified a size that is mu / k so that it will give an expected variance of mu*k. This shows that as long as you parameterize the rnbinom correctly, you get back rnbinom1.
Now with this model, if we need to simulate data, it's just using the same parameterization as above:
preds <- predict(mod, type = "response")
size <- sigma(mod)
sim <- rnbinom(nrow(df), mu = preds, size = preds/size)
plot(sim,df$y)
There are a variety of issues here, including:
sigma(mod) gives the estimated standard deviation of the residuals; it is not a variance but the square-root of a variance, so you might want to square it.
there are many parametrisations of a negative binomial distribution beyond R's version, but in R's version, if the mean is mean(dat) and the variance var(dat) then you can estimate size with mean(dat)^2/(var(dat)-mean(dat)) and the probability prob with mean(dat)/var(dat)
rnbinom() will tolerate size being non-integer or infinite despite this being a theoretical nonsense; it will not tolerate size being negative which can happen if var(dat) is less than mean(dat). It will also have problems the mean is negative or if size is zero.
So perhaps you could consider adapting your simulation lines to something like
sizes <- ifelse(sigma(mod) ^ 2 > preds, preds ^ 2 / (sigma(mod) ^ 2 - preds), Inf)
sim <- ifelse(preds > 0, rnbinom(nrow(df), mu = preds, size = sizes), 0)
then you might still get errors when sigma(mod) is less than or equal to preds