I have been implementing some negative binomial hurdle models in the R package glmmTMB and have come across something perplexing about the truncated negative binomial family.
In examining the source for that family argument I have found:
truncated_nbinom2 <- function(link="log") {
r <- list(family="truncated_nbinom2",
variance=function(mu,theta) {
stop("variance for truncated nbinom2 family not yet implemented")
})
return(make_family(r,link))
}
I am wondering if this family is still under development (as indicated by the stop command in the variance)?
It is documented as working in the vignette, and I am getting reasonable estimates from the models I have fit using this family (e.g. simulated data from the model seem sensible). I know many of the authors of the package are on this forum so I hoped someone might be able to clarify.
The truncated_nbinom2 family should work fine for most purposes. Looking through the glmmTMB source code (grep "\$variance" R/*.R) the $variance component of the family object is used only:
computing Pearson residuals
in creating objects to be used by the effects package
You may run into trouble somewhere else in the pipeline, if you're using downstream packages that need to use the expected variance of an object to compute something. But everything else should be fine.
PS I found an expression for this variance and created an issue to remind us to implement it: https://github.com/glmmTMB/glmmTMB/issues/606
PPS this is in the development version now (unfortunately, I'm pretty sure the paper I found only covers truncated NB2, so truncated NB1 may have to wait a while. However, the answer still applies - the absence of a variance function will only cause trouble in a few circumstances, and should never cause subtle trouble ...)
Related
I try to optimize the averaged prediction of two logistic regressions in a classification task using a superlearner.
My measure of interest is classif.auc
The mlr3 help file tells me (?mlr_learners_avg)
Predictions are averaged using weights (in order of appearance in the
data) which are optimized using nonlinear optimization from the
package "nloptr" for a measure provided in measure (defaults to
classif.acc for LearnerClassifAvg and regr.mse for LearnerRegrAvg).
Learned weights can be obtained from $model. Using non-linear
optimization is implemented in the SuperLearner R package. For a more
detailed analysis the reader is referred to LeDell (2015).
I have two questions regarding this information:
When I look at the source code I think LearnerClassifAvg$new() defaults to "classif.ce", is that true?
I think I could set it to classif.auc with param_set$values <- list(measure="classif.auc",optimizer="nloptr",log_level="warn")
The help file refers to the SuperLearner package and LeDell 2015. As I understand it correctly, the proposed "AUC-Maximizing Ensembles through Metalearning" solution from the paper above is, however, not impelemented in mlr3? Or do I miss something? Could this solution be applied in mlr3? In the mlr3 book I found a paragraph regarding calling an external optimization function, would that be possible for SuperLearner?
As far as I understand it, LeDell2015 proposes and evaluate a general strategy that optimizes AUC as a black-box function by learning optimal weights. They do not really propose a best strategy or any concrete defaults so I looked into the defaults of the SuperLearner package's AUC optimization strategy.
Assuming I understood the paper correctly:
The LearnerClassifAvg basically implements what is proposed in LeDell2015 namely, it optimizes the weights for any metric using non-linear optimization. LeDell2015 focus on the special case of optimizing AUC. As you rightly pointed out, by setting the measure to "classif.auc" you get a meta-learner that optimizes AUC. The default with respect to which optimization routine is used deviates between mlr3pipelines and the SuperLearner package, where we use NLOPT_LN_COBYLA and SuperLearner ... uses the Nelder-Mead method via the optim function to minimize rank loss (from the documentation).
So in order to get exactly the same behaviour, you would need to implement a Nelder-Mead bbotk::Optimizer similar to here that simply wraps stats::optim with method Nelder-Mead and carefully compare settings and stopping criteria. I am fairly confident that NLOPT_LN_COBYLA delivers somewhat comparable results, LeDell2015 has a comparison of the different optimizers for further reference.
Thanks for spotting the error in the documentation. I agree, that the description is a little unclear and I will try to improve this!
Good afternoon, all--thank you in advance for your help! I'm somewhat new to R, so my apologies if this is a trivial or otherwise inappropriate question.
TL;DR: I'm trying to determine Variable Importance (VIM) for factor variables with a random forest model built-in RandomForestSRC, which is not a built-in feature of that package. Using both the LIME and DALEX packages, I encounter the same error: cannot coerce class 'c("rfsrc, "predict", "class")' to a data.frame. Any assistance resolving this error, or alternate approaches, would be greatly appreciated!
I have a random forest model I've built in R, using the RandomForestSRC package. The model seems to work great--training and testing went fine, got the predicted output I needed, results seem in-line with what I would expect. Unfortunately, one of the requirements is that I need to be able to indicate how the model arrived at its conclusions (eg, I need to also include variable importance as a part of the output), for both continuous and factor variables.
This doesn't seem to be a built-in feature with the RandomForestSRC package, so I've looked into both the LIME and DALEX packages, both of which should be able to break out VIM from the existing RF model. Unfortunately, neither have native support for the RFSRC package, which means I've needed to build in the prediction functions myself, as recommended by this vignette:https://uc-r.github.io/dalex
model_type.rfsrc <- function (x, ...) {
return ('classification')
}
predict_model.rfsrc <- function (x, newdata, type, ...) {
as.data.frame(predict(x, newdata, ...)
}
Unfortunately, in running the VIM section of the model (in both LIME and DALEX), I'm asked to pass both the predicted output and the model that created that output. In doing so, it hits an error with the above predict_model function:
error in as.data.frame.default(predict(model, (newdata))):
cannot coerce class 'c("rfsrc, "predict", "class")' to a data.frame
And, like...of course, it can't; it's trying to turn the model itself into a data frame. Unfortunately, while I think I understand why R is giving me that error, that's about as far as I've been able to figure out on my own.
Additionally, I'm using the RandomForestSRC package for two reasons: it doesn't put a limit on the number of factor variables, and it can handle imbalanced data. I'm working with medical data, so both of these are necessary (eg, there are ~100,000 different medical codes that can be encoded in a single data variable, and the ratio of "people-who-don't-have-this-condition" vs "people-who-do-have-this-condition" is frequently 100 to 1). If anyone has any suggestions for alternative packages that handle these issues, though, and have built-in VIM functionality (or integrate with DALEX / LIME), that would be fantastic as well.
Thank you all very much for your help!
I am creating a table with tab_model from the package sjPlot (https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html).
However, when I use a negative binomial rstanarm model object, tab_model re-runs MCMC chains.
My actual model takes many hours to run, so this is not ideal for tab_model to be doing this, but it doesn't seem to do it for other models (such as with glmer in lme4).
library(rstanarm)
library(lme4)
dat.nb<-data.frame(x=rnorm(200),z=rep(c("A","B","C","D"),50),
y=rnbinom(200,size=1,prob = .5))
mod1<-glmer.nb(y~x+(1|z),data=dat.nb)
options(mc.cores = parallel::detectCores())
mod2<-stan_glmer.nb(y~x+(1|z),data=dat.nb)
Now to create the model tables:
library(sjPlot)
tab_model(mod1)
The output is quick, and as expected (although the original model also ran quick, so it is possible that tab_model is re-running the model here too).
Now when I try
tab_model(mod2)
It begins re-running MCMC. Is this normal behavior, and if so, is anyone familiar with a way to turn this off, and just use the model object already created, rather than re-running the model?
tl;dr I think this is going to be hard to avoid without hacking both the insight package and this one, or asking the package maintainer for an edit, unless you want to forgo printing the ICC, R^2, and the random-effects variance. Here, tab_model() calls insight::get_variance(), which tries to compute variances for the null model so it can compute the ICC and R^2. Computing these variances requires re-running the model. (When it does it for the glmer.nb, it goes via lme4:::update.merMod() and is quick enough that you don't notice the computation time.)
So
tab_model(mod2,show.r2=FALSE,show.icc=FALSE,show.re.var=FALSE)
doesn't recompute anything. In theory I think it should be possible to skip the resampling/recomputation step with just show.r2=FALSE, show.icc=FALSE (i.e. it shouldn't be necessary to get the RE var), but this would take some hacking/participation by the maintainer.
Digging in (by using debug(rstan::sampling) to stop inside the Stan sampling function, then where to see the call stack ...
tab_model() calls insight::get_variance() here
the insight::get_variance.stanreg() method calls insight:::.compute_variances()
... which calls insight:::.compute_variance_distribution()
... which (for a log-link count distribution) calls insight:::.variance_distributional()
... which calls null_model
... which calls .null_model_mixed()
... which calls stats::update()
I know this is a kinda old question, but I just wonder if this has a solution now.
I usually perform the mixed effects model with lme4 package with lmer function. However, I know this function does not allow me to include the negative variance components in the model.
I would really want to include the negative variances in model with R. Does anyone have any suggestions that what packages I would use? Or, does the new lme4 allow it?
New lme4 doesn't allow it, nor does nlme. It looks like ASReml might do it, if you set the IGU argument as described here -- but ASReml is commercial, so you'd have to buy a license.
The more statistically sound way to deal with this situation is typically to fit a compound symmetry variance structure, which will allow negative correlations within groups. You can do this via nlme, or in several somewhat more experimental frameworks (e.g. the "flexLambda" development branch of lme4).
More discussion from the mailing list here.
Using lmer I get the following warning:
Warning messages:
1: In optwrap(optimizer, devfun, x#theta, lower = x#lower) :
convergence code 3 from bobyqa: bobyqa -- a trust region step failed to reduce q
This error ois generated after using anova(model1, model2) . I tried to make this reproducible but if I dput the data and try again I the error does not reproduce on the dput data, despite the original and new datarames have the exact same str.
If have tried again in a clean session, and the error reproduces, and again is lost with a dput
I know I am not giving people much to work with here, like i said I would love to reproduce the problem. Cayone shed light on this warning?
(I'm not sure whether this is a comment or an answer, but it's a bit long and might be an answer.)
The proximal cause of your difficulty with reproducing the result is that lme4 uses both environments and reference classes: these are tricky to "serialize", i.e. to translate to a linear stream that can be saved via dput() or save(). (Can you please try save() and see if it works better than dput()?
In addition, both environments and reference classes use "pass-by-reference" semantics, so operating on the saved model can change it. anova() automatically refits the model, which makes some tiny but non-zero changes in the internal structure of the saved model object (we are still trying to track this down).
#alexkeil's comment is wrong: the nonlinear optimizers used within lme4 do not use any calls to the pseudo-random number generator. They are deterministic (but the two points above explain why things might look a bit weird).
To allay your concerns with the fit, I would check the fit by computing the gradient and Hessian at the final fit, e.g.
library(lme4)
library(numDeriv)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
dd <- update(fm1,devFunOnly=TRUE)
params <- getME(fm1,"theta") ## also need beta for glmer fits
grad(dd,params)
## all values 'small', say < 1e-3
## [1] 0.0002462423 0.0003276917 0.0003415010
eigen(solve(hessian(dd,params)),only.values=TRUE)$values
## all values positive and of similar magnitude
## [1] 0.029051631 0.002757233 0.001182232
We are in the process of implementing similar checks to run automatically within lme4.
That said, I would still love to see your example, if there's a way to reproduce it relatively easily.
PS: in order to be using bobyqa, you must either be using glmer or have used lmerControl to modify the default optimizer choice ... ??