I am creating a table with tab_model from the package sjPlot (https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html).
However, when I use a negative binomial rstanarm model object, tab_model re-runs MCMC chains.
My actual model takes many hours to run, so this is not ideal for tab_model to be doing this, but it doesn't seem to do it for other models (such as with glmer in lme4).
library(rstanarm)
library(lme4)
dat.nb<-data.frame(x=rnorm(200),z=rep(c("A","B","C","D"),50),
y=rnbinom(200,size=1,prob = .5))
mod1<-glmer.nb(y~x+(1|z),data=dat.nb)
options(mc.cores = parallel::detectCores())
mod2<-stan_glmer.nb(y~x+(1|z),data=dat.nb)
Now to create the model tables:
library(sjPlot)
tab_model(mod1)
The output is quick, and as expected (although the original model also ran quick, so it is possible that tab_model is re-running the model here too).
Now when I try
tab_model(mod2)
It begins re-running MCMC. Is this normal behavior, and if so, is anyone familiar with a way to turn this off, and just use the model object already created, rather than re-running the model?
tl;dr I think this is going to be hard to avoid without hacking both the insight package and this one, or asking the package maintainer for an edit, unless you want to forgo printing the ICC, R^2, and the random-effects variance. Here, tab_model() calls insight::get_variance(), which tries to compute variances for the null model so it can compute the ICC and R^2. Computing these variances requires re-running the model. (When it does it for the glmer.nb, it goes via lme4:::update.merMod() and is quick enough that you don't notice the computation time.)
So
tab_model(mod2,show.r2=FALSE,show.icc=FALSE,show.re.var=FALSE)
doesn't recompute anything. In theory I think it should be possible to skip the resampling/recomputation step with just show.r2=FALSE, show.icc=FALSE (i.e. it shouldn't be necessary to get the RE var), but this would take some hacking/participation by the maintainer.
Digging in (by using debug(rstan::sampling) to stop inside the Stan sampling function, then where to see the call stack ...
tab_model() calls insight::get_variance() here
the insight::get_variance.stanreg() method calls insight:::.compute_variances()
... which calls insight:::.compute_variance_distribution()
... which (for a log-link count distribution) calls insight:::.variance_distributional()
... which calls null_model
... which calls .null_model_mixed()
... which calls stats::update()
Related
I have been implementing some negative binomial hurdle models in the R package glmmTMB and have come across something perplexing about the truncated negative binomial family.
In examining the source for that family argument I have found:
truncated_nbinom2 <- function(link="log") {
r <- list(family="truncated_nbinom2",
variance=function(mu,theta) {
stop("variance for truncated nbinom2 family not yet implemented")
})
return(make_family(r,link))
}
I am wondering if this family is still under development (as indicated by the stop command in the variance)?
It is documented as working in the vignette, and I am getting reasonable estimates from the models I have fit using this family (e.g. simulated data from the model seem sensible). I know many of the authors of the package are on this forum so I hoped someone might be able to clarify.
The truncated_nbinom2 family should work fine for most purposes. Looking through the glmmTMB source code (grep "\$variance" R/*.R) the $variance component of the family object is used only:
computing Pearson residuals
in creating objects to be used by the effects package
You may run into trouble somewhere else in the pipeline, if you're using downstream packages that need to use the expected variance of an object to compute something. But everything else should be fine.
PS I found an expression for this variance and created an issue to remind us to implement it: https://github.com/glmmTMB/glmmTMB/issues/606
PPS this is in the development version now (unfortunately, I'm pretty sure the paper I found only covers truncated NB2, so truncated NB1 may have to wait a while. However, the answer still applies - the absence of a variance function will only cause trouble in a few circumstances, and should never cause subtle trouble ...)
I am using glmulti to select a set of candidate generalized linear models and my variable importance values and 'best' model keep changing each time I run the model.
I am struggling to understand why this is, does glmulti need a set.seed value to make results reproducible?
Thanks.
I believe the glmulti function is in the glmulti package. (You should state this in your question...) If so, the help page says that sometimes it uses a genetic algorithm to find the best model. Those are indeed random algorithms, so you can expect to get an answer that depends on the random number seed.
I'm using this LDA package for R. Specifically I am trying to do supervised latent dirichlet allocation (slda). In the linked package, there's an slda.em function. However what confuses me is that it asks for alpha, eta and variance parameters. As far as I understand, I thought these parameters are unknowns in the model. So my question is, did the author of the package mean to say that these are initial guesses for the parameters? If yes, there doesn't seem to be a way of accessing them from the result of running slda.em.
Aside from coding the extra EM steps in the algorithm, is there a suggested way to guess reasonable values for these parameters?
Since you are trying to generate a supervised model, the typical approach would be to use cross validation to determine the model parameters. So you hold out some of the data as your test set, train the a model on the remaining data, and evaluate the model performance, repeating k times. You then continue to repeat with different model parameters to determine which result in the best model performance.
In the specific case of slda, I would run demo(slda) to see the author's implementation of it. When you run the demo, you'll see that he sets alpha=1.0, eta=0.1, and variance=0.25. I'd suggest using these as your starting point, and then use cross validation to determine better parameters if you need to improve model performance.
I hope I have come to the right forum. I'm an ecologist making species distribution models using the maxent (version 3.3.3, http://www.cs.princeton.edu/~schapire/maxent/) function in R, through the dismo package. I have used the argument "replicates = 5" which tells maxent to do a 5-fold cross-validation. When running maxent from the maxent.jar file directly (the maxent software), an html file with statistics will be made, including the prediction maps. In R, an html file is also made, but the prediction maps have to be extracted afterwards, using the function "predict" in the dismo package in r. When I do this, I get 5 maps, due to the 5-fold cross-validation setting. However, (and this is the problem) I want only one output map, one "summary" prediction map. I assume this is possible, although I don't know how maxent computes it. The maxent tutorial (see link above) says that:
"...you may want to avoid eating up disk space by turning off the “write output grids” option, which will suppress writing of output grids for the replicate runs, so that you only get the summary statistics grids (avg, stderr etc.)."
A list of arguments that can be put into R is found in this forum https://groups.google.com/forum/#!topic/maxent/yRBlvZ1_9rQ.
I have tried to use the argument "outputgrids=FALSE" both in the maxent function itself, and in the predict function, but it doesn't work. I still get 5 maps, even though I don't get any errors in R.
So my question is: How do I get one "summary" prediction map instead of the five prediction maps that results from the cross-validation?
I hope someone can help me with this, I am really stuck and haven't found any answers anywhere on the internet. Not even a discussion about this. Hope my question is clear. This is the R-script that I use:
model1<-maxent(x=predvars, p=presence_points, a=target_group_absence, path="//home//...//model1", args=c("replicates=5", "outputgrids=FALSE"))
model1map<-predict(model1, predvars, filename="//home//...//model1map.tif", outputgrids=FALSE)
Best regards,
Kristin
Sorry to be the bearer of bad news, but based on the source code, it looks like Dismo's predict function does not have the ability to generate a summary map.
Nitty-gritty details for those who care: When you call maxent with replicates set to something greater than 1, the maxent function returns a MaxEntReplicates object, rather than a normal MaxEnt object. When predict receives a MaxEntReplicates object, it just iterates through all of the models that it contains and calls predict on them individually.
So, what next? Fortunately, all is not lost! The reason that Dismo doesn't have this functionality is that for most kinds of model-building, there isn't actually a valid way to average parameters across your cross-validation models. I don't want to go so far as to say that that's definitely the case for MaxEnt specifically, but I suspect it is. As such, cross-validation is usually used more as a way of checking that your model building methodology works for your data than as a way of building your model directly (see this question for further discussion of that point). After verifying via cross-validation that models built using a given procedure seem to be accurate for the phenomenon you're modelling, it's customary to build a final model using all of your data. In theory this new model should only be better than models trained on a subset of your data.
So basically, assuming your cross-validated models look reasonable, you can run MaxEnt again with only one replicate. Your final result will be a model accuracy estimate based on the cross-validation and a map based on the second run with all of your data lumped together. Depending on what exactly your question is, there might be other useful summary statistics from the cross-validation that you want to use, but those are all things you've already seen in the html output.
I may have found this a couple of years later. But you could do something like this:
xm <- maxent(predictors, pres_train) # basically the maxent model
px <- predict(predictors, xm, ext=ext, progress= '' ) #prediction
px2 <- predict(predictors, xm2, ext=ext, progress= '' ) #prediction #02
models <- stack(px,px2) # create a stack of prediction from all the models
final_map <- mean(px,px2) # Take a mean of all the prediction
plot(final_map) #plot the averaged map
xm1,xm2,.. would be the maxent models for each partitions in cross-validation, and px, px2,.. would be the predicted maps.
I am trying to calculate Gelman and Rubin's convergence diagnostic for a JAGS analysis I am currently running in R using the R package rjags.
For example, I would like to assess the convergence diagnostic for my parameter beta. To do this, I am using the library coda and the command:
library(coda)
gelman.diag(out_2MCMC$beta)
with out_2MCMC being an MCMC list object with more than one chain, resulting in correct output with no error messages or whatsoever. However, as I used a high amount of iterations as burn-in, I would like to calculate the convergence diagnostic for only a subset of iterations (the part after the burn in only!).
To do this, I tried:
gelman.diag(out_2MCMC$beta[,10000:15000,])
This resulted in the following error:
Error in mcmc.list(x) : Arguments must be mcmc objects
Therefore, I tried:
as.mcmc(out_2MCMC$beta[,10000:15000,]))
But, surprisingly this resulted in the following error:
Error in gelman.diag(as.mcmc(out_2MCMC$beta[, 10000:15000,]))
You need at least two chains
As this is the same MCMC list object as I get from the JAGS analysis and the same as I am using when I am assessing the convergence diagnostic for all iterations (which works just perfect), I don't see the problem here.
The function itself only provides the option to use the second half of the series (iterations) in the computation of the convergence diagnostic. As my burn in phase is longer than that, this is unfortunately not enough for me.
I guess that it is something very obvious that I am just missing. Does anyone have any suggestions or tipps?
As it is a lot of code, I did not provide R code to run a full 2MCMC-JAGS analysis. I hope that the code above illustrates the problem well enough, maybe someone encountered the same problem before or recognizes any mistakes in my syntax. If you however feel that the complete code is necessary to understand my problem I can still provide example code that runs a 2MCM JAGS analysis.
I was looking for a solution to the same problem and found that the window() function in the stats package will accomplish the task. For your case:
window(out_2MCMC, 100001,15000)
will return an mcmc.list object containing the last 5,000 samples for all monitored parameters and updated attributes for the new list.