I am developing an iterative algorithm that uses quantile regression models at each iteration. For that I use the rq function from the quantreg package in R. So far it has worked fine. However, I have found a dataset where, at one of the iterations, the rq function simply gets stuck. No error message, no warning. It simply goes on as if still working, but never finishes computation.
I provide here a very small minimal code example. You can download the problematic data on this link:
https://www.dropbox.com/s/yrlotit1ovk9yzd/r555.RData?dl=0
library(quantreg)
load('~r555.RData')
dependent = r$dependent
independent = r$independent
quantreg::rq(dependent ~ -1 + independent, tau=0.1)
If you execute the above mentioned code, the rq function will get stuck and never finish. Be aware that the data provided is part of the iterative process I am developing, so it has no direct interpretation by itself. I am writing to check for possible reasons on this behaviour and check for possible solutions.
Dont know if it matters, but I have tested this on two different computers running Windows10 and using different versions of the quantreg package.
Changing the default method="br" to method="fn" fixes the problem.
quantreg::rq(dependent ~ -1 + independent, tau=0.1, method="fn")
Related
I am creating a table with tab_model from the package sjPlot (https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html).
However, when I use a negative binomial rstanarm model object, tab_model re-runs MCMC chains.
My actual model takes many hours to run, so this is not ideal for tab_model to be doing this, but it doesn't seem to do it for other models (such as with glmer in lme4).
library(rstanarm)
library(lme4)
dat.nb<-data.frame(x=rnorm(200),z=rep(c("A","B","C","D"),50),
y=rnbinom(200,size=1,prob = .5))
mod1<-glmer.nb(y~x+(1|z),data=dat.nb)
options(mc.cores = parallel::detectCores())
mod2<-stan_glmer.nb(y~x+(1|z),data=dat.nb)
Now to create the model tables:
library(sjPlot)
tab_model(mod1)
The output is quick, and as expected (although the original model also ran quick, so it is possible that tab_model is re-running the model here too).
Now when I try
tab_model(mod2)
It begins re-running MCMC. Is this normal behavior, and if so, is anyone familiar with a way to turn this off, and just use the model object already created, rather than re-running the model?
tl;dr I think this is going to be hard to avoid without hacking both the insight package and this one, or asking the package maintainer for an edit, unless you want to forgo printing the ICC, R^2, and the random-effects variance. Here, tab_model() calls insight::get_variance(), which tries to compute variances for the null model so it can compute the ICC and R^2. Computing these variances requires re-running the model. (When it does it for the glmer.nb, it goes via lme4:::update.merMod() and is quick enough that you don't notice the computation time.)
So
tab_model(mod2,show.r2=FALSE,show.icc=FALSE,show.re.var=FALSE)
doesn't recompute anything. In theory I think it should be possible to skip the resampling/recomputation step with just show.r2=FALSE, show.icc=FALSE (i.e. it shouldn't be necessary to get the RE var), but this would take some hacking/participation by the maintainer.
Digging in (by using debug(rstan::sampling) to stop inside the Stan sampling function, then where to see the call stack ...
tab_model() calls insight::get_variance() here
the insight::get_variance.stanreg() method calls insight:::.compute_variances()
... which calls insight:::.compute_variance_distribution()
... which (for a log-link count distribution) calls insight:::.variance_distributional()
... which calls null_model
... which calls .null_model_mixed()
... which calls stats::update()
I have been implementing some negative binomial hurdle models in the R package glmmTMB and have come across something perplexing about the truncated negative binomial family.
In examining the source for that family argument I have found:
truncated_nbinom2 <- function(link="log") {
r <- list(family="truncated_nbinom2",
variance=function(mu,theta) {
stop("variance for truncated nbinom2 family not yet implemented")
})
return(make_family(r,link))
}
I am wondering if this family is still under development (as indicated by the stop command in the variance)?
It is documented as working in the vignette, and I am getting reasonable estimates from the models I have fit using this family (e.g. simulated data from the model seem sensible). I know many of the authors of the package are on this forum so I hoped someone might be able to clarify.
The truncated_nbinom2 family should work fine for most purposes. Looking through the glmmTMB source code (grep "\$variance" R/*.R) the $variance component of the family object is used only:
computing Pearson residuals
in creating objects to be used by the effects package
You may run into trouble somewhere else in the pipeline, if you're using downstream packages that need to use the expected variance of an object to compute something. But everything else should be fine.
PS I found an expression for this variance and created an issue to remind us to implement it: https://github.com/glmmTMB/glmmTMB/issues/606
PPS this is in the development version now (unfortunately, I'm pretty sure the paper I found only covers truncated NB2, so truncated NB1 may have to wait a while. However, the answer still applies - the absence of a variance function will only cause trouble in a few circumstances, and should never cause subtle trouble ...)
I'm trying to fit +- 70.000 values as a function of two variables using the loess() function several times. I want to use this fit to de-trend the data. My problem is that once I start the loess function, the R session takes up all available cores on the system, and that would be inconsiderate towards other users on the same computing cluster.
The relevant code would be analogous to the following:
# Approximation of the data
df <- data.frame(y = rpois(70000, rnorm(70000, 10, 2)), # y is count data
x = 50000 - rpois(70000, 100),
z = runif(70000))
# The problematic operation
fit <- loess(y ~ x + z, data = df)
When I run this example on my local machine, it only takes up 1 core, but on the cluster it takes as many cores as it could get (up to 48). Ideally, I would loess() to run on only 1 core.
I've tried to trace any multicore parameters in the code of loess, which I couldn't find. I know that loess calls stats:::simpleLoess, which in turn calls C code, which in turn calls Fortran code. I have no experience in C or Fortran and I haven't been able to figure out how I can restrict the CPU usage for this function.
Does anyone has any suggestion on how I can limit the CPU usage of the loess function?
I am not knowledgeable enough to comment on specifics about how all of this works, but I know that C++ and FORTRAN for R are usually built using the OpenMP framework for multi-thread programming. Empirically, I do know that your issue can be resolved if you set the OMP_NUM_THREADS argument before you launch R or if you set it within an R session.
Let's say you wanted to use 2 threads for the loess function. Before you launch R, you would do this ($ to signify typing this in a shell session):
$ OMP_NUM_THREADS=2 R [whatever other options you use to launch R]
Here's how to do it from within R (> to indicate an interactive R session):
> Sys.setenv("OMP_NUM_THREADS" = 2)
If you ever need to check the variable from within R, you can do the following (this will return a character vector with the number):
> Sys.getenv("OMP_NUM_THREADS")
# The result in our example will be "2"
For completeness, be sure to use ?Sys.setenv or ?Sys.getenv if you wish to get more information about those functions, and check out this site for details about OMP_NUM_THREADS.
Hope that helps!
So McG led me down a path that eventually gave me the ability to control the number of cores, which I'll post as another answer.
There were a few details I foolishly neglected to mention, namely that I was working on an RStudio server. For all other purposes, I indeed think that McG's answer would be excellent.
That answer helped me get the correct terms to google, and strolling around the search results I stumbled upon this thread that suggested that the RhpcBLASctl package has a function to set the number of cores as follows:
blas_set_num_threads(2)
Setting this in an RMarkdown document before running loess kept my CPU usage at 200% while running the loess function afterwards that was problematic before.
I am using the stepAIC function in R to run a stepwise regression on a dataset with 28 predictor variables. The backwards method is working perfectly, however the forward method has been running for the past half an hour with no output whatsoever this far. I believe it is to do with how I'm defining the scope attribute. Without the scope it runs instantaneously however only one step is run and no changes are made to the initial model.
Bstep.NNN.top<-step(lm(Response~X1+X2+X3+X4+.....+X28,data=df),
scope=list(upper=Response~X1*X2*X3*.....*X28,lower=Response~1),direction=c("forward"))
Does anybody know of a method that is quicker to run? Or if there is a way to simplify the scope attribute to a point where the run time will decrease?
Thanks
My model failed with the following error:
Compiling rjags model...
Error: The following error occured when compiling and adapting the model using rjags:
Error in rjags::jags.model(model, data = dataenv, inits = inits, n.chains = length(runjags.object$end.state), :
Error in node Y[34,10]
Observed node inconsistent with unobserved parents at initialization.
Try setting appropriate initial values.
I have done some diagnosis and found that there was a problem with initial values in chain 3. However, this can happen from time to time. Is there any way to tell run.jags or JAGS itself to re-try and re-run the model in such cases? For example, to tell him to make another N attempts to initialize the model properly. That would be very logical thing to do instead of just failing. Or do I have to do it manually with some tryCatch thing?
P.S.: note that I am currently using run.jags to run JAGS from R.
There is no facility for that provided within runjags, but it would be fairly simple to write yourself like so:
success <- FALSE
while(!success){
s <- try(results <- run.jags(...))
success <- class(s)!='try-error'
}
results
[Note that if this model NEVER works, the loop will never stop!]
A better idea might be to specify an initial values function/list that provides initial values that are guaranteed to work (if possible).
In runjags version 2, it will be possible to recover successful simulations when some simulations have crashed, so if you ran (say) 5 chains in parallel then if 1 or 2 crashed you would still have 3 or 4. That should be released within the next couple of weeks, and contains a large number of other improvements.
Usually when this error occurs it is an indication of a serious underlying problem. I don't think a strategy of "try again" is useful in general (and especially because default initial values are deterministic).
The default initial values generated by JAGS are given by a "typical" value from the prior distribution (e.g. mean, median, or mode). If it turns out that this is inconsistent with the data then there are typically two possible causes:
A posteriori constraints that need to be taken into account, such as
when modelling censored survival data with the dinterval
distribution
Prior-data conflict, e.g. the prior mean is so far
away from the value supported by the data that it has zero
likelihood.
These problems remain the same when you are supplying your own initial values.
If you think you can generate good initial values most of the time, with occasional failures, then it might be worth repeated attempts inside a call to try() but I think this is an unusual case.