brms model not converging - r

I have this brms model
library(brms)
library(dplyr)
x = rep( c(-20:20,-20:20), 5)
y = c(x[1:41]^2, (x[42:82]+5)^2)
group = c(rep("A",41), rep("B",41) )
data = data.frame( x= x, y = y , group = group)
f = brm(y~ gp(x, cov ="exp_quad") +(1|group), data = data, control = list( adapt_delta = .95) )
f
and the model does not fit. I get this error
Warning messages:
1: The model has not converged (some Rhats are > 1.1). Do not analyse the results!
We recommend running more iterations and/or setting stronger priors.
2: There were 1644 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help.
See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Any idea how to get this to fit ?

Brian is most likely correct, you have created some test data that does not have any variance. Assuming this was a toy dataset for example, and you are working with a real dataset, you need to follow the directions in the warning message. I would try calling brm with the changes that I am making here:
f = brm(y~ gp(x, cov ="exp_quad") + (1|group), data = data, control = list( adapt_delta = .99), iter = 6000)
adapt_delta is always a value between 0 and 1, so if you get a warning message that you need to set it higher than 0.99, you could try 0.999. You didn't specify the number of iterations in your call, so it went with the default, which is 2000 I believe. I have tripled that. Also, if you have multiple cores on your computer, you should be setting cores = 4 in your call so that each chain can run on its own core.

Related

Error with svyglm function in survey package in R: "all variables must be in design=argument"

New to stackoverflow. I'm working on a project with NHIS data, but I cannot get the svyglm function to work even for a simple, unadjusted logistic regression with a binary predictor and binary outcome variable (ultimately I'd like to use multiple categorical predictors, but one step at a time).
El_under_glm<-svyglm(ElUnder~SO2, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in eval(extras, data, env) :
object '.survey.prob.weights' not found
I changed the variables to 0 and 1 instead:
Under_narm$SO2REG<-ifelse(Under_narm$SO2=="Heterosexual", 0, 1)
Under_narm$ElUnderREG<-ifelse(Under_narm$ElUnder=="No", 0, 1)
But then get a different issue:
El_under_glm<-svyglm(ElUnderREG~SO2REG, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in svyglm.survey.design(ElUnderREG ~ SO2REG, design = SAMPdesign, :
all variables must be in design= argument
This is the design I'm using to account for the weights -- I'm pretty sure it's correct:
SAMPdesign=svydesign(data=Under_narm, id= ~NHISPID, weight= ~SAMPWEIGHT)
Any and all assistance appreciated! I've got a good grasp of stats but am a slow coder. Let me know if I can provide any other information.
Using some make-believe sample data I was able to get your model to run by setting rescale = TRUE. The documentation states
Rescaling of weights, to improve numerical stability. The default
rescales weights to sum to the sample size. Use FALSE to not rescale
weights.
So, one solution maybe is just to set rescale = TRUE.
library(survey)
# sample data
Under_narm <- data.frame(SO2 = factor(rep(1:2, 1000)),
ElUnder = sample(0:1, 1000, replace = TRUE),
NHISPID = paste0("id", 1:1000),
SAMPWEIGHT = sample(c(0.5, 2), 1000, replace = TRUE))
# with 'rescale' = TRUE
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(), # this family avoids warnings
rescale=TRUE) # Weights rescaled to the sum of the sample size.
summary(El_under_glm, correlation = TRUE) # use correlation with summary()
Otherwise, looking code for this function's method with 'survey:::svyglm.survey.design', it seems like there may be a bug. I could be wrong, but by my read when 'rescale' is FALSE, .survey.prob.weights does not appear to get assigned a value.
if (is.null(g$weights))
g$weights <- quote(.survey.prob.weights)
else g$weights <- bquote(.survey.prob.weights * .(g$weights)) # bug?
g$data <- quote(data)
g[[1]] <- quote(glm)
if (rescale)
data$.survey.prob.weights <- (1/design$prob)/mean(1/design$prob)
There may be a work around if you assign a vector of numeric values to .survey.prob.weights in the global environment. No idea what these values should be, but your error goes away if you do something like the following. (.survey.prob.weights needs to be double the length of the data.)
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
.survey.prob.weights <- rep(1, 2000)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(),
rescale=FALSE)
summary(El_under_glm, correlation = TRUE)

MLR - calculating feature importance for bagged, boosted trees (XGBoost)

Good morning,
I have a question about calculating feature importance for bagged and boosted regression tree models with MLR package in R. I am using XGBOOST to make predictions and i'm using bagging to estimate prediction uncertainty. My data set is relatively large; approximately 10k features and observations. The predictions work perfectly (see code below), but I can't seem to calculate feature importance (the last line in the code below). The importance function crashes with no errors... and freezes the R session. I saw some related python code, where people seem to calculate the importance for each of the bagged models here and here. I haven't been able to get that to work properly in R either. Specifically, i'm not sure how to access individual models within the objected produced by MLR (mb object in the code below). In python, this seems to be trivial. In R, i can't seem to extract mb$learner.model, which seems logically closest to what i need. So i'm wondering if anyone had any experience with this issues?
Please see the code below
learn1 <- makeRegrTask(data = train.all , target= "resp", weights = weights1)
lrn.xgb <- makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals <- list( objective="reg:squarederror", eval_metric="error", nrounds=300, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 50, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag <- setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, learn1)
fimp1 <- getFeatureImportance(mb)
If you set bw.feats = 1 it might be feasible to average the feature importance values.
Basically you just have to apply over all single models that are stored in the HomogeneousEnsembleModel. Some extra care is necessary because the order of the features gets mixed up because of the sampling - although we set it to 100%.
library(mlr)
data = data.frame(x1 = runif(100), x2 = runif(100), x3 = runif(100))
data$y = with(data, x1 + 2 * x2 + 0.1 * x3 + rnorm(100))
task = makeRegrTask(data = data, target = "y")
lrn.xgb = makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals = list( objective="reg:squarederror", eval_metric="error", nrounds=50, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 10, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag = setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, task)
fimps = lapply(mb$learner.model$next.model, function(x) getFeatureImportance(x)$res)
fimp = fimps[[1]]
# we have to take extra care because the results are not ordered
for (i in 2:length(fimps)) {
fimp = merge(fimp, fimps[[i]], by = "variable")
}
rowMeans(fimp[,-1]) # only makes sense with bw.feats = 1
# [1] 0.2787052 0.4853880 0.2359068

MLR, example dependent cost of misclassification, makeCostSensWeightedPairsWrapper

This question has been seen 74 times and has received only one response (as of noon (PDT) Wed, Aug-14).
I've rewritten the question to make it as clear as possible and I'll appreciate any help.
As a summary, I need a small but complete example on a dataset with binary response on how to use MLR's makeCostSensWeightedPairsWrapper to obtain prediction probabilities on a test set.
In the MLR tutorial in the part on cost-sensitive classification
https://mlr.mlr-org.com/articles/tutorial/cost_sensitive_classif.html
there is a paragraph on "Example-dependent misclassification costs" and an example is given based on the iris dataset.
In the code snippet below, I modified the iris data set so as to contain only two classes as I'm interested in binary classification only.
library( mlr )
set.seed( 12347 )
n1 = 100; ntrain = 70
df = iris[ 1:n1, ] # 100 points in df so as to have two classes only (setosa and versicolor)
df$Species = factor( df$Species ) # refactor the response
# partition df into a training set (70 points) and test set (30 points)
#
ix = sample( 1:n1, ntrain, replace=FALSE )
xtest = df[ setdiff( 1:n1, ix ), ] ## test set
ntest = nrow( xtest )
xtrain = df[ ix, ] # this is the training set
# create cost matrix, same as in the MLR example
#
cost = matrix(runif(ntrain * 2, 0, 2000), ntrain) * (1 - diag(2))[xtrain$Species,] + runif(ntrain, 0, 10)
colnames(cost) = levels(xtrain$Species)
rownames(cost) = rownames(xtrain)
xtrain$Species = NULL # this is done according to the MLR example
# cost-sensitive task
#
costsens.task = makeCostSensTask(id = "xtrain", data = xtrain, cost = cost )
costsens.task
##lrn = makeLearner("classif.multinom", trace = FALSE, predict.type="prob" )
lrn = makeLearner( "classif.gbm", predict.type="prob" )
lrn = makeCostSensWeightedPairsWrapper( lrn ); lrn
mod = train(lrn, costsens.task ); mod
pred = predict( mod, newdata = xtest, pred.type="prob" );
perf = performance( pred, measures = list(auc), task = costsens.task)
# I get the following error:
# Error in FUN(X[[i]], ...) :
# You need to have a 'truth' column in your pred object for measure auc!
My original project is to do a binary classification which incorporates example-dependent misclassification costs.
The goal is to do a prediction on a test dataset, obtain the probabilities and show the performance (using ROCR, for which there are MLR-mapping functions).
NOTE: The learners I've tried are 'classif.multinom' and (I guessed) 'classif.gbm' as the two that might be compatible with the weighted pairs wrapper.
My questions are:
Q1: Where in the code snippet and how to specify that I want probabilities as an output of the cost-sensitive classifier?
Q2: Which learner can be used so as to produce classification probabilities?
Q3: How to avoid the error above and get the class probabilities?
Once again, I'd really appreciate any help, even more so if there is anyone who can answer promptly.
OK, after a number of days and almost a hundred views of this question (about 20 are mine :) there has been only one comment and no answers.
From what I could understand exploring some of the available MLR documentation, it seems that the output of example-based cost-sensitive method (makeCostSensWeightedPairsWrapper) is labels only and no prediction probabilities.
In other words, no probabilities are available from a cost-sensitive task, only the new labels are given, which are in turn computed based on the probabilities of the base classifier.
So, this the answer which I'll accept.
As for MLR errors, in this case at least, it would be helpful to get an explicit error message instead of a spurious one, or simply to note this in the documentation.

vegan::ordiR2step() doesn't find best-fit model

The vegan package includes the ordiR2step() function for model building, which can be used to identify the most important variables using the R2 and the p-value as goodness of fit measures. However for the dataset I was recently working with the function doesn't provide the best-fit model.
# data
RIKZ <- read.table("http://www.uni-koblenz-landau.de/en/campus-landau/faculty7/environmental-sciences/landscape-ecology/Teaching/RIKZ_data/at_download/file", header = TRUE)
# data preparation
Species <- RIKZ[ ,2:5]
ExplVar <- RIKZ[ , 9:15]
Species_fin <- Species[ rowSums(Species) > 0, ]
ExplVar_fin <- ExplVar[ rowSums(Species) > 0, ]
# rda
RIKZ_rda <- rda(Species_fin ~ . , data = ExplVar_fin, scale = TRUE)
# stepwise model building: ordiR2step()
require(vegan)
step_both_R2 <- ordiR2step(rda(Species_fin ~ salinity, data = ExplVar_fin, scale = TRUE),
scope = formula(RIKZ_rda),
direction = "both", R2scope = TRUE, Pin = 0.05,
steps = 1000)
Why does ordiR2step() not add the variable exposure to the model, although it would increase the explained variance?
If R2scope is set FALSE and the p-value criterion is increased (Pin = 0.15) it adds the variable exposure corretly but throws the following error:
Error in terms.formula(tmp, simplify = TRUE) :
invalid model formula in ExtractVars
If R2scope is set TRUE (Pi = 0.15) exposure is not added.
Note: This might seem more as a statistic question and therefore more suitable for CV. However I think the problem is rather technical and better off here on SO.
Please read the ordiR2step documentation: it will tell you why exposure is not added to the model. The help page tells that ordiR2step has three stopping criteria. The second criterion is that "the adjusted R2 of the ‘scope’ is exceeded". This happens with exposure and therefore it was not added. This second criterion will be ignored if you set R2scope = FALSE (also documented). So the function works like documented.

Set G in prior using MCMCglmm, with categorical response and phylogeny

I am new to the MCMCglmm package in R, and rather new to glm models in general. I have a dataset of species traits and whether or not they have been introduced outside of their native range.
I would like to test whether being introduced (as a binary 0/1 response variable) can be explained by any of the species traits. I would also like to correct for phylogeny between species.
I was told that for a binary response I could use family =“threshold” and I should fix the residual variance at 1. But I am having some trouble with the other parameters needed for the prior.
I've specified the R value for the random effects, but if I specify R I must also specify G and it is not clear to me how to decide the values for this parameter. I've tried putting default values but I get error messages:
Error in MCMCglmm(fixed, random = ~species, data = data2, family = "threshold", :
prior$G has the wrong number of structures
I have read the help vignettes and course but have not found an example with a binary response, and it is not clear to me how to decide the values for the priors. This is what I have so far:
fixed=Intro_binary ~ Trait1+ Trait2 + Trait3
Ainv=inverseA(redTree1)$Ainv
binary_model = MCMCglmm(fixed, random=~species, data = data, family = "threshold", ginverse=list(species=Ainv),
 prior = list(
    G = list(),    #not sure about the parameters for random effects.
    R = list(V = 1, fix = 1)),  #to fix the residual variance at one
  nitt = 60000, burnin = 10000)
Any help or feedback would be greatly appreciated!
This one is a bit tricky with the information you provide. I'd say you can define G as a "weak" prior using:
priors <- list(R = list(V = 1, nu = 0.002),
G = list(V = 1, fix = 1)))
binary_model <- MCMCglmm(fixed, random = ~species, data = data,
family = "threshold",
ginverse = list(species = Ainv),
prior = priors,
nitt = 60000, burnin = 10000)
However, without more information on your analysis, I strongly suggest you plot your posteriors to have a look at the results and see if anything looks wrong. Have a look at the MCMCglmm package Course Notes for more info on how to set these priors (especially on what not to do in section 1.5 - you can also find more specific info on how to tune it to your model if it fits in the categories of the tutorial).

Resources