Set G in prior using MCMCglmm, with categorical response and phylogeny - r

I am new to the MCMCglmm package in R, and rather new to glm models in general. I have a dataset of species traits and whether or not they have been introduced outside of their native range.
I would like to test whether being introduced (as a binary 0/1 response variable) can be explained by any of the species traits. I would also like to correct for phylogeny between species.
I was told that for a binary response I could use family =“threshold” and I should fix the residual variance at 1. But I am having some trouble with the other parameters needed for the prior.
I've specified the R value for the random effects, but if I specify R I must also specify G and it is not clear to me how to decide the values for this parameter. I've tried putting default values but I get error messages:
Error in MCMCglmm(fixed, random = ~species, data = data2, family = "threshold", :
prior$G has the wrong number of structures
I have read the help vignettes and course but have not found an example with a binary response, and it is not clear to me how to decide the values for the priors. This is what I have so far:
fixed=Intro_binary ~ Trait1+ Trait2 + Trait3
Ainv=inverseA(redTree1)$Ainv
binary_model = MCMCglmm(fixed, random=~species, data = data, family = "threshold", ginverse=list(species=Ainv),
 prior = list(
    G = list(),    #not sure about the parameters for random effects.
    R = list(V = 1, fix = 1)),  #to fix the residual variance at one
  nitt = 60000, burnin = 10000)
Any help or feedback would be greatly appreciated!

This one is a bit tricky with the information you provide. I'd say you can define G as a "weak" prior using:
priors <- list(R = list(V = 1, nu = 0.002),
G = list(V = 1, fix = 1)))
binary_model <- MCMCglmm(fixed, random = ~species, data = data,
family = "threshold",
ginverse = list(species = Ainv),
prior = priors,
nitt = 60000, burnin = 10000)
However, without more information on your analysis, I strongly suggest you plot your posteriors to have a look at the results and see if anything looks wrong. Have a look at the MCMCglmm package Course Notes for more info on how to set these priors (especially on what not to do in section 1.5 - you can also find more specific info on how to tune it to your model if it fits in the categories of the tutorial).

Related

Error with svyglm function in survey package in R: "all variables must be in design=argument"

New to stackoverflow. I'm working on a project with NHIS data, but I cannot get the svyglm function to work even for a simple, unadjusted logistic regression with a binary predictor and binary outcome variable (ultimately I'd like to use multiple categorical predictors, but one step at a time).
El_under_glm<-svyglm(ElUnder~SO2, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in eval(extras, data, env) :
object '.survey.prob.weights' not found
I changed the variables to 0 and 1 instead:
Under_narm$SO2REG<-ifelse(Under_narm$SO2=="Heterosexual", 0, 1)
Under_narm$ElUnderREG<-ifelse(Under_narm$ElUnder=="No", 0, 1)
But then get a different issue:
El_under_glm<-svyglm(ElUnderREG~SO2REG, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in svyglm.survey.design(ElUnderREG ~ SO2REG, design = SAMPdesign, :
all variables must be in design= argument
This is the design I'm using to account for the weights -- I'm pretty sure it's correct:
SAMPdesign=svydesign(data=Under_narm, id= ~NHISPID, weight= ~SAMPWEIGHT)
Any and all assistance appreciated! I've got a good grasp of stats but am a slow coder. Let me know if I can provide any other information.
Using some make-believe sample data I was able to get your model to run by setting rescale = TRUE. The documentation states
Rescaling of weights, to improve numerical stability. The default
rescales weights to sum to the sample size. Use FALSE to not rescale
weights.
So, one solution maybe is just to set rescale = TRUE.
library(survey)
# sample data
Under_narm <- data.frame(SO2 = factor(rep(1:2, 1000)),
ElUnder = sample(0:1, 1000, replace = TRUE),
NHISPID = paste0("id", 1:1000),
SAMPWEIGHT = sample(c(0.5, 2), 1000, replace = TRUE))
# with 'rescale' = TRUE
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(), # this family avoids warnings
rescale=TRUE) # Weights rescaled to the sum of the sample size.
summary(El_under_glm, correlation = TRUE) # use correlation with summary()
Otherwise, looking code for this function's method with 'survey:::svyglm.survey.design', it seems like there may be a bug. I could be wrong, but by my read when 'rescale' is FALSE, .survey.prob.weights does not appear to get assigned a value.
if (is.null(g$weights))
g$weights <- quote(.survey.prob.weights)
else g$weights <- bquote(.survey.prob.weights * .(g$weights)) # bug?
g$data <- quote(data)
g[[1]] <- quote(glm)
if (rescale)
data$.survey.prob.weights <- (1/design$prob)/mean(1/design$prob)
There may be a work around if you assign a vector of numeric values to .survey.prob.weights in the global environment. No idea what these values should be, but your error goes away if you do something like the following. (.survey.prob.weights needs to be double the length of the data.)
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
.survey.prob.weights <- rep(1, 2000)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(),
rescale=FALSE)
summary(El_under_glm, correlation = TRUE)

MLR - calculating feature importance for bagged, boosted trees (XGBoost)

Good morning,
I have a question about calculating feature importance for bagged and boosted regression tree models with MLR package in R. I am using XGBOOST to make predictions and i'm using bagging to estimate prediction uncertainty. My data set is relatively large; approximately 10k features and observations. The predictions work perfectly (see code below), but I can't seem to calculate feature importance (the last line in the code below). The importance function crashes with no errors... and freezes the R session. I saw some related python code, where people seem to calculate the importance for each of the bagged models here and here. I haven't been able to get that to work properly in R either. Specifically, i'm not sure how to access individual models within the objected produced by MLR (mb object in the code below). In python, this seems to be trivial. In R, i can't seem to extract mb$learner.model, which seems logically closest to what i need. So i'm wondering if anyone had any experience with this issues?
Please see the code below
learn1 <- makeRegrTask(data = train.all , target= "resp", weights = weights1)
lrn.xgb <- makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals <- list( objective="reg:squarederror", eval_metric="error", nrounds=300, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 50, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag <- setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, learn1)
fimp1 <- getFeatureImportance(mb)
If you set bw.feats = 1 it might be feasible to average the feature importance values.
Basically you just have to apply over all single models that are stored in the HomogeneousEnsembleModel. Some extra care is necessary because the order of the features gets mixed up because of the sampling - although we set it to 100%.
library(mlr)
data = data.frame(x1 = runif(100), x2 = runif(100), x3 = runif(100))
data$y = with(data, x1 + 2 * x2 + 0.1 * x3 + rnorm(100))
task = makeRegrTask(data = data, target = "y")
lrn.xgb = makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals = list( objective="reg:squarederror", eval_metric="error", nrounds=50, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 10, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag = setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, task)
fimps = lapply(mb$learner.model$next.model, function(x) getFeatureImportance(x)$res)
fimp = fimps[[1]]
# we have to take extra care because the results are not ordered
for (i in 2:length(fimps)) {
fimp = merge(fimp, fimps[[i]], by = "variable")
}
rowMeans(fimp[,-1]) # only makes sense with bw.feats = 1
# [1] 0.2787052 0.4853880 0.2359068

How to use covariates in rddtools rdd_reg_lm function?

I am trying to run a parametric RD regression using the rddtools R package. However, the package documentation is not very clear to me.
First: the function to define an RD object is:
rdd_data(y, x, covar, cutpoint, z, labels, data)
where covar, in the help file, means only "Exogeneous variables" . But what type? A data frame? A list?
Second: The function rdd_reg_lm again demands informing covariates in this way:
rdd_reg_lm(rdd_object, covariates = NULL, order = 1, bw = NULL,
slope = c("separate", "same"), covar.opt = list(strategy = c("include",
"residual"), slope = c("same", "separate"), bw = NULL),
covar.strat = c("include", "residual"), weights)
Where, according to the help file, the covariates argument means simply "Formula to include covariates". Again, it is not clear to me what is exactly the correct way of applying these covariates.
Moreover, is it possible to include multiple covariates in this function rdd_data() and rdd_reg_lm()?
I appreciate some help here. I have already read the help and vignette files again and again, searched in many blogs and still nothing.
I have already checked this topic below
How to include a linear trend in a regression discontinuity design using rddtools
which showed me the following example:
rd.medic <- rdd_data(y = er ,x = ageyrs, covar = ageyrs, cutpoint=65, data = medicare)
rd.reg <- rdd_reg_lm(rdd_object=rd.medic, covariates = 'ageyrs', slope =
("same"), covar.opt = list("include"))
Even so, the syntax is still not clear to me, as I am trying to add multiple covariates without success
Thanks!
You can create a data frame with your covariates and then include it in rdd_data.
covariates<-data.frame(z1=ageyrs, z2=ageyrs2)
rd.medic <- rdd_data(y = er ,x = ageyrs, covar = covariates, cutpoint=65, data = medicare)
rd.reg <- rdd_reg_lm(rdd_object=rd.medic, covariates =TRUE, slope =("same"))

vegan::ordiR2step() doesn't find best-fit model

The vegan package includes the ordiR2step() function for model building, which can be used to identify the most important variables using the R2 and the p-value as goodness of fit measures. However for the dataset I was recently working with the function doesn't provide the best-fit model.
# data
RIKZ <- read.table("http://www.uni-koblenz-landau.de/en/campus-landau/faculty7/environmental-sciences/landscape-ecology/Teaching/RIKZ_data/at_download/file", header = TRUE)
# data preparation
Species <- RIKZ[ ,2:5]
ExplVar <- RIKZ[ , 9:15]
Species_fin <- Species[ rowSums(Species) > 0, ]
ExplVar_fin <- ExplVar[ rowSums(Species) > 0, ]
# rda
RIKZ_rda <- rda(Species_fin ~ . , data = ExplVar_fin, scale = TRUE)
# stepwise model building: ordiR2step()
require(vegan)
step_both_R2 <- ordiR2step(rda(Species_fin ~ salinity, data = ExplVar_fin, scale = TRUE),
scope = formula(RIKZ_rda),
direction = "both", R2scope = TRUE, Pin = 0.05,
steps = 1000)
Why does ordiR2step() not add the variable exposure to the model, although it would increase the explained variance?
If R2scope is set FALSE and the p-value criterion is increased (Pin = 0.15) it adds the variable exposure corretly but throws the following error:
Error in terms.formula(tmp, simplify = TRUE) :
invalid model formula in ExtractVars
If R2scope is set TRUE (Pi = 0.15) exposure is not added.
Note: This might seem more as a statistic question and therefore more suitable for CV. However I think the problem is rather technical and better off here on SO.
Please read the ordiR2step documentation: it will tell you why exposure is not added to the model. The help page tells that ordiR2step has three stopping criteria. The second criterion is that "the adjusted R2 of the ‘scope’ is exceeded". This happens with exposure and therefore it was not added. This second criterion will be ignored if you set R2scope = FALSE (also documented). So the function works like documented.

How to ensemble forecasts in R using weights

I have a couple forecasts and was trying to figure out how to merge the two according to some criterion that is widely used.
In part one, I split the data and compared the forecasts against the actual balues using Forest_comb.
library(forecast)
library(ForecastCombinations)
y1 = rnorm(100)
train = y1[1:90]
test = y1[91:100]
fit1 = auto.arima(train)
fit2 = ets(train)
forc1 = forecast(fit1, n=10)$mean
forc2 = forecast(fit2, n=10)$mean
forc_all = cbind(forc1,forc2)
forc_all
?Forecast_comb
fitted <- Forecast_comb(obs = test ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted
In part two, I rebuilt the whole model on the entire data and forecast out by ten values. How can one merge the two values together according to some criterion?
fit3 = auto.arima(y1)
fit4 = ets(y1)
forc3 = forecast(fit3, n=20)$mean
forc4 = forecast(fit4, n=20)$mean
forc_all = cbind(forc3,forc4)
forc_all
fitted <- Forecast_comb(obs = y1[91:100] ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted
Thanks for the help
The reason that I am using ForecastCombination is that it includes procedures for popular combination strategies. I thought that perhaps that function could be modified to perform the desired ensembling.
Based on a lot of Kaggle competitions where people share/discuss their scripts, I'd say that by far the most common and most effective way is simply to manually weight and add your predictions.
pacman::p_load(forecast)
pacman::p_load(ForecastCombinations)
y1 = rnorm(100)
train = y1[1:90]
test = y1[91:100]
fit1 = auto.arima(train)
fit2 = ets(train)
forc1 = forecast(fit1, n=10)$mean
forc2 = forecast(fit2, n=10)$mean
forc_all = cbind(forc1,forc2)
forc_all
?Forecast_comb
fitted_1 <- Forecast_comb(obs = test ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted_1
fit3 = auto.arima(y1)
fit4 = ets(y1)
forc3 = forecast(fit3, n=20)$mean
forc4 = forecast(fit4, n=20)$mean
forc_all = cbind(forc3,forc4)
forc_all
fitted_2 <- Forecast_comb(obs = y1[91:100] ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted_2
# By far the most common way to combine/weight is simply:
fitted <- fitted_2*.5+fitted_1*.5
fitted
One might ask if you should use equal weights or how to know what to make the weights. This is usually determined by
(a) naive, equal weighting if that's all you have time for and it seems to work fine
(b) iterating with a holdout or cross-validation sample(s), being careful not to overfit
Some people try to take more fancy approaches. It's easy to mess that up, however if you get it right then it can lead you to a more optimal forecast.
The model-based and other more fancy approaches are things like creating another stage of the modeling process wherein your predictions on a holdout sample are the X matrix and the outcome variable is the actual y for that sample.
Also, check out Erin LeDell's approach in h2oEnsemble.

Resources