The vegan package includes the ordiR2step() function for model building, which can be used to identify the most important variables using the R2 and the p-value as goodness of fit measures. However for the dataset I was recently working with the function doesn't provide the best-fit model.
# data
RIKZ <- read.table("http://www.uni-koblenz-landau.de/en/campus-landau/faculty7/environmental-sciences/landscape-ecology/Teaching/RIKZ_data/at_download/file", header = TRUE)
# data preparation
Species <- RIKZ[ ,2:5]
ExplVar <- RIKZ[ , 9:15]
Species_fin <- Species[ rowSums(Species) > 0, ]
ExplVar_fin <- ExplVar[ rowSums(Species) > 0, ]
# rda
RIKZ_rda <- rda(Species_fin ~ . , data = ExplVar_fin, scale = TRUE)
# stepwise model building: ordiR2step()
require(vegan)
step_both_R2 <- ordiR2step(rda(Species_fin ~ salinity, data = ExplVar_fin, scale = TRUE),
scope = formula(RIKZ_rda),
direction = "both", R2scope = TRUE, Pin = 0.05,
steps = 1000)
Why does ordiR2step() not add the variable exposure to the model, although it would increase the explained variance?
If R2scope is set FALSE and the p-value criterion is increased (Pin = 0.15) it adds the variable exposure corretly but throws the following error:
Error in terms.formula(tmp, simplify = TRUE) :
invalid model formula in ExtractVars
If R2scope is set TRUE (Pi = 0.15) exposure is not added.
Note: This might seem more as a statistic question and therefore more suitable for CV. However I think the problem is rather technical and better off here on SO.
Please read the ordiR2step documentation: it will tell you why exposure is not added to the model. The help page tells that ordiR2step has three stopping criteria. The second criterion is that "the adjusted R2 of the ‘scope’ is exceeded". This happens with exposure and therefore it was not added. This second criterion will be ignored if you set R2scope = FALSE (also documented). So the function works like documented.
Related
Sorry this is crossposting from https://stats.stackexchange.com/questions/593717/nlme-regression-with-weights-syntax-in-r, but I thought it might be more appropriate to post it here.
I am trying to fit a power curve to model some observations in an nlme. However, I know some observations to be less reliable than others (reliability of each OBSID reflected in the WEIV in the dummy data), relatively independent of variance, and I quantified this beforehand and wish to include it as weights in my model. Moreover, I know a part of my variance is correlated with my independent variable so I cannot use directly the variance as weights.
This is my model:
coeffs_start = lm(log(DEPV)~log(INDV), filter(testdummy10,DEPV!=0))$coefficients
nlme_fit <- nlme(DEPV ~ a*INDV^b,
data = testdummy10,
fixed=a+b~ 1,
random = a~ 1,
groups = ~ PARTID,
start = c(a=exp(coeffs_start[1]), b=coeffs_start[2]),
verbose = F,
method="REML",
weights=varFixed(~WEIV))
This is some sample dummy data (I know it is not a great fit but it's fake data anyway) : https://github.com/FlorianLeprevost/dummydata/blob/main/testdummy10.csv
This runs well without the "weights" argument, but when I add it I get this error and I am not sure why because I believe it is the correct syntax:
Error in recalc.varFunc(object[[i]], conLin) :
dims [product 52] do not match the length of object [220]
In addition: Warning message:
In conLin$Xy * varWeights(object) :
longer object length is not a multiple of shorter object length
Thanks in advance!
This looks like a very long-standing bug in nlme. I have a patched version on Github, which you can install via remotes::install_github() as below ...
remotes::install_github("bbolker/nlme")
testdummy10 <- read.csv("testdummy10.csv") |> subset(DEPV>0 & INDV>0)
coeffs_start <- coef(lm(log(DEPV)~log(INDV), testdummy10))
library(nlme)
nlme_fit <- nlme(DEPV ~ a*INDV^b,
data = testdummy10,
fixed=a+b~ 1,
random = a~ 1,
groups = ~ PARTID,
start = c(a=exp(coeffs_start[1]),
b=coeffs_start[2]),
verbose = FALSE,
method="REML",
weights=varFixed(~WEIV))
packageVersion("nlme") ## 3.1.160.9000
New to stackoverflow. I'm working on a project with NHIS data, but I cannot get the svyglm function to work even for a simple, unadjusted logistic regression with a binary predictor and binary outcome variable (ultimately I'd like to use multiple categorical predictors, but one step at a time).
El_under_glm<-svyglm(ElUnder~SO2, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in eval(extras, data, env) :
object '.survey.prob.weights' not found
I changed the variables to 0 and 1 instead:
Under_narm$SO2REG<-ifelse(Under_narm$SO2=="Heterosexual", 0, 1)
Under_narm$ElUnderREG<-ifelse(Under_narm$ElUnder=="No", 0, 1)
But then get a different issue:
El_under_glm<-svyglm(ElUnderREG~SO2REG, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in svyglm.survey.design(ElUnderREG ~ SO2REG, design = SAMPdesign, :
all variables must be in design= argument
This is the design I'm using to account for the weights -- I'm pretty sure it's correct:
SAMPdesign=svydesign(data=Under_narm, id= ~NHISPID, weight= ~SAMPWEIGHT)
Any and all assistance appreciated! I've got a good grasp of stats but am a slow coder. Let me know if I can provide any other information.
Using some make-believe sample data I was able to get your model to run by setting rescale = TRUE. The documentation states
Rescaling of weights, to improve numerical stability. The default
rescales weights to sum to the sample size. Use FALSE to not rescale
weights.
So, one solution maybe is just to set rescale = TRUE.
library(survey)
# sample data
Under_narm <- data.frame(SO2 = factor(rep(1:2, 1000)),
ElUnder = sample(0:1, 1000, replace = TRUE),
NHISPID = paste0("id", 1:1000),
SAMPWEIGHT = sample(c(0.5, 2), 1000, replace = TRUE))
# with 'rescale' = TRUE
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(), # this family avoids warnings
rescale=TRUE) # Weights rescaled to the sum of the sample size.
summary(El_under_glm, correlation = TRUE) # use correlation with summary()
Otherwise, looking code for this function's method with 'survey:::svyglm.survey.design', it seems like there may be a bug. I could be wrong, but by my read when 'rescale' is FALSE, .survey.prob.weights does not appear to get assigned a value.
if (is.null(g$weights))
g$weights <- quote(.survey.prob.weights)
else g$weights <- bquote(.survey.prob.weights * .(g$weights)) # bug?
g$data <- quote(data)
g[[1]] <- quote(glm)
if (rescale)
data$.survey.prob.weights <- (1/design$prob)/mean(1/design$prob)
There may be a work around if you assign a vector of numeric values to .survey.prob.weights in the global environment. No idea what these values should be, but your error goes away if you do something like the following. (.survey.prob.weights needs to be double the length of the data.)
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
.survey.prob.weights <- rep(1, 2000)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(),
rescale=FALSE)
summary(El_under_glm, correlation = TRUE)
This question has been seen 74 times and has received only one response (as of noon (PDT) Wed, Aug-14).
I've rewritten the question to make it as clear as possible and I'll appreciate any help.
As a summary, I need a small but complete example on a dataset with binary response on how to use MLR's makeCostSensWeightedPairsWrapper to obtain prediction probabilities on a test set.
In the MLR tutorial in the part on cost-sensitive classification
https://mlr.mlr-org.com/articles/tutorial/cost_sensitive_classif.html
there is a paragraph on "Example-dependent misclassification costs" and an example is given based on the iris dataset.
In the code snippet below, I modified the iris data set so as to contain only two classes as I'm interested in binary classification only.
library( mlr )
set.seed( 12347 )
n1 = 100; ntrain = 70
df = iris[ 1:n1, ] # 100 points in df so as to have two classes only (setosa and versicolor)
df$Species = factor( df$Species ) # refactor the response
# partition df into a training set (70 points) and test set (30 points)
#
ix = sample( 1:n1, ntrain, replace=FALSE )
xtest = df[ setdiff( 1:n1, ix ), ] ## test set
ntest = nrow( xtest )
xtrain = df[ ix, ] # this is the training set
# create cost matrix, same as in the MLR example
#
cost = matrix(runif(ntrain * 2, 0, 2000), ntrain) * (1 - diag(2))[xtrain$Species,] + runif(ntrain, 0, 10)
colnames(cost) = levels(xtrain$Species)
rownames(cost) = rownames(xtrain)
xtrain$Species = NULL # this is done according to the MLR example
# cost-sensitive task
#
costsens.task = makeCostSensTask(id = "xtrain", data = xtrain, cost = cost )
costsens.task
##lrn = makeLearner("classif.multinom", trace = FALSE, predict.type="prob" )
lrn = makeLearner( "classif.gbm", predict.type="prob" )
lrn = makeCostSensWeightedPairsWrapper( lrn ); lrn
mod = train(lrn, costsens.task ); mod
pred = predict( mod, newdata = xtest, pred.type="prob" );
perf = performance( pred, measures = list(auc), task = costsens.task)
# I get the following error:
# Error in FUN(X[[i]], ...) :
# You need to have a 'truth' column in your pred object for measure auc!
My original project is to do a binary classification which incorporates example-dependent misclassification costs.
The goal is to do a prediction on a test dataset, obtain the probabilities and show the performance (using ROCR, for which there are MLR-mapping functions).
NOTE: The learners I've tried are 'classif.multinom' and (I guessed) 'classif.gbm' as the two that might be compatible with the weighted pairs wrapper.
My questions are:
Q1: Where in the code snippet and how to specify that I want probabilities as an output of the cost-sensitive classifier?
Q2: Which learner can be used so as to produce classification probabilities?
Q3: How to avoid the error above and get the class probabilities?
Once again, I'd really appreciate any help, even more so if there is anyone who can answer promptly.
OK, after a number of days and almost a hundred views of this question (about 20 are mine :) there has been only one comment and no answers.
From what I could understand exploring some of the available MLR documentation, it seems that the output of example-based cost-sensitive method (makeCostSensWeightedPairsWrapper) is labels only and no prediction probabilities.
In other words, no probabilities are available from a cost-sensitive task, only the new labels are given, which are in turn computed based on the probabilities of the base classifier.
So, this the answer which I'll accept.
As for MLR errors, in this case at least, it would be helpful to get an explicit error message instead of a spurious one, or simply to note this in the documentation.
I am new to the MCMCglmm package in R, and rather new to glm models in general. I have a dataset of species traits and whether or not they have been introduced outside of their native range.
I would like to test whether being introduced (as a binary 0/1 response variable) can be explained by any of the species traits. I would also like to correct for phylogeny between species.
I was told that for a binary response I could use family =“threshold” and I should fix the residual variance at 1. But I am having some trouble with the other parameters needed for the prior.
I've specified the R value for the random effects, but if I specify R I must also specify G and it is not clear to me how to decide the values for this parameter. I've tried putting default values but I get error messages:
Error in MCMCglmm(fixed, random = ~species, data = data2, family = "threshold", :
prior$G has the wrong number of structures
I have read the help vignettes and course but have not found an example with a binary response, and it is not clear to me how to decide the values for the priors. This is what I have so far:
fixed=Intro_binary ~ Trait1+ Trait2 + Trait3
Ainv=inverseA(redTree1)$Ainv
binary_model = MCMCglmm(fixed, random=~species, data = data, family = "threshold", ginverse=list(species=Ainv),
prior = list(
G = list(), #not sure about the parameters for random effects.
R = list(V = 1, fix = 1)), #to fix the residual variance at one
nitt = 60000, burnin = 10000)
Any help or feedback would be greatly appreciated!
This one is a bit tricky with the information you provide. I'd say you can define G as a "weak" prior using:
priors <- list(R = list(V = 1, nu = 0.002),
G = list(V = 1, fix = 1)))
binary_model <- MCMCglmm(fixed, random = ~species, data = data,
family = "threshold",
ginverse = list(species = Ainv),
prior = priors,
nitt = 60000, burnin = 10000)
However, without more information on your analysis, I strongly suggest you plot your posteriors to have a look at the results and see if anything looks wrong. Have a look at the MCMCglmm package Course Notes for more info on how to set these priors (especially on what not to do in section 1.5 - you can also find more specific info on how to tune it to your model if it fits in the categories of the tutorial).
I'm using glmulti for model averaging in R. There are ~10 variables in my model, making exhaustive screening impractical - I therefore need to use the genetic algorithm (GA) (call: method = "g").
I need to include random effects so I'm using glmulti as a wrapper for lme4. Methods for doing this are available here http://www.inside-r.org/packages/cran/glmulti/docs/glmulti and there is also a pdf included with the glmulti package that goes into more detail. The problem is that when telling glmulti to use GA in this setting it runs indefinitely, even after the best model has been found.
This is the example taken from the pdf included in the glmulti package:
library(lme4)
library(glmulti)
# create a function for glmulti to act as a wrapper for lmer:
lmer.glmulti <- function (formula, data, random = "", ...) {
lmer(paste(deparse(formula), random), data = data, REML=F, ...)
}
# set some random variables:
y = runif(30,0,10) # mock dependent variable
a = runif(30) # dummy covariate
b = runif(30) # another dummy covariate
c = runif(30) # an another one
x = as.factor(round(runif(30),1))# dummy grouping factor
# run exhaustive screening with lmer:
bab <- glmulti(y~a*b*c, level = 2, fitfunc = lmer.glmulti, random = "+(1|x)")
This works fine. The problem is when I tell it to use the genetic algorithm:
babs <- glmulti(y~a*b*c, level = 2, fitfunc = lmer.glmulti, random = "+(1|x)", method = "g")
It just keeps running indefinitely and the AIC does not change:
...
After 19550 generations:
Best model: y~1
Crit= 161.038899734164
Mean crit= 164.13629335762
Change in best IC: 0 / Change in mean IC: 0
After 19560 generations:
Best model: y~1
Crit= 161.038899734164
Mean crit= 164.13629335762
Change in best IC: 0 / Change in mean IC: 0
After 19570 generations:
Best model: y~1
Crit= 161.038899734164
Mean crit= 164.13629335762
... etc.
I have tried using calls that tell glmulti when to stop (deltaB = 0, deltaM = 0.01, conseq = 6) but nothing seems to work. I think the problem must lie with setting the function (?). It may be something really obvious however I'm new to R and I can't work it out.
Any help with this would be much appreciated.
I received the solution from the package maintainer. The issue is that the number of models explored is set by the argument confsetsize. The default value is 100.
According to ?glmulti, this argument is:
The number of models to be looked for, i.e. the size of the returned confidence set.
The solution is to set confsetsize so that it is less than or equal to the total number of models.
Starting with the example from the OP that did not stop:
babs <- glmulti(y~a*b*c, level = 2, fitfunc = lmer.glmulti,
random = "+(1|x)", method = "g")
glmulti will determine the total number of candidate models using method = "d"
babs <- glmulti(y~a*b*c, level = 2, fitfunc = lmer.glmulti,
random = "+(1|x)", method = "d")
Initialization...
TASK: Diagnostic of candidate set.
Sample size: 30
0 factor(s).
3 covariate(s).
...
Your candidate set contains 64 models.
Thus, setting confsetsize to less than or equal to 64 will result in the desired behavior.
babs <- glmulti(y~a*b*c, level = 2, fitfunc = lmer.glmulti,
random = "+(1|x)", method = "g", confsetsize = 64)
However, for small models it may be sufficient to use the exhaustive search (method = "h"):
babs <- glmulti(y~a*b*c, level = 2, fitfunc = lmer.glmulti,
random = "+(1|x)", method = "h")
Right, I've worked this one out - the problem is that the example (above) I was using to test run this package only contains 3 variables. When you add in a fourth it works fine:
d = runif(30)
And run again telling it to use GA:
babs <- glmulti(y~a*b*c*d, level = 2, fitfunc = lmer.glmulti, random = "+(1|x)", method = "g")
Returns:
...
After 190 generations:
Best model: y~1
Crit= 159.374382952181
Mean crit= 163.380382861026
Improvements in best and average IC have bebingo en below the specified goals.
Algorithm is declared to have converged.
Completed.
Using glmulti out-of-the-box with a GLM gives the same result if you try to use GA with less than three variables. This is not really an issue however as if you've only got three variables it is possible to do an exhaustive search. The problem was the example.