retain several best models during model dredging in R - r

Is there a way to retain the best models, for example, within two Alkaike Information Criterion (AIC) units of the best fitting model, during a model dredging approach in R? I am using the glmulti package, which returns the AIC of the best models, but does not allow visualizing the models associated with those values.
Thanks in advance.
Here is my example (data here):
results <- read.csv("gameresults.csv")
require(glmulti)
M <- glmulti(result~speed*svl*tailsize*strategy,
data=results, name = "glmulti.analysis",
intercept = TRUE, marginality = FALSE,
level = 2, minsize = 0, maxsize = -1, minK = 0, maxK = -1,
fitfunction = Multinom, method = "h", crit = "aic",
confsetsize = 100,includeobjects=TRUE)
summary(M)

The function glmulti::glmulti returns a S4 class object that can be accessed like a list. All of your models, not just the best, could be accessed. Since I don't have your functions and some other optional inputs, I performed a simplified version of your model just as a demonstration:
results <- read.csv("gameresults.csv")
library(glmulti)
M <- glmulti(result~speed*svl*strategy, data=results, crit = "aic", plotty = TRUE)
Here are a list of all models, accessed by the # operator:
M#formulas
# [[1]]
# result ~ 1 + speed + svl:speed + strategy:speed
# <environment: 0x11a616750>
#
# [[2]]
# result ~ 1 + speed + svl + svl:speed + strategy:speed
# <environment: 0x11a616750>
#
# [[3]]
# result ~ 1 + strategy + speed + svl:speed + strategy:speed
# <environment: 0x11a616750>
#
## **I omitted the remaining 36-3=33 models**
You can plot them individually based on the formula, using the base graphic or any packages that support use of model formulas. For example, I randomly selected one from the list:
plot(result ~ 1 + speed + svl, data=results)
## Hit <Return> to see next plot:
## Hit <Return> to see next plot:

Related

Removing certain parts of modelsummary in R at specific statistics

I am using gamm4:gamm4 to model longitudinal change.
I am trying to use the modelsummary::modelsummary function to create an output table of the following results:
I would like to add t-values and std.error to the output of the fixed effects, and remove the empty tags values from the random effects
model_lmer <- gamm4(Y ~ Tract + s(Age, by = Tract, k = 10) + Sex,
data = (DF1),
random = ~ (0 + Tract | ID))
modelsummary(model_lmer$mer,
statistic = c("s.e. = {std.error}",
"t = {statistic}"))
But I am struggling to write the correct syntax to remove the "t" and "s.e." from the random effects output.
This is kind of tricky, actually. The issue is that modelsummary()
automatically drops empty rows when they are filled with NA or an
empty string "". However, since glue strings can include arbitrary
text, it is hard to think of a general way to figure out if the row is
empty or not, because modelsummary() cannot know ex ante what
constitutes an empty string.
If you have an idea on how this check could be implemented, please report it
on Github:
https://github.com/vincentarelbundock/modelsummary
In the meantime, you could use the powerful tidy_custom.CLASSNAME
mechanism
to customize the statistic and p.value statistics directly instead
of through a glue string:
library(gamm4)
library(modelsummary)
# simulate
x <- runif(100)
fac <- sample(1:20,100,replace=TRUE)
eta <- x^2*3 + fac/20; fac <- as.factor(fac)
y <- rpois(100,exp(eta))
# fit
mod <- gamm4(y~s(x),family=poisson,random=~(1|fac))
# customize
tidy_custom.glmerMod <- function(x) {
out <- parameters::parameters(x)
out <- insight::standardize_names(out, style = "broom")
out$statistic <- sprintf("t = %.3f", out$statistic)
out$p.value <- sprintf("p = %.3f", out$p.value)
out
}
# summarize
modelsummary(mod$mer,
statistic = c("{statistic}", "{p.value}"))
Model 1
X(Intercept)
1.550
t = 17.647
p = 0.000
Xs(x)Fx1
0.855
t = 4.445
p = 0.000
Num.Obs.
100
RMSE
2.49
Note that I used simple glue strings in statistic = "{p.value}", otherwise they would be wrapped up in parentheses, as is default for standard errors.

Kfold CV in brms

I am trying to use kfold CV as a means of evaluating a model run using brms and I feel like I'm missing something. As a reproducible example, my data are structured as a binary response (0, 1) dependent on the length of an individual. Here is some code to generate and plot data similar to those I am working with:
library(brms)
library(tidyverse)
library(loo)
length <- seq(0, 100, by = 1)
n_fish_per_length <- 10
a0 <- -48
a1 <- 2
a2 <- -0.02
prob <- plogis(a0 + a1 * length + a2 * length^2)
plot(length, prob , type = 'l')
sim_data <-
expand_grid(fish_id = seq_len(n_fish_per_length),
length = length) %>%
mutate(prob_use = plogis(a0 + a1 * length + a2 * length^2)) %>%
mutate(is_carp = rbinom(n = n(), size = 1, prob= prob_use))
ggplot(sim_data, aes(x = length, y = is_carp)) +
geom_jitter(width = 0, height = 0.05) +
geom_smooth(method = "glm", formula = y ~ x + I(x^2),
method.args = list(family = binomial(link = "logit")))
I then use brms to run my model.
Bayes_Model_Binary <- brm(formula = is_carp ~ length + I(length^2),
data=sim_data,
family = bernoulli(link = "logit"),
warmup = 2500,
iter = 5000,
chains = 4,
inits= "0",
cores=4,
seed = 123)
summary(Bayes_Model_Binary)
I'd like to use kfold CV to evaluate the model. I can use something like this:
kfold(Bayes_Model_Binary, K = 10, chains = 1, save_fits = T)
but the response in my data is highly imbalanced (~18% = 1, ~82% = 0) and my reading suggests that I need to used stratified kfold cv to account for this. If I use:
sim_data$fold <- kfold_split_stratified(K = 10, x = sim_data$is_carp)
the data are split the way I would expect but I'm not sure what the best way is to move forward with the CV process from here. I saw this post https://mc-stan.org/loo/articles/loo2-elpd.html, but I'm not sure how to modify this to work with a brmsfit object. Alternatively, it appears that I should be able to use:
kfold(Bayes_Model_Binary, K = 10, folds = 'stratified', group = sim_data$is_carp)
but this throws an error. Likely because is_carp is the response rather than a predictor in the model. What would my group be in this context? Am I missing/misinterpreting something here? I'm assuming that there is a very simple solution here that I am overlooking but appreciate any thoughts.
After some additional digging and learning how to access information about each fold in the analysis, I was able to determine that the structure of the data (proportion of 0s and 1s in the response) is maintained using the default settings in the kfold() function. To do this I used the following code.
First, save the kfold CV analysis as an object.
kfold1 <- kfold(Bayes_Model_Binary, K = 10, save_fits = T)
kfold1$fits is a list of the model fitting results and the observations used in the test data set (omitted) for each fold.
From this information, I created a loop to print the proportion of observations in each training data set where is_carp = 1 (could also do this for each test data set) with the following code.
for(i in 1:10){
print(length(which(sim_data$is_carp[-kfold1$fits[i, ]$omitted] == 1)) /
nrow(sim_data[-kfold1$fits[i, ]$omitted, ]))
}
[1] 0.1859186
[1] 0.1925193
[1] 0.1991199
[1] 0.1914191
[1] 0.1881188
[1] 0.1848185
[1] 0.1936194
[1] 0.1980198
[1] 0.190319
[1] 0.1870187
and it's easy to then compare these proportions with the proportion of observations where is_carp = 1 from the original data set.
length(which(sim_data$is_carp == 1)) / nrow(sim_data)
[1] 0.1910891

R tree doesn't use all variables(why?)

Hi I'm working on a decision tree.
tree1=tree(League.binary~TME.factor+APM.factor+Wmd.factor,starcraft)
The tree shows a partitioning based solely on the APM.factor and the leaves aren't pure. here's a screenshot:
I tried creating a tree with a subset with 300 of the 3395 observations and it used more than one variable. What went wrong in the first case? Did it not need the extra two variables so it used only one?
Try playing with the tree.control() parameters, for example setting minsize=1 so that you end up with a single observation in each leaf (overfit), e.g:
model = tree(y ~ X1 + X2, data = data, control = tree.control(nobs=n, minsize = 2, mindev=0))
Also, try the same thing with the rpart package, see what results you get, which is the "new" version of tree. You can also plot the importance of the variables. Here a syntax example:
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
## fit tree
### alt1: class
model = rpart(y ~ X1 + X2, data=data, method = "class")
### alt2: reg
model = rpart(y ~ X1 + X2, data=data, control = rpart.control(maxdepth = 30, minsplit = 1, minbucket = 1, cp=0))
## show model
print(model)
rpart.plot(model, cex=0.5)
## importance
model$variable.importance
Note that since trees do binary splits, it is possible that a single variable explains most/all of the SSR (for regression). Try plotting the response for each regressor, see if there's any significant relation to anything but the variable you're getting.
In case you want to run the examples above, here a data simulation (put it at beginning of code):
n = 12000
X1 = runif(n, -100, 100)
X2 = runif(n, -100, 100)
## 1. SQUARE DATA
# y = ifelse( (X1< -50) | (X1>50) | (X2< -50) | (X2>50), 1, 0)
## 2. CIRCLE DATA
y = ifelse(sqrt(X1^2+X2^2)<=50, 0, 1)
## 3. LINEAR BOUNDARY DATA
# y = ifelse(X2<=-X1, 0, 1)
# Create
color = ifelse(y==0,"red","green")
data = data.frame(y,X1,X2,color)
# Plot
data$color = data$color %>% as.character()
plot(data$X2 ~ data$X1, col = data$color, type='p', pch=15)

How to do the RESET test on an AR model?

> library("lmtest")
> a = arima.sim(list(ar = c(.05, -.05)), 1000)
> b = arima(a, order = c(2, 0, 0))
> resettest(b)
**Error in terms.default(formula) : no terms component nor attribute**
Question 1. What I am doing is shown above. What should I do about that?
(I have tried to put in type, data and power parameter at resettest(), result is the same.)
Question 2.If I want to do the same thing on the model below
๐‘Ÿ๐‘ก=0.5+0.5๐‘Ÿ(๐‘กโˆ’1)โˆ’0.5๐‘Ÿ(๐‘กโˆ’2)+0.1๐‘Ÿ(๐‘กโˆ’1)^2+๐œ–_๐‘ก
which is a ar(2) model plus 0.1๐‘Ÿ_(๐‘กโˆ’1)^2, how to fit this nonlinear model (by using R, thank you!)?
should have earn more reputation... can't post pic below 10 :(
The issue is that the first argument of resettest is
formula - a symbolic description for the model to be tested (or a fitted "lm" object).
So, passing an Arima object is not going to work. Instead we may manually define the lagged variables and provide an lm object or just the formula:
la1 <- Hmisc::Lag(a, 1)
la2 <- Hmisc::Lag(a, 2)
resettest(a ~ la1 + la2)
#
# RESET test
#
# data: a ~ la1 + la2
# RESET = 0.10343, df1 = 2, df2 = 993, p-value = 0.9018
Now your second model is nonlinear in variables but linear in parameters, so the same estimation methods still apply. (I'm assuming that the true DGP remains the same and you just want to test a new specification.) In particular,
resettest(a ~ la1 + la2 + I(la2^2))
#
# RESET test
#
# data: a ~ la1 + la2 + I(la2^2)
# RESET = 0.089211, df1 = 2, df2 = 992, p-value = 0.9147

R: cant get a lme{nlme} to fit when using self-constructed interaction variables

I'm trying to get a lme with self constructed interaction variables to fit. I need those for post-hoc analysis.
library(nlme)
# construct fake dataset
obsr <- 100
dist <- rep(rnorm(36), times=obsr)
meth <- dist+rnorm(length(dist), mean=0, sd=0.5); rm(dist)
meth <- meth/dist(range(meth)); meth <- meth-min(meth)
main <- data.frame(meth = meth,
cpgl = as.factor(rep(1:36, times=obsr)),
pbid = as.factor(rep(1:obsr, each=36)),
agem = rep(rnorm(obsr, mean=30, sd=10), each=36),
trma = as.factor(rep(sample(c(TRUE, FALSE), size=obsr, replace=TRUE), each=36)),
depr = as.factor(rep(sample(c(TRUE, FALSE), size=obsr, replace=TRUE), each=36)))
# check if all factor combinations are present
# TRUE for my real dataset; Naturally TRUE for the fake dataset
with(main, all(table(depr, trma, cpgl) >= 1))
# construct interaction variables
main$depr_trma <- interaction(main$depr, main$trma, sep=":", drop=TRUE)
main$depr_cpgl <- interaction(main$depr, main$cpgl, sep=":", drop=TRUE)
main$trma_cpgl <- interaction(main$trma, main$cpgl, sep=":", drop=TRUE)
main$depr_trma_cpgl <- interaction(main$depr, main$trma, main$cpgl, sep=":", drop=TRUE)
# model WITHOUT preconstructed interaction variables
form1 <- list(fixd = meth ~ agem + depr + trma + depr*trma + cpgl +
depr*cpgl +trma*cpgl + depr*trma*cpgl,
rndm = ~ 1 | pbid,
corr = ~ cpgl | pbid)
modl1 <- nlme::lme(fixed=form1[["fixd"]],
random=form1[["rndm"]],
correlation=corCompSymm(form=form1[["corr"]]),
data=main)
# model WITH preconstructed interaction variables
form2 <- list(fixd = meth ~ agem + depr + trma + depr_trma + cpgl +
depr_cpgl + trma_cpgl + depr_trma_cpgl,
rndm = ~ 1 | pbid,
corr = ~ cpgl | pbid)
modl2 <- nlme::lme(fixed=form2[["fixd"]],
random=form2[["rndm"]],
correlation=corCompSymm(form=form2[["corr"]]),
data=main)
The first model fits without any problems whereas the second model gives me following error:
Error in MEEM(object, conLin, control$niterEM) :
Singularity in backsolve at level 0, block 1
Nothing i found out about this error so far helped me to solve the problem. However the solution is probably pretty easy.
Can someone help me? Thanks in advance!
EDIT 1:
When i run:
modl3 <- lm(form1[["fixd"]], data=main)
modl4 <- lm(form2[["fixd"]], data=main)
The summaries reveal that modl4 (with the self constructed interaction variables) in contrast to modl3 shows many more predictors. All those that are in 4 but not in 3 show NA as coefficients. The problem therefore definitely lies within the way i create the interaction variables...
EDIT 2:
In the meantime I created the interaction variables "by hand" (mainly paste() and grepl()) - It seems to work now. However I would still be interested in how i could have realized it by using the interaction() function.
I should have only constructed the largest of the interaction variables (combining all 3 simple variables).
If i do so the model gets fit. The likelihoods then are very close to each other and the number of coefficients matches exactly.

Resources