Let's assume I want to see all possible variable's combinations of a GLMM (using lme4) but I don't want to consider two variables at the same time in a model. How do I do that? For instance, I want to consider 3 fixed effects and 3 random effects, but I don't want any of the random or fixed effects to be considered at the same time in a model. If I construct the model this way:
model1 <- glmer(x~var1+var2+var3+(1|var4)+(1|var5)+(1|var6),
data=data1)
and I use MuMIn::dredge() function (to perform model averaging later), I will get all possible combinations between them, but I don't want (1|var4) to be in the same model as (1|var5).
So, is it possible to limit model combinations? This way I would avoid unnecessary models and save computing time.
Just in case anyone is looking for an updated solution to this... You can now do this easier with the subset argument of dredge:
#full model
model1 <- glmer(x~var1+var2+var3+(1|var4)+(1|var5)+(1|var6),data=data1)
#exclude models containing both (1|var4) & (1|var5) at the same time
dredge(model1, subset = !((1|var4) && (1|var5)))
I don't know how to do this within MuMIn::dredge() (but see my attempts below).
set.seed(101)
dd <- data.frame(x=rnorm(1000),
var1=rnorm(1000),
var2=rnorm(1000),
var3=rnorm(1000),
var4=sample(factor(sample(1:20,size=1000,replace=TRUE))),
var5=sample(factor(sample(1:20,size=1000,replace=TRUE))),
var6=sample(factor(sample(1:20,size=1000,replace=TRUE))))
library(lme4)
m0 <- lmer(x~var1+var2+var3+(1|var4)+(1|var5)+(1|var6),dd,REML=FALSE,
na.action=na.fail)
If we try to use the m.lim argument it subsets only the fixed effects, but leaves in all the random effect terms:
dredge(m0,m.lim=c(0,1))
## Model selection table
## (Intrc) var1 var2 var3 df logLik AICc delta weight
## 1 0.02350 5 -1417.485 2845.0 0.00 0.412
## 3 0.02389 -0.03256 6 -1416.981 2846.0 1.02 0.248
## 5 0.02327 0.02168 6 -1417.254 2846.6 1.56 0.189
## 2 0.02349 -0.002981 6 -1417.480 2847.0 2.02 0.151
## Models ranked by AICc(x)
## Random terms (all models):
## ‘1 | var4’, ‘1 | var5’, ‘1 | var6’
Following demo(dredge.subset), I tried this as an example:
dredge(m0,
subset=expression(!( (var1 && var2) || ((1|var4) && (1|var5)))))
but got
Error in dredge(m0, subset = expression(!((var1 && var2) || ((1 | var4) && :
unrecognized names in 'subset' expression: "var4" and "var5"
I can't find any documentation on how to do dredging/model averaging with MuMIn::dredge() across models with different random effects (indeed, I'm not convinced this is a good idea). If you wanted to fit all models with exactly one fixed-effect and exactly one random-effect term, you could do it as follows:
Set up all combinations:
fvars <- paste0("var",1:3)
gvars <- paste0("(1|var",4:6,")")
combs <- as.matrix(expand.grid(fvars,gvars))
Now fit them:
mList <- list()
for (i in 1:nrow(combs)) {
mList[[i]] <- update(m0,
formula=reformulate(combs[i,],response="x"))
}
Now you can use lapply or sapply to operate on the elements of the list, e.g.:
lapply(mList,formula)
## [[1]]
## x ~ var1 + (1 | var4)
##
## [[2]]
## x ~ var2 + (1 | var4)
##
## [[3]]
## x ~ var3 + (1 | var4)
##
## [[4]]
## x ~ var1 + (1 | var5)
## ... et cetera ...
bbmle::AICtab(mList,weights=TRUE)
## dAIC df weight
## model5 0.0 4 0.344
## model6 0.5 4 0.262
## model4 1.0 4 0.213
## model8 4.1 4 0.044
## ... et cetera ...
... but you'll have to work a bit harder to do model averaging. You might try r-sig-mixed-models#r-project.org, r-sig-ecology#r-project.org, or e-mail the maintainer of MuMIn (maintainer("MuMIn")) ...
Related
How do I remove the intercept from the prediction when using predict.glm? I'm not talking about the model itself, just in the prediction.
For example, I want to get the difference and standard error between x=1 and x=3
I tried putting newdata=list(x=2), intercept = NULL when using predict.glm and it doesn't work
So for example:
m <- glm(speed ~ dist, data=cars, family=gaussian(link="identity"))
prediction <- predict.glm(m, newdata=list(dist=c(2)), type="response", se.fit=T, intercept=NULL)
I'm not sure if this is somehow implemented in predict, but you could the following trick1.
Add a manual intercept column (i.e. a vector of 1s) to the data and use it in the model while adding 0 to RHS of formula (to remove the "automatic" intercept).
cars$intercept <- 1L
m <- glm(speed ~ 0 + intercept + dist, family=gaussian, data=cars)
This gives us an intercept column in the model.frame, internally used by predict,
model.frame(m)
# speed intercept dist
# 1 4 1 2
# 2 4 1 10
# 3 7 1 4
# 4 7 1 22
# ...
which allows us to set it to an arbitrary value such as zero.
predict.glm(m, newdata=list(dist=2, intercept=0), type="response", se.fit=TRUE)
# $fit
# 1
# 0.3311351
#
# $se.fit
# [1] 0.03498896
#
# $residual.scale
# [1] 3.155753
I am trying to use MuMIn's model.avg function with model formulae that are pasted and use an index rather than directly input, for example:
m1<-gls(as.formula(paste(response,"~",paste(combns[,j], collapse="+"))), data=dat)
'combns' is a 2D array created by combn() containing combinations of predictor variables. This is producing model-averaged coefficients and AICc values identical to those produced if the gls functions contain the formulae directly, e.g.:
m1<-gls(median_Ta ~ day_of_season + hour_of_day + pct_grey_cover +
foliage_height_diversity + tree_shannon_diversity + median_patch_size, data=dat)
However, the relative variable importance is not computing, and I believe this to do with the use of a for loop or with using a variable to access the index of the list in which the models are stored somehow causing the component model terms not to be 'read' properly (see the term codes for the models):
Component models:
df logLik AICc delta weight
1234567b 7 -233.08 481.43 0.00 0.59
1234567f 3 -237.97 482.21 0.78 0.40
1234567e 4 -241.32 491.08 9.65 0.00
1234567a 9 -241.15 502.39 20.96 0.00
1234567c 6 -248.37 509.68 28.25 0.00
1234567d 5 -250.22 511.11 29.68 0.00
Term codes:
day_of_season foliage_height_diversity hour_of_day
1 2 3
median_patch_size pct_grey_cover tree_shannon_diversity
4 5 6
urban_boundary_distance
7
This results in the relative variable importance being given as:
Relative variable importance:
day_of_season foliage_height_diversity hour_of_day
Importance: 1 1 1
N containing models: 6 6 6
median_patch_size pct_grey_cover tree_shannon_diversity
Importance: 1 1 1
N containing models: 6 6 6
urban_boundary_distance
Importance: 1
N containing models: 6
Whereas if I use model.avg over the same models with the formulae typed individually, I get the following, correct output:
Component models:
df logLik AICc delta weight
23456 7 -233.08 481.43 0.00 0.59
1 3 -237.97 482.21 0.78 0.40
57 4 -241.32 491.08 9.65 0.00
1234567 9 -241.15 502.39 20.96 0.00
1467 6 -248.37 509.68 28.25 0.00
147 5 -250.22 511.11 29.68 0.00
Relative variable importance:
pct_grey_cover median_patch_size tree_shannon_diversity
Importance: 0.6 0.59 0.59
N containing models: 3 4 3
foliage_height_diversity hour_of_day day_of_season
Importance: 0.59 0.59 0.4
N containing models: 2 2 4
urban_boundary_distance
Importance: <0.01
N containing models: 4
How can I make model.avg read the predictor variables in the formulae properly? I've only included six models as an example here but I want to compare the full set of 128 models (and I have other response variables with larger numbers of predictor variables), so typing them out individually isn't feasible.
Thanks in advance.
Edit: reproducible example
It took me a while to narrow down the problem. The first example, m.ave, shows the problem in action with a for loop. The second example, m.ave2 shows it working with the indices typed rather than using a variable. Obviously this is just a small subset of the predictor variables.
require(nlme)
require(MuMIn)
dat<-data.frame(median_Ta=rnorm(100), day_of_season=runif(100), hour_of_day=runif(100), pct_grey_cover=rnorm(100),
foliage_height_diversity=rnorm(100), urban_boundary_distance=runif(100), tree_shannon_diversity=rnorm(100),
median_patch_size=rnorm(100))
f1<-"median_Ta ~ day_of_season + hour_of_day + pct_grey_cover + foliage_height_diversity +
urban_boundary_distance + tree_shannon_diversity + median_patch_size"
f1<-gsub("\\s", "", f1) # remove whitespace
f1split <- strsplit(f1, split="~") # split predictors and response
response <- f1split[[1]][1]
predictors <- strsplit(f1split[[1]][2], split="+", fixed=TRUE)[[1]]
modelslist<-list()
combns <- combn(predictors, 6)
for (j in 1:7) {
modelslist[[j]]<-gls(as.formula(paste(response,"~",paste(combns[,j], collapse="+"))), data=dat)
}
m.ave<-model.avg(modelslist[[2]], modelslist[[3]], modelslist[[4]],
modelslist[[5]], modelslist[[6]], modelslist[[7]], modelslist[[8]])
summary(m.ave)
#compare....
modelslist2<-list()
modelslist2[[1]]<-gls(as.formula(paste(response,"~",paste(combns[,1], collapse="+"))), data=dat)
modelslist2[[2]]<-gls(as.formula(paste(response,"~",paste(combns[,2], collapse="+"))), data=dat)
modelslist2[[3]]<-gls(as.formula(paste(response,"~",paste(combns[,3], collapse="+"))), data=dat)
modelslist2[[4]]<-gls(as.formula(paste(response,"~",paste(combns[,4], collapse="+"))), data=dat)
modelslist2[[5]]<-gls(as.formula(paste(response,"~",paste(combns[,5], collapse="+"))), data=dat)
modelslist2[[6]]<-gls(as.formula(paste(response,"~",paste(combns[,6], collapse="+"))), data=dat)
modelslist2[[7]]<-gls(as.formula(paste(response,"~",paste(combns[,7], collapse="+"))), data=dat)
m.ave2<-model.avg(modelslist2[[1]], modelslist2[[2]], modelslist2[[3]], modelslist2[[4]],
modelslist2[[5]], modelslist2[[6]], modelslist2[[7]])
summary(m.ave2)
This is a bug in the formula method for gls (in package nlme). Since the actual formula is not stored anywhere in the object, it evaluates "model" argument in the function call. In case of elements of modellist they are all the same, for example:
modelslist[[1]]$call$model
modelslist[[7]]$call$model
both return
> formula(paste(response, "~", paste(combns[, j], collapse = "+")))
which, when evaluated use the current (last) value of j, so that all formula(modellist[[N]]) returns the last model formula.
all.equal(formula(modelslist[[1]]), formula(modelslist[[7]]))
returns
> TRUE
Needles to say, all this confuses model.avg which uses the formulas to build the model selection table (this is a fallback because gls lacks terms as well).
Edit: possible workarounds
Much easier way to get what you want:
model.avg(dredge(..., m.lim = c(6,6)))
or, if you want to make predictions:
modellist <- lapply(dredge(..., m.lim = c(6,6), evaluate = FALSE), eval)
But, if you want to use an arbitrary set of models, replace the $call$model element in each gls model object with a proper formula, e.g.
combns <- combn(1:7, 6)
modellist <- vector("list", 7)
for (j in 1:7) {
f <- reformulate(predictors[combns[, j]], response = response)
fm <- gls(f, data = dat)
fm$call$model <- f # assign the actual formula
modellist[[j]] <- fm
}
I'm encountering an issue with predictInterval() from merTools. The predictions seem to be out of order when compared to the data and midpoint predictions using the standard predict() method for lme4. I can't reproduce the problem with simulated data, so the best I can do is show the lmerMod object and some of my data.
> # display input data to the model
> head(inputData)
id y x z
1 calibration19 1.336 0.531 001
2 calibration20 1.336 0.433 001
3 calibration22 0.042 0.432 001
4 calibration23 0.042 0.423 001
5 calibration16 3.300 0.491 001
6 calibration17 3.300 0.465 001
> sapply(inputData, class)
id y x z
"factor" "numeric" "numeric" "factor"
>
> # fit mixed effects regression with random intercept on z
> lmeFit = lmer(y ~ x + (1 | z), inputData)
>
> # display lmerMod object
> lmeFit
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | z)
Data: inputData
REML criterion at convergence: 444.245
Random effects:
Groups Name Std.Dev.
z (Intercept) 0.3097
Residual 0.9682
Number of obs: 157, groups: z, 17
Fixed Effects:
(Intercept) x
-0.4291 5.5638
>
> # display new data to predict in
> head(predData)
id x z
1 29999900108 0.343 001
2 29999900207 0.315 001
3 29999900306 0.336 001
4 29999900405 0.408 001
5 29999900504 0.369 001
6 29999900603 0.282 001
> sapply(predData, class)
id x z
"factor" "numeric" "factor"
>
> # estimate fitted values using predict()
> set.seed(1)
> preds_mid = predict(lmeFit, newdata=predData)
>
> # estimate fitted values using predictInterval()
> set.seed(1)
> preds_interval = predictInterval(lmeFit, newdata=predData, n.sims=1000) # wrong order
>
> # estimate fitted values just for the first observation to confirm that it should be similar to preds_mid
> set.seed(1)
> preds_interval_first_row = predictInterval(lmeFit, newdata=predData[1,], n.sims=1000)
>
> # display results
> head(preds_mid) # correct prediction
1 2 3 4 5 6
1.256860 1.101074 1.217913 1.618505 1.401518 0.917470
> head(preds_interval) # incorrect order
fit upr lwr
1 1.512410 2.694813 0.133571198
2 1.273143 2.521899 0.009878347
3 1.398273 2.785358 0.232501376
4 1.878165 3.188086 0.625161201
5 1.605049 2.813737 0.379167003
6 1.147415 2.417980 -0.108547846
> preds_interval_first_row # correct prediction
fit upr lwr
1 1.244366 2.537451 -0.04911808
> preds_interval[round(preds_interval$fit,3)==round(preds_interval_first_row$fit,3),] # the correct prediction ends up as observation 1033
fit upr lwr
1033 1.244261 2.457012 -0.0001299777
>
To put this into words, the first observation of my data frame predData should have a fitted value around 1.25 according to the predict() method, but it has a value around 1.5 using the predictInterval() method. This does not seem to be simply due to differences in the prediction approaches, because if I restrict the newdata argument to the first row of predData, the resulting fitted value is around 1.25, as expected.
The fact that I can't reproduce the problem with simulated data leads me to believe it has to do with an attribute of my input or prediction data. I've tried reclassifying the factor variable as character, enforcing the order of the rows prior to fitting the model, between fitting the model and predicting, but found no success.
Is this a known issue? What can I do to avoid it?
I have attempted to make a minimal reproducible example of this issue, but have been unsuccessful.
library(merTools)
d <- data.frame(x = rnorm(1000), z = sample(1:25L, 1000, replace=TRUE),
id = sample(LETTERS, 1000, replace = TRUE))
d$z <- as.factor(d$z)
d$id <- factor(d$id)
d$y <- simulate(~x+(1|z),family = gaussian,
newdata=d,
newparams=list(beta=c(2, -1.1), theta=c(.25),
sigma = c(.23)), seed =463)[[1]]
lmeFit <- lmer(y ~ x + (1|z), data = d)
predData <- data.frame(x = rnorm(25), z = sample(1:25L, 25, replace=TRUE),
id = sample(LETTERS, 25, replace = TRUE))
predData$z <- as.factor(predData$z)
predData$id <- factor(predData$id)
predict(lmeFit, predData)
predictInterval(lmeFit, predData)
predictInterval(lmeFit, predData[1, ])
But, playing around with this code I was not able to recreate the error observed above. Can you post a synthetic example or see if you can create a synthetic example?
Or can you test the issue first coercing the factors to characters and seeing if you see the same re-ordering issue?
Looking to create an AIC selection table for a publication in LaTex format, but I can not seem to get the form I want. I have googled this to death and was VERY surprised I couldn't find an answer. I've found answers to much more obscure questions in R.
Below is a bit of code, a few tables I made that I'm not overly keen on, and at the bottom is the general structure I'd like to make, but in nice latex table format as the stargazer package does.
I tried to use extra arguments for both packages to attain what I wanted but was unsuccessful.
##Create dummy variables
a<-1:10
b<-c(10:3,1,2)
c<-c(1,4,5,3,7,3,6,2,4,5)
##Create df
df<-data.frame(a,b,c)
##Build models
m1<-lm(a~b,data=df)
summary(m1)
m2<-lm(a~c,data=df)
m3<-lm(a~b+c,data=df)
m4<-lm(a~b*c,data=df)
##View list of AIC values
AIC(m1,m2,m3,m4)
########################CREATE AIC SELECTION TABLE
##Using MuMIn Package
library(MuMIn)
modelTABLE <- model.sel(m1,m2,m3,m4)
View(modelTABLE) ##No AIC values, just AICc, no R-squared, and model name (i.e, a~b) not present
##Using stargazer Package
library(stargazer)
test<-stargazer(m1,m2,m3,m4 ,
type = "text",
title="Regression Results",
align=TRUE,
style="default",
dep.var.labels.include=TRUE,
flip=FALSE
## ,out="models.htm"
)
View(test) ##More of a table depicting individual covariate attributes, bottom of table doesn't have AIC
###Would like a table similar to the following
Model ModelName df logLik AIC delta AICweight R2
m1 a ~ b 3 -6.111801 18.2 0 0.95 0.976
m3 a ~ b + c 4 -5.993613 20 1.8 0.05 0.976
m4 a ~ b * c 5 -5.784843 21.6 3.4 0.00 0.977
m2 a ~ c 3 -24.386821 54.8 36.6 0.00 0.068
`
model.sel result is a data.frame, so you can modify it (add model names, round numbers etc) and export to latex using e.g. latex from Hmisc package.
# include R^2:
R2 <- function(x) summary(x)$r.squared
ms <- model.sel(m1, m2, m3, m4, extra = "R2")
i <- 1:4 # indices of columns with model terms
response <- "a"
res <- as.data.frame(ms)
v <- names(ms)[i]
v[v == "(Intercept)"] <- 1
# create formula-like model names:
mnames <- apply(res[, i], 1, function(x)
deparse(simplify.formula(reformulate(v[!is.na(x)], response = response))))
## OR
# mnames <- apply(res[, i], 1, function(x)
# sapply(attr(ms, "modelList"), function(x) deparse(formula(x)))
res <- cbind(model = mnames, res[, -i])
Hmisc::latex(res, file = "")
I am new to R and I am stuck with a problem. I am trying to read a set of data in a table and I want to perform linear modeling. Below is how I read my data and my variables names:
>data =read.table(datafilename,header=TRUE)
>names(data)
[1] "price" "model" "size" "year" "color"
What I want to do is create several linear models using different combinations of the variables (price being the target ), such as:
> attach(data)
> model1 = lm(price~model+size)
> model2 = lm(price~model+year)
> model3 = lm(price~model+color)
> model4 = lm(price~model+size)
> model4 = lm(price~size+year+color)
#... and so on for all different combination...
My main aim is to compare the different models. Is there a more clever way to generate these models instead of hard coding the variables, especially that the number of my variables in some cases will increase to 13 or so.
If your goal is model selection there are several tools available in R which attempt to automate this process. Read the documentation on dredge(...) in the MuMIn package.
# dredge: example of use
library(MuMIn)
df <- mtcars[,c("mpg","cyl","disp","hp","wt")] # subset of mtcars
full.model <- lm(mpg ~ cyl+disp+hp+wt,df) # model for predicting mpg
dredge(full.model)
# Global model call: lm(formula = mpg ~ cyl + disp + hp + wt, data = df)
# ---
# Model selection table
# (Intrc) cyl disp hp wt df logLik AICc delta weight
# 10 39.69 -1.5080 -3.191 4 -74.005 157.5 0.00 0.291
# 14 38.75 -0.9416 -0.01804 -3.167 5 -72.738 157.8 0.29 0.251
# 13 37.23 -0.03177 -3.878 4 -74.326 158.1 0.64 0.211
# 16 40.83 -1.2930 0.011600 -0.02054 -3.854 6 -72.169 159.7 2.21 0.096
# 12 41.11 -1.7850 0.007473 -3.636 5 -73.779 159.9 2.37 0.089
# 15 37.11 -0.000937 -0.03116 -3.801 5 -74.321 161.0 3.46 0.052
# 11 34.96 -0.017720 -3.351 4 -78.084 165.6 8.16 0.005
# 9 37.29 -5.344 3 -80.015 166.9 9.40 0.003
# 4 34.66 -1.5870 -0.020580 4 -79.573 168.6 11.14 0.001
# 7 30.74 -0.030350 -0.02484 4 -80.309 170.1 12.61 0.001
# 2 37.88 -2.8760 3 -81.653 170.2 12.67 0.001
# 8 34.18 -1.2270 -0.018840 -0.01468 5 -79.009 170.3 12.83 0.000
# 6 36.91 -2.2650 -0.01912 4 -80.781 171.0 13.55 0.000
# 3 29.60 -0.041220 3 -82.105 171.1 13.57 0.000
# 5 30.10 -0.06823 3 -87.619 182.1 24.60 0.000
# 1 20.09 2 -102.378 209.2 51.68 0.000
You should consider these tools to help you make intelligent decisions. Do not let the tool make the decision for you!!!
For example, in this case dredge(...) suggests that the "best" model for predicting mpg, based on the AICc criterion, includes cyl and wt. But note that AICc for this model is 157.7 whereas the second best model has an AICc of 157.8, so these are basically the same. In fact, the first 5 models in this list are not significantly different in their ability to predict mpg. It does, however, narrow things down a bit. Among these 5, I would want to look at distribution of residuals (should be normal), trends in residuals (there should be none), and leverage (do some points have undue influence), before picking a "best" model.
Here's one way to get all of the combinations of variables using the combn function. It's a bit messy, and uses a loop (perhaps someone can improve on this with mapply):
vars <- c("price","model","size","year","color")
N <- list(1,2,3,4)
COMB <- sapply(N, function(m) combn(x=vars[2:5], m))
COMB2 <- list()
k=0
for(i in seq(COMB)){
tmp <- COMB[[i]]
for(j in seq(ncol(tmp))){
k <- k + 1
COMB2[[k]] <- formula(paste("price", "~", paste(tmp[,j], collapse=" + ")))
}
}
Then, you can call these formulas and store the model objects using a list or possibly give unique names with the assign function:
res <- vector(mode="list", length(COMB2))
for(i in seq(COMB2)){
res[[i]] <- lm(COMB2[[i]], data=data)
}
You can use stepwise multiple regression to determine what variables make sense to include. To get this started you write one lm() statement with all variables, such as:
library(MASS)
fit <- lm(price ~ model + size + year + color)
Then you continue with:
step <- stepAIC(model, direction="both")
Finally, you can use to following to show the results:
step$anova
Hope this gives you some inspiration for advancing your script.