I am new to R and I am stuck with a problem. I am trying to read a set of data in a table and I want to perform linear modeling. Below is how I read my data and my variables names:
>data =read.table(datafilename,header=TRUE)
>names(data)
[1] "price" "model" "size" "year" "color"
What I want to do is create several linear models using different combinations of the variables (price being the target ), such as:
> attach(data)
> model1 = lm(price~model+size)
> model2 = lm(price~model+year)
> model3 = lm(price~model+color)
> model4 = lm(price~model+size)
> model4 = lm(price~size+year+color)
#... and so on for all different combination...
My main aim is to compare the different models. Is there a more clever way to generate these models instead of hard coding the variables, especially that the number of my variables in some cases will increase to 13 or so.
If your goal is model selection there are several tools available in R which attempt to automate this process. Read the documentation on dredge(...) in the MuMIn package.
# dredge: example of use
library(MuMIn)
df <- mtcars[,c("mpg","cyl","disp","hp","wt")] # subset of mtcars
full.model <- lm(mpg ~ cyl+disp+hp+wt,df) # model for predicting mpg
dredge(full.model)
# Global model call: lm(formula = mpg ~ cyl + disp + hp + wt, data = df)
# ---
# Model selection table
# (Intrc) cyl disp hp wt df logLik AICc delta weight
# 10 39.69 -1.5080 -3.191 4 -74.005 157.5 0.00 0.291
# 14 38.75 -0.9416 -0.01804 -3.167 5 -72.738 157.8 0.29 0.251
# 13 37.23 -0.03177 -3.878 4 -74.326 158.1 0.64 0.211
# 16 40.83 -1.2930 0.011600 -0.02054 -3.854 6 -72.169 159.7 2.21 0.096
# 12 41.11 -1.7850 0.007473 -3.636 5 -73.779 159.9 2.37 0.089
# 15 37.11 -0.000937 -0.03116 -3.801 5 -74.321 161.0 3.46 0.052
# 11 34.96 -0.017720 -3.351 4 -78.084 165.6 8.16 0.005
# 9 37.29 -5.344 3 -80.015 166.9 9.40 0.003
# 4 34.66 -1.5870 -0.020580 4 -79.573 168.6 11.14 0.001
# 7 30.74 -0.030350 -0.02484 4 -80.309 170.1 12.61 0.001
# 2 37.88 -2.8760 3 -81.653 170.2 12.67 0.001
# 8 34.18 -1.2270 -0.018840 -0.01468 5 -79.009 170.3 12.83 0.000
# 6 36.91 -2.2650 -0.01912 4 -80.781 171.0 13.55 0.000
# 3 29.60 -0.041220 3 -82.105 171.1 13.57 0.000
# 5 30.10 -0.06823 3 -87.619 182.1 24.60 0.000
# 1 20.09 2 -102.378 209.2 51.68 0.000
You should consider these tools to help you make intelligent decisions. Do not let the tool make the decision for you!!!
For example, in this case dredge(...) suggests that the "best" model for predicting mpg, based on the AICc criterion, includes cyl and wt. But note that AICc for this model is 157.7 whereas the second best model has an AICc of 157.8, so these are basically the same. In fact, the first 5 models in this list are not significantly different in their ability to predict mpg. It does, however, narrow things down a bit. Among these 5, I would want to look at distribution of residuals (should be normal), trends in residuals (there should be none), and leverage (do some points have undue influence), before picking a "best" model.
Here's one way to get all of the combinations of variables using the combn function. It's a bit messy, and uses a loop (perhaps someone can improve on this with mapply):
vars <- c("price","model","size","year","color")
N <- list(1,2,3,4)
COMB <- sapply(N, function(m) combn(x=vars[2:5], m))
COMB2 <- list()
k=0
for(i in seq(COMB)){
tmp <- COMB[[i]]
for(j in seq(ncol(tmp))){
k <- k + 1
COMB2[[k]] <- formula(paste("price", "~", paste(tmp[,j], collapse=" + ")))
}
}
Then, you can call these formulas and store the model objects using a list or possibly give unique names with the assign function:
res <- vector(mode="list", length(COMB2))
for(i in seq(COMB2)){
res[[i]] <- lm(COMB2[[i]], data=data)
}
You can use stepwise multiple regression to determine what variables make sense to include. To get this started you write one lm() statement with all variables, such as:
library(MASS)
fit <- lm(price ~ model + size + year + color)
Then you continue with:
step <- stepAIC(model, direction="both")
Finally, you can use to following to show the results:
step$anova
Hope this gives you some inspiration for advancing your script.
Related
These are three different ways to run an individual fixed effect method which gives more or less the same results (see below). My main question is how to get predictive probabilities or average marginal effects using the second model (model_plm) or the third model(model_felm). I know how to do it using the first model (model_lm) and show an example below using ggeffects, but that only works when i have a small sample.
As i have over a million individual, my model only works using model_plm and model_felm. If i use model_lm, it takes a lot of time to run with one million individuals since they are controlled for in the model. I also get the following error: Error: vector memory exhausted (limit reached?). I checked many threads on StackOverflow to work around that error but nothing seems to solve it.
I was wondering whether there is an efficient way to work around this issue. My main interest is to extract the predicted probabilities of the interaction residence*union. I usually extract predictive probabilities or average marginal effects using one of these packages: ggeffects,emmeans or margins.
library(lfe)
library(plm)
library(ggeffects)
data("Males")
model_lm = lm(wage ~ exper + residence+health + residence*union +factor(nr)-1, data=Males)
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr", "year"), data=Males)
model_felm = felm(wage ~ exper + residence + health + residence*union | nr, data= Males)
pred_ggeffects <- ggpredict(model_lm, c("residence","union"),
vcov.fun = "vcovCL",
vcov.type = "HC1",
vcov.args = list(cluster = Males$nr))
I tried adjusting formula/datasets to get emmeans and plm to play nice. Let me know if there's something here. I realized the biglm answer wasn't going to cut it for a million individuals after some testing.
library(emmeans)
library(plm)
data("Males")
## this runs but we need to get an equivalent result with expanded formula
## and expanded dataset
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr"), data=Males)
## expanded dataset
Males2 <- data.frame(wage=Males[complete.cases(Males),"wage"],
model.matrix(wage ~ exper + residence + health + residence*union, Males),
nr=Males[complete.cases(Males),"nr"])
(fmla2 <- as.formula(paste("wage ~ ", paste(names(coef(model_plm)), collapse= "+"))))
## expanded formula
model_plm2 <- plm(fmla2,
model = "within",
index=c("nr"),
data=Males2)
(fmla2_rg <- as.formula(paste("wage ~ -1 +", paste(names(coef(model_plm)), collapse= "+"))))
plm2_rg <- qdrg(fmla2_rg,
data = Males2,
coef = coef(model_plm2),
vcov = vcov(model_plm2),
df = model_plm2$df.residual)
plm2_rg
### when all 3 residences are 0, that's `rural area`
### then just pick the rows when one of the residences are 1
emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
Which gives, after some row-deletion:
> ### when all 3 residences are 0, that's `rural area`
> ### then just pick the rows when one of the residences are 1
> emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
residencenorth_east residencenothern_central residencesouth unionyes emmean SE df lower.CL upper.CL
0 0 0 0 0.3777 0.0335 2677 0.31201 0.443
1 0 0 0 0.3301 0.1636 2677 0.00929 0.651
0 1 0 0 0.1924 0.1483 2677 -0.09834 0.483
0 0 1 0 0.2596 0.1514 2677 -0.03732 0.557
0 0 0 1 0.2875 0.1473 2677 -0.00144 0.576
1 0 0 1 0.3845 0.1647 2677 0.06155 0.708
0 1 0 1 0.3326 0.1539 2677 0.03091 0.634
0 0 1 1 0.3411 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: healthyes
Confidence level used: 0.95
The problem seems to be that when we add -1 to the formula, that creates an extra column in the model matrix that is not included in the regression coefficients. (This is a byproduct of the way that R creates factor codings.)
So I can work around this by adding a strategically placed coefficient of zero. We also have to fix up the covariance matrix the same way:
library(emmeans)
library(plm)
data("Males")
mod <- plm(wage ~ exper + residence + health + residence*union,
model = "within",
index = "nr",
data = Males)
BB <- c(coef(mod)[1], 0, coef(mod)[-1])
k <- length(BB)
VV <- matrix(0, nrow = k, ncol = k)
VV[c(1, 3:k), c(1, 3:k)] <- vcov(mod)
RG <- qdrg(~ -1 + exper + residence + health + residence*union,
data = Males, coef = BB, vcov = VV, df = df.residual(mod))
Verify that things line up:
> names(RG#bhat)
[1] "exper" ""
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
> colnames(RG#linfct)
[1] "exper" "residencerural_area"
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
They do line up, so we can get the results we need:
(EMM <- emmeans(RG, ~ residence * union))
residence union emmean SE df lower.CL upper.CL
rural_area no 0.378 0.0335 2677 0.31201 0.443
north_east no 0.330 0.1636 2677 0.00929 0.651
nothern_central no 0.192 0.1483 2677 -0.09834 0.483
south no 0.260 0.1514 2677 -0.03732 0.557
rural_area yes 0.287 0.1473 2677 -0.00144 0.576
north_east yes 0.385 0.1647 2677 0.06155 0.708
nothern_central yes 0.333 0.1539 2677 0.03091 0.634
south yes 0.341 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: health
Confidence level used: 0.95
In general, the key is to identify where the added column occurs. It's going to be the position of the first level of the first factor in the model formula. You can check it by looking at names(coef(mod)) and colnames(model.matrix(formula), data = data) where formula is the model formula with intercept removed.
Update: a general function
Here's a function that may be used to create a reference grid for any plm object. It turns out that sometimes these objects do have an intercept (e.g., random-effects models) so we have to check. For models lacking an intercept, you really should use this only for contrasts.
plmrg = function(object, ...) {
form = formula(formula(object))
if (!("(Intercept)" %in% names(coef(object))))
form = update(form, ~ . - 1)
data = eval(object$call$data, environment(form))
mmat = model.matrix(form, data)
sel = which(colnames(mmat) %in% names(coef(object)))
k = ncol(mmat)
b = rep(0, k)
b[sel] = coef(object)
v = matrix(0, nrow = k, ncol = k)
v[sel, sel] = vcov(object)
emmeans::qdrg(formula = form, data = data,
coef = b, vcov = v, df = df.residual(object), ...)
}
Test run:
> (rg = plmrg(mod, at = list(exper = c(3,6,9))))
'emmGrid' object with variables:
exper = 3, 6, 9
residence = rural_area, north_east, nothern_central, south
health = no, yes
union = no, yes
> emmeans(rg, "residence")
NOTE: Results may be misleading due to involvement in interactions
residence emmean SE df lower.CL upper.CL
rural_area 0.313 0.0791 2677 0.1579 0.468
north_east 0.338 0.1625 2677 0.0190 0.656
nothern_central 0.243 0.1494 2677 -0.0501 0.536
south 0.281 0.1514 2677 -0.0161 0.578
Results are averaged over the levels of: exper, health, union
Confidence level used: 0.95
This potential solution uses biglm::biglm() to fit the lm model and then uses emmeans::qdrg() with a nuisance specified. Does this approach help in your situation?
library(biglm)
library(emmeans)
## the biglm coefficients using factor() with all the `nr` levels has NAs.
## so restrict data to complete cases in the `biglm()` call
model_biglm <- biglm(wage ~ -1 +exper + residence+health + residence*union + factor(nr), data=Males[!is.na(Males$residence),])
summary(model_biglm)
## double check that biglm and lm give same/similar model
## summary(model_biglm)
## summary(model_lm)
summary(model_biglm)$rsq
summary(model_lm)$r.squared
identical(coef(model_biglm), coef(model_lm)) ## not identical! but plot the coefficients...
head(cbind(coef(model_biglm), coef(model_lm)))
tail(cbind(coef(model_biglm), coef(model_lm)))
plot(cbind(coef(model_biglm), coef(model_lm))); abline(0,1,col="blue")
## do a "[q]uick and [d]irty [r]eference [g]rid and follow examples
### from ?qdrg and https://cran.r-project.org/web/packages/emmeans/vignettes/FAQs.html
rg1 <- qdrg(wage ~ -1 + exper + residence+health + residence*union + factor(nr),
data = Males,
coef = coef(model_biglm),
vcov = vcov(model_biglm),
df = model_biglm$df.resid,
nuisance="nr")
## Since we already specified nuisance in qdrg() we don't in emmeans():
emmeans(rg1, c("residence","union"))
Which gives:
> emmeans(rg1, c("residence","union"))
residence union emmean SE df lower.CL upper.CL
rural_area no 1.72 0.1417 2677 1.44 2.00
north_east no 1.67 0.0616 2677 1.55 1.79
nothern_central no 1.53 0.0397 2677 1.45 1.61
south no 1.60 0.0386 2677 1.52 1.68
rural_area yes 1.63 0.2011 2677 1.23 2.02
north_east yes 1.72 0.0651 2677 1.60 1.85
nothern_central yes 1.67 0.0503 2677 1.57 1.77
south yes 1.68 0.0460 2677 1.59 1.77
Results are averaged over the levels of: 1 nuisance factors, health
Confidence level used: 0.95
I have a set of phylogenetic trees, some with different topologies and different branch lengths. Here and example set:
(LA:97.592181158,((HS:82.6284812237,RN:72.190055848635):10.438414999999999):3.989335,((CP:32.2668593286,CL:32.266858085):39.9232054349,(CS:78.2389673073,BT:78.238955218815):8.378847):10.974376);
(((HS:71.9309734249,((CP:30.289472339999996,CL:30.289473923):31.8509454,RN:62.1404181356):9.790551):2.049235,(CS:62.74606492390001,BS:62.74606028250001):11.234141000000001):5.067314,LA:79.0475136246);
(((((CP:39.415718961379994,CL:39.4157161214):29.043224136600003,RN:68.4589436016):8.947169,HS:77.4061105636):4.509818,(BS:63.09170355585999,CS:63.09171066541):18.824224):13.975551000000001,LA:95.891473546);
(LA:95.630761929,((HS:73.4928857457,((CP:32.673882875400004,CL:32.673881941):33.703323212,RN:66.37720021233):7.115682):5.537861,(CS:61.798048265700004,BS:61.798043931600006):17.232697):16.600025000000002);
(((HS:72.6356569413,((CP:34.015223002300004,CL:34.015223157499996):35.207698155399996,RN:69.2229294656):3.412726):8.746038,(CS:68.62665546391,BS:68.6266424085):12.755043999999998):13.40646,LA:94.78814570300001);
(LA:89.58710099299999,((HS:72.440439124,((CP:32.270428384199995,CL:32.2704269484):32.0556597315,RN:64.32607145395):8.114349):6.962274,(CS:66.3266360702,BS:66.3266352709):13.076080999999999):10.184418);
(LA:91.116083247,((HS:73.8383213643,((CP:36.4068361936,CL:36.4068400719):32.297183626700004,RN:68.704029984267):5.134297):6.50389,(BS:68.6124876659,CS:68.61249734691):11.729719):10.773886000000001);
(((HS:91.025288418,((CP:40.288406529099994,CL:40.288401832999995):29.854198951399997,RN:70.14260821095):20.882673999999998):6.163698,(CS:81.12951949976,BS:81.12952162629999):16.059462):13.109915,LA:110.298870881);
In this example there are 2 unique topologies - using R's ape unique.multiPhylo shows that (assuming the example above is saved to a file tree.fn):
tree <- ape::read.tree(tree.fn)
unique.tree <- ape::unique.multiPhylo(tree, use.tip.label = F, use.edge.length = F)
> length(tree)
[1] 8
> length(unique.tree)
[1] 2
My question is how do I get a list of trees, each one representing a unique topology in the input list, and the branch lengths are a summary statistic, such as mean or median, across all trees with the same topology.
In the example above, it will return the first tree as is, because its topology is unique, and another tree which is the topology of the other trees, with mean or median branch lengths?
If I understand well, you want to sort all the trees for each unique into different groups (e.g. in your example, the first group contains one tree, etc...) and then measure some stats for each group?
You can do that by first grouping the topologies into a list:
set.seed(5)
## Generating 20 4 tip trees (hopefully they will be identical topologies!)
tree_list <- rmtree(20, 4)
## How many unique topologies?
length(unique(tree_list))
## Sorting the trees by topologies
tree_list_tmp <- tree_list
sorted_tree_list <- list()
counter <- 0
while(length(tree_list_tmp) != 0) {
counter <- counter+1
## Is the first tree equal to any of the trees in the list
equal_to_tree_one <- unlist(lapply(tree_list_tmp, function(x, base) all.equal(x, base, use.edge.length = FALSE), base = tree_list_tmp[[1]]))
## Saving the identical trees
sorted_tree_list[[counter]] <- tree_list_tmp[which(equal_to_tree_one)]
## Removing them from the list
tree_list_tmp <- tree_list_tmp[-which(equal_to_tree_one)]
## Repeat while there are still some trees!
}
## The list of topologies should be equal to the number of unique trees
length(sorted_tree_list) == length(unique(tree_list))
## Giving them names for fancyness
names(sorted_tree_list) <- paste0("topology", 1:length(sorted_tree_list))
Then for all the trees in each unique topology group you can extract different summary statistics by making a function. Here for example I will measure the branch length sd, mean and 90% quantiles.
## function for getting some stats
get.statistics <- function(unique_topology_group) {
## Extract the branch lengths of all the trees
branch_lengths <- unlist(lapply(unique_topology_group, function(x) x$edge.length))
## Apply some statistics
return(c( n = length(unique_topology_group),
mean = mean(branch_lengths),
sd = sd(branch_lengths),
quantile(branch_lengths, prob = c(0.05, 0.95))))
}
## Getting all the stats
all_stats <- lapply(sorted_tree_list, get.statistics)
## and making it into a nice table
round(do.call(rbind, all_stats), digits = 3)
# n mean sd 5% 95%
# topology1 3 0.559 0.315 0.113 0.962
# topology2 2 0.556 0.259 0.201 0.889
# topology3 4 0.525 0.378 0.033 0.989
# topology4 2 0.489 0.291 0.049 0.855
# topology5 2 0.549 0.291 0.062 0.882
# topology6 1 0.731 0.211 0.443 0.926
# topology7 3 0.432 0.224 0.091 0.789
# topology8 1 0.577 0.329 0.115 0.890
# topology9 1 0.473 0.351 0.108 0.833
# topology10 1 0.439 0.307 0.060 0.795
Of course you can tweak it to get your own desired stats or even get the stats per trees per groups (using a double lapply lapply(sorted_trees_list, lapply, get.statistics) or something like that).
I am trying to use MuMIn's model.avg function with model formulae that are pasted and use an index rather than directly input, for example:
m1<-gls(as.formula(paste(response,"~",paste(combns[,j], collapse="+"))), data=dat)
'combns' is a 2D array created by combn() containing combinations of predictor variables. This is producing model-averaged coefficients and AICc values identical to those produced if the gls functions contain the formulae directly, e.g.:
m1<-gls(median_Ta ~ day_of_season + hour_of_day + pct_grey_cover +
foliage_height_diversity + tree_shannon_diversity + median_patch_size, data=dat)
However, the relative variable importance is not computing, and I believe this to do with the use of a for loop or with using a variable to access the index of the list in which the models are stored somehow causing the component model terms not to be 'read' properly (see the term codes for the models):
Component models:
df logLik AICc delta weight
1234567b 7 -233.08 481.43 0.00 0.59
1234567f 3 -237.97 482.21 0.78 0.40
1234567e 4 -241.32 491.08 9.65 0.00
1234567a 9 -241.15 502.39 20.96 0.00
1234567c 6 -248.37 509.68 28.25 0.00
1234567d 5 -250.22 511.11 29.68 0.00
Term codes:
day_of_season foliage_height_diversity hour_of_day
1 2 3
median_patch_size pct_grey_cover tree_shannon_diversity
4 5 6
urban_boundary_distance
7
This results in the relative variable importance being given as:
Relative variable importance:
day_of_season foliage_height_diversity hour_of_day
Importance: 1 1 1
N containing models: 6 6 6
median_patch_size pct_grey_cover tree_shannon_diversity
Importance: 1 1 1
N containing models: 6 6 6
urban_boundary_distance
Importance: 1
N containing models: 6
Whereas if I use model.avg over the same models with the formulae typed individually, I get the following, correct output:
Component models:
df logLik AICc delta weight
23456 7 -233.08 481.43 0.00 0.59
1 3 -237.97 482.21 0.78 0.40
57 4 -241.32 491.08 9.65 0.00
1234567 9 -241.15 502.39 20.96 0.00
1467 6 -248.37 509.68 28.25 0.00
147 5 -250.22 511.11 29.68 0.00
Relative variable importance:
pct_grey_cover median_patch_size tree_shannon_diversity
Importance: 0.6 0.59 0.59
N containing models: 3 4 3
foliage_height_diversity hour_of_day day_of_season
Importance: 0.59 0.59 0.4
N containing models: 2 2 4
urban_boundary_distance
Importance: <0.01
N containing models: 4
How can I make model.avg read the predictor variables in the formulae properly? I've only included six models as an example here but I want to compare the full set of 128 models (and I have other response variables with larger numbers of predictor variables), so typing them out individually isn't feasible.
Thanks in advance.
Edit: reproducible example
It took me a while to narrow down the problem. The first example, m.ave, shows the problem in action with a for loop. The second example, m.ave2 shows it working with the indices typed rather than using a variable. Obviously this is just a small subset of the predictor variables.
require(nlme)
require(MuMIn)
dat<-data.frame(median_Ta=rnorm(100), day_of_season=runif(100), hour_of_day=runif(100), pct_grey_cover=rnorm(100),
foliage_height_diversity=rnorm(100), urban_boundary_distance=runif(100), tree_shannon_diversity=rnorm(100),
median_patch_size=rnorm(100))
f1<-"median_Ta ~ day_of_season + hour_of_day + pct_grey_cover + foliage_height_diversity +
urban_boundary_distance + tree_shannon_diversity + median_patch_size"
f1<-gsub("\\s", "", f1) # remove whitespace
f1split <- strsplit(f1, split="~") # split predictors and response
response <- f1split[[1]][1]
predictors <- strsplit(f1split[[1]][2], split="+", fixed=TRUE)[[1]]
modelslist<-list()
combns <- combn(predictors, 6)
for (j in 1:7) {
modelslist[[j]]<-gls(as.formula(paste(response,"~",paste(combns[,j], collapse="+"))), data=dat)
}
m.ave<-model.avg(modelslist[[2]], modelslist[[3]], modelslist[[4]],
modelslist[[5]], modelslist[[6]], modelslist[[7]], modelslist[[8]])
summary(m.ave)
#compare....
modelslist2<-list()
modelslist2[[1]]<-gls(as.formula(paste(response,"~",paste(combns[,1], collapse="+"))), data=dat)
modelslist2[[2]]<-gls(as.formula(paste(response,"~",paste(combns[,2], collapse="+"))), data=dat)
modelslist2[[3]]<-gls(as.formula(paste(response,"~",paste(combns[,3], collapse="+"))), data=dat)
modelslist2[[4]]<-gls(as.formula(paste(response,"~",paste(combns[,4], collapse="+"))), data=dat)
modelslist2[[5]]<-gls(as.formula(paste(response,"~",paste(combns[,5], collapse="+"))), data=dat)
modelslist2[[6]]<-gls(as.formula(paste(response,"~",paste(combns[,6], collapse="+"))), data=dat)
modelslist2[[7]]<-gls(as.formula(paste(response,"~",paste(combns[,7], collapse="+"))), data=dat)
m.ave2<-model.avg(modelslist2[[1]], modelslist2[[2]], modelslist2[[3]], modelslist2[[4]],
modelslist2[[5]], modelslist2[[6]], modelslist2[[7]])
summary(m.ave2)
This is a bug in the formula method for gls (in package nlme). Since the actual formula is not stored anywhere in the object, it evaluates "model" argument in the function call. In case of elements of modellist they are all the same, for example:
modelslist[[1]]$call$model
modelslist[[7]]$call$model
both return
> formula(paste(response, "~", paste(combns[, j], collapse = "+")))
which, when evaluated use the current (last) value of j, so that all formula(modellist[[N]]) returns the last model formula.
all.equal(formula(modelslist[[1]]), formula(modelslist[[7]]))
returns
> TRUE
Needles to say, all this confuses model.avg which uses the formulas to build the model selection table (this is a fallback because gls lacks terms as well).
Edit: possible workarounds
Much easier way to get what you want:
model.avg(dredge(..., m.lim = c(6,6)))
or, if you want to make predictions:
modellist <- lapply(dredge(..., m.lim = c(6,6), evaluate = FALSE), eval)
But, if you want to use an arbitrary set of models, replace the $call$model element in each gls model object with a proper formula, e.g.
combns <- combn(1:7, 6)
modellist <- vector("list", 7)
for (j in 1:7) {
f <- reformulate(predictors[combns[, j]], response = response)
fm <- gls(f, data = dat)
fm$call$model <- f # assign the actual formula
modellist[[j]] <- fm
}
I have the following data (dat)
I have the following data(dat)
V W X Y Z
1 8 89 3 900
1 8 100 2 800
0 9 333 4 980
0 9 560 1 999
I wish to perform TukeysHSD pairwise test to the above data set.
library(reshape2)
dat1 <- gather(dat) #convert to long form
pairwise.t.test(dat1$key, dat1$value, p.adj = "holm")
However, every time I try to run it, it keeps running and does not yield an output. Any suggestions on how to correct this?
I would also like to perform the same test using the function TukeyHSD(). However, when I try to use the wide/long format, I run into a error that says
" Error in UseMethod("TukeyHSD") :
no applicable method for 'TukeyHSD' applied to an object of class "data.frame"
We need 'x' to be dat1$value as it is not specified the first argument is taken as 'x' and second as 'g'
pairwise.t.test( dat1$value, dat1$key, p.adj = "holm")
#data: dat1$value and dat1$key
# V W X Y
#W 1.000 - - -
#X 0.018 0.018 - -
#Y 1.000 1.000 0.018 -
#Z 4.1e-08 4.1e-08 2.8e-06 4.1e-08
#P value adjustment method: holm
Or we specify the argument and use in any order we wanted
pairwise.t.test(g = dat1$key, x= dat1$value, p.adj = "holm")
Regarding the TukeyHSD
TukeyHSD(aov(value~key, data = dat1), ordered = TRUE)
#Tukey multiple comparisons of means
# 95% family-wise confidence level
# factor levels have been ordered
#Fit: aov(formula = value ~ key, data = dat1)
#$key
# diff lwr upr p adj
#Y-V 2.00 -233.42378 237.4238 0.9999999
#W-V 8.00 -227.42378 243.4238 0.9999691
#X-V 270.00 34.57622 505.4238 0.0211466
#Z-V 919.25 683.82622 1154.6738 0.0000000
#W-Y 6.00 -229.42378 241.4238 0.9999902
#X-Y 268.00 32.57622 503.4238 0.0222406
#Z-Y 917.25 681.82622 1152.6738 0.0000000
#X-W 262.00 26.57622 497.4238 0.0258644
#Z-W 911.25 675.82622 1146.6738 0.0000000
#Z-X 649.25 413.82622 884.6738 0.0000034
Let's assume I want to see all possible variable's combinations of a GLMM (using lme4) but I don't want to consider two variables at the same time in a model. How do I do that? For instance, I want to consider 3 fixed effects and 3 random effects, but I don't want any of the random or fixed effects to be considered at the same time in a model. If I construct the model this way:
model1 <- glmer(x~var1+var2+var3+(1|var4)+(1|var5)+(1|var6),
data=data1)
and I use MuMIn::dredge() function (to perform model averaging later), I will get all possible combinations between them, but I don't want (1|var4) to be in the same model as (1|var5).
So, is it possible to limit model combinations? This way I would avoid unnecessary models and save computing time.
Just in case anyone is looking for an updated solution to this... You can now do this easier with the subset argument of dredge:
#full model
model1 <- glmer(x~var1+var2+var3+(1|var4)+(1|var5)+(1|var6),data=data1)
#exclude models containing both (1|var4) & (1|var5) at the same time
dredge(model1, subset = !((1|var4) && (1|var5)))
I don't know how to do this within MuMIn::dredge() (but see my attempts below).
set.seed(101)
dd <- data.frame(x=rnorm(1000),
var1=rnorm(1000),
var2=rnorm(1000),
var3=rnorm(1000),
var4=sample(factor(sample(1:20,size=1000,replace=TRUE))),
var5=sample(factor(sample(1:20,size=1000,replace=TRUE))),
var6=sample(factor(sample(1:20,size=1000,replace=TRUE))))
library(lme4)
m0 <- lmer(x~var1+var2+var3+(1|var4)+(1|var5)+(1|var6),dd,REML=FALSE,
na.action=na.fail)
If we try to use the m.lim argument it subsets only the fixed effects, but leaves in all the random effect terms:
dredge(m0,m.lim=c(0,1))
## Model selection table
## (Intrc) var1 var2 var3 df logLik AICc delta weight
## 1 0.02350 5 -1417.485 2845.0 0.00 0.412
## 3 0.02389 -0.03256 6 -1416.981 2846.0 1.02 0.248
## 5 0.02327 0.02168 6 -1417.254 2846.6 1.56 0.189
## 2 0.02349 -0.002981 6 -1417.480 2847.0 2.02 0.151
## Models ranked by AICc(x)
## Random terms (all models):
## ‘1 | var4’, ‘1 | var5’, ‘1 | var6’
Following demo(dredge.subset), I tried this as an example:
dredge(m0,
subset=expression(!( (var1 && var2) || ((1|var4) && (1|var5)))))
but got
Error in dredge(m0, subset = expression(!((var1 && var2) || ((1 | var4) && :
unrecognized names in 'subset' expression: "var4" and "var5"
I can't find any documentation on how to do dredging/model averaging with MuMIn::dredge() across models with different random effects (indeed, I'm not convinced this is a good idea). If you wanted to fit all models with exactly one fixed-effect and exactly one random-effect term, you could do it as follows:
Set up all combinations:
fvars <- paste0("var",1:3)
gvars <- paste0("(1|var",4:6,")")
combs <- as.matrix(expand.grid(fvars,gvars))
Now fit them:
mList <- list()
for (i in 1:nrow(combs)) {
mList[[i]] <- update(m0,
formula=reformulate(combs[i,],response="x"))
}
Now you can use lapply or sapply to operate on the elements of the list, e.g.:
lapply(mList,formula)
## [[1]]
## x ~ var1 + (1 | var4)
##
## [[2]]
## x ~ var2 + (1 | var4)
##
## [[3]]
## x ~ var3 + (1 | var4)
##
## [[4]]
## x ~ var1 + (1 | var5)
## ... et cetera ...
bbmle::AICtab(mList,weights=TRUE)
## dAIC df weight
## model5 0.0 4 0.344
## model6 0.5 4 0.262
## model4 1.0 4 0.213
## model8 4.1 4 0.044
## ... et cetera ...
... but you'll have to work a bit harder to do model averaging. You might try r-sig-mixed-models#r-project.org, r-sig-ecology#r-project.org, or e-mail the maintainer of MuMIn (maintainer("MuMIn")) ...