I'm doing some exploratory work where I use dredge{MuMIn}. In this procedure there are two variables that I want to set to be allowed together ONLY when the interaction between them is present, i.e. they can not be present together only as main effects.
Using sample data: I want to dredge the model fm1 (disregarding that it probably doesn't make sense). If the variables GNP and Population appear together, they must also include the interaction between them.
require(stats); require(graphics)
## give the data set in the form it is used in S-PLUS:
longley.x <- data.matrix(longley[, 1:6])
longley.y <- longley[, "Employed"]
pairs(longley, main = "longley data")
names(longley)
fm1 <- lm(Employed ~GNP*Population*Armed.Forces, data = longley)
summary(fm1)
dredge(fm1, subset=!((GNP:Population) & !(GNP + Population)))
dredge(fm1, subset=!((GNP:Population) && !(GNP + Population)))
dredge(fm1, subset=dc(GNP+Population,GNP:Population))
dredge(fm1, subset=dc(GNP+Population,GNP*Population))
How can I specify in dredge() that it should disregard all models where GNP and Population are present, but not the interaction between them?
If I understand well, you want to model the two main effects (say, a and b) only together with their interaction (a:b). So how about: subset = !a | (xor(a, b) | 'a:b') (enclose a:b in backticks (`) not straight quotes), e.g:
library(MuMIn)
data(Cement)
fm <- lm(y ~ X1 * X2, Cement, na.action = na.fail)
dredge(fm, subset = !X2 | (xor(X1, X2) | `X1:X2`))
or wrap this condition into a function to have the code more clear:
test <- function(a, b, c) !a | (xor(a, b) | c)
dredge(fm, subset = test(X1, X2, `X1:X2`))
that produces: null, X1, X2, X1*X2 (and excludes X1 + X2)
Related
R> data("FoodExpenditure", package = "betareg")
R> fe_lm <- lm(I(food/income) ~ income + persons, data = FoodExpenditure)
From what I understand I(food/income) creates a new variable which is the ratio of food over income, is it correct? Are there any other combinations (functions) possible?
Observe that these two results are the same
# transformation in formula
lm(I(food/income) ~ income + persons, data = FoodExpenditure)
# Call:
# lm(formula = I(food/income) ~ income + persons, data = FoodExpenditure)
#
# Coefficients:
# (Intercept) income persons
# 0.341740 -0.002469 0.025767
# transformation in data
dd <- transform(FoodExpenditure, ratio=food/income)
lm(ratio ~ income + persons, data = dd)
# Call:
# lm(formula = ratio ~ income + persons, data = dd)
#
# Coefficients:
# (Intercept) income persons
# 0.341740 -0.002469 0.025767
The I() function in a formula with lm() allows you to perform any function of the variables that you like. (Just make sure the function doesn't change the number of rows otherwise you can't fit the model properly.)
Yes and Yes.
Other possible combinations and operators are given in the documentation for formula ?formula. What is below is mostly taken from it.
: denotes interactions between the terms
* operator denotes factor crossing: ‘a*b’ interpreted as ‘a+b+a:b’.
^ operator indicates
crossing to the specified degree. For example ‘(a+b+c)^2’ is
identical to ‘(a+b+c)*(a+b+c)’ which in turn expands to a formula
containing the main effects for ‘a’, ‘b’ and ‘c’ together with
their second-order interactions.
%in% operator indicates
that the terms on its left are nested within those on the right.
For example ‘a + b %in% a’ expands to the formula ‘a + a:b’.
- operator removes the specified terms, so that ‘(a+b+c)^2 -
a:b’ is identical to ‘a + b + c + b:c + a:c’. It can also used to
remove the intercept term: when fitting a linear model ‘y ~ x - 1’
specifies a line through the origin. A model with no intercept
can be also specified as ‘y ~ x + 0’ or ‘y ~ 0 + x’.
arithmetic expressions - While formulae usually involve just variable and factor names,
they can also involve arithmetic expressions. The formula ‘log(y)
~ a + log(x)’ is quite legal.
I() To avoid this confusion, the function ‘I()’ can be used to bracket
those portions of a model formula where the operators are used in
their arithmetic sense. For example, in the formula ‘y ~ a +
I(b+c)’, the term ‘b+c’ is to be interpreted as the sum of ‘b’ and
‘c’.
I currently have a problem in that I have to pre-specify my formulas before sending them into a regression function. For example, using the stan_gamm4 function in R, we have the following example:
dat <- mgcv::gamSim(1, n = 400, scale = 2) ## simulate 4 term additive truth
## Now add 20 level random effect `fac'...
dat$fac <- fac <- as.factor(sample(1:20, 400, replace = TRUE))
dat$y <- dat$y + model.matrix(~ fac - 1) %*% rnorm(20) * .5
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = ~ (1 | fac),
chains = 1, iter = 200) # for example speed
Now, because the formula and random formula were specified explicitly, then if we call:
br$call$random
> ~(1 | fac)
We are able to retrieve the form of the random effects.
NOW, let us then leave everything the same, BUT use an expression for the random part:
formula.rand <- as.formula( '~(1|fac)' )
Then, if we did the same thing before, but with formula.rand taking the place, we have:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand,
chains = 1, iter = 200) # for example speed
BUT NOW: we have that:
br$call$random
> formula.rand
Instead of the original. A lot of bayesian packages rely on accessing br$call$random, so is there a way to use a variable for formula, have it pass in, AND retain the original relation when calling br$call$random? Thanks.
While I haven't used Stan, this is a problem inherent in the way that R handles storing calls. You can see it happening with lm, for example:
model <- function(formula)
{
lm(formula, data=mtcars)
}
m <- model(mpg ~ disp)
m$call$formula
# formula
The simplest solution is to construct the call using substitute to insert the actual values you want to keep, not the symbol name. In the case of lm, this would be something like
model2 <- function(formula)
{
call <- substitute(lm(formula=.f, data=mtcars), list(.f=formula))
eval(call)
}
m2 <- model2(mpg ~ disp)
m2$call$formula
# mpg ~ disp
For Stan, you can do
stan_call <- substitute(br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data=dat, random=.rf,
chains=1, iter=200),
list(.rf=formula.rand))
br <- eval(stan_call)
If I understand correctly, your problem is not, that stan_gamm4 could be computing incorrect results (which is not the case, from what I gather), but only that br$call$random refers to the variable name and not the formula. This seems to be problematic for further post-processing of the model.
Since stan_gamm4 uses match.call inside to find the call, I don't know of a way to specify the model differently to obtain a "correct" br$call$random up front. But you can simply modify it after the fact via:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand)
br$call$random <- formula.rand
br$call$random
#> ~(1 | fac)
and the continue with whatever you are doing.
IMHO, this is not a problem with stan_gamm4. In your second example, if you then do
class(br$call$random)
you will see that it is of class "name". So, it is not as if $call is just some list with stuff in it. In order to access it programatically in general, you need to evaluate that with
eval(br$call$random)
in order to obtain ~(1 | fac), which is of class "formula".
I have constructed an lme4 model for model selection in dredge but I am having trouble aligning the random effects with the relevant fixed effects.The structure of my full model is as follows.
fullModel<-glmer(y ~x1 + x2 + (0+x1|Year) + (0+x1|Country) + (0+x2|Year) + (0+x2|Country) + (1 | Year) +(1|Country), family=binomial('logit'),data = alldata)
In this model structure, model selection in dredge produces three combinations of fixed effects, i.e. x1, x2, and x1+x2, however the random effect structure remains the same as in the full model, such that even when fixed effect is only x1, the random effect will include (0+x2|Year) + (0+x2|Country). For example the model with only x1 as the fixed effect, will still have x2 within the random effects structure as follows.
y ~x1 + (0+x1|Year) + (0+x1|Country) + (0+x2|Year) +(0+x2|Country) + (1 | Year) +(1|Country), family=binomial('logit')
Is there a way to configure dredge not to select random effects that have other fixed effects specified in them? I have about x1….x50.
You cannot do that out-of-box as dredge currently omits all (x|g) expressions, but you can make a "wrapper" around (g)lmer that replaces the "|" terms in the formula with something else (e.g. re(x,g)), so that dredge thinks these are fixed effects. Example:
glmerwrap <-
function(formula) {
cl <- origCall <- match.call()
cl[[1L]] <- as.name("glmer") # replace 'lmerwrap' with 'glmer'
# replace "re" with "|" in the formula:
f <- as.formula(do.call("substitute", list(formula, list(re = as.name("|")))))
environment(f) <- environment(formula)
cl$formula <- f
x <- eval.parent(cl) # evaluate modified call
# store original call and formula in the result:
x#call <- origCall
attr(x#frame, "formula") <- formula
x
}
formals(glmerwrap) <- formals(lme4::glmer)
Following example(glmer):
# note the use of re(x,group) instead of (x|group)
(fm <- glmerwrap(cbind(incidence, size - incidence) ~ period +
re(1, herd) + re(1, obs), family = binomial, data = cbpp))
Now,
dredge(fm)
manipulates both fixed and random effects.
I this link they show how to use lm() with a data frame
Right way to use lm in R
However (being completely new to R) I'm still a little unclear on the systax?
Is there more that this addition of the . to y ~, or does it simply denote that you have moved from a vector input to a data frame input?
The . notation in a formula is commonly taken to mean "all other variables in data that do not already appear in the formula". Consider the following:
df <- data.frame(y = rnorm(10), A = runif(10), B = rnorm(10))
mod <- lm(y ~ ., data = df)
coef(mod)
R> coef(mod)
(Intercept) A B
-0.8389 0.5635 -0.2160
Ignore the values above; what is important is that there are two terms in the model (plus the intercept), taken from the set of names(df) that do not include y. This is exactly the same as writing out the full formula
mod <- lm(y ~ A + B, data = df)
but involves less typing. It is a convenient shortcut when the model formula might include many variables.
The other place this crops up is in update(), where the second argument is a formula and one uses . to indicate "what was already there". For example:
coef(update(mod, . ~ . - B))
R> coef(update(mod, . ~ . - B))
(Intercept) A
-0.8156 0.5919
Hence the first ., to the left of ~ expands to "keep the existing response variable y", whilst the second ., to the right of ~ expands to A + B and hence we have A + B - B which cancels to A.
I'm trying to predict a simple lagged time series regression with the dyn library in R. This question was a helpful starting point, but I'm getting some weird behaviour that I'm hoping someone can explain.
Here's a minimum working example.
library(dyn)
# Initial data
y.orig <- arima.sim(model=list(ar=c(.9)),n=10)
x1.orig <- rnorm(10)
data <- cbind(y=y.orig, x1=x1.orig)
# This model, with a single lag term, predicts from t=2
mod1 <- dyn$lm(y ~ lag(y, -1), data)
y.new <- window(y.orig, end=end(y.orig) + c(5,0), extend=TRUE)
newdata1 <- cbind(y=y.new)
predict(mod1, newdata1)
# This one, with a lag plus another predictor, predicts from t=1 on
mod2 <- dyn$lm(y ~ lag(y, -1) + x1, data)
y.new <- window(y.orig, end=end(y.orig) + c(5,0), extend=TRUE)
x1.new <- c(x1.orig, rnorm(5))
newdata2 <- cbind(y=y.new, x1=x1.new)
predict(mod2, newdata2)
Why is there the difference between the two? Can anyone suggest how to predict my ''mod1'' using dyn? Thanks in advance.
Both mod1 and mod2 start predicting at t=2. The prediction vector for mod2 starts at t=1 but its NA. Regarding why one starts at 2 and the other at 1 note that predict merges together the variables on the right hand side of the formula and in the case of mod1 we see that lag(y, -1) starts at t=2 since y starts at t=1. On the other hand in the case of mod2 when we merge lag(y, -1) and x1 we get a series that starts at t=1 (since x1 starts at t=1). Try this which does not involve dyn:
> start(with(as.list(newdata1), merge.zoo(lag(y, -1))))
[1] 2
> start(with(as.list(newdata2), merge.zoo(lag(y, -1), x1)))
[1] 1
If we wanted predict(mod1, newdata1) to start at t=1 we could add our own Intercept column and remove the default intercept to avoid duplication. That would force it to start at 1 since now the RHS has a series which starts at 1:
data.b <- cbind(y=y.orig, x1=x1.orig, Intercept = 1)
mod.b <- dyn$lm(y ~ Intercept + lag(y, -1) - 1, data.b)
newdata.b <- cbind(Intercept = 1, y = y.new)
predict(mod.b, newdata.b)
Regarding the second question, if you want to predict mod1 then use fitted(mod1) .
It seems there is lurking some third question about how it basically all works so maybe this clarifies it. All dyn does is to align the time series in the formula and then lm and predict can be run as usual. For example, if we create an aligned model frame using dyn$model.frame then everything else can be done using just ordinary lm and ordinary predict and dyn is not involved from that point onwards. Below mod1a is similar to mod1 from the question except it runs an ordinary lm on the aligned model frame. If you understand the mod1a lm and its predict then mod1 and predict are similar.
## mod1 and mod1a are similar
# from code in the question
mod1 <- dyn$lm(y ~ lag(y, -1), data = data)
mod1
# redo it using a plain lm by applying dyn to model.frame
mf <- dyn$model.frame(y ~ lag(y, -1), data = data)
mod1a <- lm(y ~ `lag(y, -1)`, mf)
mod1a
## the two predicts below are similar
# the 1 ensures its an mts rather than ts but is otherwise not used
newdata1 <- cbind(y=y.new, 1)
predict(mod1, newdata1)
newdata1a <- cbind(1, `lag(y, -1)` = lag(y.new, -1))
predict(mod1a, newdata1a)