When to use `y ~ .` with `lm()` - r

I this link they show how to use lm() with a data frame
Right way to use lm in R
However (being completely new to R) I'm still a little unclear on the systax?
Is there more that this addition of the . to y ~, or does it simply denote that you have moved from a vector input to a data frame input?

The . notation in a formula is commonly taken to mean "all other variables in data that do not already appear in the formula". Consider the following:
df <- data.frame(y = rnorm(10), A = runif(10), B = rnorm(10))
mod <- lm(y ~ ., data = df)
coef(mod)
R> coef(mod)
(Intercept) A B
-0.8389 0.5635 -0.2160
Ignore the values above; what is important is that there are two terms in the model (plus the intercept), taken from the set of names(df) that do not include y. This is exactly the same as writing out the full formula
mod <- lm(y ~ A + B, data = df)
but involves less typing. It is a convenient shortcut when the model formula might include many variables.
The other place this crops up is in update(), where the second argument is a formula and one uses . to indicate "what was already there". For example:
coef(update(mod, . ~ . - B))
R> coef(update(mod, . ~ . - B))
(Intercept) A
-0.8156 0.5919
Hence the first ., to the left of ~ expands to "keep the existing response variable y", whilst the second ., to the right of ~ expands to A + B and hence we have A + B - B which cancels to A.

Related

How to correctly pass formulas associated with a variable name with random effects into fitted regression models in `R`?

I currently have a problem in that I have to pre-specify my formulas before sending them into a regression function. For example, using the stan_gamm4 function in R, we have the following example:
dat <- mgcv::gamSim(1, n = 400, scale = 2) ## simulate 4 term additive truth
## Now add 20 level random effect `fac'...
dat$fac <- fac <- as.factor(sample(1:20, 400, replace = TRUE))
dat$y <- dat$y + model.matrix(~ fac - 1) %*% rnorm(20) * .5
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = ~ (1 | fac),
chains = 1, iter = 200) # for example speed
Now, because the formula and random formula were specified explicitly, then if we call:
br$call$random
> ~(1 | fac)
We are able to retrieve the form of the random effects.
NOW, let us then leave everything the same, BUT use an expression for the random part:
formula.rand <- as.formula( '~(1|fac)' )
Then, if we did the same thing before, but with formula.rand taking the place, we have:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand,
chains = 1, iter = 200) # for example speed
BUT NOW: we have that:
br$call$random
> formula.rand
Instead of the original. A lot of bayesian packages rely on accessing br$call$random, so is there a way to use a variable for formula, have it pass in, AND retain the original relation when calling br$call$random? Thanks.
While I haven't used Stan, this is a problem inherent in the way that R handles storing calls. You can see it happening with lm, for example:
model <- function(formula)
{
lm(formula, data=mtcars)
}
m <- model(mpg ~ disp)
m$call$formula
# formula
The simplest solution is to construct the call using substitute to insert the actual values you want to keep, not the symbol name. In the case of lm, this would be something like
model2 <- function(formula)
{
call <- substitute(lm(formula=.f, data=mtcars), list(.f=formula))
eval(call)
}
m2 <- model2(mpg ~ disp)
m2$call$formula
# mpg ~ disp
For Stan, you can do
stan_call <- substitute(br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data=dat, random=.rf,
chains=1, iter=200),
list(.rf=formula.rand))
br <- eval(stan_call)
If I understand correctly, your problem is not, that stan_gamm4 could be computing incorrect results (which is not the case, from what I gather), but only that br$call$random refers to the variable name and not the formula. This seems to be problematic for further post-processing of the model.
Since stan_gamm4 uses match.call inside to find the call, I don't know of a way to specify the model differently to obtain a "correct" br$call$random up front. But you can simply modify it after the fact via:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand)
br$call$random <- formula.rand
br$call$random
#> ~(1 | fac)
and the continue with whatever you are doing.
IMHO, this is not a problem with stan_gamm4. In your second example, if you then do
class(br$call$random)
you will see that it is of class "name". So, it is not as if $call is just some list with stuff in it. In order to access it programatically in general, you need to evaluate that with
eval(br$call$random)
in order to obtain ~(1 | fac), which is of class "formula".

passing model parameters to R's predict() function robustly

I am trying to use R to fit a linear model and make predictions. My model includes some constant side parameters that are not in the data frame. Here's a simplified version of what I'm doing:
dat <- data.frame(x=1:5,y=3*(1:5))
b <- 1
mdl <- lm(y~I(b*x),data=dat)
Unfortunately the model object now suffers from a dangerous scoping issue: lm() does not save b as part of mdl, so when predict() is called, it has to reach back into the environment where b was defined. Thus, if subsequent code changes the value of b, the predict value will change too:
y1 <- predict(mdl,newdata=data.frame(x=3)) # y1 == 9
b <- 5
y2 <- predict(mdl,newdata=data.frame(x=3)) # y2 == 45
How can I force predict() to use the original b value instead of the changed one? Alternatively, is there some way to control where predict() looks for the variable, so I can ensure it gets the desired value? In practice I cannot include b as part of the newdata data frame, because in my application, b is a vector of parameters that does not have the same size as the data frame of new observations.
Please note that I have greatly simplified this relative to my actual use case, so I need a robust general solution and not just ad-hoc hacking.
eval(substitute the value into the quoted expression
mdl <- eval(substitute(lm(y~I(b*x),data=dat), list(b=b)))
mdl
# Call:
# lm(formula = y ~ I(1 * x), data = dat)
# ...
We could also use bquote
mdl <- eval(bquote(lm(y~I(.(b)*x), data=dat)))
mdl
#Call:
#lm(formula = y ~ I(1 * x), data = dat)
#Coefficients:
#(Intercept) I(1 * x)
# 9.533e-15 3.000e+00
According to ?bquote description
‘bquote’ quotes its
argument except that terms wrapped in ‘.()’ are evaluated in the
specified ‘where’ environment.

Subsetting in dredge (MuMIn) - must include interaction if main effects are present

I'm doing some exploratory work where I use dredge{MuMIn}. In this procedure there are two variables that I want to set to be allowed together ONLY when the interaction between them is present, i.e. they can not be present together only as main effects.
Using sample data: I want to dredge the model fm1 (disregarding that it probably doesn't make sense). If the variables GNP and Population appear together, they must also include the interaction between them.
require(stats); require(graphics)
## give the data set in the form it is used in S-PLUS:
longley.x <- data.matrix(longley[, 1:6])
longley.y <- longley[, "Employed"]
pairs(longley, main = "longley data")
names(longley)
fm1 <- lm(Employed ~GNP*Population*Armed.Forces, data = longley)
summary(fm1)
dredge(fm1, subset=!((GNP:Population) & !(GNP + Population)))
dredge(fm1, subset=!((GNP:Population) && !(GNP + Population)))
dredge(fm1, subset=dc(GNP+Population,GNP:Population))
dredge(fm1, subset=dc(GNP+Population,GNP*Population))
How can I specify in dredge() that it should disregard all models where GNP and Population are present, but not the interaction between them?
If I understand well, you want to model the two main effects (say, a and b) only together with their interaction (a:b). So how about: subset = !a | (xor(a, b) | 'a:b') (enclose a:b in backticks (`) not straight quotes), e.g:
library(MuMIn)
data(Cement)
fm <- lm(y ~ X1 * X2, Cement, na.action = na.fail)
dredge(fm, subset = !X2 | (xor(X1, X2) | `X1:X2`))
or wrap this condition into a function to have the code more clear:
test <- function(a, b, c) !a | (xor(a, b) | c)
dredge(fm, subset = test(X1, X2, `X1:X2`))
that produces: null, X1, X2, X1*X2 (and excludes X1 + X2)

How to force a regression through the origin R

I am using R to do some multiple regression. I know that if you input for instance
reg <- lm(y~ 0 + x1+ x2, data) you will force the regression model through the origin.
My problem is that i have alot of independant variables(+/-100) and R does not seem to read all of them if i input it this way
lm(y~ 0 + x1 + x2 + ... + x100, data)
The code use is as follows:
[1] data <- read.csv("Test.csv")
[2] reg <- lm(data)
[3] summary(reg)
What do i need to put in line 2 so that i can force the model through the origin?
reg <- lm(0 + data) does not work.
Put your variables in a dataframe and use .:
lm(y ~ 0 + ., data)
See documentation:
There are two special interpretations of . in a formula. The usual one is in the context of a data argument of model fitting functions and means ‘all columns not otherwise in the formula’: see terms.formula. In the context of update.formula, only, it means ‘what was previously in this part of the formula’.

Regression Summaries in R

I've been using the glm function to do regression analysis, and it's treating me quite well. I'm wondering though, some of the things I want to regress involve a large amount of regression factors. I have two main questions:
Is it possible to give a text vector for the regressors?
Can the p-value portion of summary(glm) be sorted at all? Preferably by the p-values of each regressor.
Ex.
A # sample data frame
names(A)
[1] Dog Cat Human Limbs Tail Height Weight Teeth.Count
a = names(A)[4:7]
glm( Dog ~ a, data = A, family = "binomial")
For your first question, see as.formula. Basically you want to do the following:
x <- names(A)[4:7]
regressors <- paste(x,collapse=" + ")
form <- as.formula(c("Dog ~ ",regressors))
glm(form, data = A, family = "binomial")
If you want interaction terms in your model, you need to make the structure somewhat more complex by using different collapse= arguments. That argument specifies which symbols are placed between the elements of your vector. For instance, if you specify "*" in the code above, you will have a saturated model with all possible interactions. If you just need some interactions, but not all, you will want to create the part of the formula containing all interactions first (using "*" as collapse argument), and then add the remaining terms in the separate paste function (using "+" as collapse argument). All in all, you want to create a character string that is identical to your formula, and then convert it to the formula class.
For your second question, you need to convert the output of summary to a data structure that can be sorted. For instance, a data frame. Let's say that the name of your glm model is model:
library(plyr)
coef <- summary(model)[12]
coef.sort <- as.data.frame(coef)
names(coef.sort) <- c("Estimate","SE","Tval","Pval")
arrange(coef.sort,Pval)
Assign the result of arrange() to a varable, and continue with it as you like.
An example data frame:
set.seed(42)
A <- data.frame(Dog = sample(0:1, 100, TRUE), b = rnorm(100), c = rnorm(100))
a <- names(A)[2:3]
Firstly, you can use the character vector a to create a model formula with reformulate:
glm(Dog ~ a, data = A, family = "binomial")
form <- reformulate(a, "Dog")
# Dog ~ b + c
model <- glm(form, data = A, family = "binomial")
Secondly, this is a way to sort the model summary by the p-values:
modcoef <- summary(model)[["coefficients"]]
modcoef[order(modcoef[ , 4]), ]
# Estimate Std. Error z value Pr(>|z|)
# b 0.23902684 0.2212345 1.0804232 0.2799538
# (Intercept) 0.20855908 0.2025642 1.0295951 0.3032001
# c -0.09287769 0.2191231 -0.4238608 0.6716673

Resources