I have a set of a couple of dozen numeric variables and am trying to figure out how to compactly express a quadratic form in those variables. I also want to include the variables themselves. The idea here is that we are fitting a response surface, rather than interacting a group of treatments, as the standard R formula notation seems to assume. I am trying to get appropriate expressions turned into an R formula, suitable for estimation by different techniques, with different data sets, or over different periods.
If there is an explicit statement of how R's formula notation works, anywhere, I have not been able to find it. There is an ancient paper from which R supposedly copied the notation, but it is by no means identical to current R usage. Every other description I have found just gives examples, that do not cover every case -- not even close to every case.
So, just as an example, here I try to construct a quadratic form in three variables, without writing out all the pairs by hand with an I() around each pair.
library(tidyverse)
A <- B <- C <- 1:10
LHS <- 1:10 * 600
tb <- tibble(LHS, A, B, C)
my_eq <- as.formula(LHS ~ I(A + B + C)*I(A + B + C))
I have not found any way to tell if I have succeded
Neither
my_form_eq nor
terms(my_form_eq)
seem at all enlightening.
For example, can one predict whether
identical(as.formula(LHS ~ I(A + B + C)*I(A + B + C)), as.formula(LHS ~ I((A + B + C)*(A + B + C)))
is true or false? I can not even guess. Or to take an even simpler case, is ~ A * I(A) equal to A, I(A^2), or something else? And how would you know?
To restate my question, I would like either a full statement of how R's formula notation works, adequate to cover every case and predict what each would mean, or, failing that, a straightforward way of producing an expansion of any existing formula into all the atomic terms for which coefficients will be estimated.
This may not answer your question, but I'll post this anyway since I think it may help a little.
The I function inhibits the interpretation of operators such as "+", so your formula is probably not going to do what you expect it to do. For example, the results of lm(my_eq) will be the same as the results of doing the following:
D <- A + B + C
lm(LHS ~ D * D)
And then you may as well just do lm(LHS ~ D).
For your question, I believe John Maindonald wrote a good book that explains R formulas for many situations. But it's in my office and today is a Sunday.
Edit: For the expansion, I believe you have to fit the model and then look at the call or the terms:
> my_eq <- as.formula(LHS ~ (A + B + C) * (A + B + C))
> my_formula <- lm(my_eq)
> attr(terms(my_formula), "term.labels")
[1] "A" "B" "C" "A:B" "A:C" "B:C"
Related
Question regarding the syntax of a mixed effects model on R.
I have run the following code to examine the simple slope to determine the effect of one of my variables (variability) within another one of my variables (ambiguity):
lmer.E1.v2 <- lmer(logRT ~ Variability.c / Ambiguity.c + (Variability.c + Ambiguity.c|ID),
data=data %>% filter(Experiment == "E1"),
control=lmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)))
summary(lmer.E1.v2)
When I reverse these two variables, so that the code looks like this:
lmer.E1.v2 <- lmer(logRT ~ Ambiguity.c / Variability.c + (Ambiguity.c + Variability.c|ID),
data=data %>% filter(Experiment == "E1"),
control=lmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)))
summary(lmer.E1.v2)
.. and I get different output in the first section of code than the second. What is the difference in interpretation in reversing the order of my two variables in the syntax?
The primary issue is that the / operator is not commutative (i.e. a/b != b/a): a/b expands to a + a:b, while b/a expands to b + a:b. You should get the same overall fit (predictions, likelihood, etc.), at least up to some degree of numeric fuzz, but the model parameterization will be different.
There do exist cases where (a+b|g) gives different answers from (b+a|g) (see here, but this is unusual).
I'm using LASSO as a variable selection method for my analysis, but there's one particular variable that I wish to ensure is contained in the final formula. I have automated the entire process to return the variables that LASSO selects and spits them into a character string formula e.g. formula = y~x1+x2+x3+... However there is one variable in particular I would like to keep in the formula even if LASSO does not select it. Now I could easily manually add this variable to the formula after the fact, but in the interest of improving my R skills I'm trying to automate the entire process.
My thoughts to achieve my goal so far was nesting the grep() function inside an ifelse() statement e.g. ifelse(grep("variable I'm concerned with",formula)!=1, formula=formula,formula=paste0(formula,'variable I'm concerned with',collapse="+")) but this has not done the trick.
Am I on the right track or can anyone think of alternative routes to take?
According to documentation
penalty.factor - Separate penalty factors can be applied to each
coefficient. This is a number that multiplies lambda to allow
differential shrinkage. Can be 0 for some variables, which implies no
shrinkage, and that variable is always included in the model. Default
is 1 for all variables (and implicitly infinity for variables listed
in exclude). Note: the penalty factors are internally rescaled to sum
to nvars, and the lambda sequence will reflect this change.
So apply this as an argument to glmnet using a penalty factor of 0 for your "key coefficient" and 1 elsewhere.
Formula is not a character object, but you might want to explore terms.formula if your goal is to edit formulas directly based on character output. terms objects are really powerful ways of doing variable subset and selection. But you really need to explore it because the formula language was not really meant to be automated easily, rather it was meant to be a convenient and readable way to specify model fits (look at how difficult SAS is by comparison).
f <- y ~ x1 +x2
t <- terms(f)
## drop 'x2'
i.x2 <- match('x2', attr(t, 'term.labels'))
t <- t[, -i.x2] ## drop the variable
## t is still a "terms" object but `lm` and related functions have implicit methods for interpreting as a "formula" object.
lm(t)
Currently, you are attempting to adjust character value of formula to a formula object which will not work given the different types. Instead, consider stats::update which will not add any terms not already included as a term:
lasso_formula <- as.formula("y ~ x1 + x2 + x3")
# EXISTING TERM
lasso_formula <- update(lasso_formula, ~ . + x3)
lasso_formula
# y ~ x1 + x2 + x3
# NEEDED VARIABLE
lasso_formula <- update(lasso_formula, ~ . + myTerm)
lasso_formula
# y ~ x1 + x2 + x3 + myTerm
Should formula be a character string, be sure to use grepl (not grep) in ifelse. And do not assign with = inside ifelse as it is a function itself returning a value itself and not to be confused with if...else:
lasso_formula <- "y ~ x1 + x2 + x3"
lasso_formula <- ifelse(grepl("myterm", lasso_formula),
lasso_formula,
paste(lasso_formula, "+ myterm"))
lasso_formula
# [1] "y ~ x1 + x2 + x3 + myterm"
I'm running a LMEM (linear mixed effects model) on some data, and compare the models (in pairs) with the anova function. However, on a particular subset of data, I'm getting nonsense results.
This is my full model:
m3_full <- lmer(totfix ~ psource + cond + psource:cond +
1 + cond | subj) + (1 + psource + cond | object), data, REML=FALSE)
And this is the model I'm comparing it to: (basically dropping out one of the main effects)
m3_psource <- lmer (totfix ~ psource + cond + psource:cond -
psource + (1 + cond | subj) + (1 + psource + cond | object),
data, REML=FALSE)
Running the anova() function (anova(m3_full, m3_psource) returns Chisq = 0, pr>(Chisq) = 1
I'm doing the same for a few other LMEMs and everything seems fine, it's just this particular response value that gives me the weird chi-square and probability values. Anyone has an idea why and how I can fix it? Any help will be much appreciated!
This is not really a mixed-model-specific question: rather, it has to do with the way that R constructs model matrices from formulas (and, possibly, with the logic of your model comparison).
Let's narrow it down to the comparison between
form1 <- ~ psource + cond + psource:cond
and
form2 <- ~ psource + cond + psource:cond - psource
(which is equivalent to ~cond + psource:cond). These two formulas give equivalent model matrices, i.e. model matrices with the same number of columns, spanning the same design space, and giving the same overall goodness of fit.
Making up a minimal data set to explore:
dd <- expand.grid(psource=c("A","B"),cond=c("a","b"))
What constructed variables do we get with each formula?
colnames(model.matrix(form1,data=dd))
## [1] "(Intercept)" "psourceB" "condb" "psourceB:condb"
colnames(model.matrix(form2,data=dd))
## [1] "(Intercept)" "condb" "psourceB:conda" "psourceB:condb"
We get the same number of contrasts.
There are two possible responses to this problem.
There is one school of thought (typified by Nelder, Venables, etc.: e.g. see Venables' famous (?) but unpublished exegeses on linear models, section 5, or Wikipedia on the principle of marginality) that says that it doesn't make sense to try to test main effects in the presence of interaction terms, which is what you're trying to do.
There are occasional situations (e.g in a before-after-control-impact design where the 'before' difference between control and impact is known to be zero due to experimental protocol) where you really do want to do this comparison. In this case, you have to make up your own dummy variables and add them to your data, e.g.
## set up model matrix and drop intercept and "psourceB" column
dummies <- model.matrix(form1,data=dd)[,-(1:2)]
## d='dummy': avoid colons in column names
colnames(dummies) <- c("d_cond","d_source_by_cond")
colnames(model.matrix(~d_cond+d_source_by_cond,data.frame(dd,dummies)))
## [1] "(Intercept)" "d_cond" "d_source_by_cond"
This is a nuisance. My guess at the reason for this being difficult is that the original authors of R and S before it were from school of thought #1, and figured that generally when people were trying to do this it was a mistake; they didn't make it impossible, but they didn't go out of their way to make it easy.
The way I am currently using neural net is that it predicts one output point from many input points. More specifically, I run the below.
nn <- neuralnet(
as.formula(a ~ c + d),
data=Z, hidden=c(3,2), err.fct="sse", act.fct=custom,
linear.output=TRUE, rep=5)
Here, if Z is a matrix of columns with names a, b, c, it will predict one point from one row in column a from the corresponding points in rows c and d. (The vertical dimension is used as samples for training.)
Suppose there's also a column b. I am wondering if there's a way to predict both, a and b, from c and d? I've tried
as.formula(a+b ~ c+d)
but that does not appear to work.
Any ideas?
My bad, it works nicely using a + b ~ c + d. I thought the function did not accept this input (as it crashed many times), but there must have been another problem which is now gone that I cleaned it all up.
nn <- neuralnet(as.formula(a + b ~ c + d),
data=Z, hidden=c(3,2), err.fct="sse", act.fct=custom,
linear.output=TRUE, rep=5)
Works beautifully and returns two point (or two column) output! Neat.
Examples from neuralnet, the format works :)
AND <- c(rep(0,7),1)
OR <- c(0,rep(1,7))
binary.data <- data.frame(expand.grid(c(0,1), c(0,1), c(0,1)), AND, OR)
print(net <- neuralnet(AND+OR~Var1+Var2+Var3, binary.data, hidden=0,
rep=10, err.fct="ce", linear.output=FALSE))
To my knowledge, there are three possible ways to code for second-order (and higher-) terms in a formula.
We can use the function I(..), the function poly(..) and we can construct ourself the variable of the second degree. My question is: How do these functions work?
set.seed(23)
A = rnorm(12)
B = 1:12
C = factor(rep(c(1,2,3),4))
B2=B^2
what is the equivalent of lm(A~poly(B,2)*C) when using I(..) or when using the variable B2?
The use of raw=T in the poly(..) function does not change anything to the results, correct?
lm(A~B2*C)
or
lm(A~I(B^2)*C)
give you the result of squaring column B and then doing the regression. Using
poly(B,2)
does something completely different - see ?poly.
Edit to add:
poly() calculates orthogonal polynomials which are not the same as the standard polynomials derived from simply squaring, cubing etc. a number.
Does it mean that poly(B,2,raw=T) is equivalent to I(B^2) or to B+I(B^2)?
Try:
x = 0:99
df = data.frame(x=x,y=rnorm(100)+0.1*x + 0.04*x*x)
lm(y~poly(x,2),data=df)
lm(y~poly(x,2,raw=TRUE),data=df)
lm(y~x+I(x^2),data=df)