Formula quadratic term as interaction of the variable with itself [duplicate] - r

This question already has an answer here:
R: How can I use R's formula notation to compactly produce all but a selected subset of quadratic terms?
(1 answer)
Closed 14 days ago.
y ~ x +x:x seems to equal to just y ~ x, while y ~ x + I(x^2) correctly include the quadratic term in the model.
Why I cannot write quadratic terms as an interaction of a variable with itself?

Great question. A start at an answer is that while :, in practice, typically multiplies the numeric columns associated with the variable (e.g. if x and y are both numeric, x:y creates an interaction column that is the product of x and y), the root meaning of : in R's formula syntax is not "multiply columns" but "form an interaction". The interaction of a variable with itself is just itself.
I would love to have a complete formal description of R's version of Wilkinson-Rogers syntax (which is what this is), but I don't know that one exists. The original framing of the formula language is in Wilkinson and Rogers (1973) [where . rather than : was used for the "interaction" operator]; I believe there's a description in the "White Book" (Chambers and Hastie 1992); but other than that, I think the only full definition is the source code of model.matrix() itself (which is not all that nice to look at ...)
Chambers, J. M., and T. Hastie, eds. Statistical Models in S. Wadsworth & Brooks/Cole, 1992.
Wilkinson, G. N., and C. E. Rogers. “Symbolic Description of Factorial Models for Analysis of Variance.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 22, no. 3 (1973): 392–99. https://doi.org/10.2307/2346786.

To easily formulate polynomials, we can use poly.
lm(Petal.Length ~ Sepal.Length + I(Sepal.Length^2) + I(Sepal.Length^3), iris)$coe
# (Intercept) Sepal.Length I(Sepal.Length^2) I(Sepal.Length^3)
# 19.8028068 -13.5808046 2.8767502 -0.1742277
lm(Petal.Length ~ poly(Sepal.Length, 3, raw=TRUE), iris)$coe
# (Intercept) poly(Sepal.Length, 3, raw = TRUE)1
# 19.8028068 -13.5808046
# poly(Sepal.Length, 3, raw = TRUE)2 poly(Sepal.Length, 3, raw = TRUE)3
# 2.8767502 -0.1742277

Related

Do lm object coefficients always list intercept first?

In coef(l), where l is a object of class "lm", is (Intercept) always listed first?
R's source code for lm() is not so straightforward. lm() appears to call lm.fit(), which gets coefficients by calling a C function with .Call(C_Cdqrls, x, y, tol, FALSE), which ultimately calls a least squares fitting routine in FORTRAN according to this informative blog post. I'm not really familiar enough with R internals or actual code to do least squares regression to answer my question.
No, only when you have an intercept. Intercept is implicit in formula, but you can specify a model without it using - 1 or 0 +:
x <- rnorm(20)
y <- rnorm(20, 10)
> coef(lm(y ~ x + I(x^2)))
(Intercept) x I(x^2)
10.3035412 -0.1506304 -0.3092836
> coef(lm(y ~ I(x^3) + x - 1))
I(x^3) x
-0.5094851 -0.6598634
The coefficients will be listed in the order they appear in the formula. If there is an intercept, it will be the first. But as in many other situations in R, if you need to obtain the value of a specific component (intercept or any other), it is a good practice to call by it's name. It will return NA if the object don't have it:
intercept <- coef(model)["(Intercept)"]

I(variable_1/variable_2) in R regression

R> data("FoodExpenditure", package = "betareg")
R> fe_lm <- lm(I(food/income) ~ income + persons, data = FoodExpenditure)
From what I understand I(food/income) creates a new variable which is the ratio of food over income, is it correct? Are there any other combinations (functions) possible?
Observe that these two results are the same
# transformation in formula
lm(I(food/income) ~ income + persons, data = FoodExpenditure)
# Call:
# lm(formula = I(food/income) ~ income + persons, data = FoodExpenditure)
#
# Coefficients:
# (Intercept) income persons
# 0.341740 -0.002469 0.025767
# transformation in data
dd <- transform(FoodExpenditure, ratio=food/income)
lm(ratio ~ income + persons, data = dd)
# Call:
# lm(formula = ratio ~ income + persons, data = dd)
#
# Coefficients:
# (Intercept) income persons
# 0.341740 -0.002469 0.025767
The I() function in a formula with lm() allows you to perform any function of the variables that you like. (Just make sure the function doesn't change the number of rows otherwise you can't fit the model properly.)
Yes and Yes.
Other possible combinations and operators are given in the documentation for formula ?formula. What is below is mostly taken from it.
: denotes interactions between the terms
* operator denotes factor crossing: ‘a*b’ interpreted as ‘a+b+a:b’.
^ operator indicates
crossing to the specified degree. For example ‘(a+b+c)^2’ is
identical to ‘(a+b+c)*(a+b+c)’ which in turn expands to a formula
containing the main effects for ‘a’, ‘b’ and ‘c’ together with
their second-order interactions.
%in% operator indicates
that the terms on its left are nested within those on the right.
For example ‘a + b %in% a’ expands to the formula ‘a + a:b’.
- operator removes the specified terms, so that ‘(a+b+c)^2 -
a:b’ is identical to ‘a + b + c + b:c + a:c’. It can also used to
remove the intercept term: when fitting a linear model ‘y ~ x - 1’
specifies a line through the origin. A model with no intercept
can be also specified as ‘y ~ x + 0’ or ‘y ~ 0 + x’.
arithmetic expressions - While formulae usually involve just variable and factor names,
they can also involve arithmetic expressions. The formula ‘log(y)
~ a + log(x)’ is quite legal.
I() To avoid this confusion, the function ‘I()’ can be used to bracket
those portions of a model formula where the operators are used in
their arithmetic sense. For example, in the formula ‘y ~ a +
I(b+c)’, the term ‘b+c’ is to be interpreted as the sum of ‘b’ and
‘c’.

Interpreting linear model formula in limma package

Here is my code for building a design matrix for a linear modelling function:
f <- factor(targets$Sample.Name, levels = unique(targets$Sample.Name))
design <- model.matrix(~0 + f)
colnames(design) <- levels(f)
Am not sure how to interpret the formula "~0". I looked up ?lm() and found that if a formula has an implied intercept term one can remove this using either y ~ x - 1 or y ~ 0 + x, but am not sure if this is the same case here.
Yes it's the same case here. You can examine the output from design to check.

R: multivariate orthogonal regression without having to write the variable names explicitly

I have a dataframe train (21 predictors, 1 response, 1012 observations), and I suspect that the response is a nonlinear function of the predictors. Thus, I would like to perform a multivariate polynomial regression of the response on all the predictors, and then try to understand which are the most important terms. To avoid the collinearity problems of standard multivariate polynomial regression, I'd like to use multivariate orthogonal polynomials with polym(). However, I have quite a lot of predictors, and their names do not follow a simple rule. For example, in train I have predictors named X2,X3 and X5, but not X1 and X4. The response is X14. Is there a way to write the formula in lm without having to explicitly write the name of all predictors? Writing
OrthoModel=lm(X14~polym(.,2),data=train)
returns the error
Error in polym(., 2) : object '.' not found
EDIT: the model I wanted to fit contains about 3.5 billion terms, so it's useless. It's better to fit a term with only main effects, interactions and second degree terms -> 231 terms. I wrote the formula for a standard (non-orthogonal) second degree polynomial:
`as.formula(paste(" X14 ~ (", paste0(names(Xtrain), collapse="+"), ")^2", collapse=""))`
where Xtrain is obtained by train by deleting the response column X14. However, when I try to express the polynomial in an orthogonal basis, I get a parse text error:
as.formula(
paste(" X14 ~ (", paste0(names(Xtrain), collapse="+"), ")^2", "+",
paste( "poly(", paste0(names(Xtrain), ", degree=2)",
collapse="+"),
collapse="")
)
)
There are a couple of problems with that approach, one of which you already see but even if the dot could be expanded within polym you would still have faced an error when it came time for the 2 to be evaluated, because degree is a parameter after the "dots" in the polym argument list and it therefore must be supplied as a named parameter rather than just positionally offered.
An approach using as.formula succeeds (with the 'Orthodont' dataframe in pkg:nlme (although using 'Sex' as the dependent variable is statistically nonsense). I took out the "Subject" column from the data and also took out the "Sex" from the names passed to paste:
data(Orthodont, package="nlme")
lm( as.formula( paste("Sex~polym(" ,
paste(names(Orthodont[-(3:4)]), collapse=","),",degree=2)")),
data=Orthodont[-3])
Call:
lm(formula = as.formula(paste("Sex~polym(", paste(names(Orthodont[-(3:4)]),
collapse = ","), ",degree=2)")), data = Orthodont[-3])
Coefficients:
(Intercept) polym(distance, age, degree = 2)1.0
1.4433 -2.5849
polym(distance, age, degree = 2)2.0 polym(distance, age, degree = 2)0.1
0.4651 1.3353
polym(distance, age, degree = 2)1.1 polym(distance, age, degree = 2)0.2
-7.6514
Formula objects can be created from text input with as.formula. This is essentially an application of the last example in ?as.formula.

What does the R formula y~1 mean?

I was reading the documentation on R Formula, and trying to figure out how to work with depmix (from the depmixS4 package).
Now, in the documentation of depmixS4, sample formula tends to be something like y ~ 1.
For simple case like y ~ x, it is defining a relationship between input x and output y, so I get that it is similar to y = a * x + b, where a is the slope, and b is the intercept.
If we go back to y ~ 1, the formula is throwing me off. Is it equivalent to y = 1 (a horizontal line at y = 1)?
To add a bit context, if you look at the depmixs4 documentation, there is one example below
depmix(list(rt~1,corr~1),data=speed,nstates=2,family=list(gaussian(),multinomial()))
I think in general, formula that end with ~ 1 is confusing to me. Can any explain what ~ 1 or y ~ 1 mean?
Many of the operators used in model formulae (asterix, plus, caret) in R, have a model-specific meaning and this is one of them: the 'one' symbol indicates an intercept.
In other words, it is the value the dependent variable is expected to have when the independent variables are zero or have no influence. (To use the more common mathematical meaning of model terms, you wrap them in I()). Intercepts are usually assumed so it is most common to see it in the context of explicitly stating a model without an intercept.
Here are two ways of specifying the same model for a linear regression model of y on x. The first has an implicit intercept term, and the second an explicit one:
y ~ x
y ~ 1 + x
Here are ways to give a linear regression of y on x through the origin (that is, without an intercept term):
y ~ 0 + x
y ~ -1 + x
y ~ x - 1
In the specific case you mention ( y ~ 1 ), y is being predicted by no other variable so the natural prediction is the mean of y, as Paul Hiemstra stated:
> data(city)
> r <- lm(x~1, data=city)
> r
Call:
lm(formula = x ~ 1, data = city)
Coefficients:
(Intercept)
97.3
> mean(city$x)
[1] 97.3
And removing the intercept with a -1 leaves you with nothing:
> r <- lm(x ~ -1, data=city)
> r
Call:
lm(formula = x ~ -1, data = city)
No coefficients
formula() is a function for extracting formula out of objects and its help file isn't the best place to read about specifying model formulae in R. I suggest you look at this explanation or Chapter 11 of An Introduction to R.
if your model were of the form y ~ x1 + x2 This (roughly speaking) represents:
y = β0 + β1(x1) + β2(x2)
Which is of course the same as
y = β0(1) + β1(x1) + β2(x2)
There is an implicit +1 in the above formula. So really, the formula above is y ~ 1 + x1 + x2
We could have a very simple formula, whereby y is not dependent on any other variable. This is the formula that you are referencing,
y ~ 1 which roughly would equate to
y = β0(1) = β0
As #Paul points out, when you solve the simple model, you get β0 = mean (y)
Here is an example
# Let's make a small sample data frame
dat <- data.frame(y= (-2):3, x=3:8)
# Create the linear model as above
simpleModel <- lm(y ~ 1, data=dat)
## COMPARE THE COEFFICIENTS OF THE MODEL TO THE MEAN(y)
simpleModel$coef
# (Intercept)
# 0.5
mean(dat$y)
# [1] 0.5
In general such a formula describes the relation between dependent and independent variables in the form of a linear model. The lefthand side are the dependent variables, the right hand side the independent. The independent variables are used to calculate the trend component of the linear model, the residuals are then assumed to have some kind of distribution. When the independent are equal to one ~ 1, the trend component is a single value, e.g. the mean value of the data, i.e. the linear model only has an intercept.

Resources