Naturally, I had little success searching Google for "R I", "I in R", and "R language I".
The R help says "Change the class of an object to indicate that it should be treated ‘as is’.".
My O'Reilly R book doesn't have an entry for it in its index.
My Cambridge book essentially says, about "I(logdist^2)": "ensures that is taken as the square, rather than as an interaction of logdist".
Can someone explain the "interaction" comment? Can someone explain why "logdist^2" wouldn't be interpreted in the traditional way?
From p89 of R in a Nutshell
Caret(^) [is] Used to indicate crossing to a specific degree. For example:
y~(u+w)^2
is equivalent to
y~(u+w)*(u+w)
Identity function (I()) Used to indicate that the enclosed expression should be interpreted by it's arithmetic meaning. For example
a+b
means that both a and b should be included in the formula. The formula:
I(a+b)
means that "a plus b" should be included in the formula. See also ?AsIs()
I think your confusion is that I is very rarely used as a standalone operator. As the help page states, it's most often used to stop operator characters (^,+,*, etc) from being interpreted as they are in a formula . As user2633645 's answer says, these characters have specific meanings in a formula . Quoting from the help page for stats::formula ,
The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.
In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: ab interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The - operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.
While formulae usually involve just variable and factor names, they can also involve arithmetic expressions. The formula log(y) ~ a + log(x) is quite legal. When such arithmetic expressions involve operators which are also used symbolically in model formulae, there can be confusion between arithmetic and symbolic operator use.
To avoid this confusion, the function I() can be used to bracket those portions of a model formula where the operators are used in their arithmetic sense. For example, in the formula y ~ a + I(b+c), the term b+c is to be interpreted as the sum of b and c.
Related
I understand that in a formula like y ~ x I am looking at "y" as a function of "x". In maths this would be something like f(x) = x.
In R, functions like xtabs can take formula objects without a left side, e.g. xtabs( ~ x). From my understanding of formulas, I am now looking at nothing as a function of "x", in maths = x, but that is obviously not how R understands the formula (it returns a contingency table of a factor, for example).
So how can I understand the meaning of an empty left hand argument?
I'm sure this has been explained somewhere, but I have a hard time googling for "R ~".
Formulas only have meaning in the context of the particular functions that work with them. The same formula may mean something completely different to one function vs. another function.
In the case of xtabs it sums the left hand side over the levels of the right hand side and if there is no left hand side it gives the counts. That is, the default left hand side can be regarded as a vector of ones. e.g. these each give the same result
x <- c(1, 1, 2, 2, 2)
# 1
xtabs(~ x)
# 2
ones <- rep(1, length = length(x))
xtabs(ones ~ x)
This also gives a similar result but in this case the result is an array rather than a table:
# 3
tapply(ones, x, sum)
The use of a formula is not strongly wired in R; while there are tools for easier parsing of formula, for example to create contrast, it is up to the package author to do something useful with what comes out of the parsing.
You will often find ~x without the left side in context with counts, e.g. in lattice barplots or histograms. Often, you can think of the empty left side as "the count of".
In the meantime I have learned the following and would like to add it to the answers already given:
A two-sided formula, such as in plot(y ~ x) or lm(y ~ x), is a symbolic representation of an asymmetrical question regarding the dependence between (groups of) dependent and independent variables. Dependent variables stand on the left side of the formula and you can read the formula as "(left side) as a function of (right side)".
A one-sided formula, such as in xtabs(~ x + y) or cor.test(~ x + y) is a symbolic representation of a symmetrical question regarding the correlation (in the broad everyday sense) between two "equal" variables (e.g. both dependent, both independent, or of unknown dependence).
Feel free to correct my bad English.
I have seen how to use ~ operator in formula. For example y~x means: y is distributed as x.
However I am really confused of what does ~0+a means in this code:
require(limma)
a = factor(1:3)
model.matrix(~0+a)
Why just model.matrix(a) does not work? Why the result of model.matrix(~a) is different from model.matrix(~0+a)? And finally what is the meaning of ~ operator here?
~ creates a formula - it separates the righthand and lefthand sides of a formula
From ?`~`
Tilde is used to separate the left- and right-hand sides in model formula
Quoting from the help for formula
The models fit by, e.g., the lm and glm functions are specified in a compact symbolic form. The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.
In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The - operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.
So regarding specific issue with ~a+0
You creating a model matrix without an intercept. As a is a factor, model.matrix(~a) will return an intercept column which is a1 (You need n-1 indicators to fully specify n classes)
The help files for each function are well written, detailed and easy to find!
why doesn't model.matrix(a) work
model.matrix(a) doesn't work because a is a factor variable, not a formula or terms object
From the help for model.matrix
object an object of an appropriate class. For the default method, a
model formula or a terms object.
R is looking for a particular class of object, by passing a formula ~a you are passing an object that is of class formula. model.matrix(terms(~a)) would also work, (passing the terms object corresponding to the formula ~a
general note
#BenBolker helpfully notes in his comment, This is a modified version of Wilkinson-Rogers notation.
There is a good description in the Introduction to R.
After reading several manuals, I was confused by the meaning of model.matrix(~0+x) ountil recently that I found this excellent book chapter.
In mathematics 0+a is equal to a and writing a term like 0+a is very strange. However we are here dealing with linear models: A simple high-school equation such as y=ax+b that uncovers the relationship between the predictor variable (x) and the observation (y).
So we can think of ~0+x or equally ~x+0 as an equation of the form: y=ax+b. By adding 0 we are forcing b to be zero, that means that we are looking for a line passing the origin (no intercept). If we indicated a model like ~x+1 or just ~x, there fitted equation could possibily contain a non-zero term b. Equally we may restrict b by a formula ~x-1 or ~-1+x that both mean: no intercept (the same way we exclude a row or column in R by negative index). However something like ~x-2 or ~x+3 is meaningless.
Thanking #mnel for the useful comment, finally what's the reason to use ~ and not =? In standard mathematical terminology / symbology y~x denotes that y is equivalent to x, it is somewhat weaker that y=x. When you are fitting a linear model, you aren't really saying y=x, but more that you can model y as a linear function of x (y = ax+b for example)
To answer part of your question, tilde is used to separate the left- and right-hand sides in model formula. See ?"~" for more help.
I've seen two basic approaches to generic formulas for within-subjects designs in R/aov() (R = random, X = dependent, W? = within, B? = between):
# Pure within:
X ~ Error(R/W1*W2...)
# or
X ~ (W1*W2...) + Error(R/(W1*W2...))
# Mixed:
X ~ B1*B2*... + Error(R/W1*W2...)
# or
X ~ (B1*B2*...W1*W2...) + Error(R/(W1*W2...)+(B1*B2...))
That is, some advise never putting W factors outside the error term or B factors inside, while others put all (B, W) factors outside and inside, indicating in the error term which are nested within R.
Are these simply notational variants? Is there any reason to prefer one to the other as a default for performing ANOVA using aov()?
I would always recommend putting all within-subjects variables inside and outside of the error term.
For pure within-subject analysis this means using the following formula:
X ~ (W1*W2...) + Error(R/(W1*W2...))
Here, all wihin-subjects effects are tested with repect to their appropriate error term.
In contrast, the formula X ~ Error(R/W1*W2...) does not allow you to test the effects of your variables.
The same principle holds for mixed designs (including between- and withins-subject variables). The correct formula is:
X ~ (B1*B2*...W1*W2...) + Error(R/(W1*W2...))
There is no need to use the between-variables twice in the formula. The model above is actually identical to X ~ (B1*B2*...W1*W2...) + Error(R/(W1*W2...)+(B1*B2...)).
This formula allows you to test both between- and within-subject effects with the correct error terms.
For more information, read this ANOVA tutorial.
I once saw the GLMM modeling building process using the following script:
dative.glmm8 <- lmer(RealizationOfRecipient ~ AnimacyOfRec + DefinOfRec +
PronomOfRec * PronomOfTheme + I(AccessOfRec=="given") + AnimacyOfTheme + DefinOfTheme +
I(AccessOfTheme=="given") + log(RatioOfLengthsThemeOverRecipient) + (1|Verb),
family="binomial")
I do not understand the passed argument of "I(AccessOfTheme=="given")"? What is the physical meaning of this kind of argument setting?
This question is not actually lmer-specific, but applies to all model formulas in R. In a formula context, I() stands for "insulate": from http://cran.r-project.org/doc/manuals/R-intro.pdf ,
I(M ) Insulate M. Inside M all operators have their normal arithmetic
meaning, and
that term appears in the model matrix.
This is essentially creating a dummy (0/1) variable on the fly for AccessOfRec being equal to "given" (1) or anything else (0).
You could also do this by creating the variable beforehand, e.g. AccessOfRec_given <- (AccessOfRec=="given"), and then using the derived variable in the formula.
By the way, I would strongly recommend using the data argument to lmer, rather than either using variables from the global workspace or attach()ing data frames.
I'm trying to get my head around the use of the tilde operator, and associated functions. My 1st question is why does I() need to be used to specify arithmetic operators? For example, these 2 plots generate different results (the former having a straight line, and the latter the expected curve)
x <- c(1:100)
y <- seq(0.1,10,0.1)
plot(y~x^3)
plot(y~I(x^3))
further, both of the following plots also generate the expected result
plot(x^3, y)
plot(I(x^3), y)
My second question is, perhaps the examples I've been using are too simple, but I don't understand where ~ should actually be used.
The tilde operator is actually a function that returns an unevaluated expression, a type of language object. The expression then gets interpreted by modeling functions in a manner that is different than the interpretation of operators operating on numeric objects.
The issue here is how formulas and specifically the "+, ":", and "^" operators in them are interpreted. (A side note: the correct statistical procedure would be to use the function poly when attempting to make higher order terms in a regression formula.) Within R formulas the infix operators "+", "*", ":" and "^" have entirely different meanings than when used in calculations with numeric vectors. In a formula the tilde (~) separates the left hand side from the right hand side. The ^ and : operators are used to construct interactions so x = x^2 = x^3 rather than becoming perhaps expected mathematical powers. (A variable interacting with itself is just the same variable.) If you had typed (x+y)^2 the R interpreter would have produced (for its own good internal use), not a mathematical: x^2 +2xy +y^2 , but rather a symbolic: x + y +x:y where x:y is an interaction term without its main effects. (The ^ gives you both main effects and interactions.)
?formula
The I() function acts to convert the argument to "as.is", i.e. what you expect. So I(x^2) would return a vector of values raised to the second power.
The ~ should be thought of as saying "is distributed as" or "is dependent on" when seen in regression functions. The ~ is an infix function in its own right. You can see that LHS ~ RHS is almost shorthand for formula(LHS, RHS) by typing this at the console:
`~`(LHS,RHS)
#LHS ~ RHS
class( `~`(LHS,RHS) )
#[1] "formula"
identical( `~`(LHS,RHS), as.formula("LHS~RHS") )
#[1] TRUE # cannot use `formula` since it interprets its first argument
In regression functions the an error term in model descriptions will be in whatever form that regression function presumes or is specifically called for in the parameters for family. The mean for the base level will generally be labelled (Intercept). The function context and arguments may also further determine a link function such as log() or logit() from the family value, and it is also possible to have a non-canonical family/link combination.
The "+" symbol in a formula is not really adding two variables but is usually an implicit request to calculate a regression coefficient(s) for that variable in the context of the rest of the variables that are on the RHS of a formula. The regression functions use `model.matrix and that function will recognize the presence of factors or character vectors in the formula and build a matrix that expand the levels of the discrete components of the formula.
In plot()-ting functions it basically reverses the usual ( x, y ) order of arguments that the plot function usually takes. There was a plot.formula method written so that formulas could be used as a more "mathematical" mode of communicating with R. In the graphics::plot.formula, curve, and 'lattice' and 'ggplot' functions, it governs how multiple factors or numeric vectors are displayed and "facetted".
The overloading of the "+" operator is discussed in the comments below and is also done in the plotting packages: ggplot2 and gridExtra where is it separating functions that deliver object results. There it acting as a pass-through and layering operator. Some aggregation functions have a formula method which use "+" as an "arrangement" and grouping operator.