I've seen two basic approaches to generic formulas for within-subjects designs in R/aov() (R = random, X = dependent, W? = within, B? = between):
# Pure within:
X ~ Error(R/W1*W2...)
# or
X ~ (W1*W2...) + Error(R/(W1*W2...))
# Mixed:
X ~ B1*B2*... + Error(R/W1*W2...)
# or
X ~ (B1*B2*...W1*W2...) + Error(R/(W1*W2...)+(B1*B2...))
That is, some advise never putting W factors outside the error term or B factors inside, while others put all (B, W) factors outside and inside, indicating in the error term which are nested within R.
Are these simply notational variants? Is there any reason to prefer one to the other as a default for performing ANOVA using aov()?
I would always recommend putting all within-subjects variables inside and outside of the error term.
For pure within-subject analysis this means using the following formula:
X ~ (W1*W2...) + Error(R/(W1*W2...))
Here, all wihin-subjects effects are tested with repect to their appropriate error term.
In contrast, the formula X ~ Error(R/W1*W2...) does not allow you to test the effects of your variables.
The same principle holds for mixed designs (including between- and withins-subject variables). The correct formula is:
X ~ (B1*B2*...W1*W2...) + Error(R/(W1*W2...))
There is no need to use the between-variables twice in the formula. The model above is actually identical to X ~ (B1*B2*...W1*W2...) + Error(R/(W1*W2...)+(B1*B2...)).
This formula allows you to test both between- and within-subject effects with the correct error terms.
For more information, read this ANOVA tutorial.
Related
Let me clarify that I am a complete beginner at R.
I'm working on a problem and having a bit of trouble understanding a formula I'm supposed to use in a linear mixed effects model analysis of a dataset, more specifically this formula,
ModelName <- lmer(outcome ~ predictor1 + predictor2 + predictor1:predictor2 + (random_structure), data = DatasetName)
I don't know what the predictor1:predictor2 part of it means, could anyone please help me understand or link to something I can read to understand?
I've run the code and it gives an additional output for the predictor2 part of the formula which doesnt happen when you dont include that part.
Wow! You may be new to R, but you ask a great question!
As you probably know already, the + operator separates terms in a model.
Y ~ a + b + c means that the response is modeled by a linear combination of a, b, and c.
The colon operator denotes interaction between the items it separates, for example:
Y ~ a + b + a:b means that the response is modeled by a linear combination of a, b, and the interaction between a and b.
I hope this helps!
Rose Hartman explains how interactions affect linear models, and why it’s important to consider them in Understanding Interactions in Linear Models https://education.arcus.chop.edu/understanding-interactions/
I had to transform a variable response (e.g. Variable 1) to fulfil the assumptions of linear models in lmer using an approach suggested here https://www.r-bloggers.com/2020/01/a-guide-to-data-transformation/ for heavy-tailed data and demonstrated below:
TransformVariable1 <- sqrt(abs(Variable1 - median(Variable1))
I then fit the data to the following example model:
fit <- lmer(TransformVariable1 ~ x + y + (1|z), data = dataframe)
Next, I update the reference grid to account for the transformation as suggested here Specifying that model is logit transformed to plot backtransformed trends:
rg <- update(ref_grid(fit), tran = "TransformVariable1")
Neverthess, the emmeans are not back transformed to the original scale after using the following command:
fitemm <- as.data.frame(emmeans(rg, ~ x + y, type = "response"))
My question is: How can I back transform the emmeans to the original scale?
Thank you in advance.
There are two major problems here.
The lesser of them is in specifying tran. You need to either specify one of a handful of known transformations, such as "log", or a list with the needed functions to undo the transformation and implement the delta method. See the help for make.link, make.tran, and vignette("transformations", "emmeans").
The much more serious issue is that the transformation used here is not a monotone function, so it is impossible to back-transform the results. Each transformed response value corresponds to two possible values on either side of the median of the original variable. The model we have here does not estimate effects on the given variable, but rather effects on the dispersion of that variable. It's like trying to use the speedometer as a substitute for a navigation system.
I would suggest using a different model, or at least a different response variable.
A possible remedy
Looking again at this, I wonder if what was meant was the symmetric square-root transformation -- what is shown multiplied by sign(Variable1 - median(Variable1)). This transformation is available in emmeans::make.tran(). You will need to re-fit the model.
What I suggest is creating the transformation object first, then using it throughout:
require(lme4)
requre(emmeans)
symsqrt <- make.tran("sympower", param = c(0.5, median(Variable1)))
fit <- with(symsqrt,
lmer(linkfun(Variable1) ~ x + y + (1|z), data = dataframe)
)
emmeans(fit, ~ x + y, type = "response")
symsqrt comprises a list of functions needed to implement the transformation. The transformation itself is symsqrt$linkfun, and the emmeans package knows to look for the other stuff when the response transformation is named linkfun.
BTW, please break the habit of wrapping emmeans() in as.data.frame(). That renders invisible some important annotations, and also disables the possibility of following up with contrasts and comparisons. If you think you want to see more precision than is shown, you can precede the call with emm_options(opt.digits = FALSE); but really, you are kidding yourself if you think those extra digits give you useful information.
Using machine learning in R
while generating formula ~. ,data,
what does . indicate
for example
fit <- svm(factor(outcome)~., data= train, probability= T)
pre <- predict(fit, test, decision.value= T, probability= T)
The dot means "everything else". I.e. say you're dataset has the variables x , y and z then y~. would get translated to y ~ x + z
The help page (?formula) can shed some light regarding . interpretation :
There are two special interpretations of . in a formula. The usual one
is in the context of a data argument of model fitting functions and
means ‘all columns not otherwise in the formula’: see terms.formula.
In the context of update.formula, only, it means ‘what was previously
in this part of the formula’.
However, note that . is used differently by reshape and reshape2 packages:
?cast
There are a couple of special variables: "..." represents all other
variables not used in the formula and "." represents no variable
It means "all other variables", that are present in the dataset.
Here (.) indicates everything else . Let for Strong datasets there has four variables named age, height, weight and strength. Here strength is the response variable. Now we want to create a linear model where strength is the response variable and other three factors are dependent. Now if we can write the model such as,
model = lm(strength ~ height + weight + age , data = Strong)
This model can be write shortly,
model = lm(strength ~., data = Strong)
ref: http://www.r-bloggers.com/r-for-ecologists-putting-together-a-piecewise-regression/
In this paper, I am confused about this argument:
y ~ x*(x < breaks[i]) + x*(x>=breaks[i])
in lm().
I know * in lm means interactions and main effects so does this mean that predictors are x x (x < breaks[i]) (x < breaks[i]) and interactions?
This is a method of doing "segmented" regression. You are essentially creating two different models, one for the section where x < breaks[i] and another where the opposite is true. In this case the * will be functioning as a multiplier rather than as an interaction operator because the values are {0,1} so there won't be a two level result. The webpage seems to do a pretty nice job of illustrating this, so it's unclear what is missing. The model formula might be more clear if it were written as:
y ~ x*I(x < breaks[i]) + x*I(x>=breaks[i])
It essentially means that there are two predictors: the first one being x and the second one being a logical vector that is 1 in the region less than breaks[i] and 0 in the other region. In fact you probably would not need two terms in the model if you just used:
y ~ x*I(x < breaks[i])
I thought the predictions would be the same, but they were slightly different, perhaps because the two term model implicitly allowed completely independent intercepts.
There also are segmented and strucchange packages that support segmented regression.
I have seen how to use ~ operator in formula. For example y~x means: y is distributed as x.
However I am really confused of what does ~0+a means in this code:
require(limma)
a = factor(1:3)
model.matrix(~0+a)
Why just model.matrix(a) does not work? Why the result of model.matrix(~a) is different from model.matrix(~0+a)? And finally what is the meaning of ~ operator here?
~ creates a formula - it separates the righthand and lefthand sides of a formula
From ?`~`
Tilde is used to separate the left- and right-hand sides in model formula
Quoting from the help for formula
The models fit by, e.g., the lm and glm functions are specified in a compact symbolic form. The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.
In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The - operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.
So regarding specific issue with ~a+0
You creating a model matrix without an intercept. As a is a factor, model.matrix(~a) will return an intercept column which is a1 (You need n-1 indicators to fully specify n classes)
The help files for each function are well written, detailed and easy to find!
why doesn't model.matrix(a) work
model.matrix(a) doesn't work because a is a factor variable, not a formula or terms object
From the help for model.matrix
object an object of an appropriate class. For the default method, a
model formula or a terms object.
R is looking for a particular class of object, by passing a formula ~a you are passing an object that is of class formula. model.matrix(terms(~a)) would also work, (passing the terms object corresponding to the formula ~a
general note
#BenBolker helpfully notes in his comment, This is a modified version of Wilkinson-Rogers notation.
There is a good description in the Introduction to R.
After reading several manuals, I was confused by the meaning of model.matrix(~0+x) ountil recently that I found this excellent book chapter.
In mathematics 0+a is equal to a and writing a term like 0+a is very strange. However we are here dealing with linear models: A simple high-school equation such as y=ax+b that uncovers the relationship between the predictor variable (x) and the observation (y).
So we can think of ~0+x or equally ~x+0 as an equation of the form: y=ax+b. By adding 0 we are forcing b to be zero, that means that we are looking for a line passing the origin (no intercept). If we indicated a model like ~x+1 or just ~x, there fitted equation could possibily contain a non-zero term b. Equally we may restrict b by a formula ~x-1 or ~-1+x that both mean: no intercept (the same way we exclude a row or column in R by negative index). However something like ~x-2 or ~x+3 is meaningless.
Thanking #mnel for the useful comment, finally what's the reason to use ~ and not =? In standard mathematical terminology / symbology y~x denotes that y is equivalent to x, it is somewhat weaker that y=x. When you are fitting a linear model, you aren't really saying y=x, but more that you can model y as a linear function of x (y = ax+b for example)
To answer part of your question, tilde is used to separate the left- and right-hand sides in model formula. See ?"~" for more help.