what does a colon (:) do in a linear mixed effects model analysis? - r

Let me clarify that I am a complete beginner at R.
I'm working on a problem and having a bit of trouble understanding a formula I'm supposed to use in a linear mixed effects model analysis of a dataset, more specifically this formula,
ModelName <- lmer(outcome ~ predictor1 + predictor2 + predictor1:predictor2 + (random_structure), data = DatasetName)
I don't know what the predictor1:predictor2 part of it means, could anyone please help me understand or link to something I can read to understand?
I've run the code and it gives an additional output for the predictor2 part of the formula which doesnt happen when you dont include that part.

Wow! You may be new to R, but you ask a great question!
As you probably know already, the + operator separates terms in a model.
Y ~ a + b + c means that the response is modeled by a linear combination of a, b, and c.
The colon operator denotes interaction between the items it separates, for example:
Y ~ a + b + a:b means that the response is modeled by a linear combination of a, b, and the interaction between a and b.
I hope this helps!
Rose Hartman explains how interactions affect linear models, and why it’s important to consider them in Understanding Interactions in Linear Models https://education.arcus.chop.edu/understanding-interactions/

Related

linear models for ANOVA

I'm a bit desperate, as my exam is tomorrow.
Say I have the data for an ANOVA with 2 independent factors. According to my teacher, I would write the linear model on RStudio as:
lm(score ~ 1+ A + B + A:B, data=mydata1, contrasts=list(A=contr.sum, B=contr.sum))
I've faced an exercise in which he says, essencially, would this model be correct for a 2 way ANOVA?
lm(score ~ A + B + A:B, data=mydata1, contrasts=list(A=contr.sum, B=contr.sum))
I'm not sure what difference the "+1" makes on the LM, I assumed it changed the y for x=0, but I'm really not sure. Would it be appropriate to use?
Could anyone help me? Sorry if terms are wrong, my first language isn't english.

Having issues in transforming my data for further analysis In R

I have a dataset here:
'''dataset
I want to perform linear and multiple regression.MoralRelationship and SkeletalP are both dependent variables while others are independent. I tried all the various method of Transformation I know but it did not yield any meaningful result from my diagnostic plot
I did this:
lm1<- lm(MoralRelationship ~ RThumb + RTindex + RTmid + RTFourth + RTFifth + Lthumb + Lindex
+ LTMid + LTFourth + LTfifth + BldGRP1 + BlDGR2, data=data)
I did same for SkeletalP
I did adiagnostic plot for both. then Tried to normalize the variables because there is correlation nor linearity. I took square term, log ,Sqrtof all independent variables also,log,1/x but no better output.
I also did
`lm(SkeletalP ~ RThumb + I(RThumb^2), data=data)`
if i will get a better result with one variable.
The independent variables are right skewed except for ANB which is normally distributed.
is there method I can use to transform my data? most importantly, to be uniformly distributed so that i can perform other statistical test.
Your dataset is kind of small. You can try dimensionality reduction like PCA, but I don't think it's appropriate here. It's also harder to interpret.
Have you tried other models? Tuning might help the fit of your regression models (e.g. Lasso/Ridge L1/L2 regulation)

R: Understanding formulas

I'm trying to get a better understanding of what R formulas mathematically mean.
For example: lm(y ~ x) would fit a line to y = Ax + B
Would lm(y ~ x + z) be fitting to the plane y = Ax + Bz + C?
Would lm(y ~ x + z + x:z) be fitting to the plane y = Ax + Bz + Cxz + D?
Your understanding is correct! Though it may be helpful to understand it a bit more abstractly. Your linear model (lm) only means that it's fitting parameters on a one-dimensional dependance (Ax not Ax^2 or Asin(x) or anything fancier than that).
But that does not mean it only fits 1 to 3 parameters. Imagine that foods represent dimensions: grains, fruits, vegetables, meats, and dairy make up our 5 "dimensions of food". These things are clearly relatable--and maybe not even independent--but still not totally describable in exactly the same ways. We can think of our model as the tool which gauges our coefficients--which in this food example we can imagine as "flavors", like sweet, spicy, sour, etc.
Our model then takes points in different dimensions (food groups) and attempts to relate them by their coefficient values (flavors) for a function. This model then allows us to describe other foods/flavors. This is really what most models "do": they "train" themselves on annotated data and build a relationship--linear models just treat flavors as directly proportional to the amount of food group.
I hope this explanation was helpful. If there's anything that's unclear, please let me know. Also, I would have made this as a comment but have not yet accumulated the required 50 pts. Sorry!

R tilde operator: What does ~0+a means?

I have seen how to use ~ operator in formula. For example y~x means: y is distributed as x.
However I am really confused of what does ~0+a means in this code:
require(limma)
a = factor(1:3)
model.matrix(~0+a)
Why just model.matrix(a) does not work? Why the result of model.matrix(~a) is different from model.matrix(~0+a)? And finally what is the meaning of ~ operator here?
~ creates a formula - it separates the righthand and lefthand sides of a formula
From ?`~`
Tilde is used to separate the left- and right-hand sides in model formula
Quoting from the help for formula
The models fit by, e.g., the lm and glm functions are specified in a compact symbolic form. The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.
In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The - operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.
So regarding specific issue with ~a+0
You creating a model matrix without an intercept. As a is a factor, model.matrix(~a) will return an intercept column which is a1 (You need n-1 indicators to fully specify n classes)
The help files for each function are well written, detailed and easy to find!
why doesn't model.matrix(a) work
model.matrix(a) doesn't work because a is a factor variable, not a formula or terms object
From the help for model.matrix
object an object of an appropriate class. For the default method, a
model formula or a terms object.
R is looking for a particular class of object, by passing a formula ~a you are passing an object that is of class formula. model.matrix(terms(~a)) would also work, (passing the terms object corresponding to the formula ~a
general note
#BenBolker helpfully notes in his comment, This is a modified version of Wilkinson-Rogers notation.
There is a good description in the Introduction to R.
After reading several manuals, I was confused by the meaning of model.matrix(~0+x) ountil recently that I found this excellent book chapter.
In mathematics 0+a is equal to a and writing a term like 0+a is very strange. However we are here dealing with linear models: A simple high-school equation such as y=ax+b that uncovers the relationship between the predictor variable (x) and the observation (y).
So we can think of ~0+x or equally ~x+0 as an equation of the form: y=ax+b. By adding 0 we are forcing b to be zero, that means that we are looking for a line passing the origin (no intercept). If we indicated a model like ~x+1 or just ~x, there fitted equation could possibily contain a non-zero term b. Equally we may restrict b by a formula ~x-1 or ~-1+x that both mean: no intercept (the same way we exclude a row or column in R by negative index). However something like ~x-2 or ~x+3 is meaningless.
Thanking #mnel for the useful comment, finally what's the reason to use ~ and not =? In standard mathematical terminology / symbology y~x denotes that y is equivalent to x, it is somewhat weaker that y=x. When you are fitting a linear model, you aren't really saying y=x, but more that you can model y as a linear function of x (y = ax+b for example)
To answer part of your question, tilde is used to separate the left- and right-hand sides in model formula. See ?"~" for more help.

aov formula error term: contradictory examples

I've seen two basic approaches to generic formulas for within-subjects designs in R/aov() (R = random, X = dependent, W? = within, B? = between):
# Pure within:
X ~ Error(R/W1*W2...)
# or
X ~ (W1*W2...) + Error(R/(W1*W2...))
# Mixed:
X ~ B1*B2*... + Error(R/W1*W2...)
# or
X ~ (B1*B2*...W1*W2...) + Error(R/(W1*W2...)+(B1*B2...))
That is, some advise never putting W factors outside the error term or B factors inside, while others put all (B, W) factors outside and inside, indicating in the error term which are nested within R.
Are these simply notational variants? Is there any reason to prefer one to the other as a default for performing ANOVA using aov()?
I would always recommend putting all within-subjects variables inside and outside of the error term.
For pure within-subject analysis this means using the following formula:
X ~ (W1*W2...) + Error(R/(W1*W2...))
Here, all wihin-subjects effects are tested with repect to their appropriate error term.
In contrast, the formula X ~ Error(R/W1*W2...) does not allow you to test the effects of your variables.
The same principle holds for mixed designs (including between- and withins-subject variables). The correct formula is:
X ~ (B1*B2*...W1*W2...) + Error(R/(W1*W2...))
There is no need to use the between-variables twice in the formula. The model above is actually identical to X ~ (B1*B2*...W1*W2...) + Error(R/(W1*W2...)+(B1*B2...)).
This formula allows you to test both between- and within-subject effects with the correct error terms.
For more information, read this ANOVA tutorial.

Resources