I once saw the GLMM modeling building process using the following script:
dative.glmm8 <- lmer(RealizationOfRecipient ~ AnimacyOfRec + DefinOfRec +
PronomOfRec * PronomOfTheme + I(AccessOfRec=="given") + AnimacyOfTheme + DefinOfTheme +
I(AccessOfTheme=="given") + log(RatioOfLengthsThemeOverRecipient) + (1|Verb),
family="binomial")
I do not understand the passed argument of "I(AccessOfTheme=="given")"? What is the physical meaning of this kind of argument setting?
This question is not actually lmer-specific, but applies to all model formulas in R. In a formula context, I() stands for "insulate": from http://cran.r-project.org/doc/manuals/R-intro.pdf ,
I(M ) Insulate M. Inside M all operators have their normal arithmetic
meaning, and
that term appears in the model matrix.
This is essentially creating a dummy (0/1) variable on the fly for AccessOfRec being equal to "given" (1) or anything else (0).
You could also do this by creating the variable beforehand, e.g. AccessOfRec_given <- (AccessOfRec=="given"), and then using the derived variable in the formula.
By the way, I would strongly recommend using the data argument to lmer, rather than either using variables from the global workspace or attach()ing data frames.
Related
I had to transform a variable response (e.g. Variable 1) to fulfil the assumptions of linear models in lmer using an approach suggested here https://www.r-bloggers.com/2020/01/a-guide-to-data-transformation/ for heavy-tailed data and demonstrated below:
TransformVariable1 <- sqrt(abs(Variable1 - median(Variable1))
I then fit the data to the following example model:
fit <- lmer(TransformVariable1 ~ x + y + (1|z), data = dataframe)
Next, I update the reference grid to account for the transformation as suggested here Specifying that model is logit transformed to plot backtransformed trends:
rg <- update(ref_grid(fit), tran = "TransformVariable1")
Neverthess, the emmeans are not back transformed to the original scale after using the following command:
fitemm <- as.data.frame(emmeans(rg, ~ x + y, type = "response"))
My question is: How can I back transform the emmeans to the original scale?
Thank you in advance.
There are two major problems here.
The lesser of them is in specifying tran. You need to either specify one of a handful of known transformations, such as "log", or a list with the needed functions to undo the transformation and implement the delta method. See the help for make.link, make.tran, and vignette("transformations", "emmeans").
The much more serious issue is that the transformation used here is not a monotone function, so it is impossible to back-transform the results. Each transformed response value corresponds to two possible values on either side of the median of the original variable. The model we have here does not estimate effects on the given variable, but rather effects on the dispersion of that variable. It's like trying to use the speedometer as a substitute for a navigation system.
I would suggest using a different model, or at least a different response variable.
A possible remedy
Looking again at this, I wonder if what was meant was the symmetric square-root transformation -- what is shown multiplied by sign(Variable1 - median(Variable1)). This transformation is available in emmeans::make.tran(). You will need to re-fit the model.
What I suggest is creating the transformation object first, then using it throughout:
require(lme4)
requre(emmeans)
symsqrt <- make.tran("sympower", param = c(0.5, median(Variable1)))
fit <- with(symsqrt,
lmer(linkfun(Variable1) ~ x + y + (1|z), data = dataframe)
)
emmeans(fit, ~ x + y, type = "response")
symsqrt comprises a list of functions needed to implement the transformation. The transformation itself is symsqrt$linkfun, and the emmeans package knows to look for the other stuff when the response transformation is named linkfun.
BTW, please break the habit of wrapping emmeans() in as.data.frame(). That renders invisible some important annotations, and also disables the possibility of following up with contrasts and comparisons. If you think you want to see more precision than is shown, you can precede the call with emm_options(opt.digits = FALSE); but really, you are kidding yourself if you think those extra digits give you useful information.
Using machine learning in R
while generating formula ~. ,data,
what does . indicate
for example
fit <- svm(factor(outcome)~., data= train, probability= T)
pre <- predict(fit, test, decision.value= T, probability= T)
The dot means "everything else". I.e. say you're dataset has the variables x , y and z then y~. would get translated to y ~ x + z
The help page (?formula) can shed some light regarding . interpretation :
There are two special interpretations of . in a formula. The usual one
is in the context of a data argument of model fitting functions and
means ‘all columns not otherwise in the formula’: see terms.formula.
In the context of update.formula, only, it means ‘what was previously
in this part of the formula’.
However, note that . is used differently by reshape and reshape2 packages:
?cast
There are a couple of special variables: "..." represents all other
variables not used in the formula and "." represents no variable
It means "all other variables", that are present in the dataset.
Here (.) indicates everything else . Let for Strong datasets there has four variables named age, height, weight and strength. Here strength is the response variable. Now we want to create a linear model where strength is the response variable and other three factors are dependent. Now if we can write the model such as,
model = lm(strength ~ height + weight + age , data = Strong)
This model can be write shortly,
model = lm(strength ~., data = Strong)
Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.
I have seen how to use ~ operator in formula. For example y~x means: y is distributed as x.
However I am really confused of what does ~0+a means in this code:
require(limma)
a = factor(1:3)
model.matrix(~0+a)
Why just model.matrix(a) does not work? Why the result of model.matrix(~a) is different from model.matrix(~0+a)? And finally what is the meaning of ~ operator here?
~ creates a formula - it separates the righthand and lefthand sides of a formula
From ?`~`
Tilde is used to separate the left- and right-hand sides in model formula
Quoting from the help for formula
The models fit by, e.g., the lm and glm functions are specified in a compact symbolic form. The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.
In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The - operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.
So regarding specific issue with ~a+0
You creating a model matrix without an intercept. As a is a factor, model.matrix(~a) will return an intercept column which is a1 (You need n-1 indicators to fully specify n classes)
The help files for each function are well written, detailed and easy to find!
why doesn't model.matrix(a) work
model.matrix(a) doesn't work because a is a factor variable, not a formula or terms object
From the help for model.matrix
object an object of an appropriate class. For the default method, a
model formula or a terms object.
R is looking for a particular class of object, by passing a formula ~a you are passing an object that is of class formula. model.matrix(terms(~a)) would also work, (passing the terms object corresponding to the formula ~a
general note
#BenBolker helpfully notes in his comment, This is a modified version of Wilkinson-Rogers notation.
There is a good description in the Introduction to R.
After reading several manuals, I was confused by the meaning of model.matrix(~0+x) ountil recently that I found this excellent book chapter.
In mathematics 0+a is equal to a and writing a term like 0+a is very strange. However we are here dealing with linear models: A simple high-school equation such as y=ax+b that uncovers the relationship between the predictor variable (x) and the observation (y).
So we can think of ~0+x or equally ~x+0 as an equation of the form: y=ax+b. By adding 0 we are forcing b to be zero, that means that we are looking for a line passing the origin (no intercept). If we indicated a model like ~x+1 or just ~x, there fitted equation could possibily contain a non-zero term b. Equally we may restrict b by a formula ~x-1 or ~-1+x that both mean: no intercept (the same way we exclude a row or column in R by negative index). However something like ~x-2 or ~x+3 is meaningless.
Thanking #mnel for the useful comment, finally what's the reason to use ~ and not =? In standard mathematical terminology / symbology y~x denotes that y is equivalent to x, it is somewhat weaker that y=x. When you are fitting a linear model, you aren't really saying y=x, but more that you can model y as a linear function of x (y = ax+b for example)
To answer part of your question, tilde is used to separate the left- and right-hand sides in model formula. See ?"~" for more help.
I'm trying to get my head around the use of the tilde operator, and associated functions. My 1st question is why does I() need to be used to specify arithmetic operators? For example, these 2 plots generate different results (the former having a straight line, and the latter the expected curve)
x <- c(1:100)
y <- seq(0.1,10,0.1)
plot(y~x^3)
plot(y~I(x^3))
further, both of the following plots also generate the expected result
plot(x^3, y)
plot(I(x^3), y)
My second question is, perhaps the examples I've been using are too simple, but I don't understand where ~ should actually be used.
The tilde operator is actually a function that returns an unevaluated expression, a type of language object. The expression then gets interpreted by modeling functions in a manner that is different than the interpretation of operators operating on numeric objects.
The issue here is how formulas and specifically the "+, ":", and "^" operators in them are interpreted. (A side note: the correct statistical procedure would be to use the function poly when attempting to make higher order terms in a regression formula.) Within R formulas the infix operators "+", "*", ":" and "^" have entirely different meanings than when used in calculations with numeric vectors. In a formula the tilde (~) separates the left hand side from the right hand side. The ^ and : operators are used to construct interactions so x = x^2 = x^3 rather than becoming perhaps expected mathematical powers. (A variable interacting with itself is just the same variable.) If you had typed (x+y)^2 the R interpreter would have produced (for its own good internal use), not a mathematical: x^2 +2xy +y^2 , but rather a symbolic: x + y +x:y where x:y is an interaction term without its main effects. (The ^ gives you both main effects and interactions.)
?formula
The I() function acts to convert the argument to "as.is", i.e. what you expect. So I(x^2) would return a vector of values raised to the second power.
The ~ should be thought of as saying "is distributed as" or "is dependent on" when seen in regression functions. The ~ is an infix function in its own right. You can see that LHS ~ RHS is almost shorthand for formula(LHS, RHS) by typing this at the console:
`~`(LHS,RHS)
#LHS ~ RHS
class( `~`(LHS,RHS) )
#[1] "formula"
identical( `~`(LHS,RHS), as.formula("LHS~RHS") )
#[1] TRUE # cannot use `formula` since it interprets its first argument
In regression functions the an error term in model descriptions will be in whatever form that regression function presumes or is specifically called for in the parameters for family. The mean for the base level will generally be labelled (Intercept). The function context and arguments may also further determine a link function such as log() or logit() from the family value, and it is also possible to have a non-canonical family/link combination.
The "+" symbol in a formula is not really adding two variables but is usually an implicit request to calculate a regression coefficient(s) for that variable in the context of the rest of the variables that are on the RHS of a formula. The regression functions use `model.matrix and that function will recognize the presence of factors or character vectors in the formula and build a matrix that expand the levels of the discrete components of the formula.
In plot()-ting functions it basically reverses the usual ( x, y ) order of arguments that the plot function usually takes. There was a plot.formula method written so that formulas could be used as a more "mathematical" mode of communicating with R. In the graphics::plot.formula, curve, and 'lattice' and 'ggplot' functions, it governs how multiple factors or numeric vectors are displayed and "facetted".
The overloading of the "+" operator is discussed in the comments below and is also done in the plotting packages: ggplot2 and gridExtra where is it separating functions that deliver object results. There it acting as a pass-through and layering operator. Some aggregation functions have a formula method which use "+" as an "arrangement" and grouping operator.