Using machine learning in R
while generating formula ~. ,data,
what does . indicate
for example
fit <- svm(factor(outcome)~., data= train, probability= T)
pre <- predict(fit, test, decision.value= T, probability= T)
The dot means "everything else". I.e. say you're dataset has the variables x , y and z then y~. would get translated to y ~ x + z
The help page (?formula) can shed some light regarding . interpretation :
There are two special interpretations of . in a formula. The usual one
is in the context of a data argument of model fitting functions and
means ‘all columns not otherwise in the formula’: see terms.formula.
In the context of update.formula, only, it means ‘what was previously
in this part of the formula’.
However, note that . is used differently by reshape and reshape2 packages:
?cast
There are a couple of special variables: "..." represents all other
variables not used in the formula and "." represents no variable
It means "all other variables", that are present in the dataset.
Here (.) indicates everything else . Let for Strong datasets there has four variables named age, height, weight and strength. Here strength is the response variable. Now we want to create a linear model where strength is the response variable and other three factors are dependent. Now if we can write the model such as,
model = lm(strength ~ height + weight + age , data = Strong)
This model can be write shortly,
model = lm(strength ~., data = Strong)
Related
I had to transform a variable response (e.g. Variable 1) to fulfil the assumptions of linear models in lmer using an approach suggested here https://www.r-bloggers.com/2020/01/a-guide-to-data-transformation/ for heavy-tailed data and demonstrated below:
TransformVariable1 <- sqrt(abs(Variable1 - median(Variable1))
I then fit the data to the following example model:
fit <- lmer(TransformVariable1 ~ x + y + (1|z), data = dataframe)
Next, I update the reference grid to account for the transformation as suggested here Specifying that model is logit transformed to plot backtransformed trends:
rg <- update(ref_grid(fit), tran = "TransformVariable1")
Neverthess, the emmeans are not back transformed to the original scale after using the following command:
fitemm <- as.data.frame(emmeans(rg, ~ x + y, type = "response"))
My question is: How can I back transform the emmeans to the original scale?
Thank you in advance.
There are two major problems here.
The lesser of them is in specifying tran. You need to either specify one of a handful of known transformations, such as "log", or a list with the needed functions to undo the transformation and implement the delta method. See the help for make.link, make.tran, and vignette("transformations", "emmeans").
The much more serious issue is that the transformation used here is not a monotone function, so it is impossible to back-transform the results. Each transformed response value corresponds to two possible values on either side of the median of the original variable. The model we have here does not estimate effects on the given variable, but rather effects on the dispersion of that variable. It's like trying to use the speedometer as a substitute for a navigation system.
I would suggest using a different model, or at least a different response variable.
A possible remedy
Looking again at this, I wonder if what was meant was the symmetric square-root transformation -- what is shown multiplied by sign(Variable1 - median(Variable1)). This transformation is available in emmeans::make.tran(). You will need to re-fit the model.
What I suggest is creating the transformation object first, then using it throughout:
require(lme4)
requre(emmeans)
symsqrt <- make.tran("sympower", param = c(0.5, median(Variable1)))
fit <- with(symsqrt,
lmer(linkfun(Variable1) ~ x + y + (1|z), data = dataframe)
)
emmeans(fit, ~ x + y, type = "response")
symsqrt comprises a list of functions needed to implement the transformation. The transformation itself is symsqrt$linkfun, and the emmeans package knows to look for the other stuff when the response transformation is named linkfun.
BTW, please break the habit of wrapping emmeans() in as.data.frame(). That renders invisible some important annotations, and also disables the possibility of following up with contrasts and comparisons. If you think you want to see more precision than is shown, you can precede the call with emm_options(opt.digits = FALSE); but really, you are kidding yourself if you think those extra digits give you useful information.
I have:
myColCI<-function(colName){
predictorvariable <- glm(death180 ~ nepalData[,colName], data=nepalData, family="binomial")
summary(predictorvariable)
confint(predictorvariable)
}
One of the names of the column is parity so when after making my function, when I put myColCI(parity), it says the
object "parity" is not found
Can anyone give me a pointer to what's wrong with my code.
Your formula is wrong here, hence you are getting the error, The right hand side after tilda side shouldn't be a reference to a dataframe. It should be column names separated by plus signs :
From the documentation of ?formula
The models fit by, e.g., the lm and glm functions are specified in a
compact symbolic form. The ~ operator is basic in the formation of
such models. An expression of the form y ~ model is interpreted as a
specification that the response y is modelled by a linear predictor
specified symbolically by model. Such a model consists of a series of
terms separated by + operators. The terms themselves consist of
variable and factor names separated by : operators. Such a term is
interpreted as the interaction of all the variables and factors
appearing in the term.
dependent_variable ~ Independent_variable1 + Independent_variable2 etc
From data mtcars, I have written a glm formula as :
glm(am ~ mpg + disp + hp, data=mtcars, family="binomial")
So, your formula should be something like this:
glm(death180 ~ column1 + column2 +column3, data= nepalData, family="binomial")
To invoke this inside a function, since you have only one dependent variable it seems you can use below (Note here that, removing the datframe reference here and adding the as.formula expression to incorporate strings, convert the expression as valid formula):
myColCI<-function(colName){
predictorvariable <- glm(as.formula(paste0("death180 ~", colName)), data=nepalData, family="binomial")
summary(predictorvariable)
confint(predictorvariable)
}
myColCI("parity")
This is what you have to do
myColCI("parity")
This should give you the answer.
I don't think your code is wrong. You are trying to subset a column of a data frame using the column name. In R, when you do this, you have to pass the name as a string. eg
dataframe[,"column_name"]
For your function, if you want to have both the confidence intervals and summary output, you would have to modify a little bit. Just like this:
myColCI<-function(colName){
predictorvariable <- glm(death180 ~ nepalData[,colName],
data=nepalData, family="binomial")
return(list(summary(predictorvariable),confint(predictorvariable)))
}
Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.
I have seen how to use ~ operator in formula. For example y~x means: y is distributed as x.
However I am really confused of what does ~0+a means in this code:
require(limma)
a = factor(1:3)
model.matrix(~0+a)
Why just model.matrix(a) does not work? Why the result of model.matrix(~a) is different from model.matrix(~0+a)? And finally what is the meaning of ~ operator here?
~ creates a formula - it separates the righthand and lefthand sides of a formula
From ?`~`
Tilde is used to separate the left- and right-hand sides in model formula
Quoting from the help for formula
The models fit by, e.g., the lm and glm functions are specified in a compact symbolic form. The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.
In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The - operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.
So regarding specific issue with ~a+0
You creating a model matrix without an intercept. As a is a factor, model.matrix(~a) will return an intercept column which is a1 (You need n-1 indicators to fully specify n classes)
The help files for each function are well written, detailed and easy to find!
why doesn't model.matrix(a) work
model.matrix(a) doesn't work because a is a factor variable, not a formula or terms object
From the help for model.matrix
object an object of an appropriate class. For the default method, a
model formula or a terms object.
R is looking for a particular class of object, by passing a formula ~a you are passing an object that is of class formula. model.matrix(terms(~a)) would also work, (passing the terms object corresponding to the formula ~a
general note
#BenBolker helpfully notes in his comment, This is a modified version of Wilkinson-Rogers notation.
There is a good description in the Introduction to R.
After reading several manuals, I was confused by the meaning of model.matrix(~0+x) ountil recently that I found this excellent book chapter.
In mathematics 0+a is equal to a and writing a term like 0+a is very strange. However we are here dealing with linear models: A simple high-school equation such as y=ax+b that uncovers the relationship between the predictor variable (x) and the observation (y).
So we can think of ~0+x or equally ~x+0 as an equation of the form: y=ax+b. By adding 0 we are forcing b to be zero, that means that we are looking for a line passing the origin (no intercept). If we indicated a model like ~x+1 or just ~x, there fitted equation could possibily contain a non-zero term b. Equally we may restrict b by a formula ~x-1 or ~-1+x that both mean: no intercept (the same way we exclude a row or column in R by negative index). However something like ~x-2 or ~x+3 is meaningless.
Thanking #mnel for the useful comment, finally what's the reason to use ~ and not =? In standard mathematical terminology / symbology y~x denotes that y is equivalent to x, it is somewhat weaker that y=x. When you are fitting a linear model, you aren't really saying y=x, but more that you can model y as a linear function of x (y = ax+b for example)
To answer part of your question, tilde is used to separate the left- and right-hand sides in model formula. See ?"~" for more help.
I once saw the GLMM modeling building process using the following script:
dative.glmm8 <- lmer(RealizationOfRecipient ~ AnimacyOfRec + DefinOfRec +
PronomOfRec * PronomOfTheme + I(AccessOfRec=="given") + AnimacyOfTheme + DefinOfTheme +
I(AccessOfTheme=="given") + log(RatioOfLengthsThemeOverRecipient) + (1|Verb),
family="binomial")
I do not understand the passed argument of "I(AccessOfTheme=="given")"? What is the physical meaning of this kind of argument setting?
This question is not actually lmer-specific, but applies to all model formulas in R. In a formula context, I() stands for "insulate": from http://cran.r-project.org/doc/manuals/R-intro.pdf ,
I(M ) Insulate M. Inside M all operators have their normal arithmetic
meaning, and
that term appears in the model matrix.
This is essentially creating a dummy (0/1) variable on the fly for AccessOfRec being equal to "given" (1) or anything else (0).
You could also do this by creating the variable beforehand, e.g. AccessOfRec_given <- (AccessOfRec=="given"), and then using the derived variable in the formula.
By the way, I would strongly recommend using the data argument to lmer, rather than either using variables from the global workspace or attach()ing data frames.