tilde(~) operator in R - r

According to the R documentation: ~ operator is used in formula to separate the right and left hand side of the formula. The right hand side is independent variable and the left hand side is dependent variable. I understand when ~ is used in lm() package. However what does following mean?
x~ 1
The right hand side is 1. what does it mean? Can it be any other number instead of 1?

From ?lm:
[..] when fitting a linear model y ~ x - 1 specifies a line through the
origin [..]
The "-" in the formula removes a specified term.
So y ~ 1 is just a model with a constant (intercept) and no regressor.
lm(mtcars$mpg ~ 1)
#Call:
#lm(formula = mtcars$mpg ~ 1)
#
#Coefficients:
#(Intercept)
# 20.09
Can it be any other number instead of 1?
No, just try and see.
lm(mtcars$mpg ~ 0) tells R to remove the constant (equal to y ~ -1), and lm(mtcars$mpg ~ 2) gives an error (correctly).
You should read y ~ 1 as y ~ constant inside the formula, it's not a simple number.

Related

Do lm object coefficients always list intercept first?

In coef(l), where l is a object of class "lm", is (Intercept) always listed first?
R's source code for lm() is not so straightforward. lm() appears to call lm.fit(), which gets coefficients by calling a C function with .Call(C_Cdqrls, x, y, tol, FALSE), which ultimately calls a least squares fitting routine in FORTRAN according to this informative blog post. I'm not really familiar enough with R internals or actual code to do least squares regression to answer my question.
No, only when you have an intercept. Intercept is implicit in formula, but you can specify a model without it using - 1 or 0 +:
x <- rnorm(20)
y <- rnorm(20, 10)
> coef(lm(y ~ x + I(x^2)))
(Intercept) x I(x^2)
10.3035412 -0.1506304 -0.3092836
> coef(lm(y ~ I(x^3) + x - 1))
I(x^3) x
-0.5094851 -0.6598634
The coefficients will be listed in the order they appear in the formula. If there is an intercept, it will be the first. But as in many other situations in R, if you need to obtain the value of a specific component (intercept or any other), it is a good practice to call by it's name. It will return NA if the object don't have it:
intercept <- coef(model)["(Intercept)"]

I(variable_1/variable_2) in R regression

R> data("FoodExpenditure", package = "betareg")
R> fe_lm <- lm(I(food/income) ~ income + persons, data = FoodExpenditure)
From what I understand I(food/income) creates a new variable which is the ratio of food over income, is it correct? Are there any other combinations (functions) possible?
Observe that these two results are the same
# transformation in formula
lm(I(food/income) ~ income + persons, data = FoodExpenditure)
# Call:
# lm(formula = I(food/income) ~ income + persons, data = FoodExpenditure)
#
# Coefficients:
# (Intercept) income persons
# 0.341740 -0.002469 0.025767
# transformation in data
dd <- transform(FoodExpenditure, ratio=food/income)
lm(ratio ~ income + persons, data = dd)
# Call:
# lm(formula = ratio ~ income + persons, data = dd)
#
# Coefficients:
# (Intercept) income persons
# 0.341740 -0.002469 0.025767
The I() function in a formula with lm() allows you to perform any function of the variables that you like. (Just make sure the function doesn't change the number of rows otherwise you can't fit the model properly.)
Yes and Yes.
Other possible combinations and operators are given in the documentation for formula ?formula. What is below is mostly taken from it.
: denotes interactions between the terms
* operator denotes factor crossing: ‘a*b’ interpreted as ‘a+b+a:b’.
^ operator indicates
crossing to the specified degree. For example ‘(a+b+c)^2’ is
identical to ‘(a+b+c)*(a+b+c)’ which in turn expands to a formula
containing the main effects for ‘a’, ‘b’ and ‘c’ together with
their second-order interactions.
%in% operator indicates
that the terms on its left are nested within those on the right.
For example ‘a + b %in% a’ expands to the formula ‘a + a:b’.
- operator removes the specified terms, so that ‘(a+b+c)^2 -
a:b’ is identical to ‘a + b + c + b:c + a:c’. It can also used to
remove the intercept term: when fitting a linear model ‘y ~ x - 1’
specifies a line through the origin. A model with no intercept
can be also specified as ‘y ~ x + 0’ or ‘y ~ 0 + x’.
arithmetic expressions - While formulae usually involve just variable and factor names,
they can also involve arithmetic expressions. The formula ‘log(y)
~ a + log(x)’ is quite legal.
I() To avoid this confusion, the function ‘I()’ can be used to bracket
those portions of a model formula where the operators are used in
their arithmetic sense. For example, in the formula ‘y ~ a +
I(b+c)’, the term ‘b+c’ is to be interpreted as the sum of ‘b’ and
‘c’.

linear regression function creating a list instead of a model

I'm trying to fit an lm model using R. However, for some reason this code creates a list of the data instead of the usual regression model.
The code I use is this one
lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden )
But instead of the usual coefficients, the title of the variable appears mixed with the data in this way:
(Intercept) Inclusivity0.631 Inclusivity0.681 Inclusivity0.716 Inclusivity0.9
35.00 -4.00 -6.74 -4.30 4.90
Does anybody know why this happened and how it can be fixed?
What you are seeing is called a named num (a numeric vector with names). You can do the following:
Model <- lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden) # Assign the model to an object called Model
summary(Model) # Summary table
Model$coefficients # See only the coefficients as a named numeric vector
Model$coefficients[[1]] # See the first coefficient without name
If you want all the coefficients without names (so just a numeric vector), try:
unname(coef(Model))
It would be good if you could provide a sample of your data but I'm guessing that the key problem is that the numeric data in Inclusivity is stored as a factor. e.g.,
library(tidyverse)
x <- tibble(incl = as.factor(c(0.631, 0.681, 0.716)),
soc_vote=1:3)
lm(soc_vote ~ incl, x)
Call:
lm(formula = soc_vote ~ incl, data = x)
Coefficients:
(Intercept) incl0.681 incl0.716
1 1 2
Whereas, if you first convert the Inclusivity column to double, you get
y <- x %>% mutate(incl = as.double(as.character(incl)))
lm(soc_vote ~ incl, y)
Call:
lm(formula = soc_vote ~ incl, data = y)
Coefficients:
(Intercept) incl
-13.74 23.29
Note that I needed to convert to character first since otherwise I just get the ordinal equivalent of each factor.

What is the difference between Y ~ X and Y ~ X+1?

What is the difference between Y ~ X and Y ~ X+1 in R?
In the context of lm(), they are exactly the same. Both models include the intercept.
To remove the intercept, you could write Y ~ X - 1 or Y ~ X + 0.
What is the context of your question?
As has been mentioned, in lm and other functions that use model.matrix internally they are the same. But there are other cases where they differ, consider the following code:
plot.new()
text( .5, .1, y ~ x )
text( .5, .3, y ~ x + 1 )
Here they are different (and running the code shows the difference).
For any other functions or contexts it will depend on the implementation.
These 2 lines give the same results:
plot( Petal.Length ~ Species, data=iris )
plot( Petal.Length ~ Species + 1, data=iris )
But these don't:
library(lattice)
bwplot( Petal.Length ~ Species, data=iris )
bwplot( Petal.Length ~ Species + 1, data=iris )
I remember seeing one time (though it may have been in S-Plus rather than R and may not be possible in R) a formula that included a 0+ or -1 at the beginning and a +1 later in the formula. It constructed the main effects without an intercept (fit mean for each level of 1st factor) but the +1 at the right place changed the way the interactions were coded.
In theory there could be modeling functions (I can't think of any, but they could exist or be written in the future) that take a formula but do not include an intercept by default and so the +1 would be needed to specify an intercept.
So, what context are you asking your question in?

What does the R formula y~1 mean?

I was reading the documentation on R Formula, and trying to figure out how to work with depmix (from the depmixS4 package).
Now, in the documentation of depmixS4, sample formula tends to be something like y ~ 1.
For simple case like y ~ x, it is defining a relationship between input x and output y, so I get that it is similar to y = a * x + b, where a is the slope, and b is the intercept.
If we go back to y ~ 1, the formula is throwing me off. Is it equivalent to y = 1 (a horizontal line at y = 1)?
To add a bit context, if you look at the depmixs4 documentation, there is one example below
depmix(list(rt~1,corr~1),data=speed,nstates=2,family=list(gaussian(),multinomial()))
I think in general, formula that end with ~ 1 is confusing to me. Can any explain what ~ 1 or y ~ 1 mean?
Many of the operators used in model formulae (asterix, plus, caret) in R, have a model-specific meaning and this is one of them: the 'one' symbol indicates an intercept.
In other words, it is the value the dependent variable is expected to have when the independent variables are zero or have no influence. (To use the more common mathematical meaning of model terms, you wrap them in I()). Intercepts are usually assumed so it is most common to see it in the context of explicitly stating a model without an intercept.
Here are two ways of specifying the same model for a linear regression model of y on x. The first has an implicit intercept term, and the second an explicit one:
y ~ x
y ~ 1 + x
Here are ways to give a linear regression of y on x through the origin (that is, without an intercept term):
y ~ 0 + x
y ~ -1 + x
y ~ x - 1
In the specific case you mention ( y ~ 1 ), y is being predicted by no other variable so the natural prediction is the mean of y, as Paul Hiemstra stated:
> data(city)
> r <- lm(x~1, data=city)
> r
Call:
lm(formula = x ~ 1, data = city)
Coefficients:
(Intercept)
97.3
> mean(city$x)
[1] 97.3
And removing the intercept with a -1 leaves you with nothing:
> r <- lm(x ~ -1, data=city)
> r
Call:
lm(formula = x ~ -1, data = city)
No coefficients
formula() is a function for extracting formula out of objects and its help file isn't the best place to read about specifying model formulae in R. I suggest you look at this explanation or Chapter 11 of An Introduction to R.
if your model were of the form y ~ x1 + x2 This (roughly speaking) represents:
y = β0 + β1(x1) + β2(x2)
Which is of course the same as
y = β0(1) + β1(x1) + β2(x2)
There is an implicit +1 in the above formula. So really, the formula above is y ~ 1 + x1 + x2
We could have a very simple formula, whereby y is not dependent on any other variable. This is the formula that you are referencing,
y ~ 1 which roughly would equate to
y = β0(1) = β0
As #Paul points out, when you solve the simple model, you get β0 = mean (y)
Here is an example
# Let's make a small sample data frame
dat <- data.frame(y= (-2):3, x=3:8)
# Create the linear model as above
simpleModel <- lm(y ~ 1, data=dat)
## COMPARE THE COEFFICIENTS OF THE MODEL TO THE MEAN(y)
simpleModel$coef
# (Intercept)
# 0.5
mean(dat$y)
# [1] 0.5
In general such a formula describes the relation between dependent and independent variables in the form of a linear model. The lefthand side are the dependent variables, the right hand side the independent. The independent variables are used to calculate the trend component of the linear model, the residuals are then assumed to have some kind of distribution. When the independent are equal to one ~ 1, the trend component is a single value, e.g. the mean value of the data, i.e. the linear model only has an intercept.

Resources