What does the R formula y~1 mean? - r

I was reading the documentation on R Formula, and trying to figure out how to work with depmix (from the depmixS4 package).
Now, in the documentation of depmixS4, sample formula tends to be something like y ~ 1.
For simple case like y ~ x, it is defining a relationship between input x and output y, so I get that it is similar to y = a * x + b, where a is the slope, and b is the intercept.
If we go back to y ~ 1, the formula is throwing me off. Is it equivalent to y = 1 (a horizontal line at y = 1)?
To add a bit context, if you look at the depmixs4 documentation, there is one example below
depmix(list(rt~1,corr~1),data=speed,nstates=2,family=list(gaussian(),multinomial()))
I think in general, formula that end with ~ 1 is confusing to me. Can any explain what ~ 1 or y ~ 1 mean?

Many of the operators used in model formulae (asterix, plus, caret) in R, have a model-specific meaning and this is one of them: the 'one' symbol indicates an intercept.
In other words, it is the value the dependent variable is expected to have when the independent variables are zero or have no influence. (To use the more common mathematical meaning of model terms, you wrap them in I()). Intercepts are usually assumed so it is most common to see it in the context of explicitly stating a model without an intercept.
Here are two ways of specifying the same model for a linear regression model of y on x. The first has an implicit intercept term, and the second an explicit one:
y ~ x
y ~ 1 + x
Here are ways to give a linear regression of y on x through the origin (that is, without an intercept term):
y ~ 0 + x
y ~ -1 + x
y ~ x - 1
In the specific case you mention ( y ~ 1 ), y is being predicted by no other variable so the natural prediction is the mean of y, as Paul Hiemstra stated:
> data(city)
> r <- lm(x~1, data=city)
> r
Call:
lm(formula = x ~ 1, data = city)
Coefficients:
(Intercept)
97.3
> mean(city$x)
[1] 97.3
And removing the intercept with a -1 leaves you with nothing:
> r <- lm(x ~ -1, data=city)
> r
Call:
lm(formula = x ~ -1, data = city)
No coefficients
formula() is a function for extracting formula out of objects and its help file isn't the best place to read about specifying model formulae in R. I suggest you look at this explanation or Chapter 11 of An Introduction to R.

if your model were of the form y ~ x1 + x2 This (roughly speaking) represents:
y = β0 + β1(x1) + β2(x2)
Which is of course the same as
y = β0(1) + β1(x1) + β2(x2)
There is an implicit +1 in the above formula. So really, the formula above is y ~ 1 + x1 + x2
We could have a very simple formula, whereby y is not dependent on any other variable. This is the formula that you are referencing,
y ~ 1 which roughly would equate to
y = β0(1) = β0
As #Paul points out, when you solve the simple model, you get β0 = mean (y)
Here is an example
# Let's make a small sample data frame
dat <- data.frame(y= (-2):3, x=3:8)
# Create the linear model as above
simpleModel <- lm(y ~ 1, data=dat)
## COMPARE THE COEFFICIENTS OF THE MODEL TO THE MEAN(y)
simpleModel$coef
# (Intercept)
# 0.5
mean(dat$y)
# [1] 0.5

In general such a formula describes the relation between dependent and independent variables in the form of a linear model. The lefthand side are the dependent variables, the right hand side the independent. The independent variables are used to calculate the trend component of the linear model, the residuals are then assumed to have some kind of distribution. When the independent are equal to one ~ 1, the trend component is a single value, e.g. the mean value of the data, i.e. the linear model only has an intercept.

Related

Do lm object coefficients always list intercept first?

In coef(l), where l is a object of class "lm", is (Intercept) always listed first?
R's source code for lm() is not so straightforward. lm() appears to call lm.fit(), which gets coefficients by calling a C function with .Call(C_Cdqrls, x, y, tol, FALSE), which ultimately calls a least squares fitting routine in FORTRAN according to this informative blog post. I'm not really familiar enough with R internals or actual code to do least squares regression to answer my question.
No, only when you have an intercept. Intercept is implicit in formula, but you can specify a model without it using - 1 or 0 +:
x <- rnorm(20)
y <- rnorm(20, 10)
> coef(lm(y ~ x + I(x^2)))
(Intercept) x I(x^2)
10.3035412 -0.1506304 -0.3092836
> coef(lm(y ~ I(x^3) + x - 1))
I(x^3) x
-0.5094851 -0.6598634
The coefficients will be listed in the order they appear in the formula. If there is an intercept, it will be the first. But as in many other situations in R, if you need to obtain the value of a specific component (intercept or any other), it is a good practice to call by it's name. It will return NA if the object don't have it:
intercept <- coef(model)["(Intercept)"]

I(variable_1/variable_2) in R regression

R> data("FoodExpenditure", package = "betareg")
R> fe_lm <- lm(I(food/income) ~ income + persons, data = FoodExpenditure)
From what I understand I(food/income) creates a new variable which is the ratio of food over income, is it correct? Are there any other combinations (functions) possible?
Observe that these two results are the same
# transformation in formula
lm(I(food/income) ~ income + persons, data = FoodExpenditure)
# Call:
# lm(formula = I(food/income) ~ income + persons, data = FoodExpenditure)
#
# Coefficients:
# (Intercept) income persons
# 0.341740 -0.002469 0.025767
# transformation in data
dd <- transform(FoodExpenditure, ratio=food/income)
lm(ratio ~ income + persons, data = dd)
# Call:
# lm(formula = ratio ~ income + persons, data = dd)
#
# Coefficients:
# (Intercept) income persons
# 0.341740 -0.002469 0.025767
The I() function in a formula with lm() allows you to perform any function of the variables that you like. (Just make sure the function doesn't change the number of rows otherwise you can't fit the model properly.)
Yes and Yes.
Other possible combinations and operators are given in the documentation for formula ?formula. What is below is mostly taken from it.
: denotes interactions between the terms
* operator denotes factor crossing: ‘a*b’ interpreted as ‘a+b+a:b’.
^ operator indicates
crossing to the specified degree. For example ‘(a+b+c)^2’ is
identical to ‘(a+b+c)*(a+b+c)’ which in turn expands to a formula
containing the main effects for ‘a’, ‘b’ and ‘c’ together with
their second-order interactions.
%in% operator indicates
that the terms on its left are nested within those on the right.
For example ‘a + b %in% a’ expands to the formula ‘a + a:b’.
- operator removes the specified terms, so that ‘(a+b+c)^2 -
a:b’ is identical to ‘a + b + c + b:c + a:c’. It can also used to
remove the intercept term: when fitting a linear model ‘y ~ x - 1’
specifies a line through the origin. A model with no intercept
can be also specified as ‘y ~ x + 0’ or ‘y ~ 0 + x’.
arithmetic expressions - While formulae usually involve just variable and factor names,
they can also involve arithmetic expressions. The formula ‘log(y)
~ a + log(x)’ is quite legal.
I() To avoid this confusion, the function ‘I()’ can be used to bracket
those portions of a model formula where the operators are used in
their arithmetic sense. For example, in the formula ‘y ~ a +
I(b+c)’, the term ‘b+c’ is to be interpreted as the sum of ‘b’ and
‘c’.

Linear fit without slope in r

I want to fit a linear model with no slope and extract information of it. My objective is to know which is the best y-intercept for an horizontal line in a data set and also evaluate from derived linear fit to identify if y has a particular behavior (x is date). I've using range to evaluate behavior, but I'm looking for an index without unit.
Removing y-intercept:
X <- 1:10
Y <- 2:11
lm1 <- lm(Y~X + 0, data = data.frame(X=X,Y=Y)) # y-intercept remove opt 1
lm1 <- lm(Y~X - 1, data = data.frame(X=X,Y=Y)) # y-intercept remove opt 2
lm1 <- lm(Y~0 + X, data = data.frame(X=X,Y=Y)) # y-intercept remove opt 3
lm1$coefficients
X
1.142857
summary(lm1)$r.squared
[1] 0.9957567
All the lm showed before, has . But, if I evaluate:
lm2 <- lm(Y~1, data = data.frame(X=X,Y=Y))
lm2$coefficients
(Intercept)
6.5
summary(lm2)$r.squared
[1] 0
There is a way to calculate out of lm function or calculate an index to identify how much y is represented by an horizontal line?
Let lmObject be your linear model returned by lm (called with y = TRUE to return y).
If your model has intercept, then R-squared is computed as
with(lmObject, 1 - c(crossprod(residuals) / crossprod(y - mean(y))) )
If your model does not have an intercept, then R-squared is computed as
with(lmObject, 1 - c(crossprod(residuals) / crossprod(y)) )
Note, if your model is only an intercept (so it is certainly from the 1st case above), you have
residuals = y - mean(y)
thus R-squared is always 1 - 1 = 0.
In regression analysis, it is always recommended to include intercept in the model to get unbiased estimate. A model with intercept only is the NULL model. Any other model is compared with this NULL model for further analysis of variance.
A note. The value / quantity you want has nothing to do with regression. You can simply compute it as
c(crossprod(Y - mean(Y)) / crossprod(Y)) ## `Y` is your data
#[1] 0.1633663
Alternatively, use
(length(Y) - 1) * var(Y) / c(crossprod(Y))
#[1] 0.1633663

`rms::ols()`: how to fit a model without intercept

I'd like to use the ols() (ordinary least squares) function from the rms package to do a multivariate linear regression, but I would not like it to calculate the intercept. Using lm() the syntax would be like:
model <- lm(formula = z ~ 0 + x + y, data = myData)
where the 0 stops it from calculating an intercept, and only two coefficients are returned, on for x and the other for y. How do I do this when using ols()?
Trying
model <- ols(formula = z ~ 0 + x + y, data = myData)
did not work, it still returns an intercept and a coefficient each for x and y.
Here is a link to a csv file
It has five columns. For this example, can only use the first three columns:
model <- ols(formula = CorrEn ~ intEn_anti_ncp + intEn_par_ncp, data = ccd)
Thanks!
rms::ols uses rms:::Design instead of model.frame.default. Design is called with the default of intercept = 1, so there is no (obvious) way to specify that there is no intercept. I assume there is a good reason for this, but you can try changing ols using trace.

How to fit predefined offsets to models containing categorical variables in R

Using the following data:
http://pastebin.com/4wiFrsNg
I am wondering how to fit a predefined offset to the raw relationship of another model i.e. how to fit the estimates from Model A, thus:
ModelA<-lm(Dependent1~Explanatory)
to model B thus:
ModelB<-lm(Dependent2~Explanatory)
Where the explanatory variable is either the variable "Categorical" in my dataset, or the variable "Continuous". I got a useful answer related to a similar question on CV:
https://stats.stackexchange.com/questions/62584/how-to-fit-a-specific-model-to-some-data
Here the exaplantory variable was "Continuous". However I had some extra questions I needed answering that I thought might be more suited to SO. If this is not the case, tell me and I will delete this question :)
Specifically, I was advised in the link above that in order to fit a predefined slope for the continuous explanatory variable in my dataset I should do this:
lm( Dependent2 ~ 1 + offset( Slope * Continuous ) )
Where slope is the predefined slope taken from from Model A. Which worked great.
Now I am wondering, how do I do the same when x is a categorical variable with two levels, and then when x is a continuous variable with a quadratic term i.e. x+x^2?
For the quadratic term I am trying:
lm( Dependent2 ~ 1 + offset( Slope * Continuous )+ offset( Slope2 * I((Continuous)^2)) )
Where Slope is the value for the fixed estimate of for Continuous term, and Slope2 is the value for the fixed estimate of the quadratic term.
I am unsure how to get this working for a categorical variable however. When I try to fit an offset as:
lm( Dependent2 ~ 1 + offset( Slope * Categorical ) )
where again, slope is the slope value of the fixed estimate taken from Model A, I get an error:
"Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases
In addition: Warning message:
In Ops.factor(0.25773, Categorical) : * not meaningful for factors"
If anyone has an input on how to create offsets for categorical variables it would be greatly appreciated :)
The best you can probably do is compute the offset manually for each level of your factor:
x <- rep(1:3, each=10)
df <- data.frame(x=factor(x), y=3 - x)
# compute the offset for each level of x
df$o <- with(df, ifelse(x == "1", 2, ifelse(x == "2", 1, 0)))
# fitted coef's for the models below will all be zero due to presence of offset
lm(y - o ~ x - 1, data=df)
# or
lm(y ~ x + offset(o), data=df)

Resources