What is the difference between Y ~ X and Y ~ X+1? - r

What is the difference between Y ~ X and Y ~ X+1 in R?

In the context of lm(), they are exactly the same. Both models include the intercept.
To remove the intercept, you could write Y ~ X - 1 or Y ~ X + 0.

What is the context of your question?
As has been mentioned, in lm and other functions that use model.matrix internally they are the same. But there are other cases where they differ, consider the following code:
plot.new()
text( .5, .1, y ~ x )
text( .5, .3, y ~ x + 1 )
Here they are different (and running the code shows the difference).
For any other functions or contexts it will depend on the implementation.
These 2 lines give the same results:
plot( Petal.Length ~ Species, data=iris )
plot( Petal.Length ~ Species + 1, data=iris )
But these don't:
library(lattice)
bwplot( Petal.Length ~ Species, data=iris )
bwplot( Petal.Length ~ Species + 1, data=iris )
I remember seeing one time (though it may have been in S-Plus rather than R and may not be possible in R) a formula that included a 0+ or -1 at the beginning and a +1 later in the formula. It constructed the main effects without an intercept (fit mean for each level of 1st factor) but the +1 at the right place changed the way the interactions were coded.
In theory there could be modeling functions (I can't think of any, but they could exist or be written in the future) that take a formula but do not include an intercept by default and so the +1 would be needed to specify an intercept.
So, what context are you asking your question in?

Related

tilde(~) operator in R

According to the R documentation: ~ operator is used in formula to separate the right and left hand side of the formula. The right hand side is independent variable and the left hand side is dependent variable. I understand when ~ is used in lm() package. However what does following mean?
x~ 1
The right hand side is 1. what does it mean? Can it be any other number instead of 1?
From ?lm:
[..] when fitting a linear model y ~ x - 1 specifies a line through the
origin [..]
The "-" in the formula removes a specified term.
So y ~ 1 is just a model with a constant (intercept) and no regressor.
lm(mtcars$mpg ~ 1)
#Call:
#lm(formula = mtcars$mpg ~ 1)
#
#Coefficients:
#(Intercept)
# 20.09
Can it be any other number instead of 1?
No, just try and see.
lm(mtcars$mpg ~ 0) tells R to remove the constant (equal to y ~ -1), and lm(mtcars$mpg ~ 2) gives an error (correctly).
You should read y ~ 1 as y ~ constant inside the formula, it's not a simple number.

How to fit predefined offsets to models containing categorical variables in R

Using the following data:
http://pastebin.com/4wiFrsNg
I am wondering how to fit a predefined offset to the raw relationship of another model i.e. how to fit the estimates from Model A, thus:
ModelA<-lm(Dependent1~Explanatory)
to model B thus:
ModelB<-lm(Dependent2~Explanatory)
Where the explanatory variable is either the variable "Categorical" in my dataset, or the variable "Continuous". I got a useful answer related to a similar question on CV:
https://stats.stackexchange.com/questions/62584/how-to-fit-a-specific-model-to-some-data
Here the exaplantory variable was "Continuous". However I had some extra questions I needed answering that I thought might be more suited to SO. If this is not the case, tell me and I will delete this question :)
Specifically, I was advised in the link above that in order to fit a predefined slope for the continuous explanatory variable in my dataset I should do this:
lm( Dependent2 ~ 1 + offset( Slope * Continuous ) )
Where slope is the predefined slope taken from from Model A. Which worked great.
Now I am wondering, how do I do the same when x is a categorical variable with two levels, and then when x is a continuous variable with a quadratic term i.e. x+x^2?
For the quadratic term I am trying:
lm( Dependent2 ~ 1 + offset( Slope * Continuous )+ offset( Slope2 * I((Continuous)^2)) )
Where Slope is the value for the fixed estimate of for Continuous term, and Slope2 is the value for the fixed estimate of the quadratic term.
I am unsure how to get this working for a categorical variable however. When I try to fit an offset as:
lm( Dependent2 ~ 1 + offset( Slope * Categorical ) )
where again, slope is the slope value of the fixed estimate taken from Model A, I get an error:
"Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases
In addition: Warning message:
In Ops.factor(0.25773, Categorical) : * not meaningful for factors"
If anyone has an input on how to create offsets for categorical variables it would be greatly appreciated :)
The best you can probably do is compute the offset manually for each level of your factor:
x <- rep(1:3, each=10)
df <- data.frame(x=factor(x), y=3 - x)
# compute the offset for each level of x
df$o <- with(df, ifelse(x == "1", 2, ifelse(x == "2", 1, 0)))
# fitted coef's for the models below will all be zero due to presence of offset
lm(y - o ~ x - 1, data=df)
# or
lm(y ~ x + offset(o), data=df)

What does the R formula y~1 mean?

I was reading the documentation on R Formula, and trying to figure out how to work with depmix (from the depmixS4 package).
Now, in the documentation of depmixS4, sample formula tends to be something like y ~ 1.
For simple case like y ~ x, it is defining a relationship between input x and output y, so I get that it is similar to y = a * x + b, where a is the slope, and b is the intercept.
If we go back to y ~ 1, the formula is throwing me off. Is it equivalent to y = 1 (a horizontal line at y = 1)?
To add a bit context, if you look at the depmixs4 documentation, there is one example below
depmix(list(rt~1,corr~1),data=speed,nstates=2,family=list(gaussian(),multinomial()))
I think in general, formula that end with ~ 1 is confusing to me. Can any explain what ~ 1 or y ~ 1 mean?
Many of the operators used in model formulae (asterix, plus, caret) in R, have a model-specific meaning and this is one of them: the 'one' symbol indicates an intercept.
In other words, it is the value the dependent variable is expected to have when the independent variables are zero or have no influence. (To use the more common mathematical meaning of model terms, you wrap them in I()). Intercepts are usually assumed so it is most common to see it in the context of explicitly stating a model without an intercept.
Here are two ways of specifying the same model for a linear regression model of y on x. The first has an implicit intercept term, and the second an explicit one:
y ~ x
y ~ 1 + x
Here are ways to give a linear regression of y on x through the origin (that is, without an intercept term):
y ~ 0 + x
y ~ -1 + x
y ~ x - 1
In the specific case you mention ( y ~ 1 ), y is being predicted by no other variable so the natural prediction is the mean of y, as Paul Hiemstra stated:
> data(city)
> r <- lm(x~1, data=city)
> r
Call:
lm(formula = x ~ 1, data = city)
Coefficients:
(Intercept)
97.3
> mean(city$x)
[1] 97.3
And removing the intercept with a -1 leaves you with nothing:
> r <- lm(x ~ -1, data=city)
> r
Call:
lm(formula = x ~ -1, data = city)
No coefficients
formula() is a function for extracting formula out of objects and its help file isn't the best place to read about specifying model formulae in R. I suggest you look at this explanation or Chapter 11 of An Introduction to R.
if your model were of the form y ~ x1 + x2 This (roughly speaking) represents:
y = β0 + β1(x1) + β2(x2)
Which is of course the same as
y = β0(1) + β1(x1) + β2(x2)
There is an implicit +1 in the above formula. So really, the formula above is y ~ 1 + x1 + x2
We could have a very simple formula, whereby y is not dependent on any other variable. This is the formula that you are referencing,
y ~ 1 which roughly would equate to
y = β0(1) = β0
As #Paul points out, when you solve the simple model, you get β0 = mean (y)
Here is an example
# Let's make a small sample data frame
dat <- data.frame(y= (-2):3, x=3:8)
# Create the linear model as above
simpleModel <- lm(y ~ 1, data=dat)
## COMPARE THE COEFFICIENTS OF THE MODEL TO THE MEAN(y)
simpleModel$coef
# (Intercept)
# 0.5
mean(dat$y)
# [1] 0.5
In general such a formula describes the relation between dependent and independent variables in the form of a linear model. The lefthand side are the dependent variables, the right hand side the independent. The independent variables are used to calculate the trend component of the linear model, the residuals are then assumed to have some kind of distribution. When the independent are equal to one ~ 1, the trend component is a single value, e.g. the mean value of the data, i.e. the linear model only has an intercept.

Linear Regression with a known fixed intercept in R

I want to calculate a linear regression using the lm() function in R. Additionally I want to get the slope of a regression, where I explicitly give the intercept to lm().
I found an example on the internet and I tried to read the R-help "?lm" (unfortunately I'm not able to understand it), but I did not succeed. Can anyone tell me where my mistake is?
lin <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
plot (lin$x, lin$y)
regImp = lm(formula = lin$x ~ lin$y)
abline(regImp, col="blue")
# Does not work:
# Use 1 as intercept
explicitIntercept = rep(1, length(lin$x))
regExp = lm(formula = lin$x ~ lin$y + explicitIntercept)
abline(regExp, col="green")
Thanls for your help.
You could subtract the explicit intercept from the regressand and then fit the intercept-free model:
> intercept <- 1.0
> fit <- lm(I(x - intercept) ~ 0 + y, lin)
> summary(fit)
The 0 + suppresses the fitting of the intercept by lm.
edit To plot the fit, use
> abline(intercept, coef(fit))
P.S. The variables in your model look the wrong way round: it's usually y ~ x, not x ~ y (i.e. the regressand should go on the left and the regressor(s) on the right).
I see that you have accepted a solution using I(). I had thought that an offset() based solution would have been more obvious, but tastes vary and after working through the offset solution I can appreciate the economy of the I() solution:
with(lin, plot(y,x) )
lm_shift_up <- lm(x ~ y +0 +
offset(rep(1, nrow(lin))),
data=lin)
abline(1,coef(lm_shift_up))
I have used both offset and I(). I also find offset easier to work with (like BondedDust) since you can set your intercept.
Assuming Intercept is 10.
plot (lin$x, lin$y)
fit <-lm(lin$y~0 +lin$x,offset=rep(10,length(lin$x)))
abline(fit,col="blue")

Formula for all first and second order predictors including interactions in R

In the statistics programming language R, the following formula (e.g. in lm() or glm())
z ~ (x+y)^2
is equivalent to
z ~ x + y + x:y
Assuming, I only have continuous predictors, is there a concise way to obtain
z ~ I(x^2) + I(y^2) + I(x) + I(y) + I(x*y)
A formula that does the right thing for factor predictors is a plus.
One possible solution is
z ~ (poly(x,2) + poly(y,2))^2
I am looking for something more elegant.
I don't know if it is more elegant or not, but the poly function can take multiple vectors:
z ~ poly(x, y, degree=2)
This will create all the combinations that you asked for, without additional ones. Do note that you need to specify degree=2, not just 2 when doing it this way.

Resources