I(variable_1/variable_2) in R regression - r

R> data("FoodExpenditure", package = "betareg")
R> fe_lm <- lm(I(food/income) ~ income + persons, data = FoodExpenditure)
From what I understand I(food/income) creates a new variable which is the ratio of food over income, is it correct? Are there any other combinations (functions) possible?

Observe that these two results are the same
# transformation in formula
lm(I(food/income) ~ income + persons, data = FoodExpenditure)
# Call:
# lm(formula = I(food/income) ~ income + persons, data = FoodExpenditure)
#
# Coefficients:
# (Intercept) income persons
# 0.341740 -0.002469 0.025767
# transformation in data
dd <- transform(FoodExpenditure, ratio=food/income)
lm(ratio ~ income + persons, data = dd)
# Call:
# lm(formula = ratio ~ income + persons, data = dd)
#
# Coefficients:
# (Intercept) income persons
# 0.341740 -0.002469 0.025767
The I() function in a formula with lm() allows you to perform any function of the variables that you like. (Just make sure the function doesn't change the number of rows otherwise you can't fit the model properly.)

Yes and Yes.
Other possible combinations and operators are given in the documentation for formula ?formula. What is below is mostly taken from it.
: denotes interactions between the terms
* operator denotes factor crossing: ‘a*b’ interpreted as ‘a+b+a:b’.
^ operator indicates
crossing to the specified degree. For example ‘(a+b+c)^2’ is
identical to ‘(a+b+c)*(a+b+c)’ which in turn expands to a formula
containing the main effects for ‘a’, ‘b’ and ‘c’ together with
their second-order interactions.
%in% operator indicates
that the terms on its left are nested within those on the right.
For example ‘a + b %in% a’ expands to the formula ‘a + a:b’.
- operator removes the specified terms, so that ‘(a+b+c)^2 -
a:b’ is identical to ‘a + b + c + b:c + a:c’. It can also used to
remove the intercept term: when fitting a linear model ‘y ~ x - 1’
specifies a line through the origin. A model with no intercept
can be also specified as ‘y ~ x + 0’ or ‘y ~ 0 + x’.
arithmetic expressions - While formulae usually involve just variable and factor names,
they can also involve arithmetic expressions. The formula ‘log(y)
~ a + log(x)’ is quite legal.
I() To avoid this confusion, the function ‘I()’ can be used to bracket
those portions of a model formula where the operators are used in
their arithmetic sense. For example, in the formula ‘y ~ a +
I(b+c)’, the term ‘b+c’ is to be interpreted as the sum of ‘b’ and
‘c’.

Related

linear regression function creating a list instead of a model

I'm trying to fit an lm model using R. However, for some reason this code creates a list of the data instead of the usual regression model.
The code I use is this one
lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden )
But instead of the usual coefficients, the title of the variable appears mixed with the data in this way:
(Intercept) Inclusivity0.631 Inclusivity0.681 Inclusivity0.716 Inclusivity0.9
35.00 -4.00 -6.74 -4.30 4.90
Does anybody know why this happened and how it can be fixed?
What you are seeing is called a named num (a numeric vector with names). You can do the following:
Model <- lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden) # Assign the model to an object called Model
summary(Model) # Summary table
Model$coefficients # See only the coefficients as a named numeric vector
Model$coefficients[[1]] # See the first coefficient without name
If you want all the coefficients without names (so just a numeric vector), try:
unname(coef(Model))
It would be good if you could provide a sample of your data but I'm guessing that the key problem is that the numeric data in Inclusivity is stored as a factor. e.g.,
library(tidyverse)
x <- tibble(incl = as.factor(c(0.631, 0.681, 0.716)),
soc_vote=1:3)
lm(soc_vote ~ incl, x)
Call:
lm(formula = soc_vote ~ incl, data = x)
Coefficients:
(Intercept) incl0.681 incl0.716
1 1 2
Whereas, if you first convert the Inclusivity column to double, you get
y <- x %>% mutate(incl = as.double(as.character(incl)))
lm(soc_vote ~ incl, y)
Call:
lm(formula = soc_vote ~ incl, data = y)
Coefficients:
(Intercept) incl
-13.74 23.29
Note that I needed to convert to character first since otherwise I just get the ordinal equivalent of each factor.

tilde(~) operator in R

According to the R documentation: ~ operator is used in formula to separate the right and left hand side of the formula. The right hand side is independent variable and the left hand side is dependent variable. I understand when ~ is used in lm() package. However what does following mean?
x~ 1
The right hand side is 1. what does it mean? Can it be any other number instead of 1?
From ?lm:
[..] when fitting a linear model y ~ x - 1 specifies a line through the
origin [..]
The "-" in the formula removes a specified term.
So y ~ 1 is just a model with a constant (intercept) and no regressor.
lm(mtcars$mpg ~ 1)
#Call:
#lm(formula = mtcars$mpg ~ 1)
#
#Coefficients:
#(Intercept)
# 20.09
Can it be any other number instead of 1?
No, just try and see.
lm(mtcars$mpg ~ 0) tells R to remove the constant (equal to y ~ -1), and lm(mtcars$mpg ~ 2) gives an error (correctly).
You should read y ~ 1 as y ~ constant inside the formula, it's not a simple number.

PPML package gravity with time fixed effects

I'm trying to include time fixed effects (dummies for years generated with model.matrix) into a PPML regression in R.
Without time fixed effect the regression is:
require(gravity)
my_model <- PPML(y="v", dist="dist",
x=c("land","contig","comlang_ethno",
"smctry","tech","exrate"),
vce_robust=T, data=database)
I've tried to add command fe=c("year") within the PPML function but it doesn't work.
I'd appreciate any help on this.
I would comment on the previous answer but don't have enough reputation. The gravity model in your PPML command specifies v = dist × exp(land + contig + comlang_ethno + smctry + tech + exrate + TimeFE) = exp(log(dist) + land + contig + comlang_ethno + smctry + tech + exrate + TimeFE).
The formula inside of glm should have as its RHS the variables inside the exponential, because it represents the linear predictor produced by the link function (the Poisson default for which is natural log). So in sum, your command should be
glm(v ~ log(dist) + land + contig + comlang_ethno + smctry + tech + exrate + factor(year),
family='quasipoisson')
and in particular, you need to have distance in logs on the RHS (unlike the previous answer).
Just make sure that year is a factor, than you can just use the plain-and-simple glm-function as
glm(y ~ dist + year, family = "quasipoisson")
which gives you the results with year as dummies/fixed effects. The robust SE are then calculated with
lmtest::coeftest(EstimationResults.PPML, vcov=sandwich::vcovHC(model.PPML, "HC1"))
The PPML function does nothing more, it just isn't very flexible.
Alternatively to PPML and glm, you can also solve your problem using the function femlm (from package FENmlm) which deals with fixed-effect estimation for maximum likelihood models.
The two main advantages of function femlm are:
you can add as many fixed-effects as you want, and they are dealt with separately leading to computing times without comparison to glm (especially when fixed-effects contain many categories)
standard-errors can be clustered with intuitive commands
Here's an example regarding your problem (with just two variables and the year fixed-effects):
library(FENmlm)
# (default family is Poisson, 'pipe' separates variables from fixed-effects)
res = femlm(v ~ log(dist) + land | year, base)
summary(res, se = "cluster")
This code estimates the coefficients of variables log(dist) and land with year fixed-effects; then it displays the coefficients table with clustered standard-errors (w.r.t. year) for the two variables.
Going beyond your initial question, now assume you have a more complex case with three fixed-effects: country_i, country_j and year. You'd write:
res = femlm(v ~ log(dist) + land | country_i + country_j + year, base)
You can then easily play around with clustered standard-errors:
# Cluster w.r.t. country_i (default is first cluster encountered):
summary(res, se = "cluster")
summary(res, se = "cluster", cluster = "year") # cluster w.r.t. year cluster
# Two-way clustering:
summary(res, se = "twoway") # two-way clustering w.r.t. country_i & country_j
# two way clustering w.r.t. country_i & year:
summary(res, se = "twoway", cluster = c("country_i", "year"))
For more information on the package, the vignette can be found at https://cran.r-project.org/web/packages/FENmlm/vignettes/FENmlm.html.

controlling the value (TRUE or FALSE) of dummy variables in interaction terms when using lm()

When I estimate a model that has an interaction between two variables that don't enter the model as standalone variables, and when one of these variables is a dummy (class "logical") variable, R "flips the sign" of the dummy variable. That is, it reports an estimate of the coefficient on the interaction term when the dummy is FALSE, not when it is TRUE. Here is an example:
data(trees)
trees$dHeight <- trees$Height > 76
trees$cGirth <- trees$Girth - mean(trees$Girth)
lm(Volume ~ Girth + Girth:dHeight, data = trees) # estimate is for Girth:dHeightTRUE
lm(Volume ~ Girth + cGirth:dHeight, data = trees) # estimate is for cGirth:dHeightFALSE
Why does the regression in the last line produce an estimate for an interaction in which dHeight is FALSE rather than TRUE? (I would like R to report the estimate when dHeight is TRUE.)
This is not a big problem, but I would like to better understand why R is doing what it's doing. I know about relevel() and contrasts(), but I can't see that they would make a difference here.
The dHeight is logical. Within model this coerced to a factor, and the levels are sorted lexicographically (i.e. F is before T).
As noted in #Hongooi's answer, you can't estimate 4 parameters, so R will fit the terms in the order they appear (FALSE before TRUE)
If you want to force R to fit the TRUE value first you could fit the model to !dHeight
lm(formula = Volume ~ Girth + cGirth:!dHeight, data = trees)
Note that !dHeightFALSE is equivalent of dHeightTRUE
You will also note that in this simple case you are simply changing the sign on the coefficient so it doesn't really matter which model you fit.
EDIT A FAR BETTER APPROACH
R can regcognize that cGirth and Girth are colinear, therefore we can fit remembering that a/b expands to be a + a:b
lm(formula = Volume ~ Girth + cGirth/dHeight, data = trees)
Coefficients:
(Intercept) Girth cGirth cGirth:dHeightTRUE
-27.198 4.251 NA 1.286
This provides coefficients with easy to interpret names and R will sensibly fail to return a coefficient for cGirth
R can tell Girth and cGirth are colinear, when they are both the model as "main effect" or standalone terms.
There is no way that R should be able to tell when fitting Girth + cGirth:dHeight that cGirth and Girth are colinear and given that dHeight is logical we want cGirthdHeightTRUE to be the coefficient you fit. (you could write your own formula parser to do this if you really wanted)
another approach that would fit the model you wanted, and without any colinear terms would be to use
lm(formula = Volume ~ Girth + I(cGirth*dHeight), data = trees)
which coerces dHeight to numeric (TRUE becomes 1).
Edit to labor the point.
When you fit ~Girth + Girth:dHeight
What you are saying is that there is a main effect for Girth + adjustments for dHeight. R considers the first level of a factor the reference level. The slope for dHeightFALSE is simply the value for Girth, you then have the adjustment for when dHeight == TRUE (Girth:dHeightTRUE).
When you fit ~Girth + cGirth:dHeight -- R does not have a mind-reading parser that can tell that given cGirth and Girth are co-linear when you fit the interaction of the two terms, it will assume that the second level for dHeight is now the reference level)
Imagine if you had a variable that was totally unrelated to Girth
eg
set.seed(1)
trees$cG <- runif(nrow(trees))
Then when you fit Girth + cG:dHeight, you will get 4 parameters estimated
lm(formula = Volume ~ Girth + cG:dHeight, data = trees)
Call:
lm(formula = Volume ~ Girth + cG:dHeight, data = trees)
Coefficients:
(Intercept) Girth cG:dHeightFALSE cG:dHeightTRUE
-31.79645 4.79435 -5.92168 0.09578
Which is sensible.
When R processes Girth + cGirth:dHeight, it will expand out (with the first level of the factor first) 1 + Girth + cGirth:dHeightFALSE + cGirth:dHeightTRUE -- and will work out that it can't estimate all 4 parameters, and will estimate the first 3.
R isn't flipping the sign on the dummy variable as such. When you fit ~ Girth + cGirth:dHeight, the cGirth variable is confounded with the intercept term. You can see what's going on by removing the intercept:
> lm(Volume ~ -1 + Girth + cGirth:dHeight, data = trees)
Call:
lm(formula = Volume ~ -1 + Girth + cGirth:dHeight, data = trees)
Coefficients:
Girth cGirth:dHeightFALSE cGirth:dHeightTRUE
2.199 2.053 3.339

What does the R formula y~1 mean?

I was reading the documentation on R Formula, and trying to figure out how to work with depmix (from the depmixS4 package).
Now, in the documentation of depmixS4, sample formula tends to be something like y ~ 1.
For simple case like y ~ x, it is defining a relationship between input x and output y, so I get that it is similar to y = a * x + b, where a is the slope, and b is the intercept.
If we go back to y ~ 1, the formula is throwing me off. Is it equivalent to y = 1 (a horizontal line at y = 1)?
To add a bit context, if you look at the depmixs4 documentation, there is one example below
depmix(list(rt~1,corr~1),data=speed,nstates=2,family=list(gaussian(),multinomial()))
I think in general, formula that end with ~ 1 is confusing to me. Can any explain what ~ 1 or y ~ 1 mean?
Many of the operators used in model formulae (asterix, plus, caret) in R, have a model-specific meaning and this is one of them: the 'one' symbol indicates an intercept.
In other words, it is the value the dependent variable is expected to have when the independent variables are zero or have no influence. (To use the more common mathematical meaning of model terms, you wrap them in I()). Intercepts are usually assumed so it is most common to see it in the context of explicitly stating a model without an intercept.
Here are two ways of specifying the same model for a linear regression model of y on x. The first has an implicit intercept term, and the second an explicit one:
y ~ x
y ~ 1 + x
Here are ways to give a linear regression of y on x through the origin (that is, without an intercept term):
y ~ 0 + x
y ~ -1 + x
y ~ x - 1
In the specific case you mention ( y ~ 1 ), y is being predicted by no other variable so the natural prediction is the mean of y, as Paul Hiemstra stated:
> data(city)
> r <- lm(x~1, data=city)
> r
Call:
lm(formula = x ~ 1, data = city)
Coefficients:
(Intercept)
97.3
> mean(city$x)
[1] 97.3
And removing the intercept with a -1 leaves you with nothing:
> r <- lm(x ~ -1, data=city)
> r
Call:
lm(formula = x ~ -1, data = city)
No coefficients
formula() is a function for extracting formula out of objects and its help file isn't the best place to read about specifying model formulae in R. I suggest you look at this explanation or Chapter 11 of An Introduction to R.
if your model were of the form y ~ x1 + x2 This (roughly speaking) represents:
y = β0 + β1(x1) + β2(x2)
Which is of course the same as
y = β0(1) + β1(x1) + β2(x2)
There is an implicit +1 in the above formula. So really, the formula above is y ~ 1 + x1 + x2
We could have a very simple formula, whereby y is not dependent on any other variable. This is the formula that you are referencing,
y ~ 1 which roughly would equate to
y = β0(1) = β0
As #Paul points out, when you solve the simple model, you get β0 = mean (y)
Here is an example
# Let's make a small sample data frame
dat <- data.frame(y= (-2):3, x=3:8)
# Create the linear model as above
simpleModel <- lm(y ~ 1, data=dat)
## COMPARE THE COEFFICIENTS OF THE MODEL TO THE MEAN(y)
simpleModel$coef
# (Intercept)
# 0.5
mean(dat$y)
# [1] 0.5
In general such a formula describes the relation between dependent and independent variables in the form of a linear model. The lefthand side are the dependent variables, the right hand side the independent. The independent variables are used to calculate the trend component of the linear model, the residuals are then assumed to have some kind of distribution. When the independent are equal to one ~ 1, the trend component is a single value, e.g. the mean value of the data, i.e. the linear model only has an intercept.

Resources