Dummy variable as slope shifter without intercept - r

This is my first time to ask here.
I have trouble generating the slope dummy variables only(without intercept dummy).
However, if I multiply dummy variable by independent variable as shown below,
both slope dummy and intercept dummy results are represented.
I want to incorporate slope dummy only and exclude intercept dummy.
I will appreciate your help.
Bests,
yjkim
reg <- lm(year ~ as.factor(age)*log(v1269))
Call:
lm(formula = year ~ as.factor(age) * log(v1269))
Residuals:
Min 1Q Median 3Q Max
-6.083 -1.177 1.268 1.546 3.768
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.18076 2.16089 2.398 0.0167 *
as.factor(age)2 1.93989 2.75892 0.703 0.4821
as.factor(age)3 2.46861 2.39393 1.031 0.3027
as.factor(age)4 -0.56274 2.30123 -0.245 0.8069
log(v1269) -0.06788 0.23606 -0.288 0.7737
as.factor(age)2:log(v1269) -0.15628 0.29621 -0.528 0.5979
as.factor(age)3:log(v1269) -0.14961 0.25809 -0.580 0.5622
as.factor(age)4:log(v1269) 0.16534 0.24884 0.664 0.5065

Just need a -1 within the formaula
reg <- lm(year ~ as.factor(age)*log(v1269) -1)

If you want to estimate a different slope in each level of age, the you can use the %in% operator in the formula
set.seed(1)
df <- data.frame(age = factor(sample(1:4, 100, replace = TRUE)),
v1269 = rlnorm(100),
year = rnorm(100))
m <- lm(year ~ log(v1269) %in% age, data = df)
summary(m)
This gives (for this entirely random , dummy, silly data set)
> summary(m)
Call:
lm(formula = year ~ log(v1269) %in% age, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.93108 -0.66402 -0.05921 0.68040 2.25244
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02692 0.10705 0.251 0.802
log(v1269):age1 0.20127 0.21178 0.950 0.344
log(v1269):age2 -0.01431 0.24116 -0.059 0.953
log(v1269):age3 -0.02588 0.24435 -0.106 0.916
log(v1269):age4 0.06019 0.21979 0.274 0.785
Residual standard error: 1.065 on 95 degrees of freedom
Multiple R-squared: 0.01037, Adjusted R-squared: -0.0313
F-statistic: 0.2489 on 4 and 95 DF, p-value: 0.9097
Note that this fits a single constant term plus 4 different effects of log(v1269), one per level of age. Visually, this is sort of what the model is doing
pred <- with(df,
expand.grid(age = factor(1:4),
v1269 = seq(min(v1269), max(v1269), length = 100)))
pred <- transform(pred, fitted = predict(m, newdata = pred))
library("ggplot2")
ggplot(df, aes(x = log(v1269), y = year, colour = age)) +
geom_point() +
geom_line(data = pred, mapping = aes(y = fitted)) +
theme_bw() + theme(legend.position = "top")
Clearly, this would only be suitable if there was no significant difference in the mean values of year (the response) in the different age categories.
Note that a different parameterisation of the same model can be achieved via the / operator:
m2 <- lm(year ~ log(v1269)/age, data = df)
> m2
Call:
lm(formula = year ~ log(v1269)/age, data = df)
Coefficients:
(Intercept) log(v1269) log(v1269):age2 log(v1269):age3
0.02692 0.20127 -0.21559 -0.22715
log(v1269):age4
-0.14108
Note that now, the first log(v1269) term is for the slope for age == 1, whilst the other terms are the adjustments required to be applied to the the log(v1269) term to get the slope for the indicated group:
coef(m)[-1]
coef(m2)[2] + c(0, coef(m2)[-(1:2)])
> coef(m)[-1]
log(v1269):age1 log(v1269):age2 log(v1269):age3 log(v1269):age4
0.20127109 -0.01431491 -0.02588106 0.06018802
> coef(m2)[2] + c(0, coef(m2)[-(1:2)])
log(v1269):age2 log(v1269):age3 log(v1269):age4
0.20127109 -0.01431491 -0.02588106 0.06018802
But they work out to the same estimated slopes.

Related

R - Fixed-effects regression "plm" vs "lm + as.factor()": interpretation of R and R-Squared

I understand from this question here that coefficients are the same whether we use a lm regression with as.factor() and a plm regression with fixed effects.
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
However, the R and R-squared differ significantly. Which one is correct and how does the interpretation changes between the two models? In my case, the R-squared is much larger for the plm specification and is even negative for the lm + factor one.

A linear model matrix where each level of a categorical is contrasted with the mean

I have xy data, where y is a continuous response and x is a categorical variable:
set.seed(1)
df <- data.frame(y = rnorm(27), group = c(rep("A",9),rep("B",9),rep("C",9)), stringsAsFactors = F)
I would like to fit the linear model: y ~ group to it, in which each of the levels in df$group is contrasted with the mean.
I thought that using Deviation Coding does that:
lm(y ~ group,contrasts = "contr.sum",data=df)
But it skips contrasting group A with the mean:
> summary(lm(y ~ group,contrasts = "contr.sum",data=df))
Call:
lm(formula = y ~ group, data = df, contrasts = "contr.sum")
Residuals:
Min 1Q Median 3Q Max
-1.6445 -0.6946 -0.1304 0.6593 1.9165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2651 0.3457 -0.767 0.451
groupB 0.2057 0.4888 0.421 0.678
groupC 0.3985 0.4888 0.815 0.423
Residual standard error: 1.037 on 24 degrees of freedom
Multiple R-squared: 0.02695, Adjusted R-squared: -0.05414
F-statistic: 0.3324 on 2 and 24 DF, p-value: 0.7205
Is there any function that builds a model matrix to get each of the levels of df$group contrasted with the mean in the summary?
All I can think of is manually adding a "mean" level to df$group and setting it is as baseline with Dummy Coding:
df <- df %>% rbind(data.frame(y = mean(df$y), group ="mean"))
df$group <- factor(df$group, levels = c("mean","A","B","C"))
summary(lm(y ~ group,contrasts = "contr.treatment",data=df))
Call:
lm(formula = y ~ group, data = df, contrasts = "contr.treatment")
Residuals:
Min 1Q Median 3Q Max
-2.30003 -0.34864 0.07575 0.56896 1.42645
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.14832 0.95210 0.156 0.878
groupA 0.03250 1.00360 0.032 0.974
groupB -0.06300 1.00360 -0.063 0.950
groupC 0.03049 1.00360 0.030 0.976
Residual standard error: 0.9521 on 24 degrees of freedom
Multiple R-squared: 0.002457, Adjusted R-squared: -0.1222
F-statistic: 0.01971 on 3 and 24 DF, p-value: 0.9961
Similarly, suppose I have data with two categorical variables:
set.seed(1)
df <- data.frame(y = rnorm(18),
group = c(rep("A",9),rep("B",9)),
class = as.character(rep(c(rep(1,3),rep(2,3),rep(3,3)),2)))
and I would like to estimate the interaction effect per each level: (i.e., class1:groupB, class2:groupB, and class3:groupB for:
lm(y ~ class*group,contrasts = c("contr.sum","contr.treatment"),data=df)
How would I obtain it?
Use +0 in the lm formula to omit the intercept, then you should get the expected contrast coding:
summary(lm(y ~ 0 + group, contrasts = "contr.sum", data=df))
Result:
Call:
lm(formula = y ~ 0 + group, data = df, contrasts = "contr.sum")
Residuals:
Min 1Q Median 3Q Max
-2.3000 -0.3627 0.1487 0.5804 1.4264
Coefficients:
Estimate Std. Error t value Pr(>|t|)
groupA 0.18082 0.31737 0.570 0.574
groupB 0.08533 0.31737 0.269 0.790
groupC 0.17882 0.31737 0.563 0.578
Residual standard error: 0.9521 on 24 degrees of freedom
Multiple R-squared: 0.02891, Adjusted R-squared: -0.09248
F-statistic: 0.2381 on 3 and 24 DF, p-value: 0.8689
If you want to do this for an interaction, here's one way:
lm(y ~ 0 + class:group,
contrasts = c("contr.sum","contr.treatment"),
data=df)

glm in R, give all comparisons

Simple logistic regression example.
set.seed(1)
df <- data.frame(out=c(0,1,0,1,0,1,0,1,0),
y=rep(c('A', 'B', 'C'), 3))
result <-glm(out~factor(y), family = 'binomial', data=df)
summary(result)
#Call:
#glm(formula = out ~ factor(y), family = "binomial", data = df)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.4823 -0.9005 -0.9005 0.9005 1.4823
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -6.931e-01 1.225e+00 -0.566 0.571
#factor(y)B 1.386e+00 1.732e+00 0.800 0.423
#factor(y)C 3.950e-16 1.732e+00 0.000 1.000
#(Dispersion parameter for binomial family taken to be 1)
# Null deviance: 12.365 on 8 degrees of freedom
#Residual deviance: 11.457 on 6 degrees of freedom
#AIC: 17.457
#Number of Fisher Scoring iterations: 4
My reference category is now A; results for B and C relative to A are given. I would also like to get the results when B and C are the reference. One can change the reference manually by using levels = in factor(); but this would require fitting 3 models. Is it possible to do this in one go? Or what would be a more efficient approach?
If you want to do all pairwise comparisons, you should usually also do a correction for alpha-error inflation due to multiple testing. You can easily do a Tukey test with package multcomp.
set.seed(1)
df <- data.frame(out=c(0,1,0,1,0,1,0,1,0),
y=rep(c('A', 'B', 'C'), 3))
#y is already a factor, if not, coerce before the model fit
result <-glm(out~y, family = 'binomial', data=df)
summary(result)
library(multcomp)
comps <- glht(result, linfct = mcp(y = "Tukey"))
summary(comps)
#Simultaneous Tests for General Linear Hypotheses
#
#Multiple Comparisons of Means: Tukey Contrasts
#
#
#Fit: glm(formula = out ~ y, family = "binomial", data = df)
#
#Linear Hypotheses:
# Estimate Std. Error z value Pr(>|z|)
#B - A == 0 1.386e+00 1.732e+00 0.8 0.703
#C - A == 0 1.923e-16 1.732e+00 0.0 1.000
#C - B == 0 -1.386e+00 1.732e+00 -0.8 0.703
#(Adjusted p values reported -- single-step method)
#letter notation often used in graphs and tables
cld(comps)
# A B C
#"a" "a" "a"

R - Plm and lm - Fixed effects

I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.

Is there any way to fit a `glm()` so that all levels are included (i.e. no reference level)?

Consider the code:
x <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat",
header=TRUE)[,1:2]
fit <- glm(education ~ age, family="binomial", data=x)
summary(fit)
Where age has 4 levels: "<25" "25-29" "30-39" "40-49"
The results are:
So by default, one of the levels is used as a reference level. Is there a way to have glm output coefficients for all 4 levels + the intercept (i.e. have no reference level)? Software packages like SAS do this by default, so I was wondering if there was any option for this.
Thanks!
See ?formula, specifically, the meaning of including + 0 in your model specification...
# Sample data - explanatory variable (continuous)
x <- runif( 100 )
# explanatory data, factor with 3 levels
f <- as.factor( sample( 3 , 100 , TRUE ) )
# outcome data
y <- runif( 100 ) + rnorm(100) + rnorm( 100 , mean = c(1,3,6) )
# model without intercept
summary( glm( y ~ x + f + 0 ) )
#Call:
#glm(formula = y ~ x + f + 0)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-5.7316 -1.8923 0.0195 1.8918 5.9520
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#x 0.3216 0.9772 0.329 0.743
#f1 3.4493 0.6823 5.055 2.06e-06 ***
#f2 3.6349 0.6959 5.223 1.02e-06 ***
#f3 3.1962 0.6598 4.844 4.87e-06 ***
You'll want to use the model.matrix function to convert the factors in the age variable to binary variables.
See this answer.
EDIT:
Here is an example:
x <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat",
header=TRUE)[,1:2]
binary_variables <- model.matrix(~ x$age -1, x)
fit <- glm(x$education ~ binary_variables, family="binomial")
summary(fit)

Resources