I'm switching from Stata to R, and I find inconsistent results when I use prediction to compute marginal pred and the results from the Stata command margins fixing the values of a variable to x. Here is the example:
library(dplyr)
library(prediction)
d <- data.frame(x1 = factor(c(1,1,1,2,2,2), levels = c(1, 2)),
x2 = factor(c(1,2,3,1,2,3), levels = c(1, 2, 3)),
x3 = factor(c(1,2,1,2,1,2), levels = c(1, 2)),
y = c(3.1, 2.8, 2.5, 4.3, 4.0, 3.5))
m2 <- lm(y ~ x1 + x2 + x3, d)
summary(m2)
marg2a <- prediction(m2, at = list(x2 = "1"))
marg2b <- prediction(m2, at = list(x1 = "1"))
marg2a %>%
select(x1, fitted) %>%
group_by(x1) %>%
summarise(error = mean(fitted))
marg2b %>%
select(x2, fitted) %>%
group_by(x2) %>%
summarise(error = mean(fitted))
This is the result:
# A tibble: 2 x 2
x1 error
<fctr> <dbl>
1 1 3.133333
2 2 4.266667
# A tibble: 3 x 2
x2 error
<fctr> <dbl>
1 1 3.125
2 2 2.825
3 3 2.425
while if I try to replicate this using Stata's margins, this is the result:
regress y i.x1 i.x2 i.x3
margins i.x1, at(x2 == 1)
margins i.x2, at(x1 == 1)
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |
1 | 3.125 .0829157 37.69 0.017 2.071456 4.178544
2 | 4.275 .0829157 51.56 0.012 3.221456 5.328544
------------------------------------------------------------------------------
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x2 |
1 | 3.125 .0829157 37.69 0.017 2.071456 4.178544
2 | 2.825 .0829157 34.07 0.019 1.771456 3.878544
3 | 2.425 .0829157 29.25 0.022 1.371456 3.478544
------------------------------------------------------------------------------
The margins for x2 are the same in R and Stata, but when it comes to x1 there are differences and I don't know why. Really appreciate any help. Thanks,
P
Your Stata and R code are not equivalent. To replicate that Stata code you would need:
> prediction(m2, at = list(x1 = c("1", "2"), x2 = "1"))
Average predictions for 6 observations:
at(x1) at(x2) value
1 1 3.125
2 1 4.275
> prediction(m2, at = list(x2 = c("1", "2", "3"), x1 = "1"))
Average predictions for 6 observations:
at(x2) at(x1) value
1 1 3.125
2 1 2.825
3 1 2.425
That is because when you say margins i.x1 you are asking for predictions for counterfactual versions of the dataset where x1 is replaced with 1 and then replaced with 2, with the additional constraint that in both counterfactual x2 is held at 1. The same thing is occurring in your second Stata example.
This is due to the fact that Stata's margins command has an ambiguity or rather two syntactic expressions that obtain the same output. One is your code:
. margins i.x1, at(x2 == 1)
Predictive margins Number of obs = 6
Model VCE : OLS
Expression : Linear prediction, predict()
at : x2 = 1
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |
1 | 3.125 .0829156 37.69 0.017 2.071457 4.178543
2 | 4.275 .0829156 51.56 0.012 3.221457 5.328543
------------------------------------------------------------------------------
The other is more explicit about what is actually happening in the above:
. margins, at(x1 = (1 2) x2 == 1)
Predictive margins Number of obs = 6
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : x1 = 1
x2 = 1
2._at : x1 = 2
x2 = 1
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | 3.125 .0829156 37.69 0.017 2.071457 4.178543
2 | 4.275 .0829156 51.56 0.012 3.221457 5.328543
------------------------------------------------------------------------------
Related
I am trying to replicate Stata's marginal effects from multinomial logit models in R but with no success. For the multinomial logit model, I used the multinom() function from the nnet package and for the marginal effects I used the margins package but the marginal_effects function seems to only display effects of a single variable. What if I want to have the marginal effects of the variable conditioned on another variable? Here is the output from Stata:
. margins, dydx(male) at(site=(1 2 3)) #male conditioned on site
Average marginal effects Number of obs = 615
Model VCE : OIM
dy/dx w.r.t. : 1.male
1._predict : Pr(insure==Indemnity), predict(pr outcome(1))
2._predict : Pr(insure==Prepaid), predict(pr outcome(2))
3._predict : Pr(insure==Uninsure), predict(pr outcome(3))
1._at : site = 1
2._at : site = 2
3._at : site = 3
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.male |
_predict#_at |
1 1 | -.1492951 .0728108 -2.05 0.040 -.2920016 -.0065885
1 2 | -.159346 .0723512 -2.20 0.028 -.3011517 -.0175403
1 3 | -.055138 .0875712 -0.63 0.529 -.2267745 .1164984
2 1 | .0763095 .0765406 1.00 0.319 -.0737074 .2263264
2 2 | .1747759 .0730055 2.39 0.017 .0316877 .3178641
2 3 | .0861997 .0843816 1.02 0.307 -.0791852 .2515846
3 1 | .0729855 .0516839 1.41 0.158 -.0283131 .1742842
3 2 | -.0154299 .0104982 -1.47 0.142 -.036006 .0051462
3 3 | -.0310617 .0495625 -0.63 0.531 -.1282025 .0660791
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
My attempt to calculate the marginal effects of male using the marginal_effects function:
library(nnet)
sysdsn1$insure <- as.factor(sysdsn1$insure)
sysdsn1$male <- as.factor(sysdsn1$male)
sysdsn1$site <- as.factor(sysdsn1$site)
sysdsn1$nonwhite <- as.factor(sysdsn1$nonwhite)
sysdsn1$insure <- relevel(sysdsn1$insure, ref = "3") #set the reference level
mn0 <- multinom(insure ~ age + male*site + nonwhite, data = sysdsn1) #multinomial logit model
head(marginal_effects(mn0, variables = "male")) #this only calculate marginal effects of male, how to condition on site?
dydx_male1
1 -0.01310874
2 -0.01744213
3 0.07911846
4 -0.03386199
5 -0.01728126
6 -0.01638176
Data
Data can be downloaded from http://www.stata-press.com/data/r13/sysdsn1.dta and imported into R
Recently, I stumbled upon the fact that Stata and R handle regressions without intercept differently. I'm not a statistician, so please be kind if my vocabulary is not ideal.
I tried to make the example somewhat reproducible. This is my example in R:
> set.seed(20210211)
> df <- data.frame(y = runif(50), x = runif(50))
> df$d <- df$x > 0.5
>
> (tmp <- tempfile("data", fileext = ".csv"))
[1] "C:\\Users\\s1504gl\\AppData\\Local\\Temp\\1\\RtmpYtS6uk\\data1b2c1c4a96.csv"
> write.csv(df, tmp, row.names = FALSE)
>
> summary(lm(y ~ x + d, data = df))
Call:
lm(formula = y ~ x + d, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4375 0.1038 4.214 0.000113 ***
x -0.1026 0.3168 -0.324 0.747521
dTRUE 0.1513 0.1787 0.847 0.401353
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.03103, Adjusted R-squared: -0.0102
F-statistic: 0.7526 on 2 and 47 DF, p-value: 0.4767
> summary(lm(y ~ x + d + 0, data = df))
Call:
lm(formula = y ~ x + d + 0, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -0.1026 0.3168 -0.324 0.747521
dFALSE 0.4375 0.1038 4.214 0.000113 ***
dTRUE 0.5888 0.2482 2.372 0.021813 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.7196, Adjusted R-squared: 0.7017
F-statistic: 40.21 on 3 and 47 DF, p-value: 4.996e-13
And here is what I have in Stata (please note that I have copied the filename from R to Stata):
. import delimited "C:\Users\s1504gl\AppData\Local\Temp\1\RtmpYtS6uk\data1b2c1c4a96.csv"
(3 vars, 50 obs)
. encode d, generate(d_enc)
.
. regress y x i.d_enc
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 47) = 0.75
Model | .135181652 2 .067590826 Prob > F = 0.4767
Residual | 4.22088995 47 .089806169 R-squared = 0.0310
-------------+---------------------------------- Adj R-squared = -0.0102
Total | 4.3560716 49 .08889942 Root MSE = .29968
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | -.1025954 .3168411 -0.32 0.748 -.7399975 .5348067
|
d_enc |
TRUE | .1512977 .1786527 0.85 0.401 -.2081052 .5107007
_cons | .4375371 .103837 4.21 0.000 .2286441 .6464301
------------------------------------------------------------------------------
. regress y x i.d_enc, noconstant
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 48) = 38.13
Model | 9.23913703 2 4.61956852 Prob > F = 0.0000
Residual | 5.81541777 48 .121154537 R-squared = 0.6137
-------------+---------------------------------- Adj R-squared = 0.5976
Total | 15.0545548 50 .301091096 Root MSE = .34807
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .976214 .2167973 4.50 0.000 .5403139 1.412114
|
d_enc |
TRUE | -.2322011 .1785587 -1.30 0.200 -.5912174 .1268151
------------------------------------------------------------------------------
As you can see, the results of the regression with intercept are identical. But if I omit the intercept (+ 0 in R, , noconstant in Stata), the results differ. In R, the intercept is now captured in dFALSE, which is reasonable from what I understand. I don't understand what Stata is doing here. Also the degrees of freedom differ.
My questions:
Can anyone explain to me how Stata is handling this?
How can I replicate Stata's behavior in R?
I believe bas pointed in the right direction, but I am still unsure why both results differ.
I am not attempting to answer the question, but provdide deeper understanding of what stata is doing (by digging into the source of R's lm() function. In the following lines I replicate what lm() does, but jumping over sanity checks and options such as weights, contrasts, etc...
(I cannot yet fully understand why in the second regression (with NO CONSTANT) the dFALSE coefficient captures the effect of the intercept in the default regression (with constant)
set.seed(20210211)
df <- data.frame(y = runif(50), x = runif(50))
df$d <- df$x > 0.5
lm() With Constant
form_default <- as.formula(y ~ x + d)
mod_frame_def <- model.frame(form_default, df)
mod_matrix_def <- model.matrix(object = attr(mod_frame_def, "terms"), mod_frame_def)
head(mod_matrix_def)
#> (Intercept) x dTRUE
#> 1 1 0.7861162 1
#> 2 1 0.2059603 0
#> 3 1 0.9793946 1
#> 4 1 0.8569093 1
#> 5 1 0.8124811 1
#> 6 1 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_def),
x = mod_matrix_def
)$coefficients
#> (Intercept) x dTRUE
#> 0.4375371 -0.1025954 0.1512977
lm() No Constant
form_nocon <- as.formula(y ~ x + d + 0)
mod_frame_nocon <- model.frame(form_nocon, df)
mod_matrix_nocon <- model.matrix(object = attr(mod_frame_nocon, "terms"), mod_frame_nocon)
head(mod_matrix_nocon)
#> x dFALSE dTRUE
#> 1 0.7861162 0 1
#> 2 0.2059603 1 0
#> 3 0.9793946 0 1
#> 4 0.8569093 0 1
#> 5 0.8124811 0 1
#> 6 0.7769280 0 1
stats:::lm.fit(
y = model.response(mod_frame_nocon),
x = mod_matrix_nocon
)$coefficients
#> x dFALSE dTRUE
#> -0.1025954 0.4375371 0.5888348
lm() with as.numeric()
[as indicated in the comments by bas]
form_asnum <- as.formula(y ~ x + as.numeric(d) + 0)
mod_frame_asnum <- model.frame(form_asnum, df)
mod_matrix_asnum <- model.matrix(object = attr(mod_frame_asnum, "terms"), mod_frame_asnum)
head(mod_matrix_asnum)
#> x as.numeric(d)
#> 1 0.7861162 1
#> 2 0.2059603 0
#> 3 0.9793946 1
#> 4 0.8569093 1
#> 5 0.8124811 1
#> 6 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_asnum),
x = mod_matrix_asnum
)$coefficients
#> x as.numeric(d)
#> 0.9762140 -0.2322012
Created on 2021-03-18 by the reprex package (v1.0.0)
I'm using lmer4 package [lmer() function] to estimate several Average Models, which I want to plot their Estimated Coefficients. I found this document, "Plotting Estimates (Fixed Effects) of Regression Models, by Daniel Lüdecke" that explains how to plot Estimates, and it works with Average Models, but uses Conditional Average values instead of Full Average values.
Example of script:
library(lme4)
options(na.action = "na.omit")
PA_model_clima1_Om_ST <- lmer(O.matt ~ mes_N + Temperatura_Ar_PM_ST + RH_PM_ST + Vento_V_PM_ST + Evapotranspiracao_PM_ST + Preci_total_PM_ST + (1|ID), data=Abund)
library(MuMIn)
options(na.action = "na.fail")
PA_clima1_Om_ST<-dredge(PA_model_clima1_Om_ST)
sort.PA_clima1_Om_ST<- PA_clima1_Om_ST[order(PA_clima1_Om_ST$AICc),]
top.models_PA_clima1_Om_ST<-get.models(sort.PA_clima1_Om_ST, subset = delta < 2)
model.sel(top.models_PA_clima1_Om_ST)
Avg_PA_clima1_Om_ST<-model.avg(top.models_PA_clima1_Om_ST, fit = TRUE)
summary(Avg_PA_clima1_Om_ST)
Results of this script:
Term codes:
Evapotranspiracao_PM_ST Preci_total_PM_ST RH_PM_ST Temperatura_Ar_PM_ST
1 2 3 4
Vento_V_PM_ST
5
Model-averaged coefficients:
(full average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) 5.4199 1.4094 1.4124 3.837 0.000124 ***
Preci_total_PM_ST -0.8679 1.0300 1.0313 0.842 0.400045
RH_PM_ST 0.6116 0.8184 0.8193 0.746 0.455397
Temperatura_Ar_PM_ST -1.9635 0.7710 0.7725 2.542 0.011026 *
Vento_V_PM_ST -0.6214 0.7043 0.7052 0.881 0.378289
Evapotranspiracao_PM_ST -0.1202 0.5174 0.5183 0.232 0.816654
(conditional average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) 5.4199 1.4094 1.4124 3.837 0.000124 ***
Preci_total_PM_ST -1.2200 1.0304 1.0322 1.182 0.237249
RH_PM_ST 1.0067 0.8396 0.8410 1.197 0.231317
Temperatura_Ar_PM_ST -1.9635 0.7710 0.7725 2.542 0.011026 *
Vento_V_PM_ST -0.8607 0.6936 0.6949 1.238 0.215546
Evapotranspiracao_PM_ST -0.3053 0.7897 0.7912 0.386 0.699619
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Plot scrip:
library(sjPlot)
library(sjlabelled)
library(sjmisc)
library(ggplot2)
data(efc)
theme_set(theme_sjplot())
plot_model(Avg_PA_clima1_Om_ST, type="est", vline.color="black", sort.est = TRUE, show.values = TRUE, value.offset = .3, title= "O. mattogrossae")
Plot:
As you can see, it uses the values of Conditional Average values instead of Full Average values.
How can I plot Estimates of Average Models using Full Average values?
I think it takes the conditional.. so unless you hack the function or maybe contact the author to have such an option, one way is to plot the coefficients yourself:
library(lme4)
library(MuMIn)
options(na.action = "na.fail")
set.seed(888)
dat= data.frame(y = rnorm(100),
var1 = rnorm(100),var2 = rnorm(100),
var3=rnorm(100),rvar = sample(1:2,replace=TRUE,100))
lme_mod <- lmer(y ~ var1+ var2+ var3 + (1|rvar), dat)
dre_mod <- dredge(lme_mod)
avg_mod = model.avg(dre_mod,fit=TRUE)
summary(avg_mod)
Model-averaged coefficients:
(full average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) -0.02988 0.18699 0.18936 0.158 0.875
var2 -0.03791 0.08817 0.08858 0.428 0.669
var1 -0.02999 0.07740 0.07778 0.386 0.700
var3 0.01521 0.05371 0.05404 0.281 0.778
(conditional average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) -0.02988 0.18699 0.18936 0.158 0.875
var2 -0.16862 0.11197 0.11339 1.487 0.137
var1 -0.15293 0.10841 0.10978 1.393 0.164
var3 0.11227 0.10200 0.10327 1.087 0.277
The matrix is under:
summary(avg_mod)$coefmat.full
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) -0.02988418 0.18698720 0.18935677 0.1578194 0.8745991
var2 -0.03791016 0.08816936 0.08857788 0.4279867 0.6686608
var1 -0.02998709 0.07740247 0.07778028 0.3855360 0.6998404
var3 0.01520633 0.05371407 0.05404100 0.2813850 0.7784151
We take it out, pivot and plot:
library(ggplot2)
df = data.frame(summary(avg_mod)$coefmat.full)
df$variable = rownames(df)
colnames(df)[2] = "std_error"
df = df[df$variable !="(Intercept)",]
df$type = ifelse(df$Estimate>0,"pos","neg")
ggplot(df,aes(x=variable,y=Estimate))+
geom_point(aes(col=type),size=3) +
geom_errorbar(aes(col=type,ymin=Estimate-1.96*std_error,ymax=Estimate+1.96*std_error),width=0,size=1) +
geom_text(aes(label=round(Estimate,digits=2)),nudge_x =0.1) +
geom_hline(yintercept=0,col="black")+ theme_bw()+coord_flip()+
scale_color_manual(values=c("#c70039","#111d5e")) +
theme(legend.position="none")
You can also use parameters::model_parameters(), which is internally used by sjPlot::plot_model(). model_parameters() has a component-argument to decide which component to return. However, plot_model() does not yet pass additional arguments down to model_parameters(). I'm going to address this in sjPlot. Meanwhile, using model_parameters() at least offers a quick plot()-method.
library(lme4)
library(MuMIn)
options(na.action = "na.fail")
set.seed(888)
dat= data.frame(y = rnorm(100),
var1 = rnorm(100),var2 = rnorm(100),
var3=rnorm(100),rvar = sample(1:2,replace=TRUE,100))
lme_mod <- lmer(y ~ var1+ var2+ var3 + (1|rvar), dat)
dre_mod <- dredge(lme_mod)
xavg_mod = model.avg(dre_mod,fit=TRUE)
library(parameters)
model_parameters(avg_mod)
#> Parameter | Coefficient | SE | 95% CI | z | df | p
#> --------------------------------------------------------------------
#> (Intercept) | -0.03 | 0.19 | [-0.40, 0.34] | 0.16 | 96 | 0.875
#> var2 | -0.17 | 0.11 | [-0.39, 0.05] | 1.49 | 96 | 0.137
#> var1 | -0.15 | 0.11 | [-0.37, 0.06] | 1.39 | 96 | 0.164
#> var3 | 0.11 | 0.10 | [-0.09, 0.31] | 1.09 | 96 | 0.277
model_parameters(avg_mod, component = "full")
#> Parameter | Coefficient | SE | 95% CI | z | df | p
#> --------------------------------------------------------------------
#> (Intercept) | -0.03 | 0.19 | [-0.40, 0.34] | 0.16 | 96 | 0.875
#> var2 | -0.04 | 0.09 | [-0.21, 0.14] | 0.43 | 96 | 0.669
#> var1 | -0.03 | 0.08 | [-0.18, 0.12] | 0.39 | 96 | 0.700
#> var3 | 0.02 | 0.05 | [-0.09, 0.12] | 0.28 | 96 | 0.778
plot(model_parameters(avg_mod, component = "full"))
You can do some minor modifications to the plot:
library(ggplot2)
plot(model_parameters(avg_mod, component = "full")) +
geom_text(aes(label = round(Coefficient, 2)), nudge_x = .2)
Created on 2020-06-27 by the reprex package (v0.3.0)
I can't seem to match the xtreg command in Stata in R without using the fe option in Stata.
The coefficients are the same in Stata and R when I do a standard regression or a panel model with fixed effects.
Sample data:
library("plm" )
z <- Cigar[ Cigar$year %in% c( 63, 73) , ]
#saving so I can use in Stata
foreign::write.dta( z , "C:/Users/matthewr/Desktop/temp.dta")
So I get the same coefficient with this in R:
coef( lm( sales ~ pop , data= z2 ) )
and this in Stata
use "C:/Users/matthewr/Desktop/temp.dta" , clear
reg sales pop
And it works when I set up a panel and used the fixed effects option.
z2 <- pdata.frame( z , index=c("state", "year") )
coef( plm( sales ~ pop , data= z2 , model="within" ) ) # matches xtreg , fe
Matches this in Stata
xtset state year
xtreg sales pop, fe
I can't figure out how to match Stata when I am not using the fixed effects option
I am trying to match this result in R, and can't
This is the result I would like to reproduce:
Coefficient:-.0006838
xtreg sales pop
Stata xtreg y x is equivalent to xtreg y x, re, so what you want is to calculate random effects.
summary(plm(sales ~ pop, data=z, model="random", index=c("state", "year")))$coe
# Estimate Std. Error z-value Pr(>|z|)
# (Intercept) 1.311398e+02 6.499511330 20.176878 1.563130e-90
# pop -6.837769e-04 0.001077432 -0.634636 5.256658e-01
Stata:
xtreg sales pop, re
sales | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pop | -.0006838 .0010774 -0.63 0.526 -.0027955 .001428
_cons | 131.1398 6.499511 20.18 0.000 118.401 143.8787
Your question has been answered by #jay.sf. I just add something else although it may not directly answer your question. Both Stata xtreg and R plm have a few options, I feel RStata package could be a convenient tool to try different options and to compare results from both Stata and R directly in RStudio. I thought it could be helpful. The Stata path is only for my computer.
library("plm" )
library(RStata)
data("Cigar", package = "plm")
z <- Cigar[ Cigar$year %in% c( 63, 73) , ]
options("RStata.StataPath" = "\"C:\\Program Files (x86)\\Stata14\\StataSE-64\"")
options("RStata.StataVersion" = 14)
# Stata fe
stata_do1 <- '
xtset state year
xtreg sales pop, fe
'
stata(stata_do1, data.out = TRUE, data.in = z)
#> .
#> . xtset state year
#> panel variable: state (strongly balanced)
#> time variable: year, 63 to 73, but with gaps
#> delta: 1 unit
#> . xtreg sales pop, fe
#>
#> Fixed-effects (within) regression Number of obs = 92
#> Group variable: state Number of groups = 46
#>
#> R-sq: Obs per group:
#> within = 0.0118 min = 2
#> between = 0.0049 avg = 2.0
#> overall = 0.0048 max = 2
#>
#> F(1,45) = 0.54
#> corr(u_i, Xb) = -0.3405 Prob > F = 0.4676
#>
#> ------------------------------------------------------------------------------
#> sales | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> pop | -.0032108 .0043826 -0.73 0.468 -.0120378 .0056162
#> _cons | 141.5186 18.06909 7.83 0.000 105.1256 177.9116
#> -------------+----------------------------------------------------------------
#> sigma_u | 34.093409
#> sigma_e | 15.183908
#> rho | .83448264 (fraction of variance due to u_i)
#> ------------------------------------------------------------------------------
#> F test that all u_i=0: F(45, 45) = 8.91 Prob > F = 0.0000
# R
z2 <- pdata.frame( z , index=c("state", "year") )
coef( plm( sales ~ pop , data= z2 , model="within" ) )
#> pop
#> -0.003210817
# Stata re
stata_do2 <- '
xtset state year
xtreg sales pop, re
'
stata(stata_do2, data.out = TRUE, data.in = z)
#> .
#> . xtset state year
#> panel variable: state (strongly balanced)
#> time variable: year, 63 to 73, but with gaps
#> delta: 1 unit
#> . xtreg sales pop, re
#>
#> Random-effects GLS regression Number of obs = 92
#> Group variable: state Number of groups = 46
#>
#> R-sq: Obs per group:
#> within = 0.0118 min = 2
#> between = 0.0049 avg = 2.0
#> overall = 0.0048 max = 2
#>
#> Wald chi2(1) = 0.40
#> corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.5257
#>
#> ------------------------------------------------------------------------------
#> sales | Coef. Std. Err. z P>|z| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> pop | -.0006838 .0010774 -0.63 0.526 -.0027955 .001428
#> _cons | 131.1398 6.499511 20.18 0.000 118.401 143.8787
#> -------------+----------------------------------------------------------------
#> sigma_u | 30.573218
#> sigma_e | 15.183908
#> rho | .80214841 (fraction of variance due to u_i)
#> ------------------------------------------------------------------------------
# R random
coef(plm(sales ~ pop,
data=z,
model="random",
index=c("state", "year")))
#> (Intercept) pop
#> 1.311398e+02 -6.837769e-04
Created on 2020-01-27 by the reprex package (v0.3.0)
This question is related to https://stats.stackexchange.com/questions/3143/linear-model-with-constraints, but a slightly different scenario.
I have a simple 2-factor linear model with continuous outcome Y. factor1 has ~350 categorical values, and factor2 has the same ~350 categories. I want to constrain the coefficient on each level to sum to zero across the two factors.
(The reason for this is that each level of factor1 and factor2 enters either positively or negatively in any training example, but never appears twice in the same example.)
Here is an example dataset illustrating the situation, where there are four levels of each factor:
Y factor1 factor2
1 -1.2470416 A B
2 4.3368592 C D
3 1.0005147 D A
4 -2.8309146 A C
5 1.7501315 B D
6 -0.8372193 B A
7 3.3542627 C A
8 4.3319422 D C
9 1.4937895 D B
10 2.0951559 A D
11 -2.6610207 C D
12 -4.9917367 D B
13 2.2424169 D A
14 1.0205409 C A
15 -3.4584576 C B
The statistical model I want to estimate is:
$$ y_{(i,j)} = \alpha_i-\beta_j+\varepsilon_{(i,j)} $$
where $(i,j)$ is an outcome that depends on the pair. factor1 marks $i$ and factor2 marks $j$. If group A shows up in factor2, the parameter on A should be equal to the negative of if it showed up in factor1. Thus, I would like to set $\alpha$ equal to $\beta$ for all $i$ and $j$.
I can estimate a (nonsensical) version of this model in lm() fairly easily as follows:
Y <- c( -1.2470416, 4.3368592 , 1.0005147 , -2.8309146 , 1.7501315 , -0.8372193 , 3.3542627 , 4.3319422 , 1.4937895 , 2.0951559 , -2.6610 207 , -4.9917367 , 2.2424169 , 1.0205409 , -3.4584576 )
factor1 <- c( "A" , "C" , "D" , "A" , "B" , "B" , "C" , "D" , "D" , "A" , "C" , "D" , "D" , "C" , "C")
factor2 <- c( "B", "D", "A", "C", "D", "A", "A", "C", "B", "D", "D", "B", "A", "A", "B")
DF <- data.frame(Y,factor1,factor2)
lm(Y~factor1+factor2,data=DF)
and I get the following output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5363 2.5856 0.207 0.841
factor1B -0.4579 3.1121 -0.147 0.887
factor1C 0.4047 2.4925 0.162 0.875
factor1D 1.8737 2.4098 0.778 0.459
factor2B -3.6252 2.2050 -1.644 0.139
factor2C -0.7226 2.8903 -0.250 0.809
factor2D 0.7561 2.2094 0.342 0.741
Note that, theoretically, factor1C should equal -factor2C as dictated by my model. This is not the case in the simple lm() output because I didn't impose any constraints.
So what I would like to do is to estimate
Y ~ factor1 + factor2 [subject to factor1+factor2=0 for each level of factor1, factor2]
In plain English, this would be something like
model2 <- lm(Y~factor1-factor2, data=DF)
But this of course is not how R interprets that expression (because putting a minus sign in a model statement tells R to exclude that variable from the model).
I've read up on contrasts, but I don't think there is a way to do this. I've also read up on glmc, but didn't see a straightforward way of incorporating it for factors that have this many levels. Also, it's not clear to me that generating a new factor3 = factor1-factor2 is a well-defined operation for this specific scenario. Finally, I tried running model3 <- lm(Y+factor2 ~ factor1, data=DF) but received an error.
My sense is that I would need to create a constraint matrix by looping through the levels of each variable. I'm sufficiently new to R that I'm not sure exactly how this is done. Any help would be appreciated.
Note that it is quite easy to do this in Stata, as follows:
input ID y factor1 factor2
1 -1.2470416 1 2
2 4.3368592 3 4
3 1.0005147 4 1
4 -2.8309146 1 3
5 1.7501315 2 4
6 -0.8372193 2 1
7 3.3542627 3 1
8 4.3319422 4 3
9 1.4937895 4 2
10 2.0951559 1 4
11 -2.6610207 3 4
12 -4.9917367 4 2
13 2.2424169 4 1
14 1.0205409 3 1
15 -3.4584576 3 2
end
constraint 1 2.factor1 = -2.factor2
constraint 2 3.factor1 = -3.factor2
constraint 3 4.factor1 = -4.factor2
cnsreg y i.factor1 i.factor2, constraints(1/3)
which gives the following output:
Constrained linear regression Number of obs = 15
F( 3, 11) = 0.73
Prob > F = 0.5554
Root MSE = 2.9875
( 1) 2.factor1 + 2.factor2 = 0
( 2) 3.factor1 + 3.factor2 = 0
( 3) 4.factor1 + 4.factor2 = 0
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
factor1 |
B | 2.104393 1.439085 1.46 0.172 -1.063011 5.271798
C | .5222649 1.377463 0.38 0.712 -2.509511 3.55404
D | .6589209 1.266188 0.52 0.613 -2.127941 3.445783
|
factor2 |
B | -2.104393 1.439085 -1.46 0.172 -5.271798 1.063011
C | -.5222649 1.377463 -0.38 0.712 -3.55404 2.509511
D | -.6589209 1.266188 -0.52 0.613 -3.445783 2.127941
|
_cons | .5054862 .829675 0.61 0.555 -1.320616 2.331589
------------------------------------------------------------------------------
How does one do the above in R?
As noted in the most popular (but not accepted) answer in https://stats.stackexchange.com/questions/3143/linear-model-with-constraints, this problem is easily solved by creating a new variable which is the difference in the "one-hot" encoded factors.
In Stata, one can do this as follows:
* one-hot encode each of the factors
qui tab factor1, gen(f1dum)
qui tab factor2, gen(f2dum)
* generate difference in one-hot vectors
forv x=1/4{
gen fdiffdum`x' = f1dum`x'-f2dum`x'
}
* regress y on differenced one-hot vectors
reg y fdiffdum2 fdiffdum3 fdiffdum4
Which gives the following output:
Source | SS df MS Number of obs = 15
-------------+---------------------------------- F(3, 11) = 0.73
Model | 19.5429062 3 6.51430205 Prob > F = 0.5554
Residual | 98.1766922 11 8.92515383 R-squared = 0.1660
-------------+---------------------------------- Adj R-squared = -0.0614
Total | 117.719598 14 8.40854274 Root MSE = 2.9875
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
fdiffdum2 | 2.104393 1.439085 1.46 0.172 -1.063011 5.271798
fdiffdum3 | .5222648 1.377463 0.38 0.712 -2.509511 3.55404
fdiffdum4 | .6589209 1.266188 0.52 0.613 -2.127941 3.445783
_cons | .5054862 .829675 0.61 0.555 -1.320616 2.331589
------------------------------------------------------------------------------
In R, one can do this as follows:
factor1mat <- model.matrix(~factor1, DF)
factor2mat <- model.matrix(~factor2, DF)
factordiffmat <- factor1mat - factor2mat
summary(lm(Y~factordiffmat, data=DF))
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5055 0.8297 0.609 0.555
factordiffmat(Intercept) NA NA NA NA
factordiffmatfactor1B 2.1044 1.4391 1.462 0.172
factordiffmatfactor1C 0.5223 1.3775 0.379 0.712
factordiffmatfactor1D 0.6589 1.2662 0.520 0.613