Trying to reproduce xtreg in stata with plm in R - r

I can't seem to match the xtreg command in Stata in R without using the fe option in Stata.
The coefficients are the same in Stata and R when I do a standard regression or a panel model with fixed effects.
Sample data:
library("plm" )
z <- Cigar[ Cigar$year %in% c( 63, 73) , ]
#saving so I can use in Stata
foreign::write.dta( z , "C:/Users/matthewr/Desktop/temp.dta")
So I get the same coefficient with this in R:
coef( lm( sales ~ pop , data= z2 ) )
and this in Stata
use "C:/Users/matthewr/Desktop/temp.dta" , clear
reg sales pop
And it works when I set up a panel and used the fixed effects option.
z2 <- pdata.frame( z , index=c("state", "year") )
coef( plm( sales ~ pop , data= z2 , model="within" ) ) # matches xtreg , fe
Matches this in Stata
xtset state year
xtreg sales pop, fe
I can't figure out how to match Stata when I am not using the fixed effects option
I am trying to match this result in R, and can't
This is the result I would like to reproduce:
Coefficient:-.0006838
xtreg sales pop

Stata xtreg y x is equivalent to xtreg y x, re, so what you want is to calculate random effects.
summary(plm(sales ~ pop, data=z, model="random", index=c("state", "year")))$coe
# Estimate Std. Error z-value Pr(>|z|)
# (Intercept) 1.311398e+02 6.499511330 20.176878 1.563130e-90
# pop -6.837769e-04 0.001077432 -0.634636 5.256658e-01
Stata:
xtreg sales pop, re
sales | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pop | -.0006838 .0010774 -0.63 0.526 -.0027955 .001428
_cons | 131.1398 6.499511 20.18 0.000 118.401 143.8787

Your question has been answered by #jay.sf. I just add something else although it may not directly answer your question. Both Stata xtreg and R plm have a few options, I feel RStata package could be a convenient tool to try different options and to compare results from both Stata and R directly in RStudio. I thought it could be helpful. The Stata path is only for my computer.
library("plm" )
library(RStata)
data("Cigar", package = "plm")
z <- Cigar[ Cigar$year %in% c( 63, 73) , ]
options("RStata.StataPath" = "\"C:\\Program Files (x86)\\Stata14\\StataSE-64\"")
options("RStata.StataVersion" = 14)
# Stata fe
stata_do1 <- '
xtset state year
xtreg sales pop, fe
'
stata(stata_do1, data.out = TRUE, data.in = z)
#> .
#> . xtset state year
#> panel variable: state (strongly balanced)
#> time variable: year, 63 to 73, but with gaps
#> delta: 1 unit
#> . xtreg sales pop, fe
#>
#> Fixed-effects (within) regression Number of obs = 92
#> Group variable: state Number of groups = 46
#>
#> R-sq: Obs per group:
#> within = 0.0118 min = 2
#> between = 0.0049 avg = 2.0
#> overall = 0.0048 max = 2
#>
#> F(1,45) = 0.54
#> corr(u_i, Xb) = -0.3405 Prob > F = 0.4676
#>
#> ------------------------------------------------------------------------------
#> sales | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> pop | -.0032108 .0043826 -0.73 0.468 -.0120378 .0056162
#> _cons | 141.5186 18.06909 7.83 0.000 105.1256 177.9116
#> -------------+----------------------------------------------------------------
#> sigma_u | 34.093409
#> sigma_e | 15.183908
#> rho | .83448264 (fraction of variance due to u_i)
#> ------------------------------------------------------------------------------
#> F test that all u_i=0: F(45, 45) = 8.91 Prob > F = 0.0000
# R
z2 <- pdata.frame( z , index=c("state", "year") )
coef( plm( sales ~ pop , data= z2 , model="within" ) )
#> pop
#> -0.003210817
# Stata re
stata_do2 <- '
xtset state year
xtreg sales pop, re
'
stata(stata_do2, data.out = TRUE, data.in = z)
#> .
#> . xtset state year
#> panel variable: state (strongly balanced)
#> time variable: year, 63 to 73, but with gaps
#> delta: 1 unit
#> . xtreg sales pop, re
#>
#> Random-effects GLS regression Number of obs = 92
#> Group variable: state Number of groups = 46
#>
#> R-sq: Obs per group:
#> within = 0.0118 min = 2
#> between = 0.0049 avg = 2.0
#> overall = 0.0048 max = 2
#>
#> Wald chi2(1) = 0.40
#> corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.5257
#>
#> ------------------------------------------------------------------------------
#> sales | Coef. Std. Err. z P>|z| [95% Conf. Interval]
#> -------------+----------------------------------------------------------------
#> pop | -.0006838 .0010774 -0.63 0.526 -.0027955 .001428
#> _cons | 131.1398 6.499511 20.18 0.000 118.401 143.8787
#> -------------+----------------------------------------------------------------
#> sigma_u | 30.573218
#> sigma_e | 15.183908
#> rho | .80214841 (fraction of variance due to u_i)
#> ------------------------------------------------------------------------------
# R random
coef(plm(sales ~ pop,
data=z,
model="random",
index=c("state", "year")))
#> (Intercept) pop
#> 1.311398e+02 -6.837769e-04
Created on 2020-01-27 by the reprex package (v0.3.0)

Related

Missing standard errors and confidence for mixed model and ggeffects

I'm trying to use ggeffects::ggpredict to make some effects plots for my model. I find that the standard errors and confidence limits are missing for many of the results. I can reproduce the problem with some simulated data. It seems specifically for observations where the standard error puts the predicted probability close to 0 or 1.
I tried to get predictions on the link scale to diagnose if it's a problem with the translation from link to response, but I don't believe this is supported by the package.
Any ideas how to address this? Many thanks.
library(tidyverse)
library(lme4)
library(ggeffects)
# number of simulated observations
n <- 1000
# simulated data with a numerical predictor x, factor predictor f, response y
# the simulated effects of x and f are somewhat weak compared to the noise, so expect high standard errors
df <- tibble(
x = seq(-0.1, 0.1, length.out = n),
g = floor(runif(n) * 3),
f = letters[1 + g] %>% as.factor(),
y = pracma::sigmoid(x + (runif(n) - 0.5) + 0.1 * (g - mean(g))),
z = if_else(y > 0.5, "high", "low") %>% as.factor()
)
# glmer model
model <- glmer(z ~ x + (1 | f), data = df, family = binomial)
print(summary(model))
#> Generalized linear mixed model fit by maximum likelihood (Laplace
#> Approximation) [glmerMod]
#> Family: binomial ( logit )
#> Formula: z ~ x + (1 | f)
#> Data: df
#>
#> AIC BIC logLik deviance df.resid
#> 1373.0 1387.8 -683.5 1367.0 997
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.3858 -0.9928 0.7317 0.9534 1.3600
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> f (Intercept) 0.0337 0.1836
#> Number of obs: 1000, groups: f, 3
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.02737 0.12380 0.221 0.825
#> x -4.48012 1.12066 -3.998 6.39e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> x -0.001
# missing standard errors
ggpredict(model, c("x", "f")) %>% print()
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # Predicted probabilities of z
#>
#> # f = a
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = b
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.56, 0.67]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = c
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
ggpredict(model, c("x", "f")) %>% as_tibble() %>% print(n = 20)
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # A tibble: 9 x 6
#> x predicted std.error conf.low conf.high group
#> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 -0.1 0.617 0.167 0.537 0.691 a
#> 2 -0.1 0.617 0.124 0.558 0.672 b
#> 3 -0.1 0.617 0.167 0.537 0.691 c
#> 4 0 0.507 NA NA NA a
#> 5 0 0.507 NA NA NA b
#> 6 0 0.507 NA NA NA c
#> 7 0.1 0.396 NA NA NA a
#> 8 0.1 0.396 NA NA NA b
#> 9 0.1 0.396 NA NA NA c
Created on 2022-04-12 by the reprex package (v2.0.1)
I think this may be due to the singular model fit.
I dug down into the guts of the code as far as here, where there appears to be a mismatch between the dimensions of the covariance matrix of the predictions (3x3) and the number of predicted values (15).
I further suspect that the problem may happen here:
rows_to_keep <- as.numeric(rownames(unique(model_matrix_data[
intersect(colnames(model_matrix_data), terms)])))
Perhaps the function is getting confused because the conditional modes/BLUPs for every group are the same (which will only be true, generically, when the random effects variance is zero) ... ?
This seems worth opening an issue on the ggeffects issues list ?

Printing name of outcome above coxph output and exponentiating coefficients (R version 4.1.2 (2021-11-01) -- "Bird Hippie")

I ran some code below that looks at running Cox regression across multiple outcome types (stroke, cancer, respiratory) that appear in separate columns. purrr seems to do this quite well. But I would also like to
print the name of each outcome type above the corresponding regression model and
print the coefficients as hazard ratios with 95% CIs.
I know this is quite a big ask but is important since my real dataset has almost 20 outcome types. Any help would be much appreciated!
library(survival)
library(purrr)
mydata <- read.table(header=T,
text="age Sex survival stroke cancer respiratory
51 2 1.419178082 2 1 1
60 1 5 1 2 2
49 2 1.082191781 2 2 2
83 1 0.038356164 1 1 2
68 2 0.77260274 2 1 2
44 2 2.336986301 1 2 1
76 1 1.271232877 1 2 2")
outcomes <- names(mydata[4:6])
purrr::map(outcomes, ~coxph(as.formula(paste("Surv(survival,", .x, ") ~ Sex + age")),
mydata))
I'm not quite sure if this is what you are looking for, but if you run the following code:
result <- purrr::map(outcomes, function(x) {
f <- as.formula(paste("Surv(survival,", x, ") ~ Sex + age"))
model <- coxph(f, mydata)
model$call$formula <- f
s <- summary(model)
cat(x, ':\n', paste0(apply(s$coefficients, 1,
function(x) {
paste0("HR : ", round(exp(x[1]), 2),
' (95% CI ', round(exp(x[1] - 1.96 * x[3]), 2),
' - ', round(exp(x[1] + 1.96 * x[3]), 2), ')')}),
collapse = '\n'), '\n\n', sep = '')
invisible(model)
})
It will print out:
#> stroke:
#> HR : 650273590159.06 (95% CI 0 - Inf)
#> HR : 1.36 (95% CI 0.75 - 2.49)
#>
#> cancer:
#> HR : 1121.58 (95% CI 0 - 770170911.09)
#> HR : 1.33 (95% CI 0.78 - 2.28)
#>
#> respiratory:
#> HR : 24.1 (95% CI 0.31 - 1884.85)
#> HR : 1.2 (95% CI 0.99 - 1.45)
And your list of models will be stored with the correct call above them:
result
#> [[1]]
#> Call:
#> coxph(formula = Surv(survival, stroke) ~ Sex + age, data = mydata)
#>
#> coef exp(coef) se(coef) z p
#> Sex 2.720e+01 6.503e+11 2.111e+04 0.001 0.999
#> age 3.105e-01 1.364e+00 3.066e-01 1.013 0.311
#>
#> Likelihood ratio test=6.52 on 2 df, p=0.03834
#> n= 7, number of events= 3
#>
#> [[2]]
#> Call:
#> coxph(formula = Surv(survival, cancer) ~ Sex + age, data = mydata)
#>
#> coef exp(coef) se(coef) z p
#> Sex 7.0225 1121.5843 6.8570 1.024 0.306
#> age 0.2870 1.3325 0.2739 1.048 0.295
#>
#> Likelihood ratio test=2.58 on 2 df, p=0.2753
#> n= 7, number of events= 4
#>
#> [[3]]
#> Call:
#> coxph(formula = Surv(survival, respiratory) ~ Sex + age, data = mydata)
#>
#> coef exp(coef) se(coef) z p
#> Sex 3.18232 24.10259 2.22413 1.431 0.1525
#> age 0.18078 1.19815 0.09772 1.850 0.0643
#>
#> Likelihood ratio test=5.78 on 2 df, p=0.05552
#> n= 7, number of events= 5

SAS proc glm random effects model with contrasts translated into R

My apologies for any errors; I only recently began learning SAS. I was given this SAS code (the code below is a reprex, not the exact code) that uses proc glm to assumedly make a random effects model. Instead of using color, the SAS code uses contrasts and idnumber to indirectly map onto color.
I would like to know how to replicate this in R. Several attempts using lme4 for random effects and contrasts using MASS::ginv were unsuccessful, so I may need to use a package I am unfamiliar with.
I would also like to know the difference between red-blue and red-blue2 and why the output is different. Thank you for your help.
data df1;
input idnumber color $ value1;
datalines;
1001 red 189
1002 red 145
1003 red 210
1004 red 194
1005 red 127
1006 red 189
1007 blue 145
1008 red 210
1009 red 194
1010 red 127.
;
proc glm data=df1;
class idnumber;
model value1=idnumber/noint solution clparm;
contrast 'red vs. blue' idnumber 1 1 1 1 1 1 -9 1 1 1;
estimate 'red-blue' idnumber 1 1 1 1 1 1 -9 1 1 1/ divisor=10;
estimate 'red-blue2' idnumber .111 .111 .111 .111 .111 .111 -.999 .111 .111 .111;
run;
Below are a few attempts at replication.
idnumber <- c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010)
color <- c('red', 'red', 'red', 'red', 'red', 'red', 'blue', 'red', 'red', 'red')
value1 <- c(189, 145, 210, 194, 127, 189, 145, 210, 194, 127)
df1 <- data.frame(idnumber, color, value1)
library(lme4)
library(MASS)
library(tidyverse)
options(contrasts = c(factor = "contr.SAS", ordered = "contr.poly"))
# attempt 1
mod1 <- lme4::lmer(value1 ~ (1|idnumber), data = df1) # error
#> Error: number of levels of each grouping factor must be < number of observations (problems: idnumber)
# attempt 2
mod2 <- lme4::lmer(value1 ~ (1|color), data = df1) # singular
#> boundary (singular) fit: see help('isSingular')
summary(mod2)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: value1 ~ (1 | color)
#> Data: df1
#>
#> REML criterion at convergence: 90.9
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.3847 -0.8429 0.4816 0.6321 1.1138
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> color (Intercept) 0 0.00
#> Residual 1104 33.22
#> Number of obs: 10, groups: color, 2
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 173.00 10.51 16.47
#> optimizer (nloptwrap) convergence code: 0 (OK)
#> boundary (singular) fit: see help('isSingular')
# attempt 3
mat1 <- rbind(c(-0.5, 0.5))
cMat1 <- MASS::ginv(mat1)
mod3 <- lm(value1 ~ color, data = df1, contrasts = list(color = cMat1))
summary(mod3)
#>
#> Call:
#> lm(formula = value1 ~ color, data = df1, contrasts = list(color = cMat1))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -49.11 -23.33 12.89 17.89 33.89
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 160.56 17.74 9.052 1.78e-05 ***
#> color1 15.56 17.74 0.877 0.406
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 33.65 on 8 degrees of freedom
#> Multiple R-squared: 0.08771, Adjusted R-squared: -0.02633
#> F-statistic: 0.7691 on 1 and 8 DF, p-value: 0.4061
# attempt 4
con <- c(.1, .1, .1, .1, .1, .1, -.9, .1, .1, .1)
mod4 <- lm(value1 ~ idnumber, data = df1, contrasts = list(idnumber = con)) # error, but unsure how to fix
#> Error in `contrasts<-`(`*tmp*`, value = contrasts.arg[[nn]]): contrasts apply only to factors
Created on 2022-02-08 by the reprex package (v2.0.1)
Answering the part of this that is answerable: what is going on with the two different estimates.
The estimate statement includes a list of coefficients. Those are multiplied by the values, and then summed - giving the result. The reason they're different is, well, they're different... the first one is (after the division) 0.1 / -0.9 and the second is 0.111 (one ninth) / -0.999, effectively the same as the first one with a divisor of 9 instead of 10. Hence, the math is different.
I'm also not sure about your reprex, it doesn't really make any sense to use idnumber as the class variable... seems more likely you'd use color as the class variable. Is it possible this is just bad SAS code? I'm not a GLM expert, but it seems odd to me to try to use GLM with the classification variable being the ID number (assuming it's a unique ID, anyway).

Regression without intercept in R and Stata

Recently, I stumbled upon the fact that Stata and R handle regressions without intercept differently. I'm not a statistician, so please be kind if my vocabulary is not ideal.
I tried to make the example somewhat reproducible. This is my example in R:
> set.seed(20210211)
> df <- data.frame(y = runif(50), x = runif(50))
> df$d <- df$x > 0.5
>
> (tmp <- tempfile("data", fileext = ".csv"))
[1] "C:\\Users\\s1504gl\\AppData\\Local\\Temp\\1\\RtmpYtS6uk\\data1b2c1c4a96.csv"
> write.csv(df, tmp, row.names = FALSE)
>
> summary(lm(y ~ x + d, data = df))
Call:
lm(formula = y ~ x + d, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4375 0.1038 4.214 0.000113 ***
x -0.1026 0.3168 -0.324 0.747521
dTRUE 0.1513 0.1787 0.847 0.401353
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.03103, Adjusted R-squared: -0.0102
F-statistic: 0.7526 on 2 and 47 DF, p-value: 0.4767
> summary(lm(y ~ x + d + 0, data = df))
Call:
lm(formula = y ~ x + d + 0, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -0.1026 0.3168 -0.324 0.747521
dFALSE 0.4375 0.1038 4.214 0.000113 ***
dTRUE 0.5888 0.2482 2.372 0.021813 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.7196, Adjusted R-squared: 0.7017
F-statistic: 40.21 on 3 and 47 DF, p-value: 4.996e-13
And here is what I have in Stata (please note that I have copied the filename from R to Stata):
. import delimited "C:\Users\s1504gl\AppData\Local\Temp\1\RtmpYtS6uk\data1b2c1c4a96.csv"
(3 vars, 50 obs)
. encode d, generate(d_enc)
.
. regress y x i.d_enc
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 47) = 0.75
Model | .135181652 2 .067590826 Prob > F = 0.4767
Residual | 4.22088995 47 .089806169 R-squared = 0.0310
-------------+---------------------------------- Adj R-squared = -0.0102
Total | 4.3560716 49 .08889942 Root MSE = .29968
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | -.1025954 .3168411 -0.32 0.748 -.7399975 .5348067
|
d_enc |
TRUE | .1512977 .1786527 0.85 0.401 -.2081052 .5107007
_cons | .4375371 .103837 4.21 0.000 .2286441 .6464301
------------------------------------------------------------------------------
. regress y x i.d_enc, noconstant
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 48) = 38.13
Model | 9.23913703 2 4.61956852 Prob > F = 0.0000
Residual | 5.81541777 48 .121154537 R-squared = 0.6137
-------------+---------------------------------- Adj R-squared = 0.5976
Total | 15.0545548 50 .301091096 Root MSE = .34807
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .976214 .2167973 4.50 0.000 .5403139 1.412114
|
d_enc |
TRUE | -.2322011 .1785587 -1.30 0.200 -.5912174 .1268151
------------------------------------------------------------------------------
As you can see, the results of the regression with intercept are identical. But if I omit the intercept (+ 0 in R, , noconstant in Stata), the results differ. In R, the intercept is now captured in dFALSE, which is reasonable from what I understand. I don't understand what Stata is doing here. Also the degrees of freedom differ.
My questions:
Can anyone explain to me how Stata is handling this?
How can I replicate Stata's behavior in R?
I believe bas pointed in the right direction, but I am still unsure why both results differ.
I am not attempting to answer the question, but provdide deeper understanding of what stata is doing (by digging into the source of R's lm() function. In the following lines I replicate what lm() does, but jumping over sanity checks and options such as weights, contrasts, etc...
(I cannot yet fully understand why in the second regression (with NO CONSTANT) the dFALSE coefficient captures the effect of the intercept in the default regression (with constant)
set.seed(20210211)
df <- data.frame(y = runif(50), x = runif(50))
df$d <- df$x > 0.5
lm() With Constant
form_default <- as.formula(y ~ x + d)
mod_frame_def <- model.frame(form_default, df)
mod_matrix_def <- model.matrix(object = attr(mod_frame_def, "terms"), mod_frame_def)
head(mod_matrix_def)
#> (Intercept) x dTRUE
#> 1 1 0.7861162 1
#> 2 1 0.2059603 0
#> 3 1 0.9793946 1
#> 4 1 0.8569093 1
#> 5 1 0.8124811 1
#> 6 1 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_def),
x = mod_matrix_def
)$coefficients
#> (Intercept) x dTRUE
#> 0.4375371 -0.1025954 0.1512977
lm() No Constant
form_nocon <- as.formula(y ~ x + d + 0)
mod_frame_nocon <- model.frame(form_nocon, df)
mod_matrix_nocon <- model.matrix(object = attr(mod_frame_nocon, "terms"), mod_frame_nocon)
head(mod_matrix_nocon)
#> x dFALSE dTRUE
#> 1 0.7861162 0 1
#> 2 0.2059603 1 0
#> 3 0.9793946 0 1
#> 4 0.8569093 0 1
#> 5 0.8124811 0 1
#> 6 0.7769280 0 1
stats:::lm.fit(
y = model.response(mod_frame_nocon),
x = mod_matrix_nocon
)$coefficients
#> x dFALSE dTRUE
#> -0.1025954 0.4375371 0.5888348
lm() with as.numeric()
[as indicated in the comments by bas]
form_asnum <- as.formula(y ~ x + as.numeric(d) + 0)
mod_frame_asnum <- model.frame(form_asnum, df)
mod_matrix_asnum <- model.matrix(object = attr(mod_frame_asnum, "terms"), mod_frame_asnum)
head(mod_matrix_asnum)
#> x as.numeric(d)
#> 1 0.7861162 1
#> 2 0.2059603 0
#> 3 0.9793946 1
#> 4 0.8569093 1
#> 5 0.8124811 1
#> 6 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_asnum),
x = mod_matrix_asnum
)$coefficients
#> x as.numeric(d)
#> 0.9762140 -0.2322012
Created on 2021-03-18 by the reprex package (v1.0.0)

R prediction package VS Stata margins

I'm switching from Stata to R, and I find inconsistent results when I use prediction to compute marginal pred and the results from the Stata command margins fixing the values of a variable to x. Here is the example:
library(dplyr)
library(prediction)
d <- data.frame(x1 = factor(c(1,1,1,2,2,2), levels = c(1, 2)),
x2 = factor(c(1,2,3,1,2,3), levels = c(1, 2, 3)),
x3 = factor(c(1,2,1,2,1,2), levels = c(1, 2)),
y = c(3.1, 2.8, 2.5, 4.3, 4.0, 3.5))
m2 <- lm(y ~ x1 + x2 + x3, d)
summary(m2)
marg2a <- prediction(m2, at = list(x2 = "1"))
marg2b <- prediction(m2, at = list(x1 = "1"))
marg2a %>%
select(x1, fitted) %>%
group_by(x1) %>%
summarise(error = mean(fitted))
marg2b %>%
select(x2, fitted) %>%
group_by(x2) %>%
summarise(error = mean(fitted))
This is the result:
# A tibble: 2 x 2
x1 error
<fctr> <dbl>
1 1 3.133333
2 2 4.266667
# A tibble: 3 x 2
x2 error
<fctr> <dbl>
1 1 3.125
2 2 2.825
3 3 2.425
while if I try to replicate this using Stata's margins, this is the result:
regress y i.x1 i.x2 i.x3
margins i.x1, at(x2 == 1)
margins i.x2, at(x1 == 1)
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |
1 | 3.125 .0829157 37.69 0.017 2.071456 4.178544
2 | 4.275 .0829157 51.56 0.012 3.221456 5.328544
------------------------------------------------------------------------------
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x2 |
1 | 3.125 .0829157 37.69 0.017 2.071456 4.178544
2 | 2.825 .0829157 34.07 0.019 1.771456 3.878544
3 | 2.425 .0829157 29.25 0.022 1.371456 3.478544
------------------------------------------------------------------------------
The margins for x2 are the same in R and Stata, but when it comes to x1 there are differences and I don't know why. Really appreciate any help. Thanks,
P
Your Stata and R code are not equivalent. To replicate that Stata code you would need:
> prediction(m2, at = list(x1 = c("1", "2"), x2 = "1"))
Average predictions for 6 observations:
at(x1) at(x2) value
1 1 3.125
2 1 4.275
> prediction(m2, at = list(x2 = c("1", "2", "3"), x1 = "1"))
Average predictions for 6 observations:
at(x2) at(x1) value
1 1 3.125
2 1 2.825
3 1 2.425
That is because when you say margins i.x1 you are asking for predictions for counterfactual versions of the dataset where x1 is replaced with 1 and then replaced with 2, with the additional constraint that in both counterfactual x2 is held at 1. The same thing is occurring in your second Stata example.
This is due to the fact that Stata's margins command has an ambiguity or rather two syntactic expressions that obtain the same output. One is your code:
. margins i.x1, at(x2 == 1)
Predictive margins Number of obs = 6
Model VCE : OLS
Expression : Linear prediction, predict()
at : x2 = 1
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |
1 | 3.125 .0829156 37.69 0.017 2.071457 4.178543
2 | 4.275 .0829156 51.56 0.012 3.221457 5.328543
------------------------------------------------------------------------------
The other is more explicit about what is actually happening in the above:
. margins, at(x1 = (1 2) x2 == 1)
Predictive margins Number of obs = 6
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : x1 = 1
x2 = 1
2._at : x1 = 2
x2 = 1
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | 3.125 .0829156 37.69 0.017 2.071457 4.178543
2 | 4.275 .0829156 51.56 0.012 3.221457 5.328543
------------------------------------------------------------------------------

Resources