Linear model with equality constraints across two factor variables - r

This question is related to https://stats.stackexchange.com/questions/3143/linear-model-with-constraints, but a slightly different scenario.
I have a simple 2-factor linear model with continuous outcome Y. factor1 has ~350 categorical values, and factor2 has the same ~350 categories. I want to constrain the coefficient on each level to sum to zero across the two factors.
(The reason for this is that each level of factor1 and factor2 enters either positively or negatively in any training example, but never appears twice in the same example.)
Here is an example dataset illustrating the situation, where there are four levels of each factor:
Y factor1 factor2
1 -1.2470416 A B
2 4.3368592 C D
3 1.0005147 D A
4 -2.8309146 A C
5 1.7501315 B D
6 -0.8372193 B A
7 3.3542627 C A
8 4.3319422 D C
9 1.4937895 D B
10 2.0951559 A D
11 -2.6610207 C D
12 -4.9917367 D B
13 2.2424169 D A
14 1.0205409 C A
15 -3.4584576 C B
The statistical model I want to estimate is:
$$ y_{(i,j)} = \alpha_i-\beta_j+\varepsilon_{(i,j)} $$
where $(i,j)$ is an outcome that depends on the pair. factor1 marks $i$ and factor2 marks $j$. If group A shows up in factor2, the parameter on A should be equal to the negative of if it showed up in factor1. Thus, I would like to set $\alpha$ equal to $\beta$ for all $i$ and $j$.
I can estimate a (nonsensical) version of this model in lm() fairly easily as follows:
Y <- c( -1.2470416, 4.3368592 , 1.0005147 , -2.8309146 , 1.7501315 , -0.8372193 , 3.3542627 , 4.3319422 , 1.4937895 , 2.0951559 , -2.6610 207 , -4.9917367 , 2.2424169 , 1.0205409 , -3.4584576 )
factor1 <- c( "A" , "C" , "D" , "A" , "B" , "B" , "C" , "D" , "D" , "A" , "C" , "D" , "D" , "C" , "C")
factor2 <- c( "B", "D", "A", "C", "D", "A", "A", "C", "B", "D", "D", "B", "A", "A", "B")
DF <- data.frame(Y,factor1,factor2)
lm(Y~factor1+factor2,data=DF)
and I get the following output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5363 2.5856 0.207 0.841
factor1B -0.4579 3.1121 -0.147 0.887
factor1C 0.4047 2.4925 0.162 0.875
factor1D 1.8737 2.4098 0.778 0.459
factor2B -3.6252 2.2050 -1.644 0.139
factor2C -0.7226 2.8903 -0.250 0.809
factor2D 0.7561 2.2094 0.342 0.741
Note that, theoretically, factor1C should equal -factor2C as dictated by my model. This is not the case in the simple lm() output because I didn't impose any constraints.
So what I would like to do is to estimate
Y ~ factor1 + factor2 [subject to factor1+factor2=0 for each level of factor1, factor2]
In plain English, this would be something like
model2 <- lm(Y~factor1-factor2, data=DF)
But this of course is not how R interprets that expression (because putting a minus sign in a model statement tells R to exclude that variable from the model).
I've read up on contrasts, but I don't think there is a way to do this. I've also read up on glmc, but didn't see a straightforward way of incorporating it for factors that have this many levels. Also, it's not clear to me that generating a new factor3 = factor1-factor2 is a well-defined operation for this specific scenario. Finally, I tried running model3 <- lm(Y+factor2 ~ factor1, data=DF) but received an error.
My sense is that I would need to create a constraint matrix by looping through the levels of each variable. I'm sufficiently new to R that I'm not sure exactly how this is done. Any help would be appreciated.
Note that it is quite easy to do this in Stata, as follows:
input ID y factor1 factor2
1 -1.2470416 1 2
2 4.3368592 3 4
3 1.0005147 4 1
4 -2.8309146 1 3
5 1.7501315 2 4
6 -0.8372193 2 1
7 3.3542627 3 1
8 4.3319422 4 3
9 1.4937895 4 2
10 2.0951559 1 4
11 -2.6610207 3 4
12 -4.9917367 4 2
13 2.2424169 4 1
14 1.0205409 3 1
15 -3.4584576 3 2
end
constraint 1 2.factor1 = -2.factor2
constraint 2 3.factor1 = -3.factor2
constraint 3 4.factor1 = -4.factor2
cnsreg y i.factor1 i.factor2, constraints(1/3)
which gives the following output:
Constrained linear regression Number of obs = 15
F( 3, 11) = 0.73
Prob > F = 0.5554
Root MSE = 2.9875
( 1) 2.factor1 + 2.factor2 = 0
( 2) 3.factor1 + 3.factor2 = 0
( 3) 4.factor1 + 4.factor2 = 0
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
factor1 |
B | 2.104393 1.439085 1.46 0.172 -1.063011 5.271798
C | .5222649 1.377463 0.38 0.712 -2.509511 3.55404
D | .6589209 1.266188 0.52 0.613 -2.127941 3.445783
|
factor2 |
B | -2.104393 1.439085 -1.46 0.172 -5.271798 1.063011
C | -.5222649 1.377463 -0.38 0.712 -3.55404 2.509511
D | -.6589209 1.266188 -0.52 0.613 -3.445783 2.127941
|
_cons | .5054862 .829675 0.61 0.555 -1.320616 2.331589
------------------------------------------------------------------------------
How does one do the above in R?

As noted in the most popular (but not accepted) answer in https://stats.stackexchange.com/questions/3143/linear-model-with-constraints, this problem is easily solved by creating a new variable which is the difference in the "one-hot" encoded factors.
In Stata, one can do this as follows:
* one-hot encode each of the factors
qui tab factor1, gen(f1dum)
qui tab factor2, gen(f2dum)
* generate difference in one-hot vectors
forv x=1/4{
gen fdiffdum`x' = f1dum`x'-f2dum`x'
}
* regress y on differenced one-hot vectors
reg y fdiffdum2 fdiffdum3 fdiffdum4
Which gives the following output:
Source | SS df MS Number of obs = 15
-------------+---------------------------------- F(3, 11) = 0.73
Model | 19.5429062 3 6.51430205 Prob > F = 0.5554
Residual | 98.1766922 11 8.92515383 R-squared = 0.1660
-------------+---------------------------------- Adj R-squared = -0.0614
Total | 117.719598 14 8.40854274 Root MSE = 2.9875
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
fdiffdum2 | 2.104393 1.439085 1.46 0.172 -1.063011 5.271798
fdiffdum3 | .5222648 1.377463 0.38 0.712 -2.509511 3.55404
fdiffdum4 | .6589209 1.266188 0.52 0.613 -2.127941 3.445783
_cons | .5054862 .829675 0.61 0.555 -1.320616 2.331589
------------------------------------------------------------------------------
In R, one can do this as follows:
factor1mat <- model.matrix(~factor1, DF)
factor2mat <- model.matrix(~factor2, DF)
factordiffmat <- factor1mat - factor2mat
summary(lm(Y~factordiffmat, data=DF))
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5055 0.8297 0.609 0.555
factordiffmat(Intercept) NA NA NA NA
factordiffmatfactor1B 2.1044 1.4391 1.462 0.172
factordiffmatfactor1C 0.5223 1.3775 0.379 0.712
factordiffmatfactor1D 0.6589 1.2662 0.520 0.613

Related

What is wrong with my syntax in lme4::lmer() for a split-split plot design with unbalanced repeated measures?

I am trying to use the lme4 package in R and function lmer() to fit a model for my split-split plot design. I would have used a repeated measures ANOVA if I did not have a small number of observations missing, but the missing data should be no problem with a linear mixed effects model.
My data frame (data) has a simple structure with four factors and a numeric outcome variable called all_vai. Note that in this example data frame, not all levels of all factors are crossed even though they would be in my real data (except for the missing observations). It shouldn't matter for my question, which is an attempt to fix problematic syntax.
collected_vai <- rnorm(125, mean = 6, sd = 1)
missing <- rep(NA, times = 3)
all_vai <- c(collected_vai, missing)
year1 <- rep(2018, times = 32)
year2 <- rep(2019, times = 32)
year3 <- rep(2020, times = 32)
year4 <- rep(2021, times = 32)
year <- c(year1, year2, year3, year4)
disturbance_severity <- rep(c(0,45,65,85), each = 32)
treatment <- rep(c("B" , "T"), each = 64)
replicate <- rep(c("A", "B", "C", "D"), each = 32)
data = data.frame(all_vai, year, disturbance_severity, treatment, replicate)
data$year <- as.factor(data$year)
data$disturbance_severity <- as.factor(data$disturbance_severity)
data$treatment <- as.factor(data$treatment)
data$replicate <- as.factor(data$replicate)
Here is the model I ran for an identical data set with a different (normally distributed) numeric outcome and no missing observations -- i.e., this is the model I would be running if I didn't have unbalanced repeated measures now due to missing data:
VAImodel1 <- aov(all_vai ~ disturbance_severity*treatment*year + Error(replicate/disturbance_severity/treatment/year), data = data)
summary(VAImodel1)
When I run this, I get the error message: "Warning message:
In aov(mean_vai ~ disturbance_severity * treatment * Year + Error(Replicate/disturbance_severity/treatment/Year), :
Error() model is singular"
I have observations from different years nested within different treatments, which are nested within different disturbance severities, and all of this nested within replicates (which are experimental blocks). So I tried using this structure in lme4:
library(lme4)
library(lmerTest)
VAImodel2 <- lmer(all_vai ~ (year|replicate:disturbance_severity:treatment) + disturbance_severity*treatment*year, data = data)
summary(VAImodel2)
And this is the error message I get: "Error: number of observations (=125) <= number of random effects (=128) for term (Year | Replicate:disturbance_severity:treatment); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable"
Next I tried simplifying my model so that I was not running out of degrees of freedom, by removing the treatment variable and interaction term, like so:
VAImodel3 <- lmer(all_vai ~ (year|replicate:disturbance_severity) + disturbance_severity*year, data = data)
summary(VAImodel3)
This time I get a different error: "boundary (singular) fit: see ?isSingular
Warning message:
Model failed to converge with 1 negative eigenvalue: -1.2e-01 "
Thank you in advance for any help.
Your problem is wrong data preparation!!
Let's start by defining values for your variables year, disturbance_severity, treatment, replicate.
library(tidyverse)
set.seed(123)
yars = 2018:2021
disturbances = c(0,45,65,85)
treatments = c("B" , "T")
replicates = c("A", "B", "C", "D")
n = length(yars)*length(disturbances)*length(treatments)*length(replicates)*1
nNA=3
Please note that I first created the variables yars, disturbances, treatments and replicates with all the allowed values.
Then I calculated the amount of data in n (you can increase the last value in the multiplication from 1 e.g. to 10) and determined how many values will be missing in the variable nNA.
The key aspect is the use of the function expand.grid(yars, disturbances, treatments, replicates) which will return the appropriate table with the correct distribution of values.
Look at the first few lines of what expand.grid returns.
Var1 Var2 Var3 Var4
1 2018 0 B A
2 2019 0 B A
3 2020 0 B A
4 2021 0 B A
5 2018 45 B A
6 2019 45 B A
7 2020 45 B A
8 2021 45 B A
9 2018 65 B A
10 2019 65 B A
11 2020 65 B A
12 2021 65 B A
13 2018 85 B A
14 2019 85 B A
15 2020 85 B A
16 2021 85 B A
17 2018 0 T A
18 2019 0 T A
This is crucial here.
The next step is straight ahead. We create a tibble sequence and put it in the aov function.
data = tibble(sample(c(rnorm(n-nNA, mean = 6, sd = 1), rep(NA, nNA)), n)) %>%
mutate(expand.grid(yars, disturbances, treatments, replicates)) %>%
rename_with(~c("all_vai", "year", "disturbance_severity", "treatment", "replicate"))
VAImodel1 <- aov(all_vai ~ disturbance_severity*treatment*year +
Error(replicate/disturbance_severity/treatment/year), data = data)
summary(VAImodel1)
output
Error: replicate
Df Sum Sq Mean Sq F value Pr(>F)
disturbance_severity 1 0.1341 0.1341 0.093 0.811
treatment 1 0.0384 0.0384 0.027 0.897
Residuals 1 1.4410 1.4410
Error: replicate:disturbance_severity
Df Sum Sq Mean Sq F value Pr(>F)
disturbance_severity 1 0.1391 0.1391 0.152 0.763
treatment 1 0.1819 0.1819 0.199 0.733
year 1 1.4106 1.4106 1.545 0.431
Residuals 1 0.9129 0.9129
Error: replicate:disturbance_severity:treatment
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 0.4647 0.4647 0.698 0.491
year 1 0.8127 0.8127 1.221 0.384
Residuals 2 1.3311 0.6655
Error: replicate:disturbance_severity:treatment:year
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 2.885 2.8846 3.001 0.144
year 1 0.373 0.3734 0.388 0.560
treatment:year 1 0.002 0.0015 0.002 0.970
Residuals 5 4.806 0.9612
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 0.03 0.031 0.039 0.8430
year 1 1.29 1.292 1.662 0.2002
treatment:year 1 4.30 4.299 5.532 0.0206 *
Residuals 102 79.26 0.777
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now there are no model is singular errors!!

Regression without intercept in R and Stata

Recently, I stumbled upon the fact that Stata and R handle regressions without intercept differently. I'm not a statistician, so please be kind if my vocabulary is not ideal.
I tried to make the example somewhat reproducible. This is my example in R:
> set.seed(20210211)
> df <- data.frame(y = runif(50), x = runif(50))
> df$d <- df$x > 0.5
>
> (tmp <- tempfile("data", fileext = ".csv"))
[1] "C:\\Users\\s1504gl\\AppData\\Local\\Temp\\1\\RtmpYtS6uk\\data1b2c1c4a96.csv"
> write.csv(df, tmp, row.names = FALSE)
>
> summary(lm(y ~ x + d, data = df))
Call:
lm(formula = y ~ x + d, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4375 0.1038 4.214 0.000113 ***
x -0.1026 0.3168 -0.324 0.747521
dTRUE 0.1513 0.1787 0.847 0.401353
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.03103, Adjusted R-squared: -0.0102
F-statistic: 0.7526 on 2 and 47 DF, p-value: 0.4767
> summary(lm(y ~ x + d + 0, data = df))
Call:
lm(formula = y ~ x + d + 0, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -0.1026 0.3168 -0.324 0.747521
dFALSE 0.4375 0.1038 4.214 0.000113 ***
dTRUE 0.5888 0.2482 2.372 0.021813 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.7196, Adjusted R-squared: 0.7017
F-statistic: 40.21 on 3 and 47 DF, p-value: 4.996e-13
And here is what I have in Stata (please note that I have copied the filename from R to Stata):
. import delimited "C:\Users\s1504gl\AppData\Local\Temp\1\RtmpYtS6uk\data1b2c1c4a96.csv"
(3 vars, 50 obs)
. encode d, generate(d_enc)
.
. regress y x i.d_enc
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 47) = 0.75
Model | .135181652 2 .067590826 Prob > F = 0.4767
Residual | 4.22088995 47 .089806169 R-squared = 0.0310
-------------+---------------------------------- Adj R-squared = -0.0102
Total | 4.3560716 49 .08889942 Root MSE = .29968
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | -.1025954 .3168411 -0.32 0.748 -.7399975 .5348067
|
d_enc |
TRUE | .1512977 .1786527 0.85 0.401 -.2081052 .5107007
_cons | .4375371 .103837 4.21 0.000 .2286441 .6464301
------------------------------------------------------------------------------
. regress y x i.d_enc, noconstant
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 48) = 38.13
Model | 9.23913703 2 4.61956852 Prob > F = 0.0000
Residual | 5.81541777 48 .121154537 R-squared = 0.6137
-------------+---------------------------------- Adj R-squared = 0.5976
Total | 15.0545548 50 .301091096 Root MSE = .34807
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .976214 .2167973 4.50 0.000 .5403139 1.412114
|
d_enc |
TRUE | -.2322011 .1785587 -1.30 0.200 -.5912174 .1268151
------------------------------------------------------------------------------
As you can see, the results of the regression with intercept are identical. But if I omit the intercept (+ 0 in R, , noconstant in Stata), the results differ. In R, the intercept is now captured in dFALSE, which is reasonable from what I understand. I don't understand what Stata is doing here. Also the degrees of freedom differ.
My questions:
Can anyone explain to me how Stata is handling this?
How can I replicate Stata's behavior in R?
I believe bas pointed in the right direction, but I am still unsure why both results differ.
I am not attempting to answer the question, but provdide deeper understanding of what stata is doing (by digging into the source of R's lm() function. In the following lines I replicate what lm() does, but jumping over sanity checks and options such as weights, contrasts, etc...
(I cannot yet fully understand why in the second regression (with NO CONSTANT) the dFALSE coefficient captures the effect of the intercept in the default regression (with constant)
set.seed(20210211)
df <- data.frame(y = runif(50), x = runif(50))
df$d <- df$x > 0.5
lm() With Constant
form_default <- as.formula(y ~ x + d)
mod_frame_def <- model.frame(form_default, df)
mod_matrix_def <- model.matrix(object = attr(mod_frame_def, "terms"), mod_frame_def)
head(mod_matrix_def)
#> (Intercept) x dTRUE
#> 1 1 0.7861162 1
#> 2 1 0.2059603 0
#> 3 1 0.9793946 1
#> 4 1 0.8569093 1
#> 5 1 0.8124811 1
#> 6 1 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_def),
x = mod_matrix_def
)$coefficients
#> (Intercept) x dTRUE
#> 0.4375371 -0.1025954 0.1512977
lm() No Constant
form_nocon <- as.formula(y ~ x + d + 0)
mod_frame_nocon <- model.frame(form_nocon, df)
mod_matrix_nocon <- model.matrix(object = attr(mod_frame_nocon, "terms"), mod_frame_nocon)
head(mod_matrix_nocon)
#> x dFALSE dTRUE
#> 1 0.7861162 0 1
#> 2 0.2059603 1 0
#> 3 0.9793946 0 1
#> 4 0.8569093 0 1
#> 5 0.8124811 0 1
#> 6 0.7769280 0 1
stats:::lm.fit(
y = model.response(mod_frame_nocon),
x = mod_matrix_nocon
)$coefficients
#> x dFALSE dTRUE
#> -0.1025954 0.4375371 0.5888348
lm() with as.numeric()
[as indicated in the comments by bas]
form_asnum <- as.formula(y ~ x + as.numeric(d) + 0)
mod_frame_asnum <- model.frame(form_asnum, df)
mod_matrix_asnum <- model.matrix(object = attr(mod_frame_asnum, "terms"), mod_frame_asnum)
head(mod_matrix_asnum)
#> x as.numeric(d)
#> 1 0.7861162 1
#> 2 0.2059603 0
#> 3 0.9793946 1
#> 4 0.8569093 1
#> 5 0.8124811 1
#> 6 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_asnum),
x = mod_matrix_asnum
)$coefficients
#> x as.numeric(d)
#> 0.9762140 -0.2322012
Created on 2021-03-18 by the reprex package (v1.0.0)

Check for statistically significant differences between groups after running a logistic regression w/ interaction & random effect?

I ran an ordinal logistic regression (using the function clmm from the R package ordinal) with a two-factor interaction and a random effect.
The response is a factor w/ 5 levels (Liker scale: 1 2 3 4 5), the independent variables are a factor w/ 2 levels (time) and a factor w/ 3 levels (group)
The code looks like this:
library(ordinal)
# dataset
ID time group response
person1 1 a 3
person2 1 a 5
person3 1 c 5
person4 1 b 2
person5 1 c 2
person6 1 a 4
person1 2 a 2
person2 2 a 2
person3 2 c 1
person4 2 b 4
person5 2 c 3
person6 2 a 4
... ... ... ...
# model
model <- clmm(response ~ time*group + (1|ID))
# model results
formula: response ~ time * group + (1 | ID)
data: dataset
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 168 -226.76 473.52 508(4150) 9.42e-05 1.8e+02
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 5.18 2.276
Number of groups: ID 84
Coefficients:
Estimate Std. Error z value Pr(>|z|)
time2 0.2837 0.5289 0.536 0.59170
group_b 1.8746 0.6946 2.699 0.00695 **
group_c 4.0023 0.9383 4.265 2e-05 ***
time2:group_b -0.5100 0.7294 -0.699 0.48447
time2:group_c -0.8830 0.9749 -0.906 0.36508
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -2.6223 0.6440 -4.072
2|3 0.2474 0.5427 0.456
3|4 2.5384 0.5824 4.359
4|5 4.6786 0.7143 6.550
As you can see, the model results show only whether there are differences compared to the intercept (i.e. time1:group_a). However, what I am also interested in is to check if the difference between time1:group_b and time2:group_b is statistically significant, same for group_c.
Since I have to account for the random effect, I cannot use a simple chi-square test to check for statistically significant differences between groups. I therefore tried to run the function contrast from the R package emmeans, which uses the output of the function emmeans, see the code below:
library(emmeans)
em <- emmeans(model, ~ time | group) #calculates the estimated marginal means
contrast(em, "consec", simple = "each")
# contrast results
$`simple contrasts for time`
group = a:
contrast estimate SE df z.ratio p.value
2 - 1 0.284 0.529 Inf 0.536 0.5917
group = b:
contrast estimate SE df z.ratio p.value
2 - 1 -0.226 0.482 Inf -0.470 0.6386
group = c:
contrast estimate SE df z.ratio p.value
2 - 1 -0.599 0.816 Inf -0.734 0.4629
Note: contrasts are still on the as.factor scale
$`simple contrasts for group`
time = 1:
contrast estimate SE df z.ratio p.value
b - a 1.87 0.695 Inf 2.699 0.0137
c - b 2.13 0.871 Inf 2.443 0.0284
time = 2:
contrast estimate SE df z.ratio p.value
b - a 1.36 0.687 Inf 1.986 0.0897
c - b 1.75 0.838 Inf 2.095 0.0695
Note: contrasts are still on the as.factor scale
P value adjustment: mvt method for 2 tests
My questions are:
a) Is this a correct and valid method to check whether the differences are significant?
b) If not, what is the correct way to do this?
Of course any other suggestion is extremely welcome! Thanks a lot.

R prediction package VS Stata margins

I'm switching from Stata to R, and I find inconsistent results when I use prediction to compute marginal pred and the results from the Stata command margins fixing the values of a variable to x. Here is the example:
library(dplyr)
library(prediction)
d <- data.frame(x1 = factor(c(1,1,1,2,2,2), levels = c(1, 2)),
x2 = factor(c(1,2,3,1,2,3), levels = c(1, 2, 3)),
x3 = factor(c(1,2,1,2,1,2), levels = c(1, 2)),
y = c(3.1, 2.8, 2.5, 4.3, 4.0, 3.5))
m2 <- lm(y ~ x1 + x2 + x3, d)
summary(m2)
marg2a <- prediction(m2, at = list(x2 = "1"))
marg2b <- prediction(m2, at = list(x1 = "1"))
marg2a %>%
select(x1, fitted) %>%
group_by(x1) %>%
summarise(error = mean(fitted))
marg2b %>%
select(x2, fitted) %>%
group_by(x2) %>%
summarise(error = mean(fitted))
This is the result:
# A tibble: 2 x 2
x1 error
<fctr> <dbl>
1 1 3.133333
2 2 4.266667
# A tibble: 3 x 2
x2 error
<fctr> <dbl>
1 1 3.125
2 2 2.825
3 3 2.425
while if I try to replicate this using Stata's margins, this is the result:
regress y i.x1 i.x2 i.x3
margins i.x1, at(x2 == 1)
margins i.x2, at(x1 == 1)
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |
1 | 3.125 .0829157 37.69 0.017 2.071456 4.178544
2 | 4.275 .0829157 51.56 0.012 3.221456 5.328544
------------------------------------------------------------------------------
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x2 |
1 | 3.125 .0829157 37.69 0.017 2.071456 4.178544
2 | 2.825 .0829157 34.07 0.019 1.771456 3.878544
3 | 2.425 .0829157 29.25 0.022 1.371456 3.478544
------------------------------------------------------------------------------
The margins for x2 are the same in R and Stata, but when it comes to x1 there are differences and I don't know why. Really appreciate any help. Thanks,
P
Your Stata and R code are not equivalent. To replicate that Stata code you would need:
> prediction(m2, at = list(x1 = c("1", "2"), x2 = "1"))
Average predictions for 6 observations:
at(x1) at(x2) value
1 1 3.125
2 1 4.275
> prediction(m2, at = list(x2 = c("1", "2", "3"), x1 = "1"))
Average predictions for 6 observations:
at(x2) at(x1) value
1 1 3.125
2 1 2.825
3 1 2.425
That is because when you say margins i.x1 you are asking for predictions for counterfactual versions of the dataset where x1 is replaced with 1 and then replaced with 2, with the additional constraint that in both counterfactual x2 is held at 1. The same thing is occurring in your second Stata example.
This is due to the fact that Stata's margins command has an ambiguity or rather two syntactic expressions that obtain the same output. One is your code:
. margins i.x1, at(x2 == 1)
Predictive margins Number of obs = 6
Model VCE : OLS
Expression : Linear prediction, predict()
at : x2 = 1
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |
1 | 3.125 .0829156 37.69 0.017 2.071457 4.178543
2 | 4.275 .0829156 51.56 0.012 3.221457 5.328543
------------------------------------------------------------------------------
The other is more explicit about what is actually happening in the above:
. margins, at(x1 = (1 2) x2 == 1)
Predictive margins Number of obs = 6
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : x1 = 1
x2 = 1
2._at : x1 = 2
x2 = 1
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | 3.125 .0829156 37.69 0.017 2.071457 4.178543
2 | 4.275 .0829156 51.56 0.012 3.221457 5.328543
------------------------------------------------------------------------------

Fitting linear model / ANOVA by group [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I'm trying to run anova() in R and running into some difficulty. This is what I've done up to now to help shed some light on my question.
Here is the str() of my data to this point.
str(mhw)
'data.frame': 500 obs. of 5 variables:
$ r : int 1 2 3 4 5 6 7 8 9 10 ...
$ c : int 1 1 1 1 1 1 1 1 1 1 ...
$ grain: num 3.63 4.07 4.51 3.9 3.63 3.16 3.18 3.42 3.97 3.4 ...
$ straw: num 6.37 6.24 7.05 6.91 5.93 5.59 5.32 5.52 6.03 5.66 ...
$ Quad : Factor w/ 4 levels "NE","NW","SE",..: 2 2 2 2 2 2 2 2 2 2 ...
Column r is a numerical value indicating which row in the field an individual plot resides
Column c is a numerical value indicating which column an individual plot resides
Column Quad corresponds to the geographical location in the field to which each plot resides
Quad <- ifelse(mhw$c > 13 & mhw$r < 11, "NE",ifelse(mhw$c < 13 & mhw$r < 11,"NW", ifelse(mhw$c < 13 & mhw$r >= 11, "SW","SE")))
mhw <- cbind(mhw, Quad)
I have fit a lm() as follows
nov.model <-lm(mhw$grain ~ mhw$straw)
anova(nov.model)
This is an anova() for the entire field, which is testing grain yield against straw yield for each plot in the dataset.
My trouble is that I want to run an individual anova() for the Quad column of my data to test grain yield and straw yield in each quadrant.
perhaps a with() might fix that. I have never used it before and I am in the process of learning R currently. Any help would be greatly appreciated.
I think you are looking for by facility in R.
fit <- with(mhw, by(mhw, Quad, function (dat) lm(grain ~ straw, data = dat)))
Since you have 4 levels in Quad, you end up with 4 linear models in fit, i.e., fit is a "by" class object (a type of "list") of length 4.
To get coefficient for each model, you can use
sapply(fit, coef)
To produce model summary, use
lapply(fit, summary)
To export ANOVA table, use
lapply(fit, anova)
As a reproducible example, I am taking the example from ?by:
tmp <- with(warpbreaks,
by(warpbreaks, tension,
function(x) lm(breaks ~ wool, data = x)))
class(tmp)
# [1] "by"
mode(tmp)
# [1] "list"
sapply(tmp, coef)
# L M H
#(Intercept) 44.55556 24.000000 24.555556
#woolB -16.33333 4.777778 -5.777778
lapply(tmp, anova)
#$L
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 1200.5 1200.50 5.6531 0.03023 *
#Residuals 16 3397.8 212.36
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#$M
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 102.72 102.722 1.2531 0.2795
#Residuals 16 1311.56 81.972
#
#$H
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 150.22 150.222 2.3205 0.1472
#Residuals 16 1035.78 64.736
I was aware of this option, but not familiar with it. Thanks to #Roland for providing code for the above reproducible example:
library(nlme)
lapply(lmList(breaks ~ wool | tension, data = warpbreaks), anova)
For your data I think it would be
fit <- lmList(grain ~ straw | Quad, data = mhw)
lapply(fit, anova)
You don't need to install nlme; it comes with R as one of recommended packages.

Resources