Can I use a summary function on lm? - r

I am working with election survey data and have a dataset loaded into R and I have objects created. Right now I am working in tidyverse. I am trying to run a regression with male and another variable. However, male is under gender and I am trying to isolate just male from the gender overall. In the data male comes up as 1 and female is 2.
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
I do get coefficients, but then I try to get summary:
gender_disgust_set<-lm(gender~dem_disgusted, data=my_data_set)
summary(gender_disgust_set)
then I get this warning message:
Error in quantile.default(resid) : (unordered) factors are not allowed
In addition: Warning message:
In Ops.factor(r, 2) : ‘^’ not meaningful for factors
lm(gender~dem_disgusted, data=my_data_set)
subset(my_data_set, gender = male)
male_total<-subset(my_data_set, gender = male)
summary(male_total)
lm(gender~dem_disgusted, data=my_data_set)

The lm() function is designed for linear regression, which generally assumes a continuous response.
From the lm() details page:
A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response.
Your gender variable is a factor (not continuous; more information about data types here). If you really wanted to predict gender (a factor), you would need to use glm() for logistic regression.
Yes, you can use summary() on lm objects, but whether linear (or logistic) regression is best for your specific research question is a different question.
library(tidyverse)
set.seed(123)
gender <- sample(1:2, 10, replace = TRUE) %>% factor()
x1 <- sample(1:12, 10, replace = TRUE) %>% as.numeric()
x2 <- sample(1:100, 10, replace = TRUE) %>% as.numeric()
x3 <- sample(50:75, 10, replace = TRUE) %>% as.numeric()
my_data_set <- data.frame(gender, x1, x2, x3)
sapply(my_data_set, class)
#> gender x1 x2 x3
#> "factor" "numeric" "numeric" "numeric"
# error
# gender_disgust_set <- lm(gender ~ x1, data = my_data_set)
# summary(gender_disgust_set)
# logistic regression
gender_disgust_set1 <- glm(gender ~ x1, data = my_data_set, family = "binomial")
summary(gender_disgust_set1)
#>
#> Call:
#> glm(formula = gender ~ x1, family = "binomial", data = my_data_set)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.2271 -0.9526 -0.8296 1.1571 1.5409
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.6530 1.7983 0.363 0.717
#> x1 -0.1342 0.2149 -0.625 0.532
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 13.46 on 9 degrees of freedom
#> Residual deviance: 13.06 on 8 degrees of freedom
#> AIC: 17.06
#>
#> Number of Fisher Scoring iterations: 4
# or flip it around
# while this model works, please look into dummy-coding before using
# factors to predict continuous responses
gender_disgust_set2 <- lm(x1 ~ gender, data = my_data_set)
summary(gender_disgust_set2)
#>
#> Call:
#> lm(formula = x1 ~ gender, data = my_data_set)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.500 -2.438 0.500 2.688 3.750
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8.500 1.371 6.199 0.00026 ***
#> gender2 -1.250 2.168 -0.577 0.58010
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.359 on 8 degrees of freedom
#> Multiple R-squared: 0.03989, Adjusted R-squared: -0.08012
#> F-statistic: 0.3324 on 1 and 8 DF, p-value: 0.5801

Related

How do I extract variables that have a low p-value in R

I have a logistic model with plenty of interactions in R.
I want to extract only the variables and interactions that are either interactions or just predictor variables that are significant.
It's fine if I can just look at every interaction that's significant as I can still look at which non-significant fields were used to get them.
Thank you.
This is the most I have
broom::tidy(logmod)[,c("term", "estimate", "p.value")]
Here is a way. After fitting the logistic model use a logical condition to get the significant predictors and a regex (logical grep) to get the interactions. These two index vectors can be combined with &, in the case below returning no significant interactions at the alpha == 0.05 level.
fit <- glm(am ~ hp + qsec*vs, mtcars, family = binomial)
summary(fit)
#>
#> Call:
#> glm(formula = am ~ hp + qsec * vs, family = binomial, data = mtcars)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.93876 -0.09923 -0.00014 0.05351 1.33693
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 199.02697 102.43134 1.943 0.0520 .
#> hp -0.12104 0.06138 -1.972 0.0486 *
#> qsec -10.87980 5.62557 -1.934 0.0531 .
#> vs -108.34667 63.59912 -1.704 0.0885 .
#> qsec:vs 6.72944 3.85348 1.746 0.0808 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 43.230 on 31 degrees of freedom
#> Residual deviance: 12.574 on 27 degrees of freedom
#> AIC: 22.574
#>
#> Number of Fisher Scoring iterations: 8
alpha <- 0.05
pval <- summary(fit)$coefficients[,4]
sig <- pval <= alpha
intr <- grepl(":", names(coef(fit)))
coef(fit)[sig]
#> hp
#> -0.1210429
coef(fit)[sig & intr]
#> named numeric(0)
Created on 2022-09-15 with reprex v2.0.2

Extract confidence interval for both values of binary variable for glm()?

I want to analyze the relation between whether someone smoked or not and the number of drinks of alcohol.
The reproducible data set:
smoking_status
alcohol_drinks
1
2
0
5
1
2
0
1
1
0
1
0
0
0
1
9
1
6
1
5
I have used glm() to analyse this relation:
glm <- glm(smoking_status ~ alcohol_drinks, data = data, family = binomial)
summary(glm)
confint(glm)
Using the above I'm able to extract the p-value and the confidence interval for the entire set.
However, I would like to extract the confidence interval for each smoking status, so that I can produce this results table:
Alcohol drinks, mean (95%CI)
p-values
Smokers
X (X - X)
0.492
Non-smokers
X (X - X)
How can I produce this?
First of all, the response alcohol_drinks is not binary, a logistic regression is out of the question. Since the response is counts data, I will fit a Poisson model.
To have confidence intervals for each binary value of smoking_status, coerce to factor and fit a model without an intercept.
x <- 'smoking_status alcohol_drinks
1 2
0 5
1 2
0 1
1 0
1 0
0 0
1 9
1 6
1 5'
df1 <- read.table(textConnection(x), header = TRUE)
pois_fit <- glm(alcohol_drinks ~ 0 + factor(smoking_status), data = df1, family = poisson(link = "log"))
summary(pois_fit)
#>
#> Call:
#> glm(formula = alcohol_drinks ~ 0 + factor(smoking_status), family = poisson(link = "log"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6186 -1.7093 -0.8104 1.1389 2.4957
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> factor(smoking_status)0 0.6931 0.4082 1.698 0.0895 .
#> factor(smoking_status)1 1.2321 0.2041 6.036 1.58e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 58.785 on 10 degrees of freedom
#> Residual deviance: 31.324 on 8 degrees of freedom
#> AIC: 57.224
#>
#> Number of Fisher Scoring iterations: 5
confint(pois_fit)
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 -0.2295933 1.399304
#> factor(smoking_status)1 0.8034829 1.607200
#>
exp(confint(pois_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 0.7948568 4.052378
#> factor(smoking_status)1 2.2333058 4.988822
Created on 2022-06-04 by the reprex package (v2.0.1)
Edit
The edit to the question states that the problem was reversed, what is asked is to find out the effect of alcohol drinking on smoking status. And with a binary response, individuals can be smokers or not, a logistic regression is a possible model.
bin_fit <- glm(smoking_status ~ alcohol_drinks, data = df1, family = binomial(link = "logit"))
summary(bin_fit)
#>
#> Call:
#> glm(formula = smoking_status ~ alcohol_drinks, family = binomial(link = "logit"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.7491 -0.8722 0.6705 0.8896 1.0339
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.3474 0.9513 0.365 0.715
#> alcohol_drinks 0.1877 0.2730 0.687 0.492
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 12.217 on 9 degrees of freedom
#> Residual deviance: 11.682 on 8 degrees of freedom
#> AIC: 15.682
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(bin_fit))
#> (Intercept) alcohol_drinks
#> 1.415412 1.206413
exp(confint(bin_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.2146432 11.167555
#> alcohol_drinks 0.7464740 2.417211
Created on 2022-06-05 by the reprex package (v2.0.1)
Another way to conduct a logistic regression is to regress the cumulative counts of smokers on increasing numbers of alcoholic drinks. In order to do this, the data must be sorted by alcohol_drinks, so I will create a second data set, df2. Code inspired this in this RPubs post.
df2 <- df1[order(df1$alcohol_drinks), ]
Total <- sum(df2$smoking_status)
df2$smoking_status <- cumsum(df2$smoking_status)
fit <- glm(cbind(smoking_status, Total - smoking_status) ~ alcohol_drinks, data = df2, family = binomial())
summary(fit)
#>
#> Call:
#> glm(formula = cbind(smoking_status, Total - smoking_status) ~
#> alcohol_drinks, family = binomial(), data = df2)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.9714 -0.2152 0.1369 0.2942 0.8975
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.1671 0.3988 -2.927 0.003428 **
#> alcohol_drinks 0.4437 0.1168 3.798 0.000146 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 23.3150 on 9 degrees of freedom
#> Residual deviance: 3.0294 on 8 degrees of freedom
#> AIC: 27.226
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(fit))
#> (Intercept) alcohol_drinks
#> 0.3112572 1.5584905
exp(confint(fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.1355188 0.6569898
#> alcohol_drinks 1.2629254 2.0053079
plot(smoking_status/Total ~ alcohol_drinks,
data = df2,
xlab = "Alcoholic Drinks",
ylab = "Proportion of Smokers")
lines(df2$alcohol_drinks, fit$fitted, type="l", col="red")
title(main = "Alcohol and Smoking")
Created on 2022-06-05 by the reprex package (v2.0.1)

how to extract the random effect in multilevel modeling using lmer in r?

For example, this is the result of certain multilevel analysis
MLM1<-lmer(y ~ 1 + con + ev1 + ev2 + (1 | pid),data=dat_ind)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: y ~ 1 + con + ev1 + ev2 + (1 | pid)
Data: dat_ind
REML criterion at convergence: 837
Scaled residuals:
Min 1Q Median 3Q Max
-2.57771 -0.52765 0.00076 0.54715 2.27597
Random effects:
Groups Name Variance Std.Dev.
pid (Intercept) 1.4119 1.1882
Residual 0.9405 0.9698
Number of obs: 240, groups: pid, 120
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.1727 0.1385 116.7062 1.247 0.21494
con 0.3462 0.1044 227.3108 3.317 0.00106 **
ev1 -0.3439 0.2083 116.8432 -1.651 0.10143
ev2 0.2525 0.1688 117.0168 1.495 0.13753
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) con ev1
con 0.031
ev1 0.171 -0.049
ev2 -0.423 0.065 -0.407
for example, I can extract fixed effect such as following.
summary(MLM1)[['coefficients']]['ev1','Pr(>|t|)']
How can I extract random effect coefficients?
for example, I want to extract 1.4119, 1.1882, 0.9405, 0.9698.
Random effects:
Groups Name Variance Std.Dev.
pid (Intercept) 1.4119 1.1882
Residual 0.9405 0.9698
The random effects results are not coefficients, but to get the variance and standard deviation as reported in the summary output, you can use the VarCorr function.
For example,
library(lme4)
#> Loading required package: Matrix
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
summary(fm1)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: Reaction ~ Days + (Days | Subject)
#> Data: sleepstudy
#>
#> REML criterion at convergence: 1743.6
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -3.9536 -0.4634 0.0231 0.4634 5.1793
#>
#> Random effects:
#> Groups Name Variance Std.Dev. Corr
#> Subject (Intercept) 612.10 24.741
#> Days 35.07 5.922 0.07
#> Residual 654.94 25.592
#> Number of obs: 180, groups: Subject, 18
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 251.405 6.825 36.838
#> Days 10.467 1.546 6.771
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> Days -0.138
If you want the results as a table you could do:
cbind(Var = diag(VarCorr(fm1)$Subject),
stddev = attr(VarCorr(fm1)$Subject, "stddev"))
#> Var stddev
#> (Intercept) 612.10016 24.740658
#> Days 35.07171 5.922138
Obviously, you'll need pid instead of Subject in the code above - we don't have your data or model for a demo here.
Created on 2022-04-27 by the reprex package (v2.0.1)
VarCorr(MLM1)$pid is the basic object.
broom.mixed::tidy(MLM1, effects = "ran_pars") may give you a more convenient format.
library(lme4)
fm1 <- lmer(Reaction ~ Days + (1|Subject), sleepstudy)
## RE variance
v1 <- VarCorr(fm1)$Subject
s1 <- attr(VarCorr(fm1)$Subject, "stddev")
## or
s1 <- sqrt(v1)
attr(VarCorr(fm1), "sc") ## residual std dev
## or
sigma(fm1)
## square these values if you want the residual variance
Or:
broom.mixed::tidy(fm1, effects = "ran_pars") ## std devs
broom.mixed::tidy(fm1, effects = "ran_pars", scales = "vcov") ## variances

SAS proc glm random effects model with contrasts translated into R

My apologies for any errors; I only recently began learning SAS. I was given this SAS code (the code below is a reprex, not the exact code) that uses proc glm to assumedly make a random effects model. Instead of using color, the SAS code uses contrasts and idnumber to indirectly map onto color.
I would like to know how to replicate this in R. Several attempts using lme4 for random effects and contrasts using MASS::ginv were unsuccessful, so I may need to use a package I am unfamiliar with.
I would also like to know the difference between red-blue and red-blue2 and why the output is different. Thank you for your help.
data df1;
input idnumber color $ value1;
datalines;
1001 red 189
1002 red 145
1003 red 210
1004 red 194
1005 red 127
1006 red 189
1007 blue 145
1008 red 210
1009 red 194
1010 red 127.
;
proc glm data=df1;
class idnumber;
model value1=idnumber/noint solution clparm;
contrast 'red vs. blue' idnumber 1 1 1 1 1 1 -9 1 1 1;
estimate 'red-blue' idnumber 1 1 1 1 1 1 -9 1 1 1/ divisor=10;
estimate 'red-blue2' idnumber .111 .111 .111 .111 .111 .111 -.999 .111 .111 .111;
run;
Below are a few attempts at replication.
idnumber <- c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010)
color <- c('red', 'red', 'red', 'red', 'red', 'red', 'blue', 'red', 'red', 'red')
value1 <- c(189, 145, 210, 194, 127, 189, 145, 210, 194, 127)
df1 <- data.frame(idnumber, color, value1)
library(lme4)
library(MASS)
library(tidyverse)
options(contrasts = c(factor = "contr.SAS", ordered = "contr.poly"))
# attempt 1
mod1 <- lme4::lmer(value1 ~ (1|idnumber), data = df1) # error
#> Error: number of levels of each grouping factor must be < number of observations (problems: idnumber)
# attempt 2
mod2 <- lme4::lmer(value1 ~ (1|color), data = df1) # singular
#> boundary (singular) fit: see help('isSingular')
summary(mod2)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: value1 ~ (1 | color)
#> Data: df1
#>
#> REML criterion at convergence: 90.9
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.3847 -0.8429 0.4816 0.6321 1.1138
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> color (Intercept) 0 0.00
#> Residual 1104 33.22
#> Number of obs: 10, groups: color, 2
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 173.00 10.51 16.47
#> optimizer (nloptwrap) convergence code: 0 (OK)
#> boundary (singular) fit: see help('isSingular')
# attempt 3
mat1 <- rbind(c(-0.5, 0.5))
cMat1 <- MASS::ginv(mat1)
mod3 <- lm(value1 ~ color, data = df1, contrasts = list(color = cMat1))
summary(mod3)
#>
#> Call:
#> lm(formula = value1 ~ color, data = df1, contrasts = list(color = cMat1))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -49.11 -23.33 12.89 17.89 33.89
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 160.56 17.74 9.052 1.78e-05 ***
#> color1 15.56 17.74 0.877 0.406
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 33.65 on 8 degrees of freedom
#> Multiple R-squared: 0.08771, Adjusted R-squared: -0.02633
#> F-statistic: 0.7691 on 1 and 8 DF, p-value: 0.4061
# attempt 4
con <- c(.1, .1, .1, .1, .1, .1, -.9, .1, .1, .1)
mod4 <- lm(value1 ~ idnumber, data = df1, contrasts = list(idnumber = con)) # error, but unsure how to fix
#> Error in `contrasts<-`(`*tmp*`, value = contrasts.arg[[nn]]): contrasts apply only to factors
Created on 2022-02-08 by the reprex package (v2.0.1)
Answering the part of this that is answerable: what is going on with the two different estimates.
The estimate statement includes a list of coefficients. Those are multiplied by the values, and then summed - giving the result. The reason they're different is, well, they're different... the first one is (after the division) 0.1 / -0.9 and the second is 0.111 (one ninth) / -0.999, effectively the same as the first one with a divisor of 9 instead of 10. Hence, the math is different.
I'm also not sure about your reprex, it doesn't really make any sense to use idnumber as the class variable... seems more likely you'd use color as the class variable. Is it possible this is just bad SAS code? I'm not a GLM expert, but it seems odd to me to try to use GLM with the classification variable being the ID number (assuming it's a unique ID, anyway).

Why does lm() with the subset argument give a different answer than subsetting in advance?

I am using lm() on a training set of data that includes a polynomial. When I subset in advance with [ ] I get different coefficients compared to using the subset argument in the lm() function call. Why?
library(ISLR2)
set.seed (1)
train <- sample(392, 196)
auto_train <- Auto[train,]
lm.fit.data <- lm(mpg ~ poly(horsepower, 2), data = auto_train)
summary(lm.fit.data)
#>
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2), data = auto_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.8711 -2.6655 -0.0096 2.0806 16.1063
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.8745 0.3171 75.298 < 2e-16 ***
#> poly(horsepower, 2)1 -89.3337 4.4389 -20.125 < 2e-16 ***
#> poly(horsepower, 2)2 33.2985 4.4389 7.501 2.25e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.439 on 193 degrees of freedom
#> Multiple R-squared: 0.705, Adjusted R-squared: 0.702
#> F-statistic: 230.6 on 2 and 193 DF, p-value: < 2.2e-16
lm.fit.subset <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
summary(lm.fit.subset)
#>
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2), data = Auto, subset = train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.8711 -2.6655 -0.0096 2.0806 16.1063
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.5496 0.3175 74.182 < 2e-16 ***
#> poly(horsepower, 2)1 -123.5881 6.4587 -19.135 < 2e-16 ***
#> poly(horsepower, 2)2 47.7189 6.3613 7.501 2.25e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.439 on 193 degrees of freedom
#> Multiple R-squared: 0.705, Adjusted R-squared: 0.702
#> F-statistic: 230.6 on 2 and 193 DF, p-value: < 2.2e-16
Created on 2021-12-26 by the reprex package (v2.0.1)
tl;dr As suggested in other comments and answers, the characteristics of the orthogonal polynomial basis are computed before the subsetting is taken into account.
To add more technical detail to #JonManes's answer, let's look at lines 545-553 of the R code where 'model.frame' is defined.
First we have (lines 545-549)
if(is.null(attr(formula, "predvars"))) {
for (i in seq_along(varnames))
predvars[[i+1L]] <- makepredictcall(variables[[i]], vars[[i+1L]])
attr(formula, "predvars") <- predvars
}
At this point in the code, formula will not be an actual formula (that would be too easy!), but rather a terms object that contains various useful-to-developers info about model structures ...
predvars is the attribute that defines the information needed to properly reconstruct data-dependent bases like orthogonal polynomials and splines (see ?makepredictcall for a little bit more information, or here, although in general this stuff is really poorly documented; I'd expect it to be documented here but it isn't ...). For example,
attr(terms(model.frame(mpg ~ poly(horsepower, 2), data = auto_train)), "predvars")
gives
list(mpg, poly(horsepower, 2, coefs = list(alpha = c(102.612244897959,
142.498828460405), norm2 = c(1, 196, 277254.530612245, 625100662.205702
))))
These are the coefficients for the polynomial, which depend on the distribution of the input data.
Only after this information has been established, on line 553, do we get
subset <- eval(substitute(subset), data, env)
In other words, the subsetting argument doesn't even get evaluated until after the polynomial characteristics are determined (all of this information is then passed to the internal C_modelframe function, which you really don't want to look at ...)
Note that this issue does not result in an information leak between training and testing sets in a statistical learning context: the parameterization of the polynomial doesn't affect the predictions of the model at all (in theory, although as usual with floating point the results are unlikely to be exactly identical). At worst (if the training and full sets were very different) it could reduce numerical stability a bit.
FWIW this is all surprising (to me) and seems worth raising on the r-devel#r-project.org mailing list (at least a note in the documentation seems in order).
In your second call it looks like poly() is computed first before subsetting. Compare the outputs of model.frame() below:
# first call
model.frame(mpg ~ poly(horsepower, 2), data = auto_train)[1:5,]
#> mpg poly(horsepower, 2).1 poly(horsepower, 2).2
#> 326 44.3 -0.1037171808 0.1498371034
#> 169 23.0 -0.0372467155 -0.0099055358
#> 131 26.0 -0.0429441840 -0.0000530004
#> 301 23.9 -0.0239526225 -0.0300950106
#> 272 23.2 0.0045347198 -0.0601592336
# second call
model.frame(mpg ~ poly(horsepower, 2), data = Auto, subset = train)[1:5,]
#> mpg poly(horsepower, 2).1 poly(horsepower, 2).2
#> 326 44.3 -0.0741931315 0.1133792778
#> 169 23.0 -0.0282078693 -0.0034299423
#> 131 26.0 -0.0321494632 0.0039029222
#> 301 23.9 -0.0190108168 -0.0185862638
#> 272 23.2 0.0006971527 -0.0418538153
# same as
model.frame(mpg ~ poly(horsepower, 2), data = Auto)[train,][1:5,]
#> mpg poly(horsepower, 2).1 poly(horsepower, 2).2
#> 326 44.3 -0.0741931315 0.1133792778
#> 169 23.0 -0.0282078693 -0.0034299423
#> 131 26.0 -0.0321494632 0.0039029222
#> 301 23.9 -0.0190108168 -0.0185862638
#> 272 23.2 0.0006971527 -0.0418538153

Resources