I'm actualy running on a problem with the T pipe. I'm trying to do 3 things in the same chain :
Fit my GLM
Save it in a variable
Print it's summary
So i'm trying the following syntax :
my_variable <- data %>%
glm(Responce ~ Variables, family) %T>%
summary
Wich do not work as expected. The glm get's fitted, but the summary wont show itself. So i'm force to break it into 2 chains :
my_variable <- data %>%
glm(Responce ~ Variables, family)
my_variable %>% summary
So i'm thinking : Either i did'nt get the functionality of the T-pipe, either it's not properly coded and mess around with the summary function.
Because if i try :
my_variable <- data %>%
glm(Responce ~ Variables, family) %T>%
plot
it works well.
Some ideas ?
When you just type summary(something) in the console, print is called implicitly. It's not the case in your pipe call, so you need to explicitly call print.
Because the unbranching of %T>% works for one operation only you'll have to compose print and summary :
my_variable <- data %>%
glm(Responce ~ Variables, family) %T>%
{print(summary(.)}
You need curly braces and the dot else the glm output would be passed as the first argument to print.
I don't see, why you would need %T>% here. If you want to force printing, just use regular pipes and add print() to your pipe. Keep in mind however, that with this approach you store the summary in my_variable, not the model itself.
library(magrittr)
my_variable <- my_data %>%
glm(counts ~ outcome + treatment, family = poisson(), data = .) %>%
summary() %>%
print()
#>
#> Call:
#> glm(formula = counts ~ outcome + treatment, family = poisson(),
#> data = .)
#>
#> Deviance Residuals:
#> 1 2 3 4 5 6 7
#> -0.67125 0.96272 -0.16965 -0.21999 -0.95552 1.04939 0.84715
#> 8 9
#> -0.09167 -0.96656
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 3.045e+00 1.709e-01 17.815 <2e-16 ***
#> outcome2 -4.543e-01 2.022e-01 -2.247 0.0246 *
#> outcome3 -2.930e-01 1.927e-01 -1.520 0.1285
#> treatment2 1.338e-15 2.000e-01 0.000 1.0000
#> treatment3 1.421e-15 2.000e-01 0.000 1.0000
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 10.5814 on 8 degrees of freedom
#> Residual deviance: 5.1291 on 4 degrees of freedom
#> AIC: 56.761
#>
#> Number of Fisher Scoring iterations: 4
Data
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
my_data <- data.frame(treatment, outcome, counts)
Related
I want to perform the following task using fastfood dataset from openintro package in R.
a) Create a regression predicting whether or not a restaurant is McDonalds or Subway
based on calories, sodium, and protein. (McDonalds should be 1, Subway 0).
Save the coefficients to Q2.
b) use data from only restaurants with between 50 and 60 items in the
data set. Predict total fat from cholesterol, total carbs, vitamin a, and restaurant.
Remove any nonsignificant predictors and run again.
Assign the strongest standardized regression coefficient to Q5.
Here's my code.
library(tidyverse)
library(openintro)
library(lm.beta)
fastfood <- openintro::fastfood
head(fastfood)
#Solving for part (a)
fit_1 <- lm(I(restaurant %in% c("Subway", "Mcdonalds")) ~ calories + sodium + protein, data = fastfood)
Q2 <- round(summary(fit_1)$coefficients,2)
#Solving for part (b)
newdata <- fastfood[ which(fastfood$item>=50 & fastfood$item <= 60), ]
df = sort(sample(nrow(newdata), nrow(data)*.7))
newdata_train<-data[df,]
newdata_test<-data[-df,]
fit_5 <- lm(I(total_fat) ~ cholesterol + total_carb + vit_a + restaurant, data = newdata)
prediction_5 <- predict(fit_5, newdata = newdata_test)
Q5 <- lm.beta(fit_5)
But I'm not getting desired results
Here's is desired output
output for part (a):
output for part (b):
The first question requires logistic regression rather than linear regression, since the aim is to predict a binary outcome. The most sensible way to do this is, as the question suggests, to remove all the restaurants except McDonald's and Subway, then create a new binary variable to mark which rows are McDonald's and which aren't:
library(dplyr)
fastfood <- openintro::fastfood %>%
filter(restaurant %in% c("Mcdonalds", "Subway")) %>%
mutate(is_mcdonalds = restaurant == "Mcdonalds")
The logistic regression is done like this:
fit_1 <- glm(is_mcdonalds ~ calories + sodium + protein,
family = "binomial", data = fastfood)
And your coefficients are obtained like this:
Q2 <- round(coef(fit_1), 2)
Q2
#> (Intercept) calories sodium protein
#> -1.24 0.00 0.00 0.06
The second question requires that you filter out any restaurants with more than 60 or fewer than 50 items:
fastfood <- openintro::fastfood %>%
group_by(restaurant) %>%
filter(n() >= 50 & n() <= 60)
We now fit the described regression and examine it to look for non-significant regressors:
fit_2 <- lm(total_fat ~ cholesterol + vit_a + total_carb + restaurant,
data = fastfood)
summary(fit_2)
#>
#> Call:
#> lm(formula = total_fat ~ cholesterol + vit_a + total_carb + restaurant,
#> data = fastfood)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -24.8280 -2.9417 0.9397 5.1450 21.0494
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.20102 2.08029 -0.577 0.564751
#> cholesterol 0.26932 0.01129 23.853 < 2e-16 ***
#> vit_a 0.01159 0.01655 0.701 0.484895
#> total_carb 0.16327 0.03317 4.922 2.64e-06 ***
#> restaurantMcdonalds -4.90272 1.94071 -2.526 0.012778 *
#> restaurantSonic 6.43353 1.89014 3.404 0.000894 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 7.611 on 125 degrees of freedom
#> (34 observations deleted due to missingness)
#> Multiple R-squared: 0.8776, Adjusted R-squared: 0.8727
#> F-statistic: 179.2 on 5 and 125 DF, p-value: < 2.2e-16
We note that vit_a is non-significant and drop it from our model:
fit_3 <- update(fit_2, . ~ . - vit_a)
Now we get the regularized coefficients and round them:
coefs <- round(coef(lm.beta::lm.beta(fit_3)), 2)
and Q5 will be the maximum value of these coefficients:
Q5 <- coefs[which.max(coefs)]
Q5
#> cholesterol
#> 0.82
Created on 2022-02-26 by the reprex package (v2.0.1)
I am trying to write a function that would take linear regression model, Design matrix X and would return marginal effects of our coefficients.
Marginal effects are given in regression's summary output. However, if we have interactions present in our model we have to aggregate the coefficients.
I have prepared a small example in R:
library(tidyverse)
# Desing Matrix for regression with interaction ---------------------------
X <- model.matrix(hp ~ factor(gear) + disp + mpg + wt + wt:factor(gear), data = mtcars)
y <- mtcars$hp
lm(y ~ X - 1, data = mtcars) %>%
summary()
# Model Interpretation ----------------------------------------------------
# Our reference category is 'Gear3'.
# Meaning that X(Intercept) is an intercept for Gear3
# If I want an intercept for Gear4 and Gear5 coefs Gear4 and Gear5 are deviations from refrence cathegory
# So Intercept for Gear4 is: X(Intercept) + Xfactor(gear)4 = 166.0265 + 30.2619 = 196.2884
# Same Idea hold for Gear5: X(Intercept) + Xfactor(gear)5 = 166.0265 -58.2724 = 107.7541
# Now if we want to interpret wt as a marginal affect, again we need to take into account our created interactions.
# Xwt is marginal effect of wt for the regrence cathegory (Gear3)
# If we want marginal Affect for (Gear4 cars) we need: Xwt + Xfactor(gear)4:wt = -6.4985 -9.8580 = -16.3565
# SAme Idea for gear5: Xwt + Xfactor(gear)5:wt = -6.4985 + 49.6390 = 43.1405
I would like to write a function that would take model object and desing matrix and would return a dataframe with our aggregated marginal effects:
In case given I want a function to return a dataframe:
# X(Intercept)gear3: 166.0265
# X(Intercept)gear4: 196.2884
# X(Intercept)gear5: 107.7541
#
# Xwtgear3:-6.4985
# Xwtgear4:-16.3565
# Xwtgear5:43.1405
Where the marginals effects for interactions are already precalculated/aggregated for user.
So far I was able to create some concept that is not generalize enought.
Therefore, I would like to ask for any ideas how to write function using either dplyr or datatable.
I am not sure if you are just interested in the output or the way of doing it yourself.
Regarding the former, you can use the {interactions} package, which will give you your desired output. Regarding the latter, you could have a look at how the {interactions} packages calculates this output under the hood.
library(dplyr)
library(interactions)
mtcars2 <- mtcars %>% mutate(gear = factor(gear))
fit <- lm(hp ~ wt * gear + disp + mpg, data = mtcars2)
slopes <- sim_slopes(fit, pred = wt, modx = gear, centered = "none")
#> Warning: Johnson-Neyman intervals are not available for factor moderators.
slopes$ints
#> Value of gear Est. S.E. 2.5% 97.5% t val. p
#> 1 5 107.7540 103.74667 -106.36856 321.8766 1.038626 0.30932987
#> 2 4 196.2884 100.91523 -11.99042 404.5672 1.945082 0.06357154
#> 3 3 166.0265 71.71678 18.01029 314.0426 2.315029 0.02947996
slopes$slopes
#> Value of gear Est. S.E. 2.5% 97.5% t val. p
#> 1 5 43.140490 26.69021 -11.94539 98.22637 1.6163415 0.1190910
#> 2 4 -16.356572 20.22409 -58.09705 25.38390 -0.8087667 0.4265945
#> 3 3 -6.498535 15.37599 -38.23302 25.23595 -0.4226417 0.6763194
Created on 2021-01-03 by the reprex package (v0.3.0)
If you want both results in one data.frame you could easily bind them together:
bind_rows(mutate(slopes$ints, term = "(Intercept)"),
mutate(slopes$slopes, term = "(Interaction)")) %>%
select(term, "Value of gear", Est., p)
#> term Value of gear Est. p
#> 1 (Intercept) 5 107.754033 0.30932987
#> 2 (Intercept) 4 196.288380 0.06357154
#> 3 (Intercept) 3 166.026451 0.02947996
#> 4 (Interaction) 5 43.140490 0.11909097
#> 5 (Interaction) 4 -16.356572 0.42659451
#> 6 (Interaction) 3 -6.498535 0.67631941
I intend to run instrumental variable regressions with fixed effects using the fixest package's feols function. However, I am having issues with the syntax specifying an estimation without further exogenous controls.
Consider the following example:
# Load package
require("fixest")
# Load data
df <- airquality
I would like to something like the following, i.e. explaining the outcome via the instrumented endogenous variable and fixed effects:
feols(Temp | Month + Day | Ozone ~ Wind, df)
This, however, produces an error:
The dependent variable is a constant. Estimation cannot be done.
It only works, when I add further exogenous covariates (as in the documentation's examples):
feols(Temp ~ Solar.R | Month + Day | Ozone ~ Wind, df)
How do I fix this? How do I run the estimation without further controls, such as Solar.R in this case?
Note: I post this on Stack Overflow rather than Cross Validated because the question relates to a coding syntax issue, and not to the econometric techniques underlying the estimations.
actually there seems to be a misunderstanding on how to write the formula.
The syntax is: Dep_var ~ Exo_vars | Fixed-effects | Endo_vars ~ Instruments.
The parts Fixed-effects and Endo_vars ~ Instruments are optional. On the other hand, the part with Exo_vars must always be there, be it with only the intercept.
Knowing that, the following works:
base = iris
names(base) = c("y", "x1", "x_endo", "x_inst", "fe")
feols(y ~ 1 | x_endo ~ x_inst, base)
#> TSLS estimation, Dep. Var.: y, Endo.: x_endo, Instr.: x_inst
#> Second stage: Dep. Var.: y
#> Observations: 150
#> Standard-errors: Standard
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.345900 0.08096 53.679 < 2.2e-16 ***
#> fit_x_endo 0.398477 0.01964 20.289 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.404769 Adj. R2: 0.757834
#> F-test (1st stage): stat = 1,882.45 , p < 2.2e-16 , on 1 and 148 DoF.
#> Wu-Hausman: stat = 3.9663, p = 0.048272, on 1 and 147 DoF.
# Same with fixed-effect
feols(y ~ 1 | fe | x_endo ~ x_inst, base)
#> TSLS estimation, Dep. Var.: y, Endo.: x_endo, Instr.: x_inst
#> Second stage: Dep. Var.: y
#> Observations: 150
#> Fixed-effects: fe: 3
#> Standard-errors: Clustered (fe)
#> Estimate Std. Error t value Pr(>|t|)
#> fit_x_endo 0.900061 0.117798 7.6407 0.016701 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.333489 Adj. R2: 0.833363
#> Within R2: 0.57177
#> F-test (1st stage): stat = 44.77 , p = 4.409e-10, on 1 and 146 DoF.
#> Wu-Hausman: stat = 0.001472, p = 0.969447 , on 1 and 145 DoF.
Getting back to the original example:
feols(Temp | Month + Day | Ozone ~ Wind, df) means that the dependent variable will be Temp | Month + Day | Ozone with the pipe here meaning the logical OR, leading to a 1 for all observations. Hence the error message.
To fix it and obtain an appropriate behavior, use feols(Temp ~ 1 | Month + Day | Ozone ~ Wind, df).
I'm coded up an estimator in R and tried to follow R syntax. It goes something like this:
model <- myEstimator(y ~ x1 + x2, data = df)
model has the usual stuff: coefficients, standard errors, p-values, etc.
Now I want model to play nicely with the R ecosystem for summarizing models, like summary() or sjPlot::plot_model() or stargazer::stargazer(). The way you might do summary(lm_model) where lm_model is an lm object.
How do I achieve this? Is there a standard protocol? Define a custom S3 class? Or just coerce model to an existing class like lm?
Create an S3 class and implement summary etc methods.
myEstimator <- function(formula, data) {
result <- list(
coefficients = 1:3,
residuals = 1:3
)
class(result) <- "myEstimator"
result
}
model <- myEstimator(y ~ x1 + x2, data = df)
Functions like summary will just call summary.default.
summary(model)
#> Length Class Mode
#> coefficients 3 -none- numeric
#> residuals 3 -none- numeric
If you wish to have your own summary function, implement summary.myEstimator.
summary.myEstimator <- function(object, ...) {
value <- paste0(
"Coefficients: ", paste0(object$coefficients, collapse = ", "),
"; Residuals: ", paste0(object$residuals, collapse = ", ")
)
value
}
summary(model)
#> [1] "Coefficients: 1, 2, 3; Residuals: 1, 2, 3"
If your estimator is very similar to lm (your model is-a lm), then just add your class to the lm class.
myEstimatorLm <- function(formula, data) {
result <- lm(formula, data)
# Some customisation
result$coefficients <- pmax(result$coefficients, 1)
class(result) <- c("myEstimatorLm", class(result))
result
}
model_lm <- myEstimatorLm(Petal.Length ~ Sepal.Length + Sepal.Width, data = iris)
class(model_lm)
#> [1] "myEstimator" "lm"
Now, summary.lm will be used.
summary(model_lm)
#> Call:
#> lm(formula = formula, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.25582 -0.46922 -0.05741 0.45530 1.75599
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.00000 0.56344 1.775 0.078 .
#> Sepal.Length 1.77559 0.06441 27.569 < 2e-16 ***
#> Sepal.Width 1.00000 0.12236 8.173 1.28e-13 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.6465 on 147 degrees of freedom
#> Multiple R-squared: 0.8677, Adjusted R-squared: 0.8659
#> F-statistic: 482 on 2 and 147 DF, p-value: < 2.2e-16
You can still implement summary.myEstimatorLm
summary.myEstimatorLm <- summary.myEstimator
summary(model_lm)
#> [1] "Coefficients: 1, 1.77559254648113, 1; Residuals: ...
Say I have a data set that I'd like to create a lm, for each combination of variable A and B. Where A has two values: 'a' and 'b', and B has three values: 1,2,3. This leaving me with six possible combinations of variables A and B.
This said, I would like to create six (6) models. In example the first model would have the data subsetted where A = a and B = 1.
In SAS, in example, the code would be as follows (please note the by statement):
proc glm data = mydate;
by A B;
class Cat1 Cat2;
model Y = X + Cat1 + Cat2;
run;
The by statement will generate one model for combination of A and B.
This is really just a split-apply step:
split the data into chunks
smydate <- split(mydate, list(A = A, B = B))
Each component of smydate represents the data for a particular combination of A and B. You may need to add drop = TRUE to the split call if your data doesn't have all combinations of the levels of A and B.
apply the lm() function over the components of the list smydate
lmFun <- function(dat) {
lm(y ~ x + cat1 + cat2, data = dat)
}
models <- lapply(smydate, lmFun)
Now you have a list, models, where each component contains a lm object for the particular combination of A and B.
An example (based on the one shown by rawr in the comments is:
models <- lapply(split(mtcars, list(mtcars$am, mtcars$gear), drop = TRUE),
function(x) {lm(mpg ~ wt + disp, data = x)})
str(models)
models
which gives:
> str(models, max = 1)
List of 4
$ 0.3:List of 12
..- attr(*, "class")= chr "lm"
$ 0.4:List of 12
..- attr(*, "class")= chr "lm"
$ 1.4:List of 12
..- attr(*, "class")= chr "lm"
$ 1.5:List of 12
..- attr(*, "class")= chr "lm"
> models
$`0.3`
Call:
lm(formula = mpg ~ wt + disp, data = x)
Coefficients:
(Intercept) wt disp
27.994610 -2.384834 -0.007983
$`0.4`
Call:
lm(formula = mpg ~ wt + disp, data = x)
Coefficients:
(Intercept) wt disp
219.1047 -106.8075 0.9953
$`1.4`
Call:
lm(formula = mpg ~ wt + disp, data = x)
Coefficients:
(Intercept) wt disp
43.27860 -3.03114 -0.09481
$`1.5`
Call:
lm(formula = mpg ~ wt + disp, data = x)
Coefficients:
(Intercept) wt disp
41.779042 -7.230952 -0.006731
As rawr notes in the comments, you can do this in fewer steps using by(), or any one of a number of other higher-level functions in say the plyr package, but doing things by hand at least once illustrates the generality of the approach; you can always use the short cuts once you are familiar with the general idea.
Using group_by in the dplyr package will run an analysis for each subgroup combination. Using the mtcars dataset:
library(dplyr)
res <- mtcars %>%
group_by(am, gear) %>%
do(mod = lm(mpg ~ wt + disp, data = .))
res$mod
Will give you the list of lm objects.
Other packages will make this more elegant. You could do this in-line with the magrittr package and go straight to the list of lm objects:
library(magrittr)
mtcars %>%
group_by(am, gear) %>%
do(mod = lm(mpg ~ wt + disp, data = .)) %>%
use_series(mod)
Or use the broom package to extract coefficient values from the lm objects:
library(broom)
mtcars %>%
group_by(am, gear) %>%
do(mod = lm(mpg ~ wt + disp, data = .)) %>%
glance(mod)
Source: local data frame [4 x 13]
Groups: am, gear
am gear r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
1 0 3 0.6223489 0.5594070 2.2379851 9.887679 0.00290098 3 -31.694140 71.38828 74.22048 60.102926 12
2 0 4 0.9653343 0.8960028 0.9899495 13.923469 0.18618733 3 -2.862760 13.72552 11.27070 0.980000 1
3 1 4 0.7849464 0.6989249 2.9709337 9.125006 0.02144702 3 -18.182504 44.36501 44.68277 44.132234 5
4 1 5 0.9827679 0.9655358 1.2362092 57.031169 0.01723212 3 -5.864214 19.72843 18.16618 3.056426 2
More specifically, you can use lmList to fit linear models to categories, after using #bjoseph's strategy of generating an interaction variable:
mydate <- transform(mydate, ABcat=interaction(A,B,drop=TRUE))
library("lme4") ## or library("nlme")
lmList(Y~X+Cat1+Cat2|ABcat,mydate)
You could try several different things.
Let's say our data is:
structure(list(A = structure(c(1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"), B = structure(c(1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"), x = c(1, 2, 3, 4), y = c(2, 2, 2, 2)), .Names = c("A", "B", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
x
#> A B x y
1 A A 1 2
2 A B 2 2
3 B A 3 2
4 B B 4 2
by()
This returns a list-type object. Notice that it doesn't return results in the order we might have expected. It's trying to keep the second factor as stable as possible when iterating. You could adjust this by using list(x$B,x$A)
by(x[c("x","y")],list(x$A,x$B),function(x){x[1]*x[2]})
[1] 2
-------------------------------------------------------------------------------------
[1] 6
-------------------------------------------------------------------------------------
[1] 4
-------------------------------------------------------------------------------------
[1] 8
expand.grid()
This is a simple for loop where we pre-generated the combinations of interest, subset the data in the loop and perform the function of interest. expand.grid() can be slow with large sets of combinations and for loops aren't necessarily fast but you have a lot of control in the middle.
combinations = expand.grid(levels(x$A),levels(x$B))
for(i in 1:nrow(combinations)){
d = x[x$A==combinations[i,1] & x$B==combinations[i,2],c("x","y")]
print(d[1]*d[2])
}
#> x
1 2
x
3 6
x
2 4
x
4 8
If you want the fit/predictions instead of summary stats(t-tests, etc), it's easier to fit an interaction model of Y~(A:B)*(X + Cat1 + Cat2) - 1 - X - Cat1 - Cat2; by subtracting out the main effects, R will reparameterize and place all the variance on the interactions. Here's an example:
> mtcars <- within(mtcars, {cyl = as.factor(cyl); am=as.factor(am)})
> model <- lm(mpg~(cyl:am)*(hp+wt)-1-hp-wt, mtcars)
> summary(model)
Call:
lm(formula = mpg ~ (cyl:am) * (hp + wt) - 1 - hp - wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-2.6685 -0.9071 0.0000 0.7705 4.1879
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
cyl4:am0 2.165e+01 2.252e+01 0.961 0.3517
cyl6:am0 6.340e+01 4.245e+01 1.494 0.1560
cyl8:am0 2.746e+01 5.000e+00 5.492 6.20e-05 ***
cyl4:am1 4.725e+01 5.144e+00 9.184 1.51e-07 ***
cyl6:am1 2.320e+01 3.808e+01 0.609 0.5515
cyl8:am1 1.877e+01 1.501e+01 1.251 0.2302
cyl4:am0:hp -4.635e-02 1.107e-01 -0.419 0.6815
cyl6:am0:hp 7.425e-03 1.650e-01 0.045 0.9647
cyl8:am0:hp -2.110e-02 2.531e-02 -0.834 0.4175
cyl4:am1:hp -7.288e-02 4.457e-02 -1.635 0.1228
cyl6:am1:hp -2.000e-02 4.733e-02 -0.423 0.6786
cyl8:am1:hp -1.127e-02 4.977e-02 -0.226 0.8240
cyl4:am0:wt 1.762e+00 5.341e+00 0.330 0.7460
cyl6:am0:wt -1.332e+01 1.303e+01 -1.022 0.3231
cyl8:am0:wt -2.025e+00 1.099e+00 -1.843 0.0851 .
cyl4:am1:wt -6.465e+00 2.467e+00 -2.621 0.0193 *
cyl6:am1:wt -4.926e-15 1.386e+01 0.000 1.0000
cyl8:am1:wt NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.499 on 15 degrees of freedom
Multiple R-squared: 0.9933, Adjusted R-squared: 0.9858
F-statistic: 131.4 on 17 and 15 DF, p-value: 3.045e-13
compare with a cyl4:am1 submodel:
> summary(lm(mpg~wt+hp, mtcars, subset=cyl=='4' & am=='1'))
Call:
lm(formula = mpg ~ wt + hp, data = mtcars, subset = cyl == "4" &
am == "1")
Residuals:
Datsun 710 Fiat 128 Honda Civic Toyota Corolla Fiat X1-9 Porsche 914-2
-2.66851 4.18787 -2.61455 3.25523 -2.62538 -0.77799
Lotus Europa Volvo 142E
1.17181 0.07154
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.24552 6.57304 7.188 0.000811 ***
wt -6.46508 3.15205 -2.051 0.095512 .
hp -0.07288 0.05695 -1.280 0.256814
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.193 on 5 degrees of freedom
Multiple R-squared: 0.6378, Adjusted R-squared: 0.493
F-statistic: 4.403 on 2 and 5 DF, p-value: 0.07893
The estimates of the coefficients are exactly the same, and the standard errors are higher/more conservative here, because s is being estimated only from the subset rather than pooling across all the models. Pooling may or may not be an appropriate assumption for your use case, statistically.
It's also much easier to get predictions: predict(model, X) vs having to split-apply-combine again.