Aggregate coeffcients from linear regression with interactions [r] - r

I am trying to write a function that would take linear regression model, Design matrix X and would return marginal effects of our coefficients.
Marginal effects are given in regression's summary output. However, if we have interactions present in our model we have to aggregate the coefficients.
I have prepared a small example in R:
library(tidyverse)
# Desing Matrix for regression with interaction ---------------------------
X <- model.matrix(hp ~ factor(gear) + disp + mpg + wt + wt:factor(gear), data = mtcars)
y <- mtcars$hp
lm(y ~ X - 1, data = mtcars) %>%
summary()
# Model Interpretation ----------------------------------------------------
# Our reference category is 'Gear3'.
# Meaning that X(Intercept) is an intercept for Gear3
# If I want an intercept for Gear4 and Gear5 coefs Gear4 and Gear5 are deviations from refrence cathegory
# So Intercept for Gear4 is: X(Intercept) + Xfactor(gear)4 = 166.0265 + 30.2619 = 196.2884
# Same Idea hold for Gear5: X(Intercept) + Xfactor(gear)5 = 166.0265 -58.2724 = 107.7541
# Now if we want to interpret wt as a marginal affect, again we need to take into account our created interactions.
# Xwt is marginal effect of wt for the regrence cathegory (Gear3)
# If we want marginal Affect for (Gear4 cars) we need: Xwt + Xfactor(gear)4:wt = -6.4985 -9.8580 = -16.3565
# SAme Idea for gear5: Xwt + Xfactor(gear)5:wt = -6.4985 + 49.6390 = 43.1405
I would like to write a function that would take model object and desing matrix and would return a dataframe with our aggregated marginal effects:
In case given I want a function to return a dataframe:
# X(Intercept)gear3: 166.0265
# X(Intercept)gear4: 196.2884
# X(Intercept)gear5: 107.7541
#
# Xwtgear3:-6.4985
# Xwtgear4:-16.3565
# Xwtgear5:43.1405
Where the marginals effects for interactions are already precalculated/aggregated for user.
So far I was able to create some concept that is not generalize enought.
Therefore, I would like to ask for any ideas how to write function using either dplyr or datatable.

I am not sure if you are just interested in the output or the way of doing it yourself.
Regarding the former, you can use the {interactions} package, which will give you your desired output. Regarding the latter, you could have a look at how the {interactions} packages calculates this output under the hood.
library(dplyr)
library(interactions)
mtcars2 <- mtcars %>% mutate(gear = factor(gear))
fit <- lm(hp ~ wt * gear + disp + mpg, data = mtcars2)
slopes <- sim_slopes(fit, pred = wt, modx = gear, centered = "none")
#> Warning: Johnson-Neyman intervals are not available for factor moderators.
slopes$ints
#> Value of gear Est. S.E. 2.5% 97.5% t val. p
#> 1 5 107.7540 103.74667 -106.36856 321.8766 1.038626 0.30932987
#> 2 4 196.2884 100.91523 -11.99042 404.5672 1.945082 0.06357154
#> 3 3 166.0265 71.71678 18.01029 314.0426 2.315029 0.02947996
slopes$slopes
#> Value of gear Est. S.E. 2.5% 97.5% t val. p
#> 1 5 43.140490 26.69021 -11.94539 98.22637 1.6163415 0.1190910
#> 2 4 -16.356572 20.22409 -58.09705 25.38390 -0.8087667 0.4265945
#> 3 3 -6.498535 15.37599 -38.23302 25.23595 -0.4226417 0.6763194
Created on 2021-01-03 by the reprex package (v0.3.0)
If you want both results in one data.frame you could easily bind them together:
bind_rows(mutate(slopes$ints, term = "(Intercept)"),
mutate(slopes$slopes, term = "(Interaction)")) %>%
select(term, "Value of gear", Est., p)
#> term Value of gear Est. p
#> 1 (Intercept) 5 107.754033 0.30932987
#> 2 (Intercept) 4 196.288380 0.06357154
#> 3 (Intercept) 3 166.026451 0.02947996
#> 4 (Interaction) 5 43.140490 0.11909097
#> 5 (Interaction) 4 -16.356572 0.42659451
#> 6 (Interaction) 3 -6.498535 0.67631941

Related

Getting summary statistics from model using permutation values in R

I am trying to obtain the summary statistics (summary()) for the linear model (below), which uses 1000 permutations of the original dataset to create a 1000 random dataset (large matrix).
random_model <- rep(NA,1000)
for (i in c(1:1000)) {
random_data <- final_data
random_data$weighted_degree <- rowSums(node.perm_1000[i,,],na.rm=T)
random_model[i] <- coef(lm(weighted_degree ~ age + sex + age*sex, data=random_data))
}
I am not simply trying to compare the models to get an overall p-value but I want to get a t-value for each of the variables in the model that uses the random permutations as well.
Try with tidy() from broom package. It returns the expected values like this (example):
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.53 0.479 13.6 6.47e-28
2 iris$Sepal.Width -0.223 0.155 -1.44 1.52e- 1
In your case, the previous output will be stored for each element in the list of the loop according to your definition:
library(broom)
#Data
random_model <- rep(NA,1000)
#Loop
for (i in c(1:1000)) {
random_data <- final_data
random_data$weighted_degree <- rowSums(node.perm_1000[i,,],na.rm=T)
random_model[i] <- broom::tidy(lm(weighted_degree ~ age + sex + age*sex, data=random_data))
}
You should store the results of interest (estimated coefficients and t-values) in a list.
Here is a reproducible example using 10 replications on the mtcars dataset which is sampled at 50% rate for each replication.
The results of interest are retrieved using the $coefficients attribute of the summary() output on the lm object.
# The data
data(mtcars)
# Define sample size of each replication
N <- nrow(mtcars)
sample_size <- floor(N/2)
# Number of replications (model fits) and initialization of the list to store the results
set.seed(1717)
replications <- 10
random_model <- vector( "list", length=replications )
for (i in seq_along(random_model)) {
shuffle = sample(N, sample_size)
mtcars_shuffle = mtcars[shuffle, ]
random_model[[i]] <- summary(lm(mpg ~ cyl + disp + cyl*disp, data=mtcars_shuffle))$coefficients
}
For example, the model fitted for replications 1 and 10 are:
> random_model[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.26285335 8.219065181 5.872061 7.573836e-05
cyl -3.33999161 1.366231326 -2.444675 3.089262e-02
disp -0.12941685 0.063269362 -2.045490 6.337414e-02
cyl:disp 0.01394436 0.007877833 1.770076 1.020931e-01
> random_model[[10]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.27312267 7.662593317 7.082866 1.277746e-05
cyl -4.40545653 1.586392001 -2.777029 1.674235e-02
disp -0.15330770 0.047932153 -3.198431 7.654790e-03
cyl:disp 0.01792561 0.006707396 2.672514 2.031615e-02

R - passing object to function based on the NAME of the object

Suppose in R I have multiple GLM objects from multiple glm() function calls.
glm_01
glm_02
...
glm_nn
...and suppose that I want to do all possible pairwise comparisons using a chi-squared or F ANOVA test.
anova(glm_01, glm_02, test = "F")
anova(glm_01, glm_03, test = "F")
anova(glm_01, glm_04, test = "F")
...
I don't want to do this manually because the list of models is quite long. Instead I'd like to grab a list of relevant model objects (anything starting with "glm_") and do all pairwise comparisons automatically. However I'm unsure how to pass the model objects (rather than their names in string form) to the anova() function.
As a simple example:
data(mtcars)
# create some models
glm_01 <- glm(mpg ~ cyl , mtcars, family = gaussian())
glm_02 <- glm(mpg ~ cyl + disp , mtcars, family = gaussian())
glm_03 <- glm(mpg ~ cyl + disp + hp , mtcars, family = gaussian())
glm_04 <- glm(mpg ~ cyl + disp + hp + wt, mtcars, family = gaussian())
# get list of relevant model objects from the R environment
model_list <- ls()
model_list <- model_list[substr(model_list, 1, 4) == "glm_"]
# create a table to store the pairwise ANOVA results
n_models <- length(model_list)
anova_table <- matrix(0, nrow = n_models, ncol = n_models)
# loop through twice and do pairwise comparisons
for(row_index in 1:n_models) {
for(col_index in 1:n_models) {
anova_table[row_index, col_index] <- anova(model_list[row_index], model_list[col_index], test = "F")$'Pr(>F)'[2]
}
}
...but of course this loop at the end doesn't work because I'm not passing model objects to anova(), I'm passing the names of the objects as strings instead. How do I tell anova() to use the object that the string refers to, instead of the string itself?
Thank you.
======================
Possible solution:
data(mtcars)
glm_list <- list()
glm_list$glm_01 <- glm(mpg ~ cyl , mtcars, family = gaussian())
glm_list$glm_02 <- glm(mpg ~ cyl + disp , mtcars, family = gaussian())
glm_list$glm_03 <- glm(mpg ~ cyl + disp + hp , mtcars, family = gaussian())
glm_list$glm_04 <- glm(mpg ~ cyl + disp + hp + wt, mtcars, family = gaussian())
# create a table to store the pairwise ANOVA results
n_models <- length(glm_list)
anova_table <- matrix(0, nrow = n_models, ncol = n_models)
# loop through twice and do pairwise comparisons
row_idx <- 0
col_idx <- 0
for(row_glm in glm_list)
{
row_idx <- row_idx + 1
for(col_glm in glm_list)
{
col_idx <- col_idx + 1
anova_table[row_idx, col_idx] <- anova(row_glm, col_glm, test = "F")$'Pr(>F)'[2]
}
col_idx <- 0
}
row_idx <- 0
The easiest way to do this would be to keep all your models in a list. This makes it simple to iterate over them. For example, you can create all of your models and do a pairwise comparison between all of them like this:
data(mtcars)
f_list <- list(mpg ~ cyl,
mpg ~ cyl + disp,
mpg ~ cyl + disp + hp,
mpg ~ cyl + disp + hp + wt)
all_glms <- lapply(f_list, glm, data = mtcars, family = gaussian)
all_pairs <- as.data.frame(combn(length(all_glms), 2))
result <- lapply(all_pairs, function(i) anova(all_glms[[i[1]]], all_glms[[i[2]]]))
Which gives you:
result
#> $V1
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl
#> Model 2: mpg ~ cyl + disp
#> Resid. Df Resid. Dev Df Deviance
#> 1 30 308.33
#> 2 29 270.74 1 37.594
#>
#> $V2
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl
#> Model 2: mpg ~ cyl + disp + hp
#> Resid. Df Resid. Dev Df Deviance
#> 1 30 308.33
#> 2 28 261.37 2 46.965
#>
#> $V3
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl
#> Model 2: mpg ~ cyl + disp + hp + wt
#> Resid. Df Resid. Dev Df Deviance
#> 1 30 308.33
#> 2 27 170.44 3 137.89
#>
#> $V4
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl + disp
#> Model 2: mpg ~ cyl + disp + hp
#> Resid. Df Resid. Dev Df Deviance
#> 1 29 270.74
#> 2 28 261.37 1 9.3709
#>
#> $V5
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl + disp
#> Model 2: mpg ~ cyl + disp + hp + wt
#> Resid. Df Resid. Dev Df Deviance
#> 1 29 270.74
#> 2 27 170.44 2 100.3
#>
#> $V6
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl + disp + hp
#> Model 2: mpg ~ cyl + disp + hp + wt
#> Resid. Df Resid. Dev Df Deviance
#> 1 28 261.37
#> 2 27 170.44 1 90.925
Created on 2020-08-25 by the reprex package (v0.3.0)
If you want to reference arbitrary objects in an accessible environment by symbol without putting them into a list object, the standard way to return the top object on the search list whose symbol is equal to a string is get(), or the vector equivalent mget(). I.e. get("glm_01") gets you the top object on the search list that has the symbol glm_01. The most minimal modification to your approach would be to wrap your calls to model_list[row_index] and model_list[col_index] in get().
You can be more precise about where to look for objects by assigning the models in a named environment and only getting from that environment (using the envir parameter to get()).

How to write a for loop which creates a model and has a function which references that same model

I am trying to run a post hoc analysis on an unbalanced two way anova using the anova_test funciton in the rstatix package. I need to run this post hoc test iteratively, as I have ~26 response (y) variables. My first step is to create models of all my y variables with relation to group and treatment. I have successfully managed to do this, creating a single list with 26 models:
models <- map(data[,y1:y26], ~(lm(.x ~data$group*data$treatment)))
Now comes the part I'm stuck on. Referring to these models iteratively. I would like to run the following code for every y variable I have:
group_by(group) %>%
anova_test(y ~ treatment, error = models(y), type = 3)
where my y changes every time and as it does, the "model" (referred to in the error = term) is updated accordingly. I'm struggling with this bit since first set of models I make is used to inform the second set of models.
However, if I run just one y variable through this whole bit of code at one time, I get the appropriate results.
model <- lm(y ~ group*treatment, data = data)
data %>%
group_by(group) %>%
anova_test(y ~ treatment, error = model, type = 3)
I have tried creating a for loop as well as using the map function in the purrr package but I have been unsuccessful. I am new to for loops and purrr so I am sure it's a simple fix I just can't see it.
Basically I want a way to run
data %>%
group_by(group) %>%
anova_test(y ~ treatment, error = model, type = 3)
iteratively for different y variables (y1, y2, ..., y26) while also referring to the approprite model (model$y1, model$y2, ..., model$26).
Thanks for your help!
Well you didn't give any data so let's use toothgrowth. You seem to like the model format, so let's build a list of models. You could do this in an automated fashion but to make it clear lets do it by hand. The call purrr::map with the anova_test function. You'll get a list back. Since you're in charge of naming the list elements go to town.
Updated answer May 18th. Now using map2 since you want two different models passed build a list for each...
library(rstatix)
library(purrr)
ToothGrowth$len2 <- ToothGrowth$len^2 # for variety
models <- list(model1 = lm(len ~ supp*dose, ToothGrowth),
model2 = lm(len ~ dose*supp, ToothGrowth),
model3 = lm(len2 ~ dose*supp, ToothGrowth),
model4 = lm(len2 ~ supp*dose, ToothGrowth))
models2 <- list(model1 = lm(len ~ supp, ToothGrowth),
model2 = lm(len ~ dose, ToothGrowth),
model3 = lm(len2 ~ dose, ToothGrowth),
model4 = lm(len2 ~ supp, ToothGrowth))
# one model
purrr::map(models, ~ anova_test(.x, type = 3))
# now with model for error term
purrr::map2(models, models2, ~ anova_test(.x, error = .y, type = 3))
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> $model1
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 supp 1 58 4.058 0.049000 * 0.065
#> 2 dose 1 58 12.717 0.000734 * 0.180
#> 3 supp:dose 1 58 1.588 0.213000 0.027
#>
#> $model2
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 dose 1 58 33.626 2.92e-07 * 0.367
#> 2 supp 1 58 10.729 2.00e-03 * 0.156
#> 3 dose:supp 1 58 4.200 4.50e-02 * 0.068
#>
#> $model3
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 dose 1 58 36.028 1.35e-07 * 0.383
#> 2 supp 1 58 7.128 1.00e-02 * 0.109
#> 3 dose:supp 1 58 2.709 1.05e-01 0.045
#>
#> $model4
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 supp 1 58 2.684 0.107000 0.044
#> 2 dose 1 58 13.566 0.000508 * 0.190
#> 3 supp:dose 1 58 1.020 0.317000 0.017
Thanks to Nirgrahamuk from the rstudio community forum for this answer:
map(names(models_1) ,
~ anova_test(data=group_by(df,edge),
formula = as.formula(paste0(.x,"~ trt")),
error = models_1[[.x]],
type = 3))
(see their full answer at: https://community.rstudio.com/t/trouble-using-group-by-and-map2-together/66730/8?u=mvula)
Created on 2020-05-20 by the reprex package (v0.3.0)

map() model output to a dataframe

I've been using map() to calculate and extract certain statistics from multiple lm() models.
To give a reproducible example, using the mtcars dataset, I start with an input vector of formulae to be estimated using lm() models:
library(tidyverse)
df <- mtcars
input_char <- c("mpg ~ disp",
"mpg ~ disp + hp")
input_formula <- map(input_char, formula)
I've then got a function that calculates and extracts the relevant statistics for each model. For simplicity and reproducibility, here's a simplified function that just extracts the R-squared of the model.
get_rsquared <- function(a_formula) {
model1 <- lm(a_formula, data = df)
rsquared <- summary(model1)$r.squared
c(model = a_formula, rsquared = rsquared)
}
I've then used map to iterate through the formulae and extract the R-squared from each model.
models <- map(input_formula, get_rsquared)
models
which gives the output:
[[1]]
[[1]]$model
mpg ~ disp
<environment: 0x7f98987f4000>
[[1]]$rsquared
[1] 0.7183433
[[2]]
[[2]]$model
mpg ~ disp + hp
<environment: 0x7f98987f4000>
[[2]]$rsquared
[1] 0.7482402
My question is regarding the output being a list.
Is there a simple way to make the output a dataframe?
My desired output is:
#> model rsquared
#> 1 mpg ~ disp 0.7183433
#> 2 mpg ~ disp + hp 0.7482402
Keep the formulas as character strings and use as.formula() as part of the the get_rsquared() function as it's easier to work with them as character strings than formula objects.
library(purrr)
library(dplyr)
df <- mtcars
input_char <- c("mpg ~ disp",
"mpg ~ disp + hp")
get_rsquared <- function(a_formula) {
model1 <- lm(as.formula(a_formula), data = df)
rsquared <- summary(model1)$r.squared
list(model = a_formula, rsquared = rsquared)
}
map_df(input_char, get_rsquared)
# A tibble: 2 x 2
model rsquared
<chr> <dbl>
1 mpg ~ disp 0.718
2 mpg ~ disp + hp 0.748

In R, when creating a model, is there an equivalent to the by statement in SAS?

Say I have a data set that I'd like to create a lm, for each combination of variable A and B. Where A has two values: 'a' and 'b', and B has three values: 1,2,3. This leaving me with six possible combinations of variables A and B.
This said, I would like to create six (6) models. In example the first model would have the data subsetted where A = a and B = 1.
In SAS, in example, the code would be as follows (please note the by statement):
proc glm data = mydate;
by A B;
class Cat1 Cat2;
model Y = X + Cat1 + Cat2;
run;
The by statement will generate one model for combination of A and B.
This is really just a split-apply step:
split the data into chunks
smydate <- split(mydate, list(A = A, B = B))
Each component of smydate represents the data for a particular combination of A and B. You may need to add drop = TRUE to the split call if your data doesn't have all combinations of the levels of A and B.
apply the lm() function over the components of the list smydate
lmFun <- function(dat) {
lm(y ~ x + cat1 + cat2, data = dat)
}
models <- lapply(smydate, lmFun)
Now you have a list, models, where each component contains a lm object for the particular combination of A and B.
An example (based on the one shown by rawr in the comments is:
models <- lapply(split(mtcars, list(mtcars$am, mtcars$gear), drop = TRUE),
function(x) {lm(mpg ~ wt + disp, data = x)})
str(models)
models
which gives:
> str(models, max = 1)
List of 4
$ 0.3:List of 12
..- attr(*, "class")= chr "lm"
$ 0.4:List of 12
..- attr(*, "class")= chr "lm"
$ 1.4:List of 12
..- attr(*, "class")= chr "lm"
$ 1.5:List of 12
..- attr(*, "class")= chr "lm"
> models
$`0.3`
Call:
lm(formula = mpg ~ wt + disp, data = x)
Coefficients:
(Intercept) wt disp
27.994610 -2.384834 -0.007983
$`0.4`
Call:
lm(formula = mpg ~ wt + disp, data = x)
Coefficients:
(Intercept) wt disp
219.1047 -106.8075 0.9953
$`1.4`
Call:
lm(formula = mpg ~ wt + disp, data = x)
Coefficients:
(Intercept) wt disp
43.27860 -3.03114 -0.09481
$`1.5`
Call:
lm(formula = mpg ~ wt + disp, data = x)
Coefficients:
(Intercept) wt disp
41.779042 -7.230952 -0.006731
As rawr notes in the comments, you can do this in fewer steps using by(), or any one of a number of other higher-level functions in say the plyr package, but doing things by hand at least once illustrates the generality of the approach; you can always use the short cuts once you are familiar with the general idea.
Using group_by in the dplyr package will run an analysis for each subgroup combination. Using the mtcars dataset:
library(dplyr)
res <- mtcars %>%
group_by(am, gear) %>%
do(mod = lm(mpg ~ wt + disp, data = .))
res$mod
Will give you the list of lm objects.
Other packages will make this more elegant. You could do this in-line with the magrittr package and go straight to the list of lm objects:
library(magrittr)
mtcars %>%
group_by(am, gear) %>%
do(mod = lm(mpg ~ wt + disp, data = .)) %>%
use_series(mod)
Or use the broom package to extract coefficient values from the lm objects:
library(broom)
mtcars %>%
group_by(am, gear) %>%
do(mod = lm(mpg ~ wt + disp, data = .)) %>%
glance(mod)
Source: local data frame [4 x 13]
Groups: am, gear
am gear r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
1 0 3 0.6223489 0.5594070 2.2379851 9.887679 0.00290098 3 -31.694140 71.38828 74.22048 60.102926 12
2 0 4 0.9653343 0.8960028 0.9899495 13.923469 0.18618733 3 -2.862760 13.72552 11.27070 0.980000 1
3 1 4 0.7849464 0.6989249 2.9709337 9.125006 0.02144702 3 -18.182504 44.36501 44.68277 44.132234 5
4 1 5 0.9827679 0.9655358 1.2362092 57.031169 0.01723212 3 -5.864214 19.72843 18.16618 3.056426 2
More specifically, you can use lmList to fit linear models to categories, after using #bjoseph's strategy of generating an interaction variable:
mydate <- transform(mydate, ABcat=interaction(A,B,drop=TRUE))
library("lme4") ## or library("nlme")
lmList(Y~X+Cat1+Cat2|ABcat,mydate)
You could try several different things.
Let's say our data is:
structure(list(A = structure(c(1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"), B = structure(c(1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"), x = c(1, 2, 3, 4), y = c(2, 2, 2, 2)), .Names = c("A", "B", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
x
#> A B x y
1 A A 1 2
2 A B 2 2
3 B A 3 2
4 B B 4 2
by()
This returns a list-type object. Notice that it doesn't return results in the order we might have expected. It's trying to keep the second factor as stable as possible when iterating. You could adjust this by using list(x$B,x$A)
by(x[c("x","y")],list(x$A,x$B),function(x){x[1]*x[2]})
[1] 2
-------------------------------------------------------------------------------------
[1] 6
-------------------------------------------------------------------------------------
[1] 4
-------------------------------------------------------------------------------------
[1] 8
expand.grid()
This is a simple for loop where we pre-generated the combinations of interest, subset the data in the loop and perform the function of interest. expand.grid() can be slow with large sets of combinations and for loops aren't necessarily fast but you have a lot of control in the middle.
combinations = expand.grid(levels(x$A),levels(x$B))
for(i in 1:nrow(combinations)){
d = x[x$A==combinations[i,1] & x$B==combinations[i,2],c("x","y")]
print(d[1]*d[2])
}
#> x
1 2
x
3 6
x
2 4
x
4 8
If you want the fit/predictions instead of summary stats(t-tests, etc), it's easier to fit an interaction model of Y~(A:B)*(X + Cat1 + Cat2) - 1 - X - Cat1 - Cat2; by subtracting out the main effects, R will reparameterize and place all the variance on the interactions. Here's an example:
> mtcars <- within(mtcars, {cyl = as.factor(cyl); am=as.factor(am)})
> model <- lm(mpg~(cyl:am)*(hp+wt)-1-hp-wt, mtcars)
> summary(model)
Call:
lm(formula = mpg ~ (cyl:am) * (hp + wt) - 1 - hp - wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-2.6685 -0.9071 0.0000 0.7705 4.1879
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
cyl4:am0 2.165e+01 2.252e+01 0.961 0.3517
cyl6:am0 6.340e+01 4.245e+01 1.494 0.1560
cyl8:am0 2.746e+01 5.000e+00 5.492 6.20e-05 ***
cyl4:am1 4.725e+01 5.144e+00 9.184 1.51e-07 ***
cyl6:am1 2.320e+01 3.808e+01 0.609 0.5515
cyl8:am1 1.877e+01 1.501e+01 1.251 0.2302
cyl4:am0:hp -4.635e-02 1.107e-01 -0.419 0.6815
cyl6:am0:hp 7.425e-03 1.650e-01 0.045 0.9647
cyl8:am0:hp -2.110e-02 2.531e-02 -0.834 0.4175
cyl4:am1:hp -7.288e-02 4.457e-02 -1.635 0.1228
cyl6:am1:hp -2.000e-02 4.733e-02 -0.423 0.6786
cyl8:am1:hp -1.127e-02 4.977e-02 -0.226 0.8240
cyl4:am0:wt 1.762e+00 5.341e+00 0.330 0.7460
cyl6:am0:wt -1.332e+01 1.303e+01 -1.022 0.3231
cyl8:am0:wt -2.025e+00 1.099e+00 -1.843 0.0851 .
cyl4:am1:wt -6.465e+00 2.467e+00 -2.621 0.0193 *
cyl6:am1:wt -4.926e-15 1.386e+01 0.000 1.0000
cyl8:am1:wt NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.499 on 15 degrees of freedom
Multiple R-squared: 0.9933, Adjusted R-squared: 0.9858
F-statistic: 131.4 on 17 and 15 DF, p-value: 3.045e-13
compare with a cyl4:am1 submodel:
> summary(lm(mpg~wt+hp, mtcars, subset=cyl=='4' & am=='1'))
Call:
lm(formula = mpg ~ wt + hp, data = mtcars, subset = cyl == "4" &
am == "1")
Residuals:
Datsun 710 Fiat 128 Honda Civic Toyota Corolla Fiat X1-9 Porsche 914-2
-2.66851 4.18787 -2.61455 3.25523 -2.62538 -0.77799
Lotus Europa Volvo 142E
1.17181 0.07154
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.24552 6.57304 7.188 0.000811 ***
wt -6.46508 3.15205 -2.051 0.095512 .
hp -0.07288 0.05695 -1.280 0.256814
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.193 on 5 degrees of freedom
Multiple R-squared: 0.6378, Adjusted R-squared: 0.493
F-statistic: 4.403 on 2 and 5 DF, p-value: 0.07893
The estimates of the coefficients are exactly the same, and the standard errors are higher/more conservative here, because s is being estimated only from the subset rather than pooling across all the models. Pooling may or may not be an appropriate assumption for your use case, statistically.
It's also much easier to get predictions: predict(model, X) vs having to split-apply-combine again.

Resources