Let's say that I have a numeric data matrix with columns w, x, y, z and I also want to add in the columns that are equivalent to w*x, w*y, w*z, x*y, x*z, y*z since I want my covariate matrix to include all pairwise interactions.
Is there a clean and effective way to do this?
If you mean in a model formula, then the ^ operator does this.
## dummy data
set.seed(1)
dat <- data.frame(Y = rnorm(10), x = rnorm(10), y = rnorm(10), z = rnorm(10))
The formula is
form <- Y ~ (x + y + z)^2
which gives (using model.matrix() - which is used internally by the standard model fitting functions)
model.matrix(form, data = dat)
R> form <- Y ~ (x + y + z)^2
R> form
Y ~ (x + y + z)^2
R> model.matrix(form, data = dat)
(Intercept) x y z x:y x:z y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.054026 1.24860
2 1 0.38984 0.78214 -0.10279 0.304911 -0.040071 -0.08039
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.240837 0.02891
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.119162 0.10704
5 1 1.12493 0.61983 -1.37706 0.697261 -1.549097 -0.85354
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.018647 0.02329
7 1 -0.01619 -0.15580 -0.39429 0.002522 0.006384 0.06143
8 1 0.94384 -1.47075 -0.05931 -1.388149 -0.055982 0.08724
9 1 0.82122 -0.47815 1.10003 -0.392667 0.903364 -0.52598
10 1 0.59390 0.41794 0.76318 0.248216 0.453251 0.31896
attr(,"assign")
[1] 0 1 2 3 4 5 6
If you don't know how many variables you have, or it is tedious to write out all of them, use the . notation too
R> form <- Y ~ .^2
R> model.matrix(form, data = dat)
(Intercept) x y z x:y x:z y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.054026 1.24860
2 1 0.38984 0.78214 -0.10279 0.304911 -0.040071 -0.08039
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.240837 0.02891
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.119162 0.10704
5 1 1.12493 0.61983 -1.37706 0.697261 -1.549097 -0.85354
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.018647 0.02329
7 1 -0.01619 -0.15580 -0.39429 0.002522 0.006384 0.06143
8 1 0.94384 -1.47075 -0.05931 -1.388149 -0.055982 0.08724
9 1 0.82122 -0.47815 1.10003 -0.392667 0.903364 -0.52598
10 1 0.59390 0.41794 0.76318 0.248216 0.453251 0.31896
attr(,"assign")
[1] 0 1 2 3 4 5 6
The "power" in the ^ operator, here 2, controls the order of interactions. With ^2 we get second order interactions of all pairs of variables considered by the ^ operator. If you want up to 3rd-order interactions, then use ^3.
R> form <- Y ~ .^3
R> head(model.matrix(form, data = dat))
(Intercept) x y z x:y x:z y:z x:y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.05403 1.24860 1.887604
2 1 0.38984 0.78214 -0.10279 0.304911 -0.04007 -0.08039 -0.031341
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.24084 0.02891 -0.017958
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.11916 0.10704 -0.237055
5 1 1.12493 0.61983 -1.37706 0.697261 -1.54910 -0.85354 -0.960170
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.01865 0.02329 -0.001047
If you are doing a regression, you can just do something like
reg <- lm(w ~ (x + y + z)^2
and it will figure things out for you. For example,
lm(Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length)^2, iris)
# Call:
# lm(formula = Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length)^2,
# data = iris)
# # Coefficients:
# (Intercept) Sepal.Length Sepal.Width
# -1.05768 0.07628 0.22983
# Petal.Length Sepal.Length:Sepal.Width Sepal.Length:Petal.Length
# 0.47586 -0.03863 -0.03083
# Sepal.Width:Petal.Length
# 0.06493
Related
Is there a way to force model.matrix.lm to drop multicollinear variables, as is done in the estimation stage by lm()?
Here is an example:
library(fixest)
N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit_lm <- lm(y ~ x1 + x2, data = df)
summary(fit_lm)
# Call:
# lm(formula = y ~ x1 + x2, data = df)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.82680 -0.41503 0.05499 0.67185 0.97830
#
# Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.7494 0.2885 2.598 0.0317 *
# x1 2.3905 0.3157 7.571 6.48e-05 ***
# x2 NA NA NA NA
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.8924 on 8 degrees of freedom
# Multiple R-squared: 0.8775, Adjusted R-squared: 0.8622
# F-statistic: 57.33 on 1 and 8 DF, p-value: 6.476e-05
Note that lm() drops the collinear variable x2 from the model. But model.matrix() keeps it:
model.matrix(fit_lm)
# (Intercept) x1 x2
#1 1 1.41175158 1.41175158
#2 1 0.06164133 0.06164133
#3 1 0.09285047 0.09285047
#4 1 -0.63202909 -0.63202909
#5 1 0.25189850 0.25189850
#6 1 -0.18553830 -0.18553830
#7 1 0.65630180 0.65630180
#8 1 -1.77536852 -1.77536852
#9 1 -0.30571009 -0.30571009
#10 1 -1.47296229 -1.47296229
#attr(,"assign")
#[1] 0 1 2
The model.matrix method from fixest instead allows to drop x2:
fit_feols <- feols(y ~ x1 + x2, data = df)
model.matrix(fit_feols, type = "rhs", collin.rm = TRUE)
# (Intercept) x1
# [1,] 1 1.41175158
# [2,] 1 0.06164133
# [3,] 1 0.09285047
# [4,] 1 -0.63202909
# [5,] 1 0.25189850
# [6,] 1 -0.18553830
# [7,] 1 0.65630180
# [8,] 1 -1.77536852
# [9,] 1 -0.30571009
# [10,] 1 -1.47296229
Is there a way to drop x2 when calling model.matrix.lm()?
So long as the overhead from running the linear model is not too high, you could write a little function like the one here to do it:
N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit_lm <- lm(y ~ x1 + x2, data = df)
model.matrix2 <- function(model){
bn <- names(na.omit(coef(model)))
X <- model.matrix(model)
X[,colnames(X) %in% bn]
}
model.matrix2(fit_lm)
#> (Intercept) x1
#> 1 1 -0.04654473
#> 2 1 2.14473751
#> 3 1 0.02688125
#> 4 1 0.95071038
#> 5 1 -1.41621259
#> 6 1 1.47840480
#> 7 1 0.56580182
#> 8 1 0.14480401
#> 9 1 -0.02404072
#> 10 1 -0.14393258
Created on 2022-05-02 by the reprex package (v2.0.1)
In the code above, model.matrix2() is the function that post-processes the model matrix to contain only the variables that have non-missing coefficients in the linear model.
I simulated some data which I wanted to split into a list of data based on id but it seems that the split() function is not working properly?
set.seed(323)
#simulate some data
tsfunc2 <- function () {
x1 = rnorm(25, mean = 3, sd = 1)
x2.sample = rnorm(5, mean = 2, sd = 0.5)
x2 = rep(x2.sample, each = 5)
mu = rnorm(25, mean = 10, sd = 2)
y=as.numeric(mu + x1 + x2)
data.frame(id=rep(1:5, each=5), time=1:5, x1=x1, x2=x2, y=y)
}
set.seed(63)
#create a dataset in which the simulated data are randomly sampled in order to create imbalanced panel data
fd <- function() {
df <- tsfunc2()[sample(nrow(tsfunc2()), 20), ]
ds <- df[with(df, order(id, time)),]
return(ds)
}
set.seed(124)
split(fd(), fd()$id) #it seems that data are not properly split based on id (e.g., the first row of id2)
$`1`
id time x1 x2 y
1 1 1 1.614929 1.900059 13.43994
3 1 3 2.236970 1.900059 14.49136
4 1 4 3.212306 1.900059 15.08736
$`2`
id time x1 x2 y
5 1 5 4.425538 1.900059 15.53696 #this row is supposed to be in id1
7 2 2 3.700229 2.027456 17.48522
8 2 3 2.770645 2.027456 15.20741
9 2 4 3.197094 2.027456 13.44979
$`3`
id time x1 x2 y
12 3 2 1.576201 1.658917 16.40684
13 3 3 2.594909 1.658917 14.34763
14 3 4 3.995387 1.658917 16.36730
15 3 5 3.958818 1.658917 15.37498
$`4`
id time x1 x2 y
16 4 1 3.918088 1.636148 15.48205
17 4 2 2.849030 1.636148 12.52288
18 4 3 1.776931 1.636148 12.54456
19 4 4 2.131176 1.636148 13.63235
20 4 5 1.957515 1.636148 15.55745
$`5`
id time x1 x2 y
21 5 1 1.896362 1.569048 12.54131
22 5 2 3.444185 1.569048 14.56303
23 5 3 2.795049 1.569048 12.67120
25 5 5 2.868678 1.569048 13.88765
I have a list of dataframes (and parameters for sensitivity analyses for a study), and I want to mutate each dataframe in the same way. The expected output is generated by the code below (a new column x2). Is there a way to assign the resulting dataframes (newdfs) to the list without using a for loop?
models <- list(m1 = list('params' = list('start'='2014-01-01'),
'data' = data.frame(y=c(1,2,3), x1=c(4,5,6))),
m2 = list('params' = list('start'='2017-01-01'),
'data' = data.frame(y=c(1,2,3), x1=c(7,8,9))))
newdfs <- lapply(models, function(z) {z$data$x2 <- z$data$x1 + 1
z$data})
# Can I do this without "for"?
for(x in 1:length(models)) models[[x]]$data <- newdfs[[x]]
You can try this:
newdfs <- lapply(models, function(z) {z$data$x2 <- z$data$x1 + 1
return(z)})
$m1
$m1$params
$m1$params$start
[1] "2014-01-01"
$m1$data
y x1 x2
1 1 4 5
2 2 5 6
3 3 6 7
$m2
$m2$params
$m2$params$start
[1] "2017-01-01"
$m2$data
y x1 x2
1 1 7 8
2 2 8 9
3 3 9 10
Revise the function in lapply() to return z instead of z$data:
lapply(models, function(z) {z$data$x2 <- z$data$x1 + 1 ; z})
To make this question complete, here are two purrr solutions:
library(purrr)
map() + map_at()
map(models, map_at, "data", transform, x2 = x1 + 1)
transpose() + map()
models %>%
transpose %>%
`[[<-`(., "data", map(.$data, transform, x2 = x1 + 1)) %>%
transpose
Output
$m1
$m1$params
$m1$params$start
[1] "2014-01-01"
$m1$data
y x1 x2
1 1 4 5
2 2 5 6
3 3 6 7
$m2
$m2$params
$m2$params$start
[1] "2017-01-01"
$m2$data
y x1 x2
1 1 7 8
2 2 8 9
3 3 9 10
I would like to specify a regression in R that would estimate coefficients on x that are conditional on a third variable, z, being greater than 0. For example
y ~ a + x*1(z>0) + x*1(z<=0)
What is the correct way to do this in R using formulas?
The ":" (colon) operator is used to construct conditional interactions (when used with disjoint predictors constructed with I). Should be used with predict
> y=rnorm(10)
> x=rnorm(10)
> z=rnorm(10)
> mod <- lm(y ~ x:I(z>0) )
> mod
Call:
lm(formula = y ~ x:I(z > 0))
Coefficients:
(Intercept) x:I(z > 0)FALSE x:I(z > 0)TRUE
-0.009983 -0.203004 -0.655941
> predict(mod, newdata=data.frame(x=1:10, z=c(-1, 1)) )
1 2 3 4 5 6 7
-0.2129879 -1.3218653 -0.6189968 -2.6337471 -1.0250057 -3.9456289 -1.4310147
8 9 10
-5.2575108 -1.8370236 -6.5693926
> plot(1:10, predict(mod, newdata=data.frame(x=1:10, z=c(-1)) ) )
> lines(1:10, predict(mod, newdata=data.frame(x=1:10, z=c(1)) ) )
Might help to look at its model matrix:
> model.matrix(mod)
(Intercept) x:I(z > 0)FALSE x:I(z > 0)TRUE
1 1 -0.2866252 0.00000000
2 1 0.0000000 -0.03197743
3 1 -0.7427334 0.00000000
4 1 2.0852202 0.00000000
5 1 0.8548904 0.00000000
6 1 0.0000000 1.00044600
7 1 0.0000000 -1.18411791
8 1 0.0000000 -1.54110256
9 1 0.0000000 -0.21173300
10 1 0.0000000 0.17035257
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$`I(z > 0)`
[1] "contr.treatment"
y <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
z <- sample(x=-10:10,size=length(trt),replace=T)
x <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
a <- rnorm(n=length(x))
lm(y~a+I(x*1*I(z>0))+ I(x*1*I(z<=0)))
But I think using the : operator in DWIN solution is more elegant..
Edit
lm(y~a+I(x*1*I(z>0))+ I(x*1*I(z<=0)))
Call:
lm(formula = y ~ a + I(x * 1 * I(z > 0)) + I(x * 1 * I(z <= 0)))
Coefficients:
(Intercept) a I(x * 1 * I(z > 0)) I(x * 1 * I(z <= 0))
6.5775 -0.1345 -0.3352 -0.3366
> lm(formula = y ~ a+ x:I(z > 0))
Call:
lm(formula = y ~ a + x:I(z > 0))
Coefficients:
(Intercept) a x:I(z > 0)FALSE x:I(z > 0)TRUE
6.5775 -0.1345 -0.3366 -0.3352
This question already has an answer here:
Fitting linear model / ANOVA by group [duplicate]
(1 answer)
Closed 6 years ago.
Data:
Y X levels
y1 x1 2
...
lm(Y~X,I(levels==1))
Does the I(levels==1) mean under levels==1? If not, how can I do regression of Y vs X only when levels equals 1?
Have a look at lmList from the nlme package
set.seed(12345)
dataset <- data.frame(x = rnorm(100), y = rnorm(100), levels = gl(2, 50))
dataset$y <- with(dataset,
y + (0.1 + as.numeric(levels)) * x + 5 * as.numeric(levels)
)
library(nlme)
models <- lmList(y ~ x|levels, data = dataset)
the output is a list of lm models, one per level
models
Call:
Model: y ~ x | levels
Data: dataset
Coefficients:
(Intercept) x
1 4.964104 1.227478
2 10.085231 2.158683
Degrees of freedom: 100 total; 96 residual
Residual standard error: 1.019202
here is the summary of the first model
summary(models[[1]])
Call:
lm(formula = form, data = dat, na.action = na.action)
Residuals:
Min 1Q Median 3Q Max
-2.16569 -1.04457 -0.00318 0.78667 2.65927
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.9641 0.1617 30.703 < 2e-16 ***
x 1.2275 0.1469 8.354 6.47e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.128 on 48 degrees of freedom
Multiple R-squared: 0.5925, Adjusted R-squared: 0.584
F-statistic: 69.78 on 1 and 48 DF, p-value: 6.469e-11
You have the parameter subset of lm, here is an example.
x <- rnorm(100)
y <- rnorm(100, sd=0.1)
y[1:50] <- y[1:50] + 3*x[1:50] + 10 # line y = 3x+10
y[51:100] <- y[51:100] + 8*x[51:100] - 5 # line y = 8x-5
levels <- rep(1:2, each=50, len=100)
data = data.frame(x=x, y=y, levels=levels)
lm(y ~ x, data=data, subset=levels==1) # regression for the first part
Coefficients: (Intercept) x
10.015 2.996
lm(y ~ x, data=data, subset=levels==2) # second part
Coefficients: (Intercept) x
-4.986 8.000
You are passing I(levels==1) implicitly to subset inside lm.
I was not sure. But this code seems to suggest that you are correct.
my.data <- "x y level
1 2 1
2 4 2
3 4 1
4 3 2
5 5 1
6 5 2
7 7 1
8 6 2
9 10 1
10 5 2"
my.data2 <- read.table(textConnection(my.data), header = T)
my.data2
lm(x ~ y,I(level==1), data=my.data2)
my.data <- "x y level
1 2 1
3 4 1
5 5 1
7 7 1
9 10 1"
my.data2 <- read.table(textConnection(my.data), header = T)
my.data2
lm(x ~ y, data=my.data2)