specifying a regression in R with an indicator variable - r

I would like to specify a regression in R that would estimate coefficients on x that are conditional on a third variable, z, being greater than 0. For example
y ~ a + x*1(z>0) + x*1(z<=0)
What is the correct way to do this in R using formulas?

The ":" (colon) operator is used to construct conditional interactions (when used with disjoint predictors constructed with I). Should be used with predict
> y=rnorm(10)
> x=rnorm(10)
> z=rnorm(10)
> mod <- lm(y ~ x:I(z>0) )
> mod
Call:
lm(formula = y ~ x:I(z > 0))
Coefficients:
(Intercept) x:I(z > 0)FALSE x:I(z > 0)TRUE
-0.009983 -0.203004 -0.655941
> predict(mod, newdata=data.frame(x=1:10, z=c(-1, 1)) )
1 2 3 4 5 6 7
-0.2129879 -1.3218653 -0.6189968 -2.6337471 -1.0250057 -3.9456289 -1.4310147
8 9 10
-5.2575108 -1.8370236 -6.5693926
> plot(1:10, predict(mod, newdata=data.frame(x=1:10, z=c(-1)) ) )
> lines(1:10, predict(mod, newdata=data.frame(x=1:10, z=c(1)) ) )
Might help to look at its model matrix:
> model.matrix(mod)
(Intercept) x:I(z > 0)FALSE x:I(z > 0)TRUE
1 1 -0.2866252 0.00000000
2 1 0.0000000 -0.03197743
3 1 -0.7427334 0.00000000
4 1 2.0852202 0.00000000
5 1 0.8548904 0.00000000
6 1 0.0000000 1.00044600
7 1 0.0000000 -1.18411791
8 1 0.0000000 -1.54110256
9 1 0.0000000 -0.21173300
10 1 0.0000000 0.17035257
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$`I(z > 0)`
[1] "contr.treatment"

y <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
z <- sample(x=-10:10,size=length(trt),replace=T)
x <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
a <- rnorm(n=length(x))
lm(y~a+I(x*1*I(z>0))+ I(x*1*I(z<=0)))
But I think using the : operator in DWIN solution is more elegant..
Edit
lm(y~a+I(x*1*I(z>0))+ I(x*1*I(z<=0)))
Call:
lm(formula = y ~ a + I(x * 1 * I(z > 0)) + I(x * 1 * I(z <= 0)))
Coefficients:
(Intercept) a I(x * 1 * I(z > 0)) I(x * 1 * I(z <= 0))
6.5775 -0.1345 -0.3352 -0.3366
> lm(formula = y ~ a+ x:I(z > 0))
Call:
lm(formula = y ~ a + x:I(z > 0))
Coefficients:
(Intercept) a x:I(z > 0)FALSE x:I(z > 0)TRUE
6.5775 -0.1345 -0.3366 -0.3352

Related

R model.matrix drop multicollinear variables

Is there a way to force model.matrix.lm to drop multicollinear variables, as is done in the estimation stage by lm()?
Here is an example:
library(fixest)
N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit_lm <- lm(y ~ x1 + x2, data = df)
summary(fit_lm)
# Call:
# lm(formula = y ~ x1 + x2, data = df)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.82680 -0.41503 0.05499 0.67185 0.97830
#
# Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.7494 0.2885 2.598 0.0317 *
# x1 2.3905 0.3157 7.571 6.48e-05 ***
# x2 NA NA NA NA
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.8924 on 8 degrees of freedom
# Multiple R-squared: 0.8775, Adjusted R-squared: 0.8622
# F-statistic: 57.33 on 1 and 8 DF, p-value: 6.476e-05
Note that lm() drops the collinear variable x2 from the model. But model.matrix() keeps it:
model.matrix(fit_lm)
# (Intercept) x1 x2
#1 1 1.41175158 1.41175158
#2 1 0.06164133 0.06164133
#3 1 0.09285047 0.09285047
#4 1 -0.63202909 -0.63202909
#5 1 0.25189850 0.25189850
#6 1 -0.18553830 -0.18553830
#7 1 0.65630180 0.65630180
#8 1 -1.77536852 -1.77536852
#9 1 -0.30571009 -0.30571009
#10 1 -1.47296229 -1.47296229
#attr(,"assign")
#[1] 0 1 2
The model.matrix method from fixest instead allows to drop x2:
fit_feols <- feols(y ~ x1 + x2, data = df)
model.matrix(fit_feols, type = "rhs", collin.rm = TRUE)
# (Intercept) x1
# [1,] 1 1.41175158
# [2,] 1 0.06164133
# [3,] 1 0.09285047
# [4,] 1 -0.63202909
# [5,] 1 0.25189850
# [6,] 1 -0.18553830
# [7,] 1 0.65630180
# [8,] 1 -1.77536852
# [9,] 1 -0.30571009
# [10,] 1 -1.47296229
Is there a way to drop x2 when calling model.matrix.lm()?
So long as the overhead from running the linear model is not too high, you could write a little function like the one here to do it:
N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit_lm <- lm(y ~ x1 + x2, data = df)
model.matrix2 <- function(model){
bn <- names(na.omit(coef(model)))
X <- model.matrix(model)
X[,colnames(X) %in% bn]
}
model.matrix2(fit_lm)
#> (Intercept) x1
#> 1 1 -0.04654473
#> 2 1 2.14473751
#> 3 1 0.02688125
#> 4 1 0.95071038
#> 5 1 -1.41621259
#> 6 1 1.47840480
#> 7 1 0.56580182
#> 8 1 0.14480401
#> 9 1 -0.02404072
#> 10 1 -0.14393258
Created on 2022-05-02 by the reprex package (v2.0.1)
In the code above, model.matrix2() is the function that post-processes the model matrix to contain only the variables that have non-missing coefficients in the linear model.

lm(y~poly(x1, x2,x3, degree=2, raw=TRUE), data)

Is
lm(y~poly(x1, x2,x3, degree=2, raw=TRUE), data)
equal to
lm(y~x1 + x2 + x3 + x1*x2 + x1*x3 + x2*x3 + x1^2 + x2^2 + x3^2 , data)
?
If yes, why do we need to set raw=TRUE?
You can test this yourself easily:
DF <- data.frame(x1 = 1:2, x2 = 3:4, x3 = 5:6)
with(DF, poly(x1, x2, x3, degree = 2, raw = TRUE))
# 1.0.0 2.0.0 0.1.0 1.1.0 0.2.0 0.0.1 1.0.1 0.1.1 0.0.2
#[1,] 1 1 3 3 9 5 5 15 25
#[2,] 2 4 4 8 16 6 12 24 36
#attr(,"degree")
#[1] 1 2 1 2 2 1 2 2 2
#attr(,"class")
#[1] "poly" "matrix"
The column names show the product of the three variables and the degree each variable has in this product. E.g., 1.1.0 means x1^1 + x2^1 + x3^0.
Of course, you see this also in the output of the regression model.
You need raw = TRUE if you want the coefficients to correspond to raw polynomials, i.e., alpha0 + alpha11 * x1^1 + alpha12 * x1^2 + .... If you don't need that, you should not set raw = TRUE because orthogonal polynomials have some desirable properties for regression analysis.

quadratic optimization in R with both equality and inequality constraints

I'm trying to find how to solve quadratic problem in R with both equality and inequality constraints as well as with upper and lower bounds:
min 0.5*x'*H*x + f'*x
subject to: A*x <= b
Aeq*x = beq
LB <= x <= UB
I've checked 'quadprog' and 'kernlab' packages but ... I must be missing something as I have no idea how specify both 'A' and 'Aeq' for solve.QP() or ipop()
Here's a working example:
library('quadprog')
# min
# -8 x1 -16 x2 + x1^2 + 4 x2^2
#
# s.t.
#
# x1 + 2 x2 == 12 # equalities
# x1 + x2 <= 10 # inequalities (N.B. you need to turn it into "greater-equal" form )
# 1 <= x1 <= 3 # bounds
# 1 <= x2 <= 6 # bounds
H <- rbind(c(2, 0),
c(0, 8))
f <- c(8,16)
# equalities
A.eq <- rbind(c(1,2))
b.eq <- c(12)
# inequalities
A.ge <- rbind(c(-1,-1))
b.ge <- c(-10)
# lower-bounds
A.lbs <- rbind(c( 1, 0),
c( 0, 1))
b.lbs <- c(1, 1)
# upper-bounds on variables
A.ubs <- rbind(c(-1, 0),
c( 0,-1))
b.ubs <- c(-3, -6)
# solve
sol <- solve.QP(Dmat = H,
dvec = f,
Amat = t(rbind(A.eq, A.ge, A.lbs, A.ubs)),
bvec = c(b.eq, b.ge, b.lbs, b.ubs),
meq = 1) # this argument says the first "meq" rows of Amat are equalities
sol
> sol
$solution
[1] 3.0 4.5
$value
[1] -6
$unconstrained.solution
[1] 4 2
$iterations
[1] 3 0
$Lagrangian
[1] 10 0 0 0 12 0
$iact
[1] 1 5

MGCV get design matrix

GAM regression with splines basis is defined by the following cost function:
cost = ||y - S \beta ||^2 + scale * integral(|S'' \beta|^2)
where S is the design matrix defined by the splines.
In R I can compute gam with the following code:
library('mgcv')
data = data.frame('x'=c(1,2,3,4,5), 'y'=c(1,0,0,0,1))
g = gam(y~s(x, k = 4),family = 'binomial', data = data, scale = 0.5)
plot(g)
I would like to get the design matrix S that is generated by s() function.
How can I do that?
I believe there are two ways to get the design matrix from a gamObject
library('mgcv')
data <- data.frame('x'=c(1,2,3,4,5), 'y'=c(1,0,0,0,1))
g <- gam(y~s(x, k = 4),family = 'binomial', data = data, scale = 0.5)
plot(g)
(option1 <- predict(g, type = "lpmatrix"))
# (Intercept) s(x).1 s(x).2 s(x).3
# 1 1 1.18270529 -0.39063809 -1.4142136
# 2 1 0.94027407 0.07402655 -0.7071068
# 3 1 -0.03736554 0.32947477 0.0000000
# 4 1 -0.97272283 0.21209396 0.7071068
# 5 1 -1.11289099 -0.22495720 1.4142136
# attr(,"model.offset")
# [1] 0
(option2 <- model.matrix.gam(g))
# (Intercept) s(x).1 s(x).2 s(x).3
# 1 1 1.18270529 -0.39063809 -1.4142136
# 2 1 0.94027407 0.07402655 -0.7071068
# 3 1 -0.03736554 0.32947477 0.0000000
# 4 1 -0.97272283 0.21209396 0.7071068
# 5 1 -1.11289099 -0.22495720 1.4142136
# attr(,"model.offset")
# [1] 0

Model matrix with all pairwise interactions between columns

Let's say that I have a numeric data matrix with columns w, x, y, z and I also want to add in the columns that are equivalent to w*x, w*y, w*z, x*y, x*z, y*z since I want my covariate matrix to include all pairwise interactions.
Is there a clean and effective way to do this?
If you mean in a model formula, then the ^ operator does this.
## dummy data
set.seed(1)
dat <- data.frame(Y = rnorm(10), x = rnorm(10), y = rnorm(10), z = rnorm(10))
The formula is
form <- Y ~ (x + y + z)^2
which gives (using model.matrix() - which is used internally by the standard model fitting functions)
model.matrix(form, data = dat)
R> form <- Y ~ (x + y + z)^2
R> form
Y ~ (x + y + z)^2
R> model.matrix(form, data = dat)
(Intercept) x y z x:y x:z y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.054026 1.24860
2 1 0.38984 0.78214 -0.10279 0.304911 -0.040071 -0.08039
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.240837 0.02891
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.119162 0.10704
5 1 1.12493 0.61983 -1.37706 0.697261 -1.549097 -0.85354
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.018647 0.02329
7 1 -0.01619 -0.15580 -0.39429 0.002522 0.006384 0.06143
8 1 0.94384 -1.47075 -0.05931 -1.388149 -0.055982 0.08724
9 1 0.82122 -0.47815 1.10003 -0.392667 0.903364 -0.52598
10 1 0.59390 0.41794 0.76318 0.248216 0.453251 0.31896
attr(,"assign")
[1] 0 1 2 3 4 5 6
If you don't know how many variables you have, or it is tedious to write out all of them, use the . notation too
R> form <- Y ~ .^2
R> model.matrix(form, data = dat)
(Intercept) x y z x:y x:z y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.054026 1.24860
2 1 0.38984 0.78214 -0.10279 0.304911 -0.040071 -0.08039
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.240837 0.02891
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.119162 0.10704
5 1 1.12493 0.61983 -1.37706 0.697261 -1.549097 -0.85354
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.018647 0.02329
7 1 -0.01619 -0.15580 -0.39429 0.002522 0.006384 0.06143
8 1 0.94384 -1.47075 -0.05931 -1.388149 -0.055982 0.08724
9 1 0.82122 -0.47815 1.10003 -0.392667 0.903364 -0.52598
10 1 0.59390 0.41794 0.76318 0.248216 0.453251 0.31896
attr(,"assign")
[1] 0 1 2 3 4 5 6
The "power" in the ^ operator, here 2, controls the order of interactions. With ^2 we get second order interactions of all pairs of variables considered by the ^ operator. If you want up to 3rd-order interactions, then use ^3.
R> form <- Y ~ .^3
R> head(model.matrix(form, data = dat))
(Intercept) x y z x:y x:z y:z x:y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.05403 1.24860 1.887604
2 1 0.38984 0.78214 -0.10279 0.304911 -0.04007 -0.08039 -0.031341
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.24084 0.02891 -0.017958
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.11916 0.10704 -0.237055
5 1 1.12493 0.61983 -1.37706 0.697261 -1.54910 -0.85354 -0.960170
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.01865 0.02329 -0.001047
If you are doing a regression, you can just do something like
reg <- lm(w ~ (x + y + z)^2
and it will figure things out for you. For example,
lm(Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length)^2, iris)
# Call:
# lm(formula = Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length)^2,
# data = iris)
# # Coefficients:
# (Intercept) Sepal.Length Sepal.Width
# -1.05768 0.07628 0.22983
# Petal.Length Sepal.Length:Sepal.Width Sepal.Length:Petal.Length
# 0.47586 -0.03863 -0.03083
# Sepal.Width:Petal.Length
# 0.06493

Resources