Is there a way to force model.matrix.lm to drop multicollinear variables, as is done in the estimation stage by lm()?
Here is an example:
library(fixest)
N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit_lm <- lm(y ~ x1 + x2, data = df)
summary(fit_lm)
# Call:
# lm(formula = y ~ x1 + x2, data = df)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.82680 -0.41503 0.05499 0.67185 0.97830
#
# Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.7494 0.2885 2.598 0.0317 *
# x1 2.3905 0.3157 7.571 6.48e-05 ***
# x2 NA NA NA NA
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.8924 on 8 degrees of freedom
# Multiple R-squared: 0.8775, Adjusted R-squared: 0.8622
# F-statistic: 57.33 on 1 and 8 DF, p-value: 6.476e-05
Note that lm() drops the collinear variable x2 from the model. But model.matrix() keeps it:
model.matrix(fit_lm)
# (Intercept) x1 x2
#1 1 1.41175158 1.41175158
#2 1 0.06164133 0.06164133
#3 1 0.09285047 0.09285047
#4 1 -0.63202909 -0.63202909
#5 1 0.25189850 0.25189850
#6 1 -0.18553830 -0.18553830
#7 1 0.65630180 0.65630180
#8 1 -1.77536852 -1.77536852
#9 1 -0.30571009 -0.30571009
#10 1 -1.47296229 -1.47296229
#attr(,"assign")
#[1] 0 1 2
The model.matrix method from fixest instead allows to drop x2:
fit_feols <- feols(y ~ x1 + x2, data = df)
model.matrix(fit_feols, type = "rhs", collin.rm = TRUE)
# (Intercept) x1
# [1,] 1 1.41175158
# [2,] 1 0.06164133
# [3,] 1 0.09285047
# [4,] 1 -0.63202909
# [5,] 1 0.25189850
# [6,] 1 -0.18553830
# [7,] 1 0.65630180
# [8,] 1 -1.77536852
# [9,] 1 -0.30571009
# [10,] 1 -1.47296229
Is there a way to drop x2 when calling model.matrix.lm()?
So long as the overhead from running the linear model is not too high, you could write a little function like the one here to do it:
N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit_lm <- lm(y ~ x1 + x2, data = df)
model.matrix2 <- function(model){
bn <- names(na.omit(coef(model)))
X <- model.matrix(model)
X[,colnames(X) %in% bn]
}
model.matrix2(fit_lm)
#> (Intercept) x1
#> 1 1 -0.04654473
#> 2 1 2.14473751
#> 3 1 0.02688125
#> 4 1 0.95071038
#> 5 1 -1.41621259
#> 6 1 1.47840480
#> 7 1 0.56580182
#> 8 1 0.14480401
#> 9 1 -0.02404072
#> 10 1 -0.14393258
Created on 2022-05-02 by the reprex package (v2.0.1)
In the code above, model.matrix2() is the function that post-processes the model matrix to contain only the variables that have non-missing coefficients in the linear model.
Related
What I currently have a problem with this problem is understanding how to fimulate 10,000 draws and fix the covariates.
Y
<int>
X1
<dbl>
X2
<dbl>
X3
<int>
1 4264 305.657 7.17 0
2 4496 328.476 6.20 0
3 4317 317.164 4.61 0
4 4292 366.745 7.02 0
5 4945 265.518 8.61 1
6 4325 301.995 6.88 0
6 rows
That is the head of the grocery code.
What I've done so far for other problems related:
#5.
#using beta_hat
#create a matrix with all the Xs and numbers from 1-52
X <- cbind(rep(1,52), grocery$X1, grocery$X2, grocery$X3)
beta_hat <- solve((t(X) %*% X)) %*% t(X) %*% grocery$Y
round(t(beta_hat), 2)
#using lm formula and residuals
#lm formula
lm0 <- lm(formula = Y ~ X1 + X2 + X3, data = grocery)
#6.
residuals(lm0)[1:5]
Below is what the lm() in the original function:
Call:
lm(formula = Y ~ X1 + X2 + X3, data = grocery)
Residuals:
Min 1Q Median 3Q Max
-264.05 -110.73 -22.52 79.29 295.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4149.8872 195.5654 21.220 < 2e-16 ***
X1 0.7871 0.3646 2.159 0.0359 *
X2 -13.1660 23.0917 -0.570 0.5712
X3 623.5545 62.6409 9.954 2.94e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 143.3 on 48 degrees of freedom
Multiple R-squared: 0.6883, Adjusted R-squared: 0.6689
F-statistic: 35.34 on 3 and 48 DF, p-value: 3.316e-12
The result should be a loop that can do the sampling distribution in the t test. Right now what I have is for another problem that focuses on fitting the model based on the data.
Here I'm given the true model (for the true hypothesis) but not sure where to begin with the loop.
Okay, have a look at the following:
# get some sample data:
set.seed(42)
df <- data.frame(X1 = rnorm(10), X2 = rnorm(10), X3 = rbinom(10, 1, 0.5))
# note how X1 gets multiplied with 0, to highlight that the null is imposed.
df$y_star <- with(df, 4200 + 0*X1 - 15*X2 + 620 * X3)
head(df)
X1 X2 X3 y_star
1 1.37095845 1.3048697 0 4180.427
2 -0.56469817 2.2866454 0 4165.700
3 0.36312841 -1.3888607 0 4220.833
4 0.63286260 -0.2787888 1 4824.182
5 0.40426832 -0.1333213 0 4202.000
# define function to get the t statistic
get_tstat <- function(){
# declare the outcome, with random noise added:
# The added random noise here will be different in each draw
df$y <- with(df, y_star + rnorm(10, mean = 0, sd = sqrt(20500)))
# run linear model
mod <- lm(y ~ X1 + X2 + X3, data = df)
return(summary(mod)$coefficients["X1", "t value"])
}
# get 10 values from the t-statistic:
replicate(10, get_tstat())
[1] -0.8337737 -1.2567709 -1.2303073 0.3629552 -0.1203216 -0.1150734 0.3533095 1.6261360
[9] 0.8259006 -1.3979176
error variable length differs
I am confused with this error and I don't know what to do.
n1<-20
m1<-0
sd1<-1
y<-rnorm(n1,m1, sd1)
x<-rnorm(n1,m1, sd1)
e<-rnorm(n1,m1, sd1)
b0<-0
b1<-1
modelfit1<-lm(y~ b0 + b1*x + e)
Error in model.frame.default(formula = y ~ b0 + b1 * x + e:
variable lengths differ (found for 'b0')
edited:
I am working on such case where n=20, the parameters b0=0, and b=1 are true and the independent and the error are normally distributed with mean=0 and sd=1.
Is this possible?
Thanks a lot!
I may be wrong, but I believe you want to simulate an outcome and then estimate it's parameters. If that is true, you would rather do the following:
n1 <- 20
m1 <- 0
sd1<- 1
b0 <- 0
b1 <- 1
x <- rnorm(n1,m1, sd1)
e <- rnorm(n1,m1, sd1)
y <- b0 + b1*x + e
summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.66052 -0.40203 0.05659 0.44115 1.38798
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3078 0.1951 -1.578 0.132
x 1.1774 0.2292 5.137 6.9e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.852 on 18 degrees of freedom
Multiple R-squared: 0.5945, Adjusted R-squared: 0.572
F-statistic: 26.39 on 1 and 18 DF, p-value: 6.903e-05
And in case you want to do this multiple times, consider the following:
repetitions <- 5
betas <- t(sapply(1:repetitions, function(i){
y <- b0 + b1*x + rnorm(n1,m1, sd1)
coefficients(lm(y~x))
}))
betas
(Intercept) x
[1,] 0.21989182 0.8185690
[2,] -0.12820726 0.7289041
[3,] -0.27596844 0.9794432
[4,] 0.06145306 1.0575050
[5,] -0.31429950 0.9984262
Now you can look at the mean of the estimated betas:
colMeans(betas)
(Intercept) x
-0.08742606 0.91656951
and the variance-covariance matrix:
var(betas)
(Intercept) x
(Intercept) 0.051323041 -0.007976803
x -0.007976803 0.018834711
I suggest you put everything in a data.frame and deal with it that way:
set.seed(2)
m1<-0
sd1<-1
y<-rnorm(n1,m1, sd1)
x<-rnorm(n1,m1, sd1)
b0<-0
b1<-1
d <- data.frame(y,b0,b1,x,e=rnorm(20,0,1))
head(d)
# y b0 b1 x e
# 1 -0.89691455 0 1 2.090819205 -0.3835862
# 2 0.18484918 0 1 -1.199925820 -1.9591032
# 3 1.58784533 0 1 1.589638200 -0.8417051
# 4 -1.13037567 0 1 1.954651642 1.9035475
# 5 -0.08025176 0 1 0.004937777 0.6224939
# 6 0.13242028 0 1 -2.451706388 1.9909204
Now things work nicely:
modelfit1 <- lm(y~b0+b1*x+e, data=d)
modelfit1
# Call:
# lm(formula = y ~ b0 + b1 * x + e, data = d)
# Coefficients:
# (Intercept) b0 b1 x e b1:x
# 0.19331 NA NA -0.06752 0.02240 NA
summary(modelfit1)
# Call:
# lm(formula = y ~ b0 + b1 * x + e, data = d)
# Residuals:
# Min 1Q Median 3Q Max
# -2.5006 -0.4786 -0.1425 0.6211 1.8488
# Coefficients: (3 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.19331 0.25013 0.773 0.450
# b0 NA NA NA NA
# b1 NA NA NA NA
# x -0.06752 0.21720 -0.311 0.760
# e 0.02240 0.20069 0.112 0.912
# b1:x NA NA NA NA
# Residual standard error: 1.115 on 17 degrees of freedom
# Multiple R-squared: 0.006657, Adjusted R-squared: -0.1102
# F-statistic: 0.05697 on 2 and 17 DF, p-value: 0.9448
As one option for model selection for MCMCglmm (see also this related question) I am trying out model averaging using the package MuMIn. It doesn't seem to work - see output below. Any ideas why? The output looks nonsense. In particular, there are a bunch of NA values for z values, and where these are not NA, they are all exactly 1. This may stem from the fact that all but one model has been assigned a weight of 0, which again seem unrealistic.
Note that in the documentation for MuMIn, it is listed as being compatible with MCMCglmm objects.
Reproducible example:
set.seed(1234)
library(MCMCglmm)
data(bird.families)
n <- Ntip(bird.families)
# Create some dummy variables
d <- data.frame(taxon = bird.families$tip.label,
X1 = rnorm(n),
X2 = rnorm(n),
X3 = sample(c("A", "B", "C"), n, replace = T),
X4 = sample(c("A", "B", "C"), n, replace = T))
# Simulate a phenotype composed of phylogenetic, fixed and residual effects
d$phenotype <- rbv(bird.families, 1, nodes="TIPS") +
d$X1*0.7 +
ifelse(d$X3 == "B", 0.5, 0) +
ifelse(d$X3 == "C", 0.8, 0) +
rnorm(n, 0, 1)
# Inverse matrix of shared phyloegnetic history
Ainv <- inverseA(bird.families)$Ainv
# Set priors
prior <- list(R = list(V = 1, nu = 0.002),
G = list(G1 = list(V = 1, nu = 0.002)))
uMCMCglmm <- updateable(MCMCglmm)
model <- uMCMCglmm(phenotype ~ X1 + X2 + X3 + X4,
random = ~taxon,
ginverse = list(taxon=Ainv),
data = d,
prior = prior,
verbose = FALSE)
# Explore possible simplified models
options(na.action = "na.fail")
dred <- dredge(model)
# Calculate a model average
avg <- model.avg(dred)
summary(avg)
Output:
Call:
model.avg(object = dred)
Component model call:
uMCMCglmm(fixed = phenotype ~ <16 unique rhs>, random = ~taxon, data = d,
prior = prior, verbose = FALSE, ginverse = list(taxon = Ainv))
Component models:
df logLik AICc delta weight
3 5 -49.24 108.93 0.00 1
4 5 -71.18 152.82 43.89 0
(Null) 3 -76.98 160.13 51.20 0
34 7 -90.35 195.56 86.63 0
23 6 -95.03 202.71 93.78 0
24 6 -105.79 224.22 115.29 0
1 4 -134.87 278.04 169.11 0
123 7 -137.36 289.59 180.66 0
2 4 -154.82 317.93 209.00 0
234 8 -162.69 342.51 233.58 0
13 6 -167.74 348.12 239.19 0
124 7 -171.06 356.99 248.05 0
14 6 -172.53 357.70 248.77 0
134 8 -171.60 360.33 251.40 0
12 5 -181.16 372.78 263.84 0
1234 9 -189.33 398.07 289.14 0
Term codes:
X1 X2 X3 X4
1 2 3 4
Model-averaged coefficients:
(full average)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.642e-01 NA NA NA
X3B 6.708e-01 6.708e-01 1 0.317
X3C 9.802e-01 9.802e-01 1 0.317
X4B -9.505e-11 9.505e-11 1 0.317
X4C -7.822e-11 7.822e-11 1 0.317
X2 -3.259e-22 3.259e-22 1 0.317
X1 1.378e-37 1.378e-37 1 0.317
(conditional average)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.76421 NA NA NA
X3B 0.67078 NA NA NA
X3C 0.98025 NA NA NA
X4B -0.32229 NA NA NA
X4C -0.26522 NA NA NA
X2 -0.07528 NA NA NA
X1 0.72300 NA NA NA
Relative variable importance:
X3 X4 X2 X1
Importance: 1 <0.01 <0.01 <0.01
N containing models: 8 8 8 8
Let's say that I have a numeric data matrix with columns w, x, y, z and I also want to add in the columns that are equivalent to w*x, w*y, w*z, x*y, x*z, y*z since I want my covariate matrix to include all pairwise interactions.
Is there a clean and effective way to do this?
If you mean in a model formula, then the ^ operator does this.
## dummy data
set.seed(1)
dat <- data.frame(Y = rnorm(10), x = rnorm(10), y = rnorm(10), z = rnorm(10))
The formula is
form <- Y ~ (x + y + z)^2
which gives (using model.matrix() - which is used internally by the standard model fitting functions)
model.matrix(form, data = dat)
R> form <- Y ~ (x + y + z)^2
R> form
Y ~ (x + y + z)^2
R> model.matrix(form, data = dat)
(Intercept) x y z x:y x:z y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.054026 1.24860
2 1 0.38984 0.78214 -0.10279 0.304911 -0.040071 -0.08039
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.240837 0.02891
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.119162 0.10704
5 1 1.12493 0.61983 -1.37706 0.697261 -1.549097 -0.85354
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.018647 0.02329
7 1 -0.01619 -0.15580 -0.39429 0.002522 0.006384 0.06143
8 1 0.94384 -1.47075 -0.05931 -1.388149 -0.055982 0.08724
9 1 0.82122 -0.47815 1.10003 -0.392667 0.903364 -0.52598
10 1 0.59390 0.41794 0.76318 0.248216 0.453251 0.31896
attr(,"assign")
[1] 0 1 2 3 4 5 6
If you don't know how many variables you have, or it is tedious to write out all of them, use the . notation too
R> form <- Y ~ .^2
R> model.matrix(form, data = dat)
(Intercept) x y z x:y x:z y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.054026 1.24860
2 1 0.38984 0.78214 -0.10279 0.304911 -0.040071 -0.08039
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.240837 0.02891
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.119162 0.10704
5 1 1.12493 0.61983 -1.37706 0.697261 -1.549097 -0.85354
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.018647 0.02329
7 1 -0.01619 -0.15580 -0.39429 0.002522 0.006384 0.06143
8 1 0.94384 -1.47075 -0.05931 -1.388149 -0.055982 0.08724
9 1 0.82122 -0.47815 1.10003 -0.392667 0.903364 -0.52598
10 1 0.59390 0.41794 0.76318 0.248216 0.453251 0.31896
attr(,"assign")
[1] 0 1 2 3 4 5 6
The "power" in the ^ operator, here 2, controls the order of interactions. With ^2 we get second order interactions of all pairs of variables considered by the ^ operator. If you want up to 3rd-order interactions, then use ^3.
R> form <- Y ~ .^3
R> head(model.matrix(form, data = dat))
(Intercept) x y z x:y x:z y:z x:y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.05403 1.24860 1.887604
2 1 0.38984 0.78214 -0.10279 0.304911 -0.04007 -0.08039 -0.031341
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.24084 0.02891 -0.017958
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.11916 0.10704 -0.237055
5 1 1.12493 0.61983 -1.37706 0.697261 -1.54910 -0.85354 -0.960170
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.01865 0.02329 -0.001047
If you are doing a regression, you can just do something like
reg <- lm(w ~ (x + y + z)^2
and it will figure things out for you. For example,
lm(Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length)^2, iris)
# Call:
# lm(formula = Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length)^2,
# data = iris)
# # Coefficients:
# (Intercept) Sepal.Length Sepal.Width
# -1.05768 0.07628 0.22983
# Petal.Length Sepal.Length:Sepal.Width Sepal.Length:Petal.Length
# 0.47586 -0.03863 -0.03083
# Sepal.Width:Petal.Length
# 0.06493
This question already has an answer here:
Fitting linear model / ANOVA by group [duplicate]
(1 answer)
Closed 6 years ago.
Data:
Y X levels
y1 x1 2
...
lm(Y~X,I(levels==1))
Does the I(levels==1) mean under levels==1? If not, how can I do regression of Y vs X only when levels equals 1?
Have a look at lmList from the nlme package
set.seed(12345)
dataset <- data.frame(x = rnorm(100), y = rnorm(100), levels = gl(2, 50))
dataset$y <- with(dataset,
y + (0.1 + as.numeric(levels)) * x + 5 * as.numeric(levels)
)
library(nlme)
models <- lmList(y ~ x|levels, data = dataset)
the output is a list of lm models, one per level
models
Call:
Model: y ~ x | levels
Data: dataset
Coefficients:
(Intercept) x
1 4.964104 1.227478
2 10.085231 2.158683
Degrees of freedom: 100 total; 96 residual
Residual standard error: 1.019202
here is the summary of the first model
summary(models[[1]])
Call:
lm(formula = form, data = dat, na.action = na.action)
Residuals:
Min 1Q Median 3Q Max
-2.16569 -1.04457 -0.00318 0.78667 2.65927
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.9641 0.1617 30.703 < 2e-16 ***
x 1.2275 0.1469 8.354 6.47e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.128 on 48 degrees of freedom
Multiple R-squared: 0.5925, Adjusted R-squared: 0.584
F-statistic: 69.78 on 1 and 48 DF, p-value: 6.469e-11
You have the parameter subset of lm, here is an example.
x <- rnorm(100)
y <- rnorm(100, sd=0.1)
y[1:50] <- y[1:50] + 3*x[1:50] + 10 # line y = 3x+10
y[51:100] <- y[51:100] + 8*x[51:100] - 5 # line y = 8x-5
levels <- rep(1:2, each=50, len=100)
data = data.frame(x=x, y=y, levels=levels)
lm(y ~ x, data=data, subset=levels==1) # regression for the first part
Coefficients: (Intercept) x
10.015 2.996
lm(y ~ x, data=data, subset=levels==2) # second part
Coefficients: (Intercept) x
-4.986 8.000
You are passing I(levels==1) implicitly to subset inside lm.
I was not sure. But this code seems to suggest that you are correct.
my.data <- "x y level
1 2 1
2 4 2
3 4 1
4 3 2
5 5 1
6 5 2
7 7 1
8 6 2
9 10 1
10 5 2"
my.data2 <- read.table(textConnection(my.data), header = T)
my.data2
lm(x ~ y,I(level==1), data=my.data2)
my.data <- "x y level
1 2 1
3 4 1
5 5 1
7 7 1
9 10 1"
my.data2 <- read.table(textConnection(my.data), header = T)
my.data2
lm(x ~ y, data=my.data2)