error variable length differs
I am confused with this error and I don't know what to do.
n1<-20
m1<-0
sd1<-1
y<-rnorm(n1,m1, sd1)
x<-rnorm(n1,m1, sd1)
e<-rnorm(n1,m1, sd1)
b0<-0
b1<-1
modelfit1<-lm(y~ b0 + b1*x + e)
Error in model.frame.default(formula = y ~ b0 + b1 * x + e:
variable lengths differ (found for 'b0')
edited:
I am working on such case where n=20, the parameters b0=0, and b=1 are true and the independent and the error are normally distributed with mean=0 and sd=1.
Is this possible?
Thanks a lot!
I may be wrong, but I believe you want to simulate an outcome and then estimate it's parameters. If that is true, you would rather do the following:
n1 <- 20
m1 <- 0
sd1<- 1
b0 <- 0
b1 <- 1
x <- rnorm(n1,m1, sd1)
e <- rnorm(n1,m1, sd1)
y <- b0 + b1*x + e
summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.66052 -0.40203 0.05659 0.44115 1.38798
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3078 0.1951 -1.578 0.132
x 1.1774 0.2292 5.137 6.9e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.852 on 18 degrees of freedom
Multiple R-squared: 0.5945, Adjusted R-squared: 0.572
F-statistic: 26.39 on 1 and 18 DF, p-value: 6.903e-05
And in case you want to do this multiple times, consider the following:
repetitions <- 5
betas <- t(sapply(1:repetitions, function(i){
y <- b0 + b1*x + rnorm(n1,m1, sd1)
coefficients(lm(y~x))
}))
betas
(Intercept) x
[1,] 0.21989182 0.8185690
[2,] -0.12820726 0.7289041
[3,] -0.27596844 0.9794432
[4,] 0.06145306 1.0575050
[5,] -0.31429950 0.9984262
Now you can look at the mean of the estimated betas:
colMeans(betas)
(Intercept) x
-0.08742606 0.91656951
and the variance-covariance matrix:
var(betas)
(Intercept) x
(Intercept) 0.051323041 -0.007976803
x -0.007976803 0.018834711
I suggest you put everything in a data.frame and deal with it that way:
set.seed(2)
m1<-0
sd1<-1
y<-rnorm(n1,m1, sd1)
x<-rnorm(n1,m1, sd1)
b0<-0
b1<-1
d <- data.frame(y,b0,b1,x,e=rnorm(20,0,1))
head(d)
# y b0 b1 x e
# 1 -0.89691455 0 1 2.090819205 -0.3835862
# 2 0.18484918 0 1 -1.199925820 -1.9591032
# 3 1.58784533 0 1 1.589638200 -0.8417051
# 4 -1.13037567 0 1 1.954651642 1.9035475
# 5 -0.08025176 0 1 0.004937777 0.6224939
# 6 0.13242028 0 1 -2.451706388 1.9909204
Now things work nicely:
modelfit1 <- lm(y~b0+b1*x+e, data=d)
modelfit1
# Call:
# lm(formula = y ~ b0 + b1 * x + e, data = d)
# Coefficients:
# (Intercept) b0 b1 x e b1:x
# 0.19331 NA NA -0.06752 0.02240 NA
summary(modelfit1)
# Call:
# lm(formula = y ~ b0 + b1 * x + e, data = d)
# Residuals:
# Min 1Q Median 3Q Max
# -2.5006 -0.4786 -0.1425 0.6211 1.8488
# Coefficients: (3 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.19331 0.25013 0.773 0.450
# b0 NA NA NA NA
# b1 NA NA NA NA
# x -0.06752 0.21720 -0.311 0.760
# e 0.02240 0.20069 0.112 0.912
# b1:x NA NA NA NA
# Residual standard error: 1.115 on 17 degrees of freedom
# Multiple R-squared: 0.006657, Adjusted R-squared: -0.1102
# F-statistic: 0.05697 on 2 and 17 DF, p-value: 0.9448
Related
Is there a way to force model.matrix.lm to drop multicollinear variables, as is done in the estimation stage by lm()?
Here is an example:
library(fixest)
N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit_lm <- lm(y ~ x1 + x2, data = df)
summary(fit_lm)
# Call:
# lm(formula = y ~ x1 + x2, data = df)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.82680 -0.41503 0.05499 0.67185 0.97830
#
# Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.7494 0.2885 2.598 0.0317 *
# x1 2.3905 0.3157 7.571 6.48e-05 ***
# x2 NA NA NA NA
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.8924 on 8 degrees of freedom
# Multiple R-squared: 0.8775, Adjusted R-squared: 0.8622
# F-statistic: 57.33 on 1 and 8 DF, p-value: 6.476e-05
Note that lm() drops the collinear variable x2 from the model. But model.matrix() keeps it:
model.matrix(fit_lm)
# (Intercept) x1 x2
#1 1 1.41175158 1.41175158
#2 1 0.06164133 0.06164133
#3 1 0.09285047 0.09285047
#4 1 -0.63202909 -0.63202909
#5 1 0.25189850 0.25189850
#6 1 -0.18553830 -0.18553830
#7 1 0.65630180 0.65630180
#8 1 -1.77536852 -1.77536852
#9 1 -0.30571009 -0.30571009
#10 1 -1.47296229 -1.47296229
#attr(,"assign")
#[1] 0 1 2
The model.matrix method from fixest instead allows to drop x2:
fit_feols <- feols(y ~ x1 + x2, data = df)
model.matrix(fit_feols, type = "rhs", collin.rm = TRUE)
# (Intercept) x1
# [1,] 1 1.41175158
# [2,] 1 0.06164133
# [3,] 1 0.09285047
# [4,] 1 -0.63202909
# [5,] 1 0.25189850
# [6,] 1 -0.18553830
# [7,] 1 0.65630180
# [8,] 1 -1.77536852
# [9,] 1 -0.30571009
# [10,] 1 -1.47296229
Is there a way to drop x2 when calling model.matrix.lm()?
So long as the overhead from running the linear model is not too high, you could write a little function like the one here to do it:
N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit_lm <- lm(y ~ x1 + x2, data = df)
model.matrix2 <- function(model){
bn <- names(na.omit(coef(model)))
X <- model.matrix(model)
X[,colnames(X) %in% bn]
}
model.matrix2(fit_lm)
#> (Intercept) x1
#> 1 1 -0.04654473
#> 2 1 2.14473751
#> 3 1 0.02688125
#> 4 1 0.95071038
#> 5 1 -1.41621259
#> 6 1 1.47840480
#> 7 1 0.56580182
#> 8 1 0.14480401
#> 9 1 -0.02404072
#> 10 1 -0.14393258
Created on 2022-05-02 by the reprex package (v2.0.1)
In the code above, model.matrix2() is the function that post-processes the model matrix to contain only the variables that have non-missing coefficients in the linear model.
What I currently have a problem with this problem is understanding how to fimulate 10,000 draws and fix the covariates.
Y
<int>
X1
<dbl>
X2
<dbl>
X3
<int>
1 4264 305.657 7.17 0
2 4496 328.476 6.20 0
3 4317 317.164 4.61 0
4 4292 366.745 7.02 0
5 4945 265.518 8.61 1
6 4325 301.995 6.88 0
6 rows
That is the head of the grocery code.
What I've done so far for other problems related:
#5.
#using beta_hat
#create a matrix with all the Xs and numbers from 1-52
X <- cbind(rep(1,52), grocery$X1, grocery$X2, grocery$X3)
beta_hat <- solve((t(X) %*% X)) %*% t(X) %*% grocery$Y
round(t(beta_hat), 2)
#using lm formula and residuals
#lm formula
lm0 <- lm(formula = Y ~ X1 + X2 + X3, data = grocery)
#6.
residuals(lm0)[1:5]
Below is what the lm() in the original function:
Call:
lm(formula = Y ~ X1 + X2 + X3, data = grocery)
Residuals:
Min 1Q Median 3Q Max
-264.05 -110.73 -22.52 79.29 295.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4149.8872 195.5654 21.220 < 2e-16 ***
X1 0.7871 0.3646 2.159 0.0359 *
X2 -13.1660 23.0917 -0.570 0.5712
X3 623.5545 62.6409 9.954 2.94e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 143.3 on 48 degrees of freedom
Multiple R-squared: 0.6883, Adjusted R-squared: 0.6689
F-statistic: 35.34 on 3 and 48 DF, p-value: 3.316e-12
The result should be a loop that can do the sampling distribution in the t test. Right now what I have is for another problem that focuses on fitting the model based on the data.
Here I'm given the true model (for the true hypothesis) but not sure where to begin with the loop.
Okay, have a look at the following:
# get some sample data:
set.seed(42)
df <- data.frame(X1 = rnorm(10), X2 = rnorm(10), X3 = rbinom(10, 1, 0.5))
# note how X1 gets multiplied with 0, to highlight that the null is imposed.
df$y_star <- with(df, 4200 + 0*X1 - 15*X2 + 620 * X3)
head(df)
X1 X2 X3 y_star
1 1.37095845 1.3048697 0 4180.427
2 -0.56469817 2.2866454 0 4165.700
3 0.36312841 -1.3888607 0 4220.833
4 0.63286260 -0.2787888 1 4824.182
5 0.40426832 -0.1333213 0 4202.000
# define function to get the t statistic
get_tstat <- function(){
# declare the outcome, with random noise added:
# The added random noise here will be different in each draw
df$y <- with(df, y_star + rnorm(10, mean = 0, sd = sqrt(20500)))
# run linear model
mod <- lm(y ~ X1 + X2 + X3, data = df)
return(summary(mod)$coefficients["X1", "t value"])
}
# get 10 values from the t-statistic:
replicate(10, get_tstat())
[1] -0.8337737 -1.2567709 -1.2303073 0.3629552 -0.1203216 -0.1150734 0.3533095 1.6261360
[9] 0.8259006 -1.3979176
I'm running linear regression with all predictors (I have 384 predictors), but only get 373 coefficients from summary. I'm wondering why does R not return all coefficients and how can I get all 384 coefficients?
full_lm <- lm(Y ~ ., data=dat[,2:385]) #384 predictors
coef_lm <- as.matrix(summary(full_lm)$coefficients[,4]) #only gives me 373
First, summary(full_lm)$coefficients[,4] returns the p-values not the coefficients. Now, to actually answer your question, I believe that some of your variables drop out of the estimation because they are perfectly collinear with some others. If you run summary(full_lm), you will see that the estimation for these variables returns NA in all fields. So, they are not included in summary(full_lm)$coefficients. As an example:
x<- rnorm(1000)
x1<- 2*x
x2<- runif(1000)
eps<- rnorm(1000)
y<- 5+3*x + x1 + x2 + eps
full_lm <- lm(y ~ x + x1 + x2)
summary(full_lm)
#Call:
#lm(formula = y ~ x + x1 + x2)
#
#Residuals:
# Min 1Q Median 3Q Max
#-2.90396 -0.67761 -0.02374 0.71906 2.88259
#
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 4.96254 0.06379 77.79 <2e-16 ***
#x 5.04771 0.03497 144.33 <2e-16 ***
#x1 NA NA NA NA
#x2 1.05833 0.11259 9.40 <2e-16 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 1.024 on 997 degrees of freedom
#Multiple R-squared: 0.9546, Adjusted R-squared: 0.9545
#F-statistic: 1.048e+04 on 2 and 997 DF, p-value: < 2.2e-16
coef_lm <- as.matrix(summary(full_lm)$coefficients[,1])
coef_lm
#(Intercept) 4.962538
#x 5.047709
#x2 1.058327
E.g., if some columns in your data are linear combinations of others, then the coefficient will be NA and if you index the way you do, it'll be omitted automatically.
a <- rnorm(100)
b <- rnorm(100)
c <- rnorm(100)
d <- b + 2*c
e <- lm(a ~ b + c + d)
gives
Call:
lm(formula = a ~ b + c + d)
Coefficients:
(Intercept) b c d
0.088463 -0.008097 -0.077994 NA
But indexing...
> as.matrix(summary(e)$coefficients)[, 4]
(Intercept) b c
0.3651726 0.9435427 0.3562072
library(lmPerm)
x <- lmp(formula = a ~ b * c + d + e, data = df, perm = "Prob")
summary(x) # truncated output, I can see `NA` rows here!
#Coefficients: (1 not defined because of singularities)
# Estimate Iter Pr(Prob)
#b 5.874 51 1.000
#c -30.060 281 0.263
#b:c NA NA NA
#d1 -31.333 60 0.633
#d2 33.297 165 0.382
#d3 -19.096 51 1.000
#e 1.976 NA NA
I want to pull out the Pr(Prob) results for everything, but
y <- summary(x)$coef[, "Pr(Prob)"]
#(Intercept) b c d1 d2
# 0.09459459 1.00000000 0.26334520 0.63333333 0.38181818
# d3 e
# 1.00000000 NA
This is not what I want. I need b:c row, too, in the right position.
An example of the output I would like from the above would be:
# (Intercept) b c b:c d1 d2
# 0.09459459 1.00000000 0.26334520 NA 0.63333333 0.38181818
# d3 e
# 1.00000000 NA
I also would like to pull out the Iter column that corresponds to each variable. Thanks.
lmp is based on lm and summary.lmp also behaves like summary.lm, so I will first use lm for illustration, then show that we can do the same for lmp.
lm and summary.lm
Have a read on ?summary.lm and watch out for the following returned values:
coefficients: a p x 4 matrix with columns for the estimated
coefficient, its standard error, t-statistic and
corresponding (two-sided) p-value. Aliased coefficients are
omitted.
aliased: named logical vector showing if the original coefficients are
aliased.
When you have rank-deficient models, NA coefficients are omitted in the coefficient table, and they are called aliased variables. Consider the following small, reproducible example:
set.seed(0)
zz <- xx <- rnorm(10)
yy <- rnorm(10)
fit <- lm(yy ~ xx + zz)
coef(fit) ## we can see `NA` here
#(Intercept) xx zz
# 0.1295147 0.2706560 NA
a <- summary(fit) ## it is also printed to screen
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1295 0.3143 0.412 0.691
#xx 0.2707 0.2669 1.014 0.340
#zz NA NA NA NA
b <- coef(a) ## but no `NA` returned in the matrix / table
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1295147 0.3142758 0.4121051 0.6910837
#xx 0.2706560 0.2669118 1.0140279 0.3402525
d <- a$aliased
#(Intercept) xx zz
# FALSE FALSE TRUE
If you want to pad NA rows to coefficient table / matrix, we can do
## an augmented matrix of `NA`
e <- matrix(nrow = length(d), ncol = ncol(b),
dimnames = list(names(d), dimnames(b)[[2]]))
## fill rows for non-aliased variables
e[!d] <- b
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1295147 0.3142758 0.4121051 0.6910837
#xx 0.2706560 0.2669118 1.0140279 0.3402525
#zz NA NA NA NA
lmp and summary.lmp
Nothing needs be changed.
library(lmPerm)
fit <- lmp(yy ~ xx + zz, perm = "Prob")
a <- summary(fit) ## `summary.lmp`
b <- coef(a)
# Estimate Iter Pr(Prob)
#(Intercept) -0.0264354 241 0.2946058
#xx 0.2706560 241 0.2946058
d <- a$aliased
#(Intercept) xx zz
# FALSE FALSE TRUE
e <- matrix(nrow = length(d), ncol = ncol(b),
dimnames = list(names(d), dimnames(b)[[2]]))
e[!d] <- b
# Estimate Iter Pr(Prob)
#(Intercept) -0.0264354 241 0.2946058
#xx 0.2706560 241 0.2946058
#zz NA NA NA
If you, want to extract Iter and Pr(Prob), just do
e[, 2] ## e[, "Iter"]
#(Intercept) xx zz
# 241 241 NA
e[, 3] ## e[, "Pr(Prob)"]
#(Intercept) xx zz
# 0.2946058 0.2946058 NA
This question already has an answer here:
Fitting linear model / ANOVA by group [duplicate]
(1 answer)
Closed 6 years ago.
Data:
Y X levels
y1 x1 2
...
lm(Y~X,I(levels==1))
Does the I(levels==1) mean under levels==1? If not, how can I do regression of Y vs X only when levels equals 1?
Have a look at lmList from the nlme package
set.seed(12345)
dataset <- data.frame(x = rnorm(100), y = rnorm(100), levels = gl(2, 50))
dataset$y <- with(dataset,
y + (0.1 + as.numeric(levels)) * x + 5 * as.numeric(levels)
)
library(nlme)
models <- lmList(y ~ x|levels, data = dataset)
the output is a list of lm models, one per level
models
Call:
Model: y ~ x | levels
Data: dataset
Coefficients:
(Intercept) x
1 4.964104 1.227478
2 10.085231 2.158683
Degrees of freedom: 100 total; 96 residual
Residual standard error: 1.019202
here is the summary of the first model
summary(models[[1]])
Call:
lm(formula = form, data = dat, na.action = na.action)
Residuals:
Min 1Q Median 3Q Max
-2.16569 -1.04457 -0.00318 0.78667 2.65927
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.9641 0.1617 30.703 < 2e-16 ***
x 1.2275 0.1469 8.354 6.47e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.128 on 48 degrees of freedom
Multiple R-squared: 0.5925, Adjusted R-squared: 0.584
F-statistic: 69.78 on 1 and 48 DF, p-value: 6.469e-11
You have the parameter subset of lm, here is an example.
x <- rnorm(100)
y <- rnorm(100, sd=0.1)
y[1:50] <- y[1:50] + 3*x[1:50] + 10 # line y = 3x+10
y[51:100] <- y[51:100] + 8*x[51:100] - 5 # line y = 8x-5
levels <- rep(1:2, each=50, len=100)
data = data.frame(x=x, y=y, levels=levels)
lm(y ~ x, data=data, subset=levels==1) # regression for the first part
Coefficients: (Intercept) x
10.015 2.996
lm(y ~ x, data=data, subset=levels==2) # second part
Coefficients: (Intercept) x
-4.986 8.000
You are passing I(levels==1) implicitly to subset inside lm.
I was not sure. But this code seems to suggest that you are correct.
my.data <- "x y level
1 2 1
2 4 2
3 4 1
4 3 2
5 5 1
6 5 2
7 7 1
8 6 2
9 10 1
10 5 2"
my.data2 <- read.table(textConnection(my.data), header = T)
my.data2
lm(x ~ y,I(level==1), data=my.data2)
my.data <- "x y level
1 2 1
3 4 1
5 5 1
7 7 1
9 10 1"
my.data2 <- read.table(textConnection(my.data), header = T)
my.data2
lm(x ~ y, data=my.data2)