Linear regression not returning all coefficients - r

I'm running linear regression with all predictors (I have 384 predictors), but only get 373 coefficients from summary. I'm wondering why does R not return all coefficients and how can I get all 384 coefficients?
full_lm <- lm(Y ~ ., data=dat[,2:385]) #384 predictors
coef_lm <- as.matrix(summary(full_lm)$coefficients[,4]) #only gives me 373

First, summary(full_lm)$coefficients[,4] returns the p-values not the coefficients. Now, to actually answer your question, I believe that some of your variables drop out of the estimation because they are perfectly collinear with some others. If you run summary(full_lm), you will see that the estimation for these variables returns NA in all fields. So, they are not included in summary(full_lm)$coefficients. As an example:
x<- rnorm(1000)
x1<- 2*x
x2<- runif(1000)
eps<- rnorm(1000)
y<- 5+3*x + x1 + x2 + eps
full_lm <- lm(y ~ x + x1 + x2)
summary(full_lm)
#Call:
#lm(formula = y ~ x + x1 + x2)
#
#Residuals:
# Min 1Q Median 3Q Max
#-2.90396 -0.67761 -0.02374 0.71906 2.88259
#
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 4.96254 0.06379 77.79 <2e-16 ***
#x 5.04771 0.03497 144.33 <2e-16 ***
#x1 NA NA NA NA
#x2 1.05833 0.11259 9.40 <2e-16 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 1.024 on 997 degrees of freedom
#Multiple R-squared: 0.9546, Adjusted R-squared: 0.9545
#F-statistic: 1.048e+04 on 2 and 997 DF, p-value: < 2.2e-16
coef_lm <- as.matrix(summary(full_lm)$coefficients[,1])
coef_lm
#(Intercept) 4.962538
#x 5.047709
#x2 1.058327

E.g., if some columns in your data are linear combinations of others, then the coefficient will be NA and if you index the way you do, it'll be omitted automatically.
a <- rnorm(100)
b <- rnorm(100)
c <- rnorm(100)
d <- b + 2*c
e <- lm(a ~ b + c + d)
gives
Call:
lm(formula = a ~ b + c + d)
Coefficients:
(Intercept) b c d
0.088463 -0.008097 -0.077994 NA
But indexing...
> as.matrix(summary(e)$coefficients)[, 4]
(Intercept) b c
0.3651726 0.9435427 0.3562072

Related

How to write a loop to simulate sampling distribution of t-statistic under null using a true model?

What I currently have a problem with this problem is understanding how to fimulate 10,000 draws and fix the covariates.
Y
<int>
X1
<dbl>
X2
<dbl>
X3
<int>
1 4264 305.657 7.17 0
2 4496 328.476 6.20 0
3 4317 317.164 4.61 0
4 4292 366.745 7.02 0
5 4945 265.518 8.61 1
6 4325 301.995 6.88 0
6 rows
That is the head of the grocery code.
What I've done so far for other problems related:
#5.
#using beta_hat
#create a matrix with all the Xs and numbers from 1-52
X <- cbind(rep(1,52), grocery$X1, grocery$X2, grocery$X3)
beta_hat <- solve((t(X) %*% X)) %*% t(X) %*% grocery$Y
round(t(beta_hat), 2)
#using lm formula and residuals
#lm formula
lm0 <- lm(formula = Y ~ X1 + X2 + X3, data = grocery)
#6.
residuals(lm0)[1:5]
Below is what the lm() in the original function:
Call:
lm(formula = Y ~ X1 + X2 + X3, data = grocery)
Residuals:
Min 1Q Median 3Q Max
-264.05 -110.73 -22.52 79.29 295.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4149.8872 195.5654 21.220 < 2e-16 ***
X1 0.7871 0.3646 2.159 0.0359 *
X2 -13.1660 23.0917 -0.570 0.5712
X3 623.5545 62.6409 9.954 2.94e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 143.3 on 48 degrees of freedom
Multiple R-squared: 0.6883, Adjusted R-squared: 0.6689
F-statistic: 35.34 on 3 and 48 DF, p-value: 3.316e-12
The result should be a loop that can do the sampling distribution in the t test. Right now what I have is for another problem that focuses on fitting the model based on the data.
Here I'm given the true model (for the true hypothesis) but not sure where to begin with the loop.
Okay, have a look at the following:
# get some sample data:
set.seed(42)
df <- data.frame(X1 = rnorm(10), X2 = rnorm(10), X3 = rbinom(10, 1, 0.5))
# note how X1 gets multiplied with 0, to highlight that the null is imposed.
df$y_star <- with(df, 4200 + 0*X1 - 15*X2 + 620 * X3)
head(df)
X1 X2 X3 y_star
1 1.37095845 1.3048697 0 4180.427
2 -0.56469817 2.2866454 0 4165.700
3 0.36312841 -1.3888607 0 4220.833
4 0.63286260 -0.2787888 1 4824.182
5 0.40426832 -0.1333213 0 4202.000
# define function to get the t statistic
get_tstat <- function(){
# declare the outcome, with random noise added:
# The added random noise here will be different in each draw
df$y <- with(df, y_star + rnorm(10, mean = 0, sd = sqrt(20500)))
# run linear model
mod <- lm(y ~ X1 + X2 + X3, data = df)
return(summary(mod)$coefficients["X1", "t value"])
}
# get 10 values from the t-statistic:
replicate(10, get_tstat())
[1] -0.8337737 -1.2567709 -1.2303073 0.3629552 -0.1203216 -0.1150734 0.3533095 1.6261360
[9] 0.8259006 -1.3979176

r lm parameter estimates

error variable length differs
I am confused with this error and I don't know what to do.
n1<-20
m1<-0
sd1<-1
y<-rnorm(n1,m1, sd1)
x<-rnorm(n1,m1, sd1)
e<-rnorm(n1,m1, sd1)
b0<-0
b1<-1
modelfit1<-lm(y~ b0 + b1*x + e)
Error in model.frame.default(formula = y ~ b0 + b1 * x + e:
variable lengths differ (found for 'b0')
edited:
I am working on such case where n=20, the parameters b0=0, and b=1 are true and the independent and the error are normally distributed with mean=0 and sd=1.
Is this possible?
Thanks a lot!
I may be wrong, but I believe you want to simulate an outcome and then estimate it's parameters. If that is true, you would rather do the following:
n1 <- 20
m1 <- 0
sd1<- 1
b0 <- 0
b1 <- 1
x <- rnorm(n1,m1, sd1)
e <- rnorm(n1,m1, sd1)
y <- b0 + b1*x + e
summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.66052 -0.40203 0.05659 0.44115 1.38798
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3078 0.1951 -1.578 0.132
x 1.1774 0.2292 5.137 6.9e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.852 on 18 degrees of freedom
Multiple R-squared: 0.5945, Adjusted R-squared: 0.572
F-statistic: 26.39 on 1 and 18 DF, p-value: 6.903e-05
And in case you want to do this multiple times, consider the following:
repetitions <- 5
betas <- t(sapply(1:repetitions, function(i){
y <- b0 + b1*x + rnorm(n1,m1, sd1)
coefficients(lm(y~x))
}))
betas
(Intercept) x
[1,] 0.21989182 0.8185690
[2,] -0.12820726 0.7289041
[3,] -0.27596844 0.9794432
[4,] 0.06145306 1.0575050
[5,] -0.31429950 0.9984262
Now you can look at the mean of the estimated betas:
colMeans(betas)
(Intercept) x
-0.08742606 0.91656951
and the variance-covariance matrix:
var(betas)
(Intercept) x
(Intercept) 0.051323041 -0.007976803
x -0.007976803 0.018834711
I suggest you put everything in a data.frame and deal with it that way:
set.seed(2)
m1<-0
sd1<-1
y<-rnorm(n1,m1, sd1)
x<-rnorm(n1,m1, sd1)
b0<-0
b1<-1
d <- data.frame(y,b0,b1,x,e=rnorm(20,0,1))
head(d)
# y b0 b1 x e
# 1 -0.89691455 0 1 2.090819205 -0.3835862
# 2 0.18484918 0 1 -1.199925820 -1.9591032
# 3 1.58784533 0 1 1.589638200 -0.8417051
# 4 -1.13037567 0 1 1.954651642 1.9035475
# 5 -0.08025176 0 1 0.004937777 0.6224939
# 6 0.13242028 0 1 -2.451706388 1.9909204
Now things work nicely:
modelfit1 <- lm(y~b0+b1*x+e, data=d)
modelfit1
# Call:
# lm(formula = y ~ b0 + b1 * x + e, data = d)
# Coefficients:
# (Intercept) b0 b1 x e b1:x
# 0.19331 NA NA -0.06752 0.02240 NA
summary(modelfit1)
# Call:
# lm(formula = y ~ b0 + b1 * x + e, data = d)
# Residuals:
# Min 1Q Median 3Q Max
# -2.5006 -0.4786 -0.1425 0.6211 1.8488
# Coefficients: (3 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.19331 0.25013 0.773 0.450
# b0 NA NA NA NA
# b1 NA NA NA NA
# x -0.06752 0.21720 -0.311 0.760
# e 0.02240 0.20069 0.112 0.912
# b1:x NA NA NA NA
# Residual standard error: 1.115 on 17 degrees of freedom
# Multiple R-squared: 0.006657, Adjusted R-squared: -0.1102
# F-statistic: 0.05697 on 2 and 17 DF, p-value: 0.9448

Getting the NA coefficients from summary of lm.fit object in R [duplicate]

library(lmPerm)
x <- lmp(formula = a ~ b * c + d + e, data = df, perm = "Prob")
summary(x) # truncated output, I can see `NA` rows here!
#Coefficients: (1 not defined because of singularities)
# Estimate Iter Pr(Prob)
#b 5.874 51 1.000
#c -30.060 281 0.263
#b:c NA NA NA
#d1 -31.333 60 0.633
#d2 33.297 165 0.382
#d3 -19.096 51 1.000
#e 1.976 NA NA
I want to pull out the Pr(Prob) results for everything, but
y <- summary(x)$coef[, "Pr(Prob)"]
#(Intercept) b c d1 d2
# 0.09459459 1.00000000 0.26334520 0.63333333 0.38181818
# d3 e
# 1.00000000 NA
This is not what I want. I need b:c row, too, in the right position.
An example of the output I would like from the above would be:
# (Intercept) b c b:c d1 d2
# 0.09459459 1.00000000 0.26334520 NA 0.63333333 0.38181818
# d3 e
# 1.00000000 NA
I also would like to pull out the Iter column that corresponds to each variable. Thanks.
lmp is based on lm and summary.lmp also behaves like summary.lm, so I will first use lm for illustration, then show that we can do the same for lmp.
lm and summary.lm
Have a read on ?summary.lm and watch out for the following returned values:
coefficients: a p x 4 matrix with columns for the estimated
coefficient, its standard error, t-statistic and
corresponding (two-sided) p-value. Aliased coefficients are
omitted.
aliased: named logical vector showing if the original coefficients are
aliased.
When you have rank-deficient models, NA coefficients are omitted in the coefficient table, and they are called aliased variables. Consider the following small, reproducible example:
set.seed(0)
zz <- xx <- rnorm(10)
yy <- rnorm(10)
fit <- lm(yy ~ xx + zz)
coef(fit) ## we can see `NA` here
#(Intercept) xx zz
# 0.1295147 0.2706560 NA
a <- summary(fit) ## it is also printed to screen
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1295 0.3143 0.412 0.691
#xx 0.2707 0.2669 1.014 0.340
#zz NA NA NA NA
b <- coef(a) ## but no `NA` returned in the matrix / table
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1295147 0.3142758 0.4121051 0.6910837
#xx 0.2706560 0.2669118 1.0140279 0.3402525
d <- a$aliased
#(Intercept) xx zz
# FALSE FALSE TRUE
If you want to pad NA rows to coefficient table / matrix, we can do
## an augmented matrix of `NA`
e <- matrix(nrow = length(d), ncol = ncol(b),
dimnames = list(names(d), dimnames(b)[[2]]))
## fill rows for non-aliased variables
e[!d] <- b
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1295147 0.3142758 0.4121051 0.6910837
#xx 0.2706560 0.2669118 1.0140279 0.3402525
#zz NA NA NA NA
lmp and summary.lmp
Nothing needs be changed.
library(lmPerm)
fit <- lmp(yy ~ xx + zz, perm = "Prob")
a <- summary(fit) ## `summary.lmp`
b <- coef(a)
# Estimate Iter Pr(Prob)
#(Intercept) -0.0264354 241 0.2946058
#xx 0.2706560 241 0.2946058
d <- a$aliased
#(Intercept) xx zz
# FALSE FALSE TRUE
e <- matrix(nrow = length(d), ncol = ncol(b),
dimnames = list(names(d), dimnames(b)[[2]]))
e[!d] <- b
# Estimate Iter Pr(Prob)
#(Intercept) -0.0264354 241 0.2946058
#xx 0.2706560 241 0.2946058
#zz NA NA NA
If you, want to extract Iter and Pr(Prob), just do
e[, 2] ## e[, "Iter"]
#(Intercept) xx zz
# 241 241 NA
e[, 3] ## e[, "Pr(Prob)"]
#(Intercept) xx zz
# 0.2946058 0.2946058 NA

Is there a way to change the way R labels the interaction parameter in model output?

I am having a seemingly simple but very frustrating problem. When you run a model with an interaction term in R, R names the parameter generated "var1:var2" etc. Unfortunately, this naming convention prevents me from calculating predicted values and CI's where newdata is required, because ":" is not a character that can be included in a column header, and the names in the original data frame must exactly match those in newdata. Has anyone else had this problem?
Here is a sample of my code:
wemedist2.exp = glm(survive/trials ~ sitedist + type + sitedist*type + roaddist, family = binomial(logexp(wemedata$expos)), data=wemedata)
summary(wemedist2.exp)
wemepredict3 = with(wemedata, data.frame(sitedist=mean(sitedist),roaddist=mean(roaddist), type=factor(1:2)))
wemepredict3 = cbind(wemepredict3, predict(wemedist2.exp, newdata = wemepredict3, type = "link", se = TRUE))
This produces a table with predicted values for each of the variables at the specified levels, but not interaction.
For your newdata data frame, you shouldn't include columns for the interactions. The product of the interactive variables will be calculated for you (and multiplied by the estimated coefficient) when calling predict.
For example:
Create some dummy data:
set.seed(1)
n <- 10000
X <- data.frame(x1=runif(n), x2=runif(n))
X$x1x2 <- X$x1 * X$x2
head(X)
# x1 x2 x1x2
# 1 0.2655087 0.06471249 0.017181728
# 2 0.3721239 0.67661240 0.251783646
# 3 0.5728534 0.73537169 0.421260147
# 4 0.9082078 0.11129967 0.101083225
# 5 0.2016819 0.04665462 0.009409393
# 6 0.8983897 0.13091031 0.117608474
b <- runif(4)
y <- b[1] + c(as.matrix(X) %*% b[-1]) + rnorm(n, sd=0.1)
Fit the model and compare the estimated vs. true coefficients:
M <- lm(y ~ x1 * x2, X)
summary(M)
# Call:
# lm(formula = y ~ x1 * x2, data = X)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.43208 -0.06743 -0.00170 0.06601 0.37197
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.202040 0.003906 51.72 <2e-16 ***
# x1 0.128237 0.006809 18.83 <2e-16 ***
# x2 0.156942 0.006763 23.21 <2e-16 ***
# x1:x2 0.292582 0.011773 24.85 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.09906 on 9996 degrees of freedom
# Multiple R-squared: 0.5997, Adjusted R-squared: 0.5996
# F-statistic: 4992 on 3 and 9996 DF, p-value: < 2.2e-16
b
# [1] 0.2106027 0.1147864 0.1453641 0.3099322
Create example data to predict to, and do prediction. Note that we only create x1 and x2, and do not create x1:x2:
X.predict <- data.frame(x1=runif(10), x2=runif(10))
head(X.predict)
# x1 x2
# 1 0.26037592 0.7652155
# 2 0.73988333 0.3352932
# 3 0.02650689 0.9788743
# 4 0.84083874 0.1446228
# 5 0.85052685 0.7674547
# 6 0.13568509 0.9612156
predict(M, newdata=X.predict)
# 1 2 3 4 5 6 7
# 0.4138194 0.4221251 0.3666572 0.3681432 0.6225354 0.4084543 0.4711018
# 8 9 10
# 0.7092744 0.3401867 0.2320834
Or...
An alternative approach is to include the interactions in your model-fitting data by calculating the product of the interactive terms, and then include this in your new data as well. We've done the first step in point 1 above, where we created a column called x1x2.
Then we would fit the model with: lm(y ~ x1 + x2 + x1x2, X)
And predict to the following data:
X.predict <- data.frame(x1=runif(10), x2=runif(10), x1x2=runif(10)
If you have categorical variables involved in interactions...
When you have interactions involving categorical variables, the model estimates coefficients describing the effect of belonging to each level relative to belonging to a reference level. So for instance if we have one continuous predictor (x1) and one categorical predictor (x2, with levels a, b, and c), then the model y ~ x1 * x2 will estimate six coefficients, describing:
the intercept (i.e. the predicted y when x1 is zero and the observation belongs to the reference level of x2);
the effect of varying x1 when the observation belongs to the reference level of x2 (i.e. the slope, for the reference level of x2);
the effect of belonging to the second level (i.e. the change in intercept due to belonging to the second level, relative to belonging to the reference level);
the effect of belonging to the third level (i.e. the change in intercept due to belonging to the third level, relative to belonging to the reference level);
the change in the effect of x1 (i.e. change in slope) due to belonging to the second level, relative to belonging to the reference level; and
the change in the effect of x1 (i.e. change in slope) due to belonging to the third level, relative to belonging to the reference level.
If you want to fit and predict the model with/to pre-calculated data describing the interaction, you can create a dataframe that includes columns: x1; x2b (binary, indicating whether the observation belongs to level b); x2c (binary, indicating whether the observation belongs to level c); x1x2b (the product of x1 and x2b); and x1x2c (the product of x1 and x2c).
A quick way to do this is with model.matrix:
set.seed(1)
n <- 1000
d <- data.frame(x1=runif(n), x2=sample(letters[1:3], n, replace=TRUE))
head(d)
# x1 x2
# 1 0.2655087 b
# 2 0.3721239 c
# 3 0.5728534 b
# 4 0.9082078 c
# 5 0.2016819 a
# 6 0.8983897 a
X <- model.matrix(~x1*x2, d)
head(X)
# (Intercept) x1 x2b x2c x1:x2b x1:x2c
# 1 1 0.2655087 1 0 0.2655087 0.0000000
# 2 1 0.3721239 0 1 0.0000000 0.3721239
# 3 1 0.5728534 1 0 0.5728534 0.0000000
# 4 1 0.9082078 0 1 0.0000000 0.9082078
# 5 1 0.2016819 0 0 0.0000000 0.0000000
# 6 1 0.8983897 0 0 0.0000000 0.0000000
b <- rnorm(6) # coefficients
y <- X %*% b + rnorm(n, sd=0.1)
You can rename the columns of X to whatever you want, as long as you use consistent naming when predicting the model to new data later.
Now fit the model. Here I tell lm not to calculate an intercept (with -1), since the variable (Intercept) already exists in X and will have a coefficient calculated for it. We could have also done this by fitting to data as.data.frame(X[, -1]):
(M <- lm(y ~ . - 1, as.data.frame(X)))
# Call:
# lm(formula = y ~ . - 1, data = as.data.frame(X))
#
# Coefficients:
# `(Intercept)` x1 x2b x2c `x1:x2b` `x1:x2c`
# 1.14389 1.09168 -0.88879 0.20405 0.09085 -1.63769
Create some new data to predict to, and carry out the prediction:
d.predict <- expand.grid(x1=seq(0, 1, 0.1), x2=letters[1:3])
X.predict <- model.matrix(~x1*x2, d.predict)
y.predict <- predict(M, as.data.frame(X.predict))

lm options, do regression of each category [duplicate]

This question already has an answer here:
Fitting linear model / ANOVA by group [duplicate]
(1 answer)
Closed 6 years ago.
Data:
Y X levels
y1 x1 2
...
lm(Y~X,I(levels==1))
Does the I(levels==1) mean under levels==1? If not, how can I do regression of Y vs X only when levels equals 1?
Have a look at lmList from the nlme package
set.seed(12345)
dataset <- data.frame(x = rnorm(100), y = rnorm(100), levels = gl(2, 50))
dataset$y <- with(dataset,
y + (0.1 + as.numeric(levels)) * x + 5 * as.numeric(levels)
)
library(nlme)
models <- lmList(y ~ x|levels, data = dataset)
the output is a list of lm models, one per level
models
Call:
Model: y ~ x | levels
Data: dataset
Coefficients:
(Intercept) x
1 4.964104 1.227478
2 10.085231 2.158683
Degrees of freedom: 100 total; 96 residual
Residual standard error: 1.019202
here is the summary of the first model
summary(models[[1]])
Call:
lm(formula = form, data = dat, na.action = na.action)
Residuals:
Min 1Q Median 3Q Max
-2.16569 -1.04457 -0.00318 0.78667 2.65927
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.9641 0.1617 30.703 < 2e-16 ***
x 1.2275 0.1469 8.354 6.47e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.128 on 48 degrees of freedom
Multiple R-squared: 0.5925, Adjusted R-squared: 0.584
F-statistic: 69.78 on 1 and 48 DF, p-value: 6.469e-11
You have the parameter subset of lm, here is an example.
x <- rnorm(100)
y <- rnorm(100, sd=0.1)
y[1:50] <- y[1:50] + 3*x[1:50] + 10 # line y = 3x+10
y[51:100] <- y[51:100] + 8*x[51:100] - 5 # line y = 8x-5
levels <- rep(1:2, each=50, len=100)
data = data.frame(x=x, y=y, levels=levels)
lm(y ~ x, data=data, subset=levels==1) # regression for the first part
Coefficients: (Intercept) x
10.015 2.996
lm(y ~ x, data=data, subset=levels==2) # second part
Coefficients: (Intercept) x
-4.986 8.000
You are passing I(levels==1) implicitly to subset inside lm.
I was not sure. But this code seems to suggest that you are correct.
my.data <- "x y level
1 2 1
2 4 2
3 4 1
4 3 2
5 5 1
6 5 2
7 7 1
8 6 2
9 10 1
10 5 2"
my.data2 <- read.table(textConnection(my.data), header = T)
my.data2
lm(x ~ y,I(level==1), data=my.data2)
my.data <- "x y level
1 2 1
3 4 1
5 5 1
7 7 1
9 10 1"
my.data2 <- read.table(textConnection(my.data), header = T)
my.data2
lm(x ~ y, data=my.data2)

Resources