I am looking for a function in R to calculate the covariance matrix of the residuals of an OLS regression. I am unable to find if the cov() function takes into account degrees of freedom of the model and the number of data points in the model when it computes the covariance matrix.
Update: I am trying to do an optimization process that minimizes the residuals of an OLS regression. Typically the unbiased OLS residual variance is given by : E(RSS/N−p-1)=σ². Where RSS is the Residual Sum of Squares, N the number of observations and p the number of coefficients. I am trying to see if such a correction is needed for computing the covariance matrix and if so, is there a function in R that does it?
You can use the vcov() function with the summary object after running a regression using the lm function.
Here's an example using the mtcars dataset:
vcov(summary(lm(mpg ~ disp + wt + cyl + carb, data = mtcars)))
(Intercept) disp wt cyl carb
(Intercept) 8.55669203 0.0293259201 -2.08615285 -1.491482503 0.29243798
disp 0.02932592 0.0001528819 -0.00919016 -0.006308583 0.00142303
wt -2.08615285 -0.0091901600 1.12326190 0.137990642 -0.09283828
cyl -1.49148250 -0.0063085825 0.13799064 0.454163264 -0.10918226
carb 0.29243798 0.0014230298 -0.09283828 -0.109182256 0.12568429
This is another way of writing the syntax:
model <- lm(mpg ~ disp + wt + cyl + carb, data = mtcars)
modelsum <- summary(model)
vcov(modelsum)
Related
I seem to be running into some issues when I run the code below:
library(lme4)
columns <- c("disp", "hp", "wt", "qsec", "vs")
X <- mtcars[,c(columns, 'cyl', 'carb', 'gear')] # data
rf <- lmer(cyl~ (carb | gear) +., data = X)
rf
### The output that I don't want (lists 'carb' and 'gear' as fixed variables):
Linear mixed model fit by REML ['lmerMod']
Formula: cyl ~ (carb | gear) + .
Data: X
REML criterion at convergence: 76.9662
Random effects:
Groups Name Std.Dev. Corr
gear (Intercept) 0.2887
carb 0.2039 -1.00
Residual 0.5202
Number of obs: 32, groups: gear, 3
Fixed Effects:
(Intercept) carb gear disp hp wt
10.179140 0.025990 -0.873174 0.003883 0.008190 0.089656
qsec vs
-0.159582 -0.779400
As you can see, it is counting 'carb' and 'gear' as fixed variables when I only need them to be used for my random effect variable.
My goal is to keep the code in a similar format and also be able to run the model without the variables 'carb' and 'gear' being taken in as fixed effects (only as random effects).
How can I prevent "~." in the first model from selecting 'carb' and 'gear' as fixed variables so that it may produce the same output as the second model below?
The output that I need: (ONLY 'carb' and 'gear' listed as random effects):
> el <- lmer(cyl~ disp + hp + wt + qsec + vs + (carb | gear), data = mtcars)
> el
Linear mixed model fit by REML ['lmerMod']
Formula: cyl ~ disp + hp + wt + qsec + vs + (carb | gear)
Data: mtcars
REML criterion at convergence: 79.7548
Random effects:
Groups Name Std.Dev. Corr
gear (Intercept) 0.9932
carb 0.1688 -0.82
Residual 0.5263
Number of obs: 32, groups: gear, 3
Fixed Effects:
(Intercept) disp hp wt qsec vs
6.848103 0.004024 0.006929 0.172789 -0.169145 -0.785878
Any help at all is greatly appreciated!
The . shortcut in R formula notation means "every variable in data except the variable(s) on the left-hand side of the formula." This is just a convenient shorthand that was originally intended for use with the lm() function. It is not very flexible as you have found out from trying to apply it in a different context than what it was intended for. The lmer() formula includes dependent variables, fixed effects, and random effects. So in this case the . shortcut will not work. To avoid hard-coding the variable names you can put together the formula by pasting strings, like this:
columns <- c("disp", "hp", "wt", "qsec", "vs")
fixed_effects <- paste(columns, collapse = '+')
model_formula <- as.formula(paste('cyl ~ (carb|gear) +', fixed_effects))
model_formula
# cyl ~ (carb | gear) + disp + hp + wt + qsec + vs
rf <- lmer(model_formula, data = mtcars)
Here we use paste() with the collapse argument to concatenate all elements of columns with a + between each one. Then we paste() again to combine the part of the formula with the left-hand side and the random effects with the fixed effects. That gives us the full model formula as a string, which we can convert to a formula using as.formula().
I am running regression models with the function cv.glmnet(). The argument standardize = TRUE standardises all x variables (predictors) prior to fitting the model. However, the coefficients are always returned on the original scale for the output / result.
Is there a way of receiving standardized coefficients (beta weights) for the output, so that coefficients are comparable?
When you standardize or scale, you do (x - mean(x))/sd(x). When regression is done on this, the centering part (- mean(x) ) , goes into the intercept, so only the standard deviate affects your coefficient.
To go from the unscaled coefficients to scaled, you can multiply by the standard deviation.
We can check this, first the regression on scaled x variables:
scaled_mt = mtcars
scaled_mt[,-1] = scale(scaled_mt[,-1])
fit_scaled = lm(mpg ~ .,data=scaled_mt)
The regression on original:
fit = lm(mpg ~ .,data=mtcars)
The glmnet, where I set very low lambda to include all terms:
fit_lasso = cv.glmnet(y=as.matrix(mtcars[,1]),x=as.matrix(mtcars)[,-1],lambda=c(0.0001,0.00001))
Standard deviation for all x variables:
AllSD = apply(mtcars[,-1],2,sd)
To show the transformation is ok:
cbind(scaled=coefficients(fit_scaled)[-1],
from_lm = coefficients(fit)[-1]*allSD,
from_glmnet = coefficients(fit_lasso)[-1]*allSD)
scaled from_lm from_glmnet
cyl -0.1990240 -0.1990240 -0.1762826
disp 1.6527522 1.6527522 1.6167872
hp -1.4728757 -1.4728757 -1.4677513
drat 0.4208515 0.4208515 0.4268243
wt -3.6352668 -3.6352668 -3.6071975
qsec 1.4671532 1.4671532 1.4601126
vs 0.1601576 0.1601576 0.1615794
am 1.2575703 1.2575703 1.2563485
gear 0.4835664 0.4835664 0.4922507
carb -0.3221020 -0.3221020 -0.3412025
But note, this does not necessary make them comparable, because they are scaled by standard deviation. The more important purpose of scaling is to center them, so you can interpret positive or negative relationships more easily.
Will anyone be able to explain how to set constants for different levels of categorical variables in r?
I have read the following: How to set the Coefficient Value in Regression; R and it does a good job for explaining how to set a constant for the whole of a categorical variable. I would like to know how to set one for each level.
As an example, let us look at the MTCARS dataset:
df <- as.data.frame(mtcars)
df$cyl <- as.factor(df$cyl)
set.seed(1)
glm(mpg ~ cyl + hp + gear, data = df)
This gives me the following output:
Call: glm(formula = mpg ~ cyl + hp + gear, data = df)
Coefficients:
(Intercept) cyl6 cyl8 hp gear
19.80268 -4.07000 -2.29798 -0.05541 2.79645
Degrees of Freedom: 31 Total (i.e. Null); 27 Residual
Null Deviance: 1126
Residual Deviance: 219.5 AIC: 164.4
If I wanted to set cyl6 to -.34 and cyl8 to -1.4, and then rerun to see how it effects the other variables, how would I do that?
I think this is what you can do
df$mpgCyl=df$mpg
df$mpgCyl[df$cyl==6]=df$mpgCyl[df$cyl==6]-0.34
df$mpgCyl[df$cyl==8]=df$mpgCyl[df$cyl==8]-1.4
model2=glm(mpgCyl ~ hp + gear, data = df)
> model2
Call: glm(formula = mpgCyl ~ hp + gear, data = df)
Coefficients:
(Intercept) hp gear
16.86483 -0.07146 3.53128
UPDATE withe comments:
cyl is a factor, therefore by default it contributes to glm as offset, not slope. Actually cyl==4 is 'hidden' but existing in the glm as well. So in your first glm what the models says is:
1) for cyl==4: mpg=19.8-0.055*hp+2.79*gear
2) for cyl==6: mpg=(19.8-4.07)-0.055*hp+2.79*gear
3) for cyl==8: mpg=(19.8-2.29)-0.055*hp+2.79*gear
Maybe you can also check here https://stats.stackexchange.com/questions/213710/level-of-factor-taken-as-intercept and here Is there any way to fit a `glm()` so that all levels are included (i.e. no reference level)?
Hope this helps
I am trying to use the between and within effects model specifications. My question is how do I extract the intraclass correlation from these models. In Stata, this output is rho, or the variance due to differences across panels. Below is a copy of what I have using the mtcars dataset. (Hopefully the between and within effects models are correctly specified.)
between <- lmer(mpg ~ disp + hp + (1|cyl), mtcars)
summary(between)
within <- felm(mpg ~ disp + hp | factor(cyl), data = mtcars)
summary(within)
I think I answered my own question regarding the lmer model specification. By running
between.null <- lmer(mpg ~ 1 + (1|cyl), mtcars)
summary(between.null)
There are reported both variance of cyl and variance of Residual. Rho should then just be variance of cyl / (variance of cyl + variance of Residual).
I'm trying to loop through all the column names of my data.frame and use them
as predictor variable in a linear regression.
What I currently have is:
for (i in 1:11){
for (j in 1:11){
if (i != j ){
var1 = names(newData)[i]
var2 = names(newData)[j]
glm.fit = glm(re78 ~ as.name(var1):as.name(var2), data=newData)
summary(glm.fit)
cv.glm(newData, glm.fit, K = 10)$delta[1]
}
}
}
Where newData is my data.frame and there are 11 columns in total. This code gives me the following error:
Error in model.frame.default(formula = re78 ~ as.name(var1), data = newData, :
invalid type (symbol) for variable 'as.name(var1)'
How can I fix this, and make it work?
It looks like you want models that use all combinations of two variables. Here's another way to do that using the built-in mtcars data frame for illustration and using mpg as the outcome variable.
We get all combinations of two variables (excluding the outcome variable, mpg in this case) using combn. combn returns a list where each list element is a vector containing the names of a pair of variables. Then we use map (from the purrr package) to create models for each pair of variables and store the results in a list.
We use reformulate to construct the model formula. .x refers back to the vectors of variables names (each element of vars). If you run, for example, reformulate(paste(c("cyl", "disp"),collapse="*"), "mpg"), you can see what reformulate is doing.
library(purrr)
# Get all combinations of two variables
vars = combn(names(mtcars)[-grep("mpg", names(mtcars))], 2, simplify=FALSE)
Now we want to run regression models on all pairs of variables and store results in a list:
# No interaction
models = map(vars, ~ glm(reformulate(.x, "mpg"), data=mtcars))
# Interaction only (no main effects)
models = map(vars, ~ glm(reformulate(paste(.x, collapse=":"), "mpg"), data=mtcars))
# Interaction and main effects
models = map(vars, ~ glm(reformulate(paste(.x, collapse="*"), "mpg"), data=mtcars))
Name each list element with the formula for that model:
names(models) = map(models, ~ .x[["terms"]])
To create the model formulas using paste instead of reformulate you could do (change + to : or *, depending on what combination of interactions and main effects you want to include):
models = map(vars, ~ glm(paste("mpg ~", paste(.x, collapse=" + ")), data=mtcars))
To see how paste is being used here, you can run:
paste("mpg ~", paste(c("cyl", "disp"), collapse=" * "))
Here's what the first two models look like when the models include both main effects and the interaction:
models[1:2]
$`mpg ~ cyl * disp`
Call: glm(formula = reformulate(paste(.x, collapse = "*"), "mpg"),
data = mtcars)
Coefficients:
(Intercept) cyl disp cyl:disp
49.03721 -3.40524 -0.14553 0.01585
Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
Null Deviance: 1126
Residual Deviance: 198.1 AIC: 159.1
$`mpg ~ cyl * hp`
Call: glm(formula = reformulate(paste(.x, collapse = "*"), "mpg"),
data = mtcars)
Coefficients:
(Intercept) cyl hp cyl:hp
50.75121 -4.11914 -0.17068 0.01974
Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
Null Deviance: 1126
Residual Deviance: 247.6 AIC: 166.3
To assess model output, you can use functions from the broom package. The code below returns data frames with, respectively, the coefficients and performance statistics for each model.
library(broom)
model_coefs = map_df(models, tidy, .id="Model")
model_performance = map_df(models, glance, .id="Model")
Here are what the results look like for models with both main effects and the interaction:
head(model_coefs, 8)
Model term estimate std.error statistic p.value
1 mpg ~ cyl * disp (Intercept) 49.03721186 5.004636297 9.798357 1.506091e-10
2 mpg ~ cyl * disp cyl -3.40524372 0.840189015 -4.052950 3.645320e-04
3 mpg ~ cyl * disp disp -0.14552575 0.040002465 -3.637919 1.099280e-03
4 mpg ~ cyl * disp cyl:disp 0.01585388 0.004947824 3.204212 3.369023e-03
5 mpg ~ cyl * hp (Intercept) 50.75120716 6.511685614 7.793866 1.724224e-08
6 mpg ~ cyl * hp cyl -4.11913952 0.988229081 -4.168203 2.672495e-04
7 mpg ~ cyl * hp hp -0.17068010 0.069101555 -2.469989 1.987035e-02
8 mpg ~ cyl * hp cyl:hp 0.01973741 0.008810871 2.240120 3.320219e-02
You can use fit <- glm(as.formula(paste0("re78 ~ ", var1)), data=newData) as #akrun suggest. Further, you likely do not want to call your object glm.fit as there is a function with the same.
Caveat: I do not why you have the double loop and the :. Do you not want a regression with a single covaraite? I have no idea what you are trying to achieve otherwise.