glmnet: extracting standardized coefficients - r

I am running regression models with the function cv.glmnet(). The argument standardize = TRUE standardises all x variables (predictors) prior to fitting the model. However, the coefficients are always returned on the original scale for the output / result.
Is there a way of receiving standardized coefficients (beta weights) for the output, so that coefficients are comparable?

When you standardize or scale, you do (x - mean(x))/sd(x). When regression is done on this, the centering part (- mean(x) ) , goes into the intercept, so only the standard deviate affects your coefficient.
To go from the unscaled coefficients to scaled, you can multiply by the standard deviation.
We can check this, first the regression on scaled x variables:
scaled_mt = mtcars
scaled_mt[,-1] = scale(scaled_mt[,-1])
fit_scaled = lm(mpg ~ .,data=scaled_mt)
The regression on original:
fit = lm(mpg ~ .,data=mtcars)
The glmnet, where I set very low lambda to include all terms:
fit_lasso = cv.glmnet(y=as.matrix(mtcars[,1]),x=as.matrix(mtcars)[,-1],lambda=c(0.0001,0.00001))
Standard deviation for all x variables:
AllSD = apply(mtcars[,-1],2,sd)
To show the transformation is ok:
cbind(scaled=coefficients(fit_scaled)[-1],
from_lm = coefficients(fit)[-1]*allSD,
from_glmnet = coefficients(fit_lasso)[-1]*allSD)
scaled from_lm from_glmnet
cyl -0.1990240 -0.1990240 -0.1762826
disp 1.6527522 1.6527522 1.6167872
hp -1.4728757 -1.4728757 -1.4677513
drat 0.4208515 0.4208515 0.4268243
wt -3.6352668 -3.6352668 -3.6071975
qsec 1.4671532 1.4671532 1.4601126
vs 0.1601576 0.1601576 0.1615794
am 1.2575703 1.2575703 1.2563485
gear 0.4835664 0.4835664 0.4922507
carb -0.3221020 -0.3221020 -0.3412025
But note, this does not necessary make them comparable, because they are scaled by standard deviation. The more important purpose of scaling is to center them, so you can interpret positive or negative relationships more easily.

Related

How to calculate marginal effects of logit model with fixed effects by using a sample of more than 50 million observations

I have a sample of more than 50 million observations. I estimate the following model in R:
model1 <- feglm(rejection~ variable1+ variable1^2 + variable2+ variable3+ variable4 | city_fixed_effects + year_fixed_effects, family=binomial(link="logit"), data=database)
Based on the estimates from model1, I calculate the marginal effects:
mfx2 <- marginaleffects(model1)
summary(mfx2)
This line of code also calculates the marginal effects of each fixed effects which slows down R. I only need to calculate the average marginal effects of variables 1, 2, and 3. If I separately, calculate the marginal effects by using mfx2 <- marginaleffects(model1, variables = "variable1") then it does not show the standard error and the p-value of the average marginal effects.
Any solution for this issue?
Both the fixest and the marginaleffects packages have made recent
changes to improve interoperability. The next official CRAN releases
will be able to do this, but as of 2021-12-08 you can use the
development versions. Install:
library(remotes)
install_github("lrberge/fixest")
install_github("vincentarelbundock/marginaleffects")
I recommend converting your fixed effects variables to factors before
fitting your models:
library(fixest)
library(marginaleffects)
dat <- mtcars
dat$gear <- as.factor(dat$gear)
mod <- feglm(am ~ mpg + mpg^2 + hp + hp^3| gear,
family = binomial(link = "logit"),
data = dat)
Then, you can use marginaleffects and summary to compute average
marginal effects:
mfx <- marginaleffects(mod, variables = "mpg")
summary(mfx)
## Average marginal effects
## type Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
## 1 response mpg 0.3352 40 0.008381 0.99331 -78.06 78.73
##
## Model type: fixest
## Prediction type: response
Note that computing average marginal effects requires calculating a
distinct marginal effect for every single row of your dataset. This can
be computationally expensive when your data includes millions of
observations.
Instead, you can compute marginal effects for specific values of the
regressors using the newdata argument and the typical function.
Please refer to the marginaleffects documentation for details on
those:
marginaleffects(mod,
variables = "mpg",
newdata = typical(mpg = 22, gear = 4))
## rowid type term dydx std.error hp mpg gear predicted
## 1 1 response mpg 1.068844 50.7849 146.6875 22 4 0.4167502

What is actually occurring in this multiple linear regression analysis done in R?

I am working on a project with data analysis in R.
What I am seeking to do is determine if a dataset can be described with a linear regression model, and then trying to test if certain subgroups of that dataset have a stronger correlation than the whole dataset. More specifically, I am comparing a dataset where students recorded their pulse and time estimations, and checking to see if there is a stronger correlation in a subgroup of the data where students were not found to have a daily rhythm to either variable vs. a subgroup where students were calculated to have a daily rhythm in both time estimation and heart rate. The values I am using are their daily averages for both time estimation and heart rate.
I ran a linear model of the whole dataset:
> summary(ptmod1)
Call:
lm(formula = avg.time ~ avg.pulse, data = pulsetime)
Residuals:
Min 1Q Median 3Q Max
-11.7310 -1.6725 -0.0162 2.0134 9.8548
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.82047 2.99244 22.998 <2e-16 ***
avg.pulse -0.10449 0.04115 -2.539 0.0125 *
and also attempted to run a linear regression of each subgroup
> summary(ptmod2)
Call:
lm(formula = avg.time ~ avg.pulse + Group, data = pulsetime)
Residuals:
Min 1Q Median 3Q Max
-12.9884 -1.7723 -0.1873 2.4900 8.7424
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.45350 2.92287 23.420 < 2e-16 ***
avg.pulse -0.08566 0.03985 -2.149 0.03388 *
GroupOne -1.22325 0.91444 -1.338 0.18386
GroupThree 0.11062 0.97666 0.113 0.91003
GroupTwo -3.09096 0.95446 -3.238 0.00161 **
However, I wanted to make sure that what I was seeing was correct, because I did not really expect so many of the groups to have significant coefficients. So I cut the groups up into their own .csv files and generated linear models for each of them individually. Cutting them up into their own files also made it easier to run a Chow test as a post-hoc analysis. When I ran regressions on them again, I got quite different coefficients.
For example, here is the summary for Group One:
> summary(mod1)
Call:
lm(formula = avg.time ~ avg.pulse, data = group1)
Residuals:
Min 1Q Median 3Q Max
-7.1048 -1.6529 -0.7279 1.4063 5.6574
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.41445 4.15917 16.209 8.99e-15 ***
avg.pulse -0.08916 0.05657 -1.576 0.128
This makes me question what exactly my results from the summary of ptmod2 actually mean? I was uncertain of how to set up the R code for the linear model sorted by individual subgroups, so my code for it was
> ptmod2<-lm(avg.time~avg.pulse+Group, data=pulsetime)
In my spreadsheet file, I have three columns: avg.pulse, avg.time, and Group. "Group" is a column of the words "One", "Two", "Three", and "Four" assigned based on subgroup.
Did the summary for ptmod2 fit a linear regression across the whole dataset? I am really not sure what happened.
Thankyou so much for any insight you can provide. Perhaps my code for comparing regressions by group was incorrect.
This is somewhat of a split between a programming and statistics question. It is maybe better suited for crossvalidation. However the question is simple enough to get an understanding about.
Your question can be split into the following sub-questions:
Am I fitting a model on the full (whole) dataset in ptmod2?
How do I estimate multiple models across grouped datasets?
What is the correct way to analyse the coefficients of such a situation?
Am I fitting a model on the full (whole) datset in ptmod2?
The long and short is "yes". In R and statistics, when you add a "group" variable to your dataset, this is not equivalent to splitting your dataset into multiple groups. Instead it is adding an indicator variable (0 or 1) indicating the specific groups, including a reference level. So in your case you have 4 groups, 1 through 4, and you are adding an indicator for whether someone is in group 1, group 2, group 3 or (reference level) group 4. This is a measure of how much the intercept differs between the groups. Eg. these variables have the interpretation:
If the models share a common slope avg.pulse are there a significant difference in the avg.time explained by the specific group?
The reason why you see only 3 groups and not 4, is that the fourth group is explained by setting all the other groups equal to FALSE. Eg. if you are not in group 1, 2 or 3 you are part of group 4. So the "effect" of being in group 4, is the effect of not being in group 1, 2 or 3 (in this case).
A method for studying this, that it seems many of my students found helpful, is to study a small version of the model.matrix for example:
data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
model.matrix(mpg ~ hp + cyl, data = mtcars)
where you can very actively see that there is a column for the (intercept), hp and 2 columns for cyl6 and cyl8 (no column for cyl4 which is the reference). Matching the indices in cyl6 and cyl8 to the value in mtcars illustrates that a 1 in the cyl6 column indicates that cyl == 6.
How do I estimate multiple models across grouped datasets?
There are multiple methods for doing this depending on the question being sought. In your case you seem interested in the question "Are there a significant difference in the effect of avg.pulse depending for each group?". Eg, you want to estimate the avg.pulse coefficient for each group. One is to do as you did later, and estimate a model across each group
groups <- split(pulsetime, pulsetime$Group)
models <- lapply(groups, function(df)lm(avg.time ~ avg.pulse, data = df))
lapply(models, summary)
which gives the estimate. The problem here is "how to compare these". There are methods for doing so, by comparing the covariance between the parameters between each model, which is called "multivariate statistical analysis" or multiple regression models. This is overly complicated however, as the models share a common outcome.
A much simpler method is to incorporate the different estimate by adding the "extra" effect for each group using indicator variables. This works similar to adding the group variable, but instead of adding it alone (indicating the effect of being in group X), we multiply it to the variable in question using one of
# Let each group have their own `avg.pulse` variable
ptmod2_1 <- lm(formula = avg.time ~ avg.pulse : Group, data = pulsetime)
# Let each group have their own `avg.pulse` variable and account for the effect of `Group`
ptmod2_2 <- lm(formula = avg.time ~ avg.pulse * Group, data = pulsetime)
In the prior you'll see avg.time:GroupX, for all 4 groups, meaning these are the "effect of avg.time in group X", while in the latter you'll one again have a reference level. Note a stark difference between the 2 methods is that in the latter all group have the same intercept while in the latter all groups can have a different intercept.
In general statistics the latter is the preferred method, unless you have a very good reason not to expect each group to have a different average. It is very similar to the rule of thumb: Don't test your intercept, unless you have a very good reason, and even then you probably shouldn't". Basically because it makes a lot of logical sense to follow those rules (though it can take a few days of reflecting to realize why).
What is the correct way to analyse the coefficients of such a situation?
If you've stuck with one of the 2 latter methods, the analysis is similar to a normal regression analysis. Coefficients can be tested using t-tests, anova and so on (using summary/drop1 and anova), and if you have a reason you can test group merging using standard test as well (although if they are insignificant there is rarely a reason to merge them either way). The whole trick becomes "how do I interpret the coefficients".
For method 1 it is glaringly obvious. "Group 1 has an effect of avg.pulse of so much" and so on. For method 2 it is slightly more subtle. The effect of avg.pulse in group 1 is avg.pulse + avg.pulse:GroupOne. Because you have to note that avg.pulsedoes **not** disappear when you change group. It is the reference level, and every other effect is the **additional** effect onavg.pulseof going from being in group X to being in group Y. Visually yourslope` changes in the graph becoming steeper(flatter) if the coefficient is positive(negative).
I've given a visualization below using the mtcars dataset with using mpg for outcome, hp for numeric variable and cyl (as a factor) as a grouping variable. Confidence intervals are removed as they are not important for the illustration. The important part is to note how different the 2 models are (cyl == 4 is positive in one, negative in the other!). This further goes along the idea why method 2 is often "more correct" than the prior.
Code for reproducibility
Below is the code I've used for my illustrations and examples
data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
model.matrix(mpg ~ hp + cyl, data = mtcars)
#split-fit across groups and run summary
groups <- split(mtcars, mtcars$cyl)
models <- lapply(groups, function(df)lm(mpg ~ hp, data = df))
lapply(models, summary)
#Fit using interaction effects
fit_11 <- lm(mpg ~ hp:cyl , mtcars)
fit_12 <- lm(mpg ~ hp*cyl , mtcars)
summary(fit_11)
summary(fit_12)
#Illustrate interaction effects
library(sjPlot)
library(sjmisc)
library(ggplot2)
library(patchwork)
theme_set(theme_sjplot())
p1 <- plot_model(fit_11, type = "pred", terms = c("hp","cyl"), ci.lvl = 0) + ggtitle("Same intercept different slope") +
geom_point(aes(x = hp, y = mpg, col = cyl, fill = cyl), mtcars, inherit.aes = FALSE)
p2 <- plot_model(fit_12, type = "pred", terms = c("hp", "cyl"), ci.lvl = 0) + ggtitle("Different intercept and slope") +
geom_point(aes(x = hp, y = mpg, col = cyl, fill = cyl), mtcars, inherit.aes = FALSE)
p1 / p2

Covariance function in R for covariance matrix of residuals

I am looking for a function in R to calculate the covariance matrix of the residuals of an OLS regression. I am unable to find if the cov() function takes into account degrees of freedom of the model and the number of data points in the model when it computes the covariance matrix.
Update: I am trying to do an optimization process that minimizes the residuals of an OLS regression. Typically the unbiased OLS residual variance is given by : E(RSS/N−p-1)=σ². Where RSS is the Residual Sum of Squares, N the number of observations and p the number of coefficients. I am trying to see if such a correction is needed for computing the covariance matrix and if so, is there a function in R that does it?
You can use the vcov() function with the summary object after running a regression using the lm function.
Here's an example using the mtcars dataset:
vcov(summary(lm(mpg ~ disp + wt + cyl + carb, data = mtcars)))
(Intercept) disp wt cyl carb
(Intercept) 8.55669203 0.0293259201 -2.08615285 -1.491482503 0.29243798
disp 0.02932592 0.0001528819 -0.00919016 -0.006308583 0.00142303
wt -2.08615285 -0.0091901600 1.12326190 0.137990642 -0.09283828
cyl -1.49148250 -0.0063085825 0.13799064 0.454163264 -0.10918226
carb 0.29243798 0.0014230298 -0.09283828 -0.109182256 0.12568429
This is another way of writing the syntax:
model <- lm(mpg ~ disp + wt + cyl + carb, data = mtcars)
modelsum <- summary(model)
vcov(modelsum)

Standardizing regression coefficients changed significance

I originally had this formula:
lm(PopDif ~ RailDensityDif + Ports + Coast, data = Pop) and got a coefficient of 1,419,000 for RailDensityDif, -0.1011 for ports, and 3418 for Coast. After scaling the variables: lm(scale(PopDif) ~ scale(RailDensityDif) + scale(Ports) + scale(Coast), data = Pop), my coefficient for RailDensityDif is 0.02107 and 0.2221 for Coast, so now Coast is more significant than RailDensityDif. I know scaling isn't supposed to change the significance—why did this happen?
tldr; The p-values characterising the statistical significance of parameters in a linear model may change following scaling (standardising) variables.
As an example, I will work with the mtcars dataset, and regress mpg on disp and drat; or in R's formula language mpg ~ disp + drat.
1. Three linear models
We implement three different (OLS) linear models, the difference being different scaling strategies of the variables.
To start, we don't do any scaling.
m1 <- lm(mpg ~ disp + drat, data = mtcars)
Next, we scale values using scale which by default does two things: (1) It centers values at 0 by subtracting the mean, and (2) it scales values to have unit variance by dividing the (centered) values by their standard deviation.
m2 <- lm(mpg ~ disp + drat, data = as.data.frame(scale(mtcars)))
Note that we can apply scale to the data.frame directly, which will scale values by column. scale returns a matrix so we need to transform the resulting object back to a data.frame.
Finally, we scale values using scale without centering, but scaling values to have unit variance
m3 <- lm(mpg ~ disp + drat, data = as.data.frame(scale(mtcars, center = F)))
2. Comparison of parameter estimates and statistical significance
Let's inspect the parameter estimates for m1
summary(m1)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 21.84487993 6.747971087 3.237252 3.016655e-03
#disp -0.03569388 0.006652672 -5.365345 9.191388e-06
#drat 1.80202739 1.542091386 1.168561 2.520974e-01
We get the t values from the ratio of the parameter estimates and standard errors; the p-values then follow from the area under the curve of the pdf for df = nrow(mtcars) - 3 (as we have 3 parameters) where x > |t| (corresponding to a two-sided t-test). So for example, for disp we confirm the t value
summary(m1)$coef["disp", "Estimate"] / summary(m1)$coef["disp", "Std. Error"]
#[1] -5.365345
and the p-value
2 * pt(summary(m1)$coef["disp", "Estimate"] / summary(m1)$coef["disp", "Std. Error"], nrow(mtcars) - 3)
#[1] 9.191388e-06
Let's take a look at results from m2:
summary(m2)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -1.306994e-17 0.09479281 -1.378790e-16 1.000000e+00
#disp -7.340121e-01 0.13680614 -5.365345e+00 9.191388e-06
#drat 1.598663e-01 0.13680614 1.168561e+00 2.520974e-01
Notice how the t values (i.e. the ratios of the estimates and standard errors) are different compared to those of m1, due to the centering and scaling of data to have unit variance.
If however, we don't center values and only scale them to have unit variance
summary(m3)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.0263872 0.31705513 3.237252 3.016655e-03
#disp -0.4446985 0.08288348 -5.365345 9.191388e-06
#drat 0.3126834 0.26757994 1.168561 2.520974e-01
we can see that while estimates and standard errors are different compared to (unscaled) results from m1, their respective ratios (i.e. the t values) are identical. So (default) scale(...) will change the statistical significance of parameter estimates while scale(..., center = FALSE) will not.
It's easy to see why dividing values by their standard deviation does not change the ratio of OLS parameter estimates and standard errors when taking a look at the closed form for the OLS parameter estimate and standard error, see e.g. here.

Categorical Regression with Centered Levels

R's standard way of doing regression on categorical variables is to select one factor level as a reference level and constraining the effect of that level to be zero. Instead of constraining a single level effect to be zero, I'd like to constrain the sum of the coefficients to be zero.
I can hack together coefficient estimates for this manually after fitting the model the standard way:
x <- lm(data = mtcars, mpg ~ factor(cyl))
z <- c(coef(x), "factor(cyl)4" = 0)
y <- mean(z[-1])
z[-1] <- z[-1] - y
z[1] <- z[1] + y
z
## (Intercept) factor(cyl)6 factor(cyl)8 factor(cyl)4
## 20.5021645 -0.7593074 -5.4021645 6.1614719
But that leaves me without standard error estimates for the former reference level that I just added as an explicit effect, and I need to have those as well.
I did some searching and found the constrasts functions, and tried
lm(data = mtcars, mpg ~ C(factor(cyl), contr = contr.sum))
but this still only produces two effect estimates. Is there a way to change which constraint R uses for linear regression on categorical variables properly?
Think I've figured it out. Using contrasts actually is the right way to go about it, you just need to do a little work to get the results into a convenient looking form. Here's the fit:
fit <- lm(data = mtcars, mpg ~ C(factor(cyl), contr = contr.sum))
Then the matrix cs <- contr.sum(factor(cyl)) is used to get the effect estimates and the standard error.
The effect estimates just come from multiplying the contrast matrix by the effect estimates lm spits out, like so:
cs %*% coef(fit)[-1]
The standard error can be calculated using the contrast matrix and the variance-covariance matrix of the coefficients, like so:
diag(cs %*% vcov(fit)[-1,-1] %*% t(cs))

Resources