I originally had this formula:
lm(PopDif ~ RailDensityDif + Ports + Coast, data = Pop) and got a coefficient of 1,419,000 for RailDensityDif, -0.1011 for ports, and 3418 for Coast. After scaling the variables: lm(scale(PopDif) ~ scale(RailDensityDif) + scale(Ports) + scale(Coast), data = Pop), my coefficient for RailDensityDif is 0.02107 and 0.2221 for Coast, so now Coast is more significant than RailDensityDif. I know scaling isn't supposed to change the significance—why did this happen?
tldr; The p-values characterising the statistical significance of parameters in a linear model may change following scaling (standardising) variables.
As an example, I will work with the mtcars dataset, and regress mpg on disp and drat; or in R's formula language mpg ~ disp + drat.
1. Three linear models
We implement three different (OLS) linear models, the difference being different scaling strategies of the variables.
To start, we don't do any scaling.
m1 <- lm(mpg ~ disp + drat, data = mtcars)
Next, we scale values using scale which by default does two things: (1) It centers values at 0 by subtracting the mean, and (2) it scales values to have unit variance by dividing the (centered) values by their standard deviation.
m2 <- lm(mpg ~ disp + drat, data = as.data.frame(scale(mtcars)))
Note that we can apply scale to the data.frame directly, which will scale values by column. scale returns a matrix so we need to transform the resulting object back to a data.frame.
Finally, we scale values using scale without centering, but scaling values to have unit variance
m3 <- lm(mpg ~ disp + drat, data = as.data.frame(scale(mtcars, center = F)))
2. Comparison of parameter estimates and statistical significance
Let's inspect the parameter estimates for m1
summary(m1)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 21.84487993 6.747971087 3.237252 3.016655e-03
#disp -0.03569388 0.006652672 -5.365345 9.191388e-06
#drat 1.80202739 1.542091386 1.168561 2.520974e-01
We get the t values from the ratio of the parameter estimates and standard errors; the p-values then follow from the area under the curve of the pdf for df = nrow(mtcars) - 3 (as we have 3 parameters) where x > |t| (corresponding to a two-sided t-test). So for example, for disp we confirm the t value
summary(m1)$coef["disp", "Estimate"] / summary(m1)$coef["disp", "Std. Error"]
#[1] -5.365345
and the p-value
2 * pt(summary(m1)$coef["disp", "Estimate"] / summary(m1)$coef["disp", "Std. Error"], nrow(mtcars) - 3)
#[1] 9.191388e-06
Let's take a look at results from m2:
summary(m2)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -1.306994e-17 0.09479281 -1.378790e-16 1.000000e+00
#disp -7.340121e-01 0.13680614 -5.365345e+00 9.191388e-06
#drat 1.598663e-01 0.13680614 1.168561e+00 2.520974e-01
Notice how the t values (i.e. the ratios of the estimates and standard errors) are different compared to those of m1, due to the centering and scaling of data to have unit variance.
If however, we don't center values and only scale them to have unit variance
summary(m3)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.0263872 0.31705513 3.237252 3.016655e-03
#disp -0.4446985 0.08288348 -5.365345 9.191388e-06
#drat 0.3126834 0.26757994 1.168561 2.520974e-01
we can see that while estimates and standard errors are different compared to (unscaled) results from m1, their respective ratios (i.e. the t values) are identical. So (default) scale(...) will change the statistical significance of parameter estimates while scale(..., center = FALSE) will not.
It's easy to see why dividing values by their standard deviation does not change the ratio of OLS parameter estimates and standard errors when taking a look at the closed form for the OLS parameter estimate and standard error, see e.g. here.
Related
I have fitted a logistic regression for an outcome (a type of side effect - whether patients have this or not). The formula and results of this model is below:
model <- glm(side_effect_G1 ~ age + bmi + surgerytype1 + surgerytype2 + surgerytype3 + cvd + rt_axilla, family = 'binomial', data= data1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.888112 0.859847 -9.174 < 2e-16 ***
age 0.028529 0.009212 3.097 0.00196 **
bmi 0.095759 0.015265 6.273 3.53e-10 ***
surgery11 0.923723 0.524588 1.761 0.07826 .
surgery21 1.607389 0.600113 2.678 0.00740 **
surgery31 1.544822 0.573972 2.691 0.00711 **
cvd1 0.624692 0.290005 2.154 0.03123 *
rt1 -0.816374 0.353953 -2.306 0.02109 *
I want to check my models, so I have plotted residuals against predictors or fitted values. I know, if a model is properly fitted, there should be no correlation between residuals and predictors and fitted values so I essentially run...
residualPlots(model)
My plots look funny because from what I have seen from examples online, is that it should be symmetrical around 0. Also, my factor variables aren't shown in box-plots although I have checked the structure of my data and coded surgery1, surgery2, surgery4,cvd,rt as factors. Can someone help me interpret my plots and guide me how to plot boxplots for my factor variables?
Thanks
Your label or response variable is expected for an imbalanced dataset. From your plots most of your residuals actually go below the dotted line, so I suspect this is the case.
Very briefly, the symmetric around residuals only holds for logistic regression when your classes are balanced. If it is heavily imbalanced towards the reference label (or 0 label), the intercept will be forced towards a low value (i.e the 0 label), and you will see that positive labels will have a very large pearson residual (because they deviate a lot from the expected). You can read more about imbalanced class and logistic regression in this post
Here's an example to demonstrate this, using a dataset where you see the evenly distributed residues :
library(mlbench)
library(car)
data(PimaIndiansDiabetes)
table(PimaIndiansDiabetes$diabetes)
neg pos
500 268
mdl = glm(diabetes ~ .,data=PimaIndiansDiabetes,family="binomial")
residualPlots(mdl)
Let's make it more imbalanced, and you get a plot exactly like yours:
da = PimaIndiansDiabetes
wh = c(which(da$diabetes=="neg"),which(da$diabetes == "pos")[1:100])
da = da[wh,]
table(da$diabetes)
neg pos
500 100
mdl = glm(diabetes ~ .,data=da,family="binomial")
residualPlots(mdl)
I have a sample of more than 50 million observations. I estimate the following model in R:
model1 <- feglm(rejection~ variable1+ variable1^2 + variable2+ variable3+ variable4 | city_fixed_effects + year_fixed_effects, family=binomial(link="logit"), data=database)
Based on the estimates from model1, I calculate the marginal effects:
mfx2 <- marginaleffects(model1)
summary(mfx2)
This line of code also calculates the marginal effects of each fixed effects which slows down R. I only need to calculate the average marginal effects of variables 1, 2, and 3. If I separately, calculate the marginal effects by using mfx2 <- marginaleffects(model1, variables = "variable1") then it does not show the standard error and the p-value of the average marginal effects.
Any solution for this issue?
Both the fixest and the marginaleffects packages have made recent
changes to improve interoperability. The next official CRAN releases
will be able to do this, but as of 2021-12-08 you can use the
development versions. Install:
library(remotes)
install_github("lrberge/fixest")
install_github("vincentarelbundock/marginaleffects")
I recommend converting your fixed effects variables to factors before
fitting your models:
library(fixest)
library(marginaleffects)
dat <- mtcars
dat$gear <- as.factor(dat$gear)
mod <- feglm(am ~ mpg + mpg^2 + hp + hp^3| gear,
family = binomial(link = "logit"),
data = dat)
Then, you can use marginaleffects and summary to compute average
marginal effects:
mfx <- marginaleffects(mod, variables = "mpg")
summary(mfx)
## Average marginal effects
## type Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
## 1 response mpg 0.3352 40 0.008381 0.99331 -78.06 78.73
##
## Model type: fixest
## Prediction type: response
Note that computing average marginal effects requires calculating a
distinct marginal effect for every single row of your dataset. This can
be computationally expensive when your data includes millions of
observations.
Instead, you can compute marginal effects for specific values of the
regressors using the newdata argument and the typical function.
Please refer to the marginaleffects documentation for details on
those:
marginaleffects(mod,
variables = "mpg",
newdata = typical(mpg = 22, gear = 4))
## rowid type term dydx std.error hp mpg gear predicted
## 1 1 response mpg 1.068844 50.7849 146.6875 22 4 0.4167502
I am working on a project with data analysis in R.
What I am seeking to do is determine if a dataset can be described with a linear regression model, and then trying to test if certain subgroups of that dataset have a stronger correlation than the whole dataset. More specifically, I am comparing a dataset where students recorded their pulse and time estimations, and checking to see if there is a stronger correlation in a subgroup of the data where students were not found to have a daily rhythm to either variable vs. a subgroup where students were calculated to have a daily rhythm in both time estimation and heart rate. The values I am using are their daily averages for both time estimation and heart rate.
I ran a linear model of the whole dataset:
> summary(ptmod1)
Call:
lm(formula = avg.time ~ avg.pulse, data = pulsetime)
Residuals:
Min 1Q Median 3Q Max
-11.7310 -1.6725 -0.0162 2.0134 9.8548
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.82047 2.99244 22.998 <2e-16 ***
avg.pulse -0.10449 0.04115 -2.539 0.0125 *
and also attempted to run a linear regression of each subgroup
> summary(ptmod2)
Call:
lm(formula = avg.time ~ avg.pulse + Group, data = pulsetime)
Residuals:
Min 1Q Median 3Q Max
-12.9884 -1.7723 -0.1873 2.4900 8.7424
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.45350 2.92287 23.420 < 2e-16 ***
avg.pulse -0.08566 0.03985 -2.149 0.03388 *
GroupOne -1.22325 0.91444 -1.338 0.18386
GroupThree 0.11062 0.97666 0.113 0.91003
GroupTwo -3.09096 0.95446 -3.238 0.00161 **
However, I wanted to make sure that what I was seeing was correct, because I did not really expect so many of the groups to have significant coefficients. So I cut the groups up into their own .csv files and generated linear models for each of them individually. Cutting them up into their own files also made it easier to run a Chow test as a post-hoc analysis. When I ran regressions on them again, I got quite different coefficients.
For example, here is the summary for Group One:
> summary(mod1)
Call:
lm(formula = avg.time ~ avg.pulse, data = group1)
Residuals:
Min 1Q Median 3Q Max
-7.1048 -1.6529 -0.7279 1.4063 5.6574
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.41445 4.15917 16.209 8.99e-15 ***
avg.pulse -0.08916 0.05657 -1.576 0.128
This makes me question what exactly my results from the summary of ptmod2 actually mean? I was uncertain of how to set up the R code for the linear model sorted by individual subgroups, so my code for it was
> ptmod2<-lm(avg.time~avg.pulse+Group, data=pulsetime)
In my spreadsheet file, I have three columns: avg.pulse, avg.time, and Group. "Group" is a column of the words "One", "Two", "Three", and "Four" assigned based on subgroup.
Did the summary for ptmod2 fit a linear regression across the whole dataset? I am really not sure what happened.
Thankyou so much for any insight you can provide. Perhaps my code for comparing regressions by group was incorrect.
This is somewhat of a split between a programming and statistics question. It is maybe better suited for crossvalidation. However the question is simple enough to get an understanding about.
Your question can be split into the following sub-questions:
Am I fitting a model on the full (whole) dataset in ptmod2?
How do I estimate multiple models across grouped datasets?
What is the correct way to analyse the coefficients of such a situation?
Am I fitting a model on the full (whole) datset in ptmod2?
The long and short is "yes". In R and statistics, when you add a "group" variable to your dataset, this is not equivalent to splitting your dataset into multiple groups. Instead it is adding an indicator variable (0 or 1) indicating the specific groups, including a reference level. So in your case you have 4 groups, 1 through 4, and you are adding an indicator for whether someone is in group 1, group 2, group 3 or (reference level) group 4. This is a measure of how much the intercept differs between the groups. Eg. these variables have the interpretation:
If the models share a common slope avg.pulse are there a significant difference in the avg.time explained by the specific group?
The reason why you see only 3 groups and not 4, is that the fourth group is explained by setting all the other groups equal to FALSE. Eg. if you are not in group 1, 2 or 3 you are part of group 4. So the "effect" of being in group 4, is the effect of not being in group 1, 2 or 3 (in this case).
A method for studying this, that it seems many of my students found helpful, is to study a small version of the model.matrix for example:
data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
model.matrix(mpg ~ hp + cyl, data = mtcars)
where you can very actively see that there is a column for the (intercept), hp and 2 columns for cyl6 and cyl8 (no column for cyl4 which is the reference). Matching the indices in cyl6 and cyl8 to the value in mtcars illustrates that a 1 in the cyl6 column indicates that cyl == 6.
How do I estimate multiple models across grouped datasets?
There are multiple methods for doing this depending on the question being sought. In your case you seem interested in the question "Are there a significant difference in the effect of avg.pulse depending for each group?". Eg, you want to estimate the avg.pulse coefficient for each group. One is to do as you did later, and estimate a model across each group
groups <- split(pulsetime, pulsetime$Group)
models <- lapply(groups, function(df)lm(avg.time ~ avg.pulse, data = df))
lapply(models, summary)
which gives the estimate. The problem here is "how to compare these". There are methods for doing so, by comparing the covariance between the parameters between each model, which is called "multivariate statistical analysis" or multiple regression models. This is overly complicated however, as the models share a common outcome.
A much simpler method is to incorporate the different estimate by adding the "extra" effect for each group using indicator variables. This works similar to adding the group variable, but instead of adding it alone (indicating the effect of being in group X), we multiply it to the variable in question using one of
# Let each group have their own `avg.pulse` variable
ptmod2_1 <- lm(formula = avg.time ~ avg.pulse : Group, data = pulsetime)
# Let each group have their own `avg.pulse` variable and account for the effect of `Group`
ptmod2_2 <- lm(formula = avg.time ~ avg.pulse * Group, data = pulsetime)
In the prior you'll see avg.time:GroupX, for all 4 groups, meaning these are the "effect of avg.time in group X", while in the latter you'll one again have a reference level. Note a stark difference between the 2 methods is that in the latter all group have the same intercept while in the latter all groups can have a different intercept.
In general statistics the latter is the preferred method, unless you have a very good reason not to expect each group to have a different average. It is very similar to the rule of thumb: Don't test your intercept, unless you have a very good reason, and even then you probably shouldn't". Basically because it makes a lot of logical sense to follow those rules (though it can take a few days of reflecting to realize why).
What is the correct way to analyse the coefficients of such a situation?
If you've stuck with one of the 2 latter methods, the analysis is similar to a normal regression analysis. Coefficients can be tested using t-tests, anova and so on (using summary/drop1 and anova), and if you have a reason you can test group merging using standard test as well (although if they are insignificant there is rarely a reason to merge them either way). The whole trick becomes "how do I interpret the coefficients".
For method 1 it is glaringly obvious. "Group 1 has an effect of avg.pulse of so much" and so on. For method 2 it is slightly more subtle. The effect of avg.pulse in group 1 is avg.pulse + avg.pulse:GroupOne. Because you have to note that avg.pulsedoes **not** disappear when you change group. It is the reference level, and every other effect is the **additional** effect onavg.pulseof going from being in group X to being in group Y. Visually yourslope` changes in the graph becoming steeper(flatter) if the coefficient is positive(negative).
I've given a visualization below using the mtcars dataset with using mpg for outcome, hp for numeric variable and cyl (as a factor) as a grouping variable. Confidence intervals are removed as they are not important for the illustration. The important part is to note how different the 2 models are (cyl == 4 is positive in one, negative in the other!). This further goes along the idea why method 2 is often "more correct" than the prior.
Code for reproducibility
Below is the code I've used for my illustrations and examples
data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
model.matrix(mpg ~ hp + cyl, data = mtcars)
#split-fit across groups and run summary
groups <- split(mtcars, mtcars$cyl)
models <- lapply(groups, function(df)lm(mpg ~ hp, data = df))
lapply(models, summary)
#Fit using interaction effects
fit_11 <- lm(mpg ~ hp:cyl , mtcars)
fit_12 <- lm(mpg ~ hp*cyl , mtcars)
summary(fit_11)
summary(fit_12)
#Illustrate interaction effects
library(sjPlot)
library(sjmisc)
library(ggplot2)
library(patchwork)
theme_set(theme_sjplot())
p1 <- plot_model(fit_11, type = "pred", terms = c("hp","cyl"), ci.lvl = 0) + ggtitle("Same intercept different slope") +
geom_point(aes(x = hp, y = mpg, col = cyl, fill = cyl), mtcars, inherit.aes = FALSE)
p2 <- plot_model(fit_12, type = "pred", terms = c("hp", "cyl"), ci.lvl = 0) + ggtitle("Different intercept and slope") +
geom_point(aes(x = hp, y = mpg, col = cyl, fill = cyl), mtcars, inherit.aes = FALSE)
p1 / p2
I am running regression models with the function cv.glmnet(). The argument standardize = TRUE standardises all x variables (predictors) prior to fitting the model. However, the coefficients are always returned on the original scale for the output / result.
Is there a way of receiving standardized coefficients (beta weights) for the output, so that coefficients are comparable?
When you standardize or scale, you do (x - mean(x))/sd(x). When regression is done on this, the centering part (- mean(x) ) , goes into the intercept, so only the standard deviate affects your coefficient.
To go from the unscaled coefficients to scaled, you can multiply by the standard deviation.
We can check this, first the regression on scaled x variables:
scaled_mt = mtcars
scaled_mt[,-1] = scale(scaled_mt[,-1])
fit_scaled = lm(mpg ~ .,data=scaled_mt)
The regression on original:
fit = lm(mpg ~ .,data=mtcars)
The glmnet, where I set very low lambda to include all terms:
fit_lasso = cv.glmnet(y=as.matrix(mtcars[,1]),x=as.matrix(mtcars)[,-1],lambda=c(0.0001,0.00001))
Standard deviation for all x variables:
AllSD = apply(mtcars[,-1],2,sd)
To show the transformation is ok:
cbind(scaled=coefficients(fit_scaled)[-1],
from_lm = coefficients(fit)[-1]*allSD,
from_glmnet = coefficients(fit_lasso)[-1]*allSD)
scaled from_lm from_glmnet
cyl -0.1990240 -0.1990240 -0.1762826
disp 1.6527522 1.6527522 1.6167872
hp -1.4728757 -1.4728757 -1.4677513
drat 0.4208515 0.4208515 0.4268243
wt -3.6352668 -3.6352668 -3.6071975
qsec 1.4671532 1.4671532 1.4601126
vs 0.1601576 0.1601576 0.1615794
am 1.2575703 1.2575703 1.2563485
gear 0.4835664 0.4835664 0.4922507
carb -0.3221020 -0.3221020 -0.3412025
But note, this does not necessary make them comparable, because they are scaled by standard deviation. The more important purpose of scaling is to center them, so you can interpret positive or negative relationships more easily.
This is the example that I work on it :
data2 = data.frame( X = c(0,2,4,6,8,10),
Y = c(300,220,210,90,80,10))
attach(data2)
model <- glm(log(Y)~X)
model
Call: glm(formula = log(Y) ~ X)
Coefficients:
(Intercept) X
6.0968 -0.2984
Degrees of Freedom: 5 Total (i.e. Null); 4 Residual
Null Deviance: 7.742
Residual Deviance: 1.509 AIC: 14.74
My question is :
There is an option on glm function that allows me to fix intercept Coefficients with value that I want ? and to predict the x value ?
For example : I want that my Curve start With the upper "Y" value ==> I want change the intercept with log(300)
You are using glm(...) incorrectly, which IMO is a much bigger problem than offsets.
The main underlying assumption in least squares regression is that the error in the response is normally distributed with constant variance. If the error in Y is normally distributed, then log(Y) most certainly is not. So, while you can "run the numbers" on a fit of log(Y)~X, the results will not be meaningful. The theory of generalized linear modelling was developed to deal with this problem. So using glm, rather than fit log(Y) ~X you should fit Y~X with family=poisson. The former fits
log(Y) = b0 + b1x
while the latter fits
Y = exp(b0 + b1x)
In the latter case, if the error in Y is normally distributed, and if the model is valid, then the residuals will be normally distributed, as required. Note that these two approaches give very different results for b0 and b1.
fit.incorrect <- glm(log(Y)~X,data=data2)
fit.correct <- glm(Y~X,data=data2,family=poisson)
coef(summary(fit.incorrect))
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 6.0968294 0.44450740 13.71592 0.0001636875
# X -0.2984013 0.07340798 -4.06497 0.0152860490
coef(summary(fit.correct))
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 5.8170223 0.04577816 127.06982 0.000000e+00
# X -0.2063744 0.01122240 -18.38951 1.594013e-75
In particular, the coefficient of X is almost 30% smaller when using the correct approach.
Notice how the models differ:
plot(Y~X,data2)
curve(exp(coef(fit.incorrect)[1]+x*coef(fit.incorrect)[2]),
add=T,col="red")
curve(predict(fit.correct, type="response",newdata=data.frame(X=x)),
add=T,col="blue")
The result of the correct fit (blue curve) passes through the data more or less randomly, while the result of the incorrect fit grossly overestimates the data for small X and underestimates the data for larger X. I wonder if this is why you want to "fix" the intercept. Looking at the other answer, you can see that when you do fix Y0 = 300, the fit underestimates throughout.
In contrast, let's see what happens when we fix Y0 using glm properly.
data2$b0 <- log(300) # add the offset as a separate column
# b0 not fixed
fit <- glm(Y~X,data2,family=poisson)
plot(Y~X,data2)
curve(predict(fit,type="response",newdata=data.frame(X=x)),
add=TRUE,col="blue")
# b0 fixed so that Y0 = 300
fit.fixed <-glm(Y~X-1+offset(b0), data2,family=poisson)
curve(predict(fit.fixed,type="response",newdata=data.frame(X=x,b0=log(300))),
add=TRUE,col="green")
Here, the blue curve is the unconstrained fit (done properly), and the green curve is the fit constraining Y0 = 300. You cna see that they do not differ very much, because the correct (unconstrained) fit is already quite good.
data2 <- data.frame( X = c(0,2,4,6,8,10),
Y = c(300,220,210,90,80,10)
m1 <- lm(log(Y)~X-1+offset(rep(log(300),nrow(data2))),data2)
There is a predict() function, but here it's probably easier to just predict by hand.
par(las=1,bty="l")
plot(Y~X,data=data2)
curve(300*exp(coef(m1)*x),add=TRUE)
For what it's worth, if you want compare log-Normal and Poisson models, you can do it via
library("ggplot2")
theme_set(theme_bw())
ggplot(data2,aes(X,Y))+geom_point()+
geom_smooth(method="glm",family=quasipoisson)+
geom_smooth(method="glm",family=quasi(link="log",var="mu^2"),
colour="red",fill="red")