Print (display) reference category in R's lm summary? - r

How can you print the reference category used when a categorical/nominal variable is entered into a linear model. Here's an example:
summary(lm(data = iris, Sepal.Length ~ Species))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.76 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.03 8.8e-16 ***
Speciesvirginica 1.5820 0.1030 15.37 < 2e-16 ***
Here's what I'd like:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.76 < 2e-16 ***
**Reference: Speciessetosa**
Speciesversicolor 0.9300 0.1030 9.03 8.8e-16 ***
Speciesvirginica 1.5820 0.1030 15.37 < 2e-16 ***
If there is a way to make this work generally (when there are multiple categorical predictors, then each reference group is easily identifiable), that would be most excellent. And if there is a way to make the formatting particularly clear, that would be doubly excellent (I'm not beholden to the example formatting above).

You can specify that your intercept is zero.
summary(lm(Sepal.Length ~ Species + 0, data = iris))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#Speciessetosa 5.0060 0.0728 68.76 <2e-16 ***
#Speciesversicolor 5.9360 0.0728 81.54 <2e-16 ***
#Speciesvirginica 6.5880 0.0728 90.49 <2e-16 ***

Related

How do I make my variable levels appear in my regression summary table in R?

My model has two IVs: t and story_type. The levels of t are t1, t2, t3 and for story_type are (for simplicity, we'll call them) A, B, C, D. Previously everything worked as usual--I would see story_typeA, story_typeB, story_typeC in my model summary. For some reason, the summary (as of my last refresh) now shows the following, with numeric markers (story_type1, story_type2, story_type3) rather than the actual level labels:
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.2047159 0.0175622 239.419 < 2e-16 ***
t1 0.0001681 0.0464882 0.004 0.997
t2 -0.2313327 0.0392468 -5.894 3.76e-09 ***
story_type1 -0.0934701 0.0034883 -26.795 < 2e-16 ***
story_type2 -0.1252931 0.0035278 -35.516 < 2e-16 ***
story_type3 0.2304953 0.0031908 72.238 < 2e-16 ***
I tried converting story_type to a factor (it was originally a character vec), this didn't help. I've also carefully run through all of the preceding code several times now to check whether something had been changed accidentally, also to no avail.
Does anyone have any idea why this would be happening, and how I'd be able to see my level names again?
(e.g. I'd have a summary that looks as follows: )
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.2047159 0.0175622 239.419 < 2e-16 ***
t1 0.0001681 0.0464882 0.004 0.997
t2 -0.2313327 0.0392468 -5.894 3.76e-09 ***
story_typeA -0.0934701 0.0034883 -26.795 < 2e-16 ***
story_typeB -0.1252931 0.0035278 -35.516 < 2e-16 ***
story_typeC 0.2304953 0.0031908 72.238 < 2e-16 ***
EDIT:
So I spun up a toy dataset and am getting the same problem:
set.seed(16)
scores <- rnorm(n = 20, mean = 0, sd = 1)
type <- rep(LETTERS[1:4], each = 5)
df <- data.frame(scores, type)
model <- lm(scores ~ type, data = df)
summary(model)
----------------------------------------------
> Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1105 0.2335 0.47 0.64
type1 -0.0731 0.4045 -0.18 0.86
type2 0.6081 0.4045 1.50 0.15
type3 -0.1164 0.4045 -0.29 0.78
> str(df)
'data.frame': 20 obs. of 2 variables:
$ scores: num -0.4684 -1.006 0.0636 1.025 0.5731 ...
$ type : chr "A" "A" "A" "A" ...
As a factor it returns the same summary, without the level names appearing in the output.

Predict a value using the "output equation" of a heckit-model (sampleSelection)

I estimate a heckit-model using the heckit-model from sampleSelection.
The model looks as follows:
library(sampleSelection) Heckman = heckit(AgencyTRACE ~ SizeCat + log(Amt_Issued) + log(daysfromissuance) + log(daystomaturity) + EoW + dMon + EoM + VIX_95_Dummy + quarter, Avg_Spread_Choi ~ SizeCat + log(Amt_Issued) + log(daysfromissuance) + log(daystomaturity) + VIX_95_Dummy + TresholdHYIG_II, data=heckmandata, method = "2step")
The summary generates a probit selection equation and an outcome equation - see below:
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
2019085 observations (1915401 censored and 103684 observed)
26 free parameters (df = 2019060)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.038164 0.043275 0.882 0.378
SizeCat2 0.201571 0.003378 59.672 < 2e-16 ***
SizeCat3 0.318331 0.008436 37.733 < 2e-16 ***
log(Amt_Issued) -0.099472 0.001825 -54.496 < 2e-16 ***
log(daysfromissuance) 0.079691 0.001606 49.613 < 2e-16 ***
log(daystomaturity) -0.036434 0.001514 -24.066 < 2e-16 ***
EoW 0.021169 0.003945 5.366 8.04e-08 ***
dMon -0.003409 0.003852 -0.885 0.376
EoM 0.008937 0.007000 1.277 0.202
VIX_95_Dummy1 0.088558 0.006521 13.580 < 2e-16 ***
quarter2019.2 -0.092681 0.005202 -17.817 < 2e-16 ***
quarter2019.3 -0.117021 0.005182 -22.581 < 2e-16 ***
quarter2019.4 -0.059833 0.005253 -11.389 < 2e-16 ***
quarter2020.1 -0.005230 0.004943 -1.058 0.290
quarter2020.2 0.073175 0.005080 14.406 < 2e-16 ***
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.29436 6.26019 7.395 1.41e-13 ***
SizeCat2 -25.63433 0.79836 -32.109 < 2e-16 ***
SizeCat3 -34.25275 1.48030 -23.139 < 2e-16 ***
log(Amt_Issued) -0.38051 0.39506 -0.963 0.33547
log(daysfromissuance) 0.02452 0.34197 0.072 0.94283
log(daystomaturity) 7.92338 0.24498 32.343 < 2e-16 ***
VIX_95_Dummy1 -2.34875 0.89133 -2.635 0.00841 **
TresholdHYIG_II1 10.36993 1.07267 9.667 < 2e-16 ***
Multiple R-Squared:0.0406, Adjusted R-Squared:0.0405
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio -23.8204 3.6910 -6.454 1.09e-10 ***
sigma 68.5011 NA NA NA
rho -0.3477 NA NA NA
Now I'd like to estimate a value using the outcome equation. I'd like to predict Spread_Choi_All using the following data:
newdata = data.frame(SizeCat=as.factor(1),
Amt_Issued=50*1000000,
daysfromissuance=5*365,
daystomaturity=5*365,
VIX_95_Dummy=as.factor(0),
TresholdHYIG_II=as.factor(0)
SizeCat is a categorical/factor variable with the value 1, 2 or 3.
I have tried varies ways, i.e.
predict(Heckman, part ="outcome", newdata = newdata)
I aim to predict a value (with the data from newdata) using the outcome equation (incl. the invMillsRatio). Is there a way how to predict a value from the outcome equation?

Using predict() and calculating manually in Logistic regression in R does not match. What is the reason?

When I run logistic regression and use predict() function and when I manually calculate with formula p=1/(1+e^-(b0+b1*x1...)) I cannot get the same answer. What could be the reason?
>test[1,]
loan_status loan_Amount interest_rate period sector sex age grade
10000 0 608 41.72451 12 Online Shop Female 44 D3
sector and period was insignificant so I removed it from the regression.
glm gives:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1542256 0.7610472 -1.517 0.12936
interest_rate -0.0479765 0.0043415 -11.051 < 2e-16 ***
sexMale -0.8814945 0.0656296 -13.431 < 2e-16 ***
age -0.0139100 0.0035193 -3.953 7.73e-05 ***
gradeB 0.3209587 0.8238955 0.390 0.69686
gradeC1 -0.7113279 0.8728260 -0.815 0.41509
gradeC2 -0.4730014 0.8427544 -0.561 0.57462
gradeC3 0.0007541 0.7887911 0.001 0.99924
gradeD1 0.5637668 0.7597531 0.742 0.45806
gradeD2 1.3207785 0.7355950 1.796 0.07257 .
gradeD3 0.9201400 0.7303779 1.260 0.20774
gradeE1 1.7245351 0.7208260 2.392 0.01674 *
gradeE2 2.1547773 0.7242669 2.975 0.00293 **
gradeE3 3.1163245 0.7142881 4.363 1.28e-05 ***
>predictions_1st <- predict(Final_Model, newdata = test[1,], type = "response")
>predictions_1st
answer: **0.05478904**
But when I calculate like this:
>prob_1 <- 1/(1+e^-((-0.0479764603)*41.72451)-0.0139099563*44)
>prob_1
answer: 0.09081154
I calculated also with insignificant coefficients but answer still is not the same. What could be the reason?
You have also an (Intercept) -1.1542256 and a gradeD3 0.9201400
1/(1+exp(-1*(-1.1542256 -0.0479764603*41.72451 -0.0139099563*44 + 0.9201400)))
#[1] 0.05478904

What happens to the coefficients when we switch labels (0/1) - in practice?

I am trying to see in practice what was explained here what happens to the coefficients once labels are switched but I am not getting what is expected. Here is my attempt:
I am using the example of natality public-use data given as an example in "Practical Data Science with R" Where the output is a logical variable that classifies new born babies if they are atRisk with levels FALSE and TRUE
load(url("https://github.com/WinVector/zmPDSwR/tree/master/CDC/NatalRiskData.rData"))
train <- sdata[sdata$ORIGRANDGROUP<=5,]
test <- sdata[sdata$ORIGRANDGROUP>5,]
complications <- c("ULD_MECO","ULD_PRECIP","ULD_BREECH")
riskfactors <- c("URF_DIAB", "URF_CHYPER", "URF_PHYPER",
"URF_ECLAM")
y <- "atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
summary(model)
This results to the following coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***
PWGT 0.003762 0.001487 2.530 0.011417 *
UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***
CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .
GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***
DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **
DPLURALtwin 0.312319 0.241088 1.295 0.195163
ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***
ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951
ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***
URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187
URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676
URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029
URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489
OK, now let us switch the labels in our atRisk variable:
esdata$atRisk <- factor(sdata$atRisk)
levels(sdata$atRisk) <- c("TRUE", "FALSE")
and re-run the above analysis where I am expecting to see a change in the signs of the above reported coefficients, however, I am getting exactly the same coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***
PWGT 0.003762 0.001487 2.530 0.011417 *
UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***
CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .
GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***
DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **
DPLURALtwin 0.312319 0.241088 1.295 0.195163
ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***
ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951
ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***
URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187
URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676
URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029
URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489
What is that am I doing wrong here? Can you help please
its because you set train <- sdata[sdata$ORIGRANDGROUP<=5,] and then you change sdata$atRisk <- factor(sdata$atRisk) but your model is using the train dataset, whose levels DID NOT get changed.
Instead you can do
y <- "!atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
Call:
glm(formula = fmla, family = binomial(link = "logit"), data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.2641 0.1358 0.1511 0.1818 0.9732
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.412189 0.289352 15.249 < 2e-16 ***
PWGT -0.003762 0.001487 -2.530 0.011417 *
UPREVIS 0.063289 0.015252 4.150 3.33e-05 ***
CIG_RECTRUE -0.313169 0.187230 -1.673 0.094398 .
GESTREC3< 37 weeks -1.545183 0.140795 -10.975 < 2e-16 ***
DPLURALtriplet or higher -1.394193 0.498866 -2.795 0.005194 **
DPLURALtwin -0.312319 0.241088 -1.295 0.195163
ULD_MECOTRUE -0.818426 0.235798 -3.471 0.000519 ***
ULD_PRECIPTRUE -0.191720 0.357680 -0.536 0.591951
ULD_BREECHTRUE -0.749237 0.178129 -4.206 2.60e-05 ***
URF_DIABTRUE 0.346467 0.287514 1.205 0.228187
URF_CHYPERTRUE -0.560025 0.389678 -1.437 0.150676
URF_PHYPERTRUE -0.161599 0.250003 -0.646 0.518029
URF_ECLAMTRUE -0.498064 0.776948 -0.641 0.521489
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2698.7 on 14211 degrees of freedom
Residual deviance: 2463.0 on 14198 degrees of freedom
AIC: 2491
Number of Fisher Scoring iterations: 7

How to rename a complicated formula?

Suppose I have a formula y~x1+x2+I( (x1==x2)*x3 ) and estimate an linear model
summary(lm(y~x1+x2+I( (x1==x2)*x3 ), data=some_data ))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6027 1.6069 2.242 0.0662 .
x1 1.8685 1.9769 0.945 0.3811
x2 2.6041 2.0286 1.284 0.2466
I((x1 == x2) * x3) 0.5666 1.5456 0.367 0.7265
Beside creating a new 'data.frame', is there any methods to modify the formula so that the summary table would become
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6027 1.6069 2.242 0.0662 .
x1 1.8685 1.9769 0.945 0.3811
x2 2.6041 2.0286 1.284 0.2466
some_name 0.5666 1.5456 0.367 0.7265
Using within you could hack something up.
Instead of:
summary(lm(Sepal.Length ~ I(Sepal.Width * Petal.Length), data=iris))
#...
#(Intercept) 4.252934 0.069396 61.28 <2e-16 ***
#I(Sepal.Width * Petal.Length) 0.142483 0.005632 25.30 <2e-16 ***
#...
You can use:
summary(lm(Sepal.Length ~ newvar, data=
within(iris, newvar <- Sepal.Width * Petal.Length)))
#...
#(Intercept) 4.252934 0.069396 61.28 <2e-16 ***
#newvar 0.142483 0.005632 25.30 <2e-16 ***
#...

Resources