Preparation of variables for binary logistic regression - r

I want to perform a binary logistic regression of a binary variable. The variable "burdened" (german: "belastet") has the two values agree (1) and disagree (2).
The formula I am working with is glm () and family = "binomial".
When I put my independent variables into the model (both categorical and metric) and then calculate the p-value using pchisq, I get 0.
belastet0 <- glm(belastetB ~ 1, data = MF, family = binomial (), subset = (sex == 2))
summary(belastet0)
belastet1 <- glm(belastetB ~ age + SES_3 + eig_Kinder + Zufriedenh_BZ + LZ + Angst + guteSeiten + finanzielleEinb + persKontakt, data = MF, family = "binomial", subset = (sex == 2))
summary(belastet1)
bel_chi <- belastet0$null.deviance - belastet1$deviance
bel_chidf <- belastet1$df.null - belastet1$df.residual
bel_pchisq <- 1 - pchisq(bel_chi, bel_chidf)
The output i get:
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0832 -0.5579 0.4269 0.7315 2.1323
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.933019 0.345034 11.399 < 2e-16 ***
age -0.017936 0.005805 -3.090 0.00200 **
SES_3mittel -0.252995 0.081740 -3.095 0.00197 **
SES_3niedrig -0.426660 0.131045 -3.256 0.00113 **
eig_Kinder 0.195782 0.044914 4.359 1.31e-05 ***
Zufriedenh_BZ 0.074256 0.021855 3.398 0.00068 ***
LZ -0.452521 0.026458 -17.103 < 2e-16 ***
Angststimme zu 0.955357 0.073680 12.966 < 2e-16 ***
Angstweder noch 0.554067 0.109405 5.064 4.10e-07 ***
guteSeitenstimme zu -1.312848 0.105667 -12.424 < 2e-16 ***
guteSeitenweder noch -0.451338 0.144038 -3.133 0.00173 **
finanzielleEinbstimme zu 0.759940 0.092765 8.192 2.57e-16 ***
finanzielleEinbweder noch 0.814164 0.136931 5.946 2.75e-09 ***
persKontaktstimme zu 1.001333 0.082896 12.079 < 2e-16 ***
persKontaktweder noch 0.538896 0.124962 4.312 1.61e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6691.7 on 5928 degrees of freedom
Residual deviance: 5366.5 on 5914 degrees of freedom
(14325 Beobachtungen als fehlend gelöscht)
AIC: 5396.5
And for:
bel_pchisq <- 1 - pchisq(bel_chi, bel_chidf)
I receive: bel_pchisq = 0
I think the problem is that I have not cleaned my data?
I have already done a revision of my metric variables: data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T)
and categorically: MF$belastetB <- as.factor(MF$belastetB), unfortunately with only partial success. In addition, by applying the metric formula, I had overwritten all my variables, but I still need them in their original form.
Unfortunately, I am not at all sure how to prepare my variables for binary logistic regression so that I get a p-value that is not 0. Because that indicates that I have an error in my formula, or my variables are not prepared correctly.
My categorical independent variables are: SES (high, medium, low), Angst (agree, neither, disagree), guteSeiten (agree, neither, disagree), finanzielleEinb (agree, neither, disagree), persKontakt (agree, neither, disagree)
My metric independent variables are: age, eig_Kinder, Zufriedenh_BZ (scale: 0-10), LZ (scale: 0-10)
For example, the output of LZ (metric) looks like this:
summary(MF$LZ)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 6.000 7.000 6.794 8.000 10.000 707
table(MF$LZ)
0 1 2 3 4 5 6 7 8 9 10
231 261 728 1551 1775 4024 4166 7937 9792 4085 1710
And for Angst (categorial):
table(MF$Angst)
stimme nicht zu stimme zu weder noch
16918 14607 5255
summary(MF$Angst)
Length Class Mode
36967 character character
What formulas can I apply or how do I need to change/adjust my variables so that I get an output for the p-value, other than 0?

Related

Linear model regressing on every level of a numeric field

I am currently trying to run a linear model on a large data set, but am running into issues with some specific variables.
pv_model <- lm(SalePrice ~ MSSubClass + LotConfig + GarageArea + LotFrontage, data = train)
summary(pv_model)
Here is code for my regression. SalePrice, MSSubClass, GarageArea, and LotFrontage are all numeric fields, while LotConfig is a factored variable.
Here is the output of my pv_model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98154.64 17235.51 5.695 1.75e-08 ***
MSSubClass 50.05 58.38 0.857 0.391539
LotConfigCulDSac 69949.50 12740.62 5.490 5.42e-08 ***
LotConfigFR2 19998.34 14592.31 1.370 0.170932
LotConfigFR3 21390.99 34126.44 0.627 0.530962
LotConfigInside 21666.04 5597.33 3.871 0.000118 ***
GarageArea 175.67 10.96 16.035 < 2e-16 ***
LotFrontage101 42571.20 42664.89 0.998 0.318682
LotFrontage102 26051.49 35876.54 0.726 0.467968
LotFrontage103 36528.81 35967.56 1.016 0.310131
LotFrontage104 218129.42 58129.56 3.752 0.000188 ***
LotFrontage105 61737.12 27618.21 2.235 0.025673 *
LotFrontage106 40806.22 58159.42 0.702 0.483120
LotFrontage107 36744.69 29494.94 1.246 0.213211
LotFrontage108 71537.30 42565.91 1.681 0.093234 .
LotFrontage109 -29193.02 42528.98 -0.686 0.492647
LotFrontage110 73589.28 27706.92 2.656 0.008068 **
As you can see, the first variables operate correctly. Both the factored and numeric fields respond appropriately. That is, until it gets to LotFrontage. For whatever reason, the model runs the regression on every single level of LotFrontage.
For reference, LotFrontage describes the square footage of the subject's front yard. I have properly cleaned the data and replaced NA values. I really am at a loss for why this particular column is acting so unusually.
Any help is greatly appreciated.
If I download the data from the kaggle link or use a github link and do:
train = read.csv("train.csv")
class(x$LotFrontage)
[1] "integer"
pv_model <- lm(SalePrice ~ MSSubClass + LotConfig + GarageArea + LotFrontage,
data = train)
summary(pv_model)
Call:
lm(formula = SalePrice ~ MSSubClass + LotConfig + GarageArea +
LotFrontage, data = train)
Residuals:
Min 1Q Median 3Q Max
-380310 -33812 -4418 24345 487970
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11915.866 9455.677 1.260 0.20785
MSSubClass 105.699 45.345 2.331 0.01992 *
LotConfigCulDSac 81789.113 10547.120 7.755 1.89e-14 ***
LotConfigFR2 17736.355 11787.227 1.505 0.13266
LotConfigFR3 17649.409 31418.281 0.562 0.57439
LotConfigInside 13073.201 5002.092 2.614 0.00907 **
GarageArea 208.708 8.725 23.920 < 2e-16 ***
LotFrontage 722.380 88.294 8.182 7.12e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Suggest that you read in the csv again like above.

Why am I getting good accuracy but low ROC AUC for multiple models?

My dataset size is 42542 x 14 and I am trying to build different models like logistic regression, KNN, RF, Decision trees and compare the accuracies.
I get a high accuracy but low ROC AUC for every model.
The data has about 85% samples with target variable = 1 and 15% with target variable 0. I tried taking samples in order to handle this imbalance, but it still gives the same results.
Coeffs for glm are as follow:
glm(formula = loan_status ~ ., family = "binomial", data = lc_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7617 0.3131 0.4664 0.6129 1.6734
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.264e+00 8.338e-01 -9.911 < 2e-16 ***
annual_inc 5.518e-01 3.748e-02 14.721 < 2e-16 ***
home_own 4.938e-02 3.740e-02 1.320 0.186780
inq_last_6mths1 -2.094e-01 4.241e-02 -4.938 7.88e-07 ***
inq_last_6mths2-5 -3.805e-01 4.187e-02 -9.087 < 2e-16 ***
inq_last_6mths6-10 -9.993e-01 1.065e-01 -9.380 < 2e-16 ***
inq_last_6mths11-15 -1.448e+00 3.510e-01 -4.126 3.68e-05 ***
inq_last_6mths16-20 -2.323e+00 7.946e-01 -2.924 0.003457 **
inq_last_6mths21-25 -1.399e+01 1.970e+02 -0.071 0.943394
inq_last_6mths26-30 1.039e+01 1.384e+02 0.075 0.940161
inq_last_6mths31-35 -1.973e+00 1.230e+00 -1.604 0.108767
loan_amnt -1.838e-05 3.242e-06 -5.669 1.43e-08 ***
purposecredit_card 3.286e-02 1.130e-01 0.291 0.771169
purposedebt_consolidation -1.406e-01 1.032e-01 -1.362 0.173108
purposeeducational -3.591e-01 1.819e-01 -1.974 0.048350 *
purposehome_improvement -2.106e-01 1.189e-01 -1.771 0.076577 .
purposehouse -3.327e-01 1.917e-01 -1.735 0.082718 .
purposemajor_purchase -7.310e-03 1.288e-01 -0.057 0.954732
purposemedical -4.955e-01 1.530e-01 -3.238 0.001203 **
purposemoving -4.352e-01 1.636e-01 -2.661 0.007800 **
purposeother -3.858e-01 1.105e-01 -3.493 0.000478 ***
purposerenewable_energy -8.150e-01 3.036e-01 -2.685 0.007263 **
purposesmall_business -9.715e-01 1.186e-01 -8.191 2.60e-16 ***
purposevacation -4.169e-01 2.012e-01 -2.072 0.038294 *
purposewedding 3.909e-02 1.557e-01 0.251 0.801751
open_acc -1.408e-04 4.147e-03 -0.034 0.972923
gradeB -4.377e-01 6.991e-02 -6.261 3.83e-10 ***
gradeC -5.858e-01 8.340e-02 -7.024 2.15e-12 ***
gradeD -7.636e-01 9.558e-02 -7.990 1.35e-15 ***
gradeE -7.832e-01 1.115e-01 -7.026 2.13e-12 ***
gradeF -9.730e-01 1.325e-01 -7.341 2.11e-13 ***
gradeG -1.031e+00 1.632e-01 -6.318 2.65e-10 ***
verification_statusSource Verified 6.340e-02 4.435e-02 1.429 0.152898
verification_statusVerified 6.864e-02 4.400e-02 1.560 0.118739
dti -4.683e-03 2.791e-03 -1.678 0.093373 .
fico_range_low 6.705e-03 9.292e-04 7.216 5.34e-13 ***
term 5.773e-01 4.499e-02 12.833 < 2e-16 ***
emp_length2-4 years 6.341e-02 4.911e-02 1.291 0.196664
emp_length5-9 years -3.136e-02 5.135e-02 -0.611 0.541355
emp_length10+ years -2.538e-01 5.185e-02 -4.895 9.82e-07 ***
delinq_2yrs2+ 5.919e-02 9.701e-02 0.610 0.541754
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 25339 on 29779 degrees of freedom
Residual deviance: 23265 on 29739 degrees of freedom
AIC: 23347
Number of Fisher Scoring iterations: 10
The confusion matrix for LR is as below:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 32 40
1 1902 10788
Accuracy : 0.8478
95% CI : (0.8415, 0.854)
No Information Rate : 0.8485
P-Value [Acc > NIR] : 0.5842
Kappa : 0.0213
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.016546
Specificity : 0.996306
Pos Pred Value : 0.444444
Neg Pred Value : 0.850118
Prevalence : 0.151544
Detection Rate : 0.002507
Detection Prevalence : 0.005642
Balanced Accuracy : 0.506426
'Positive' Class : 0
Is there any way I can improve the AUC?
If someone presents a confusion matrix and talks about low ROC AUC, it usually means that he/she has converted predictions/probabilities into 0 and 1, while ROC AUC formula does not require that - it works on raw probabilities, what gives much better results. If the aim is to obtain the best AUC value, it is good to set it as an evaluation metric while training, which enables to obtain better results than with other metrics.

Logistic regression from R returning values greater than one

I have run a logistic regression in R using glm to predict the likelihood that an individual in 1993 will have arthritis in 2004 (Arth2004) based on gender (Gen), smoking status (Smoke1993), hypertension (HT1993), high cholesterol (HC1993), and BMI (BMI1993) status in 1993. My sample size is n=7896. All variables are binary with 0 and 1 for false and true except BMI, which is continuous numeric. For gender, male=1 and female=0.
When I run the regression in R, I get good p-values, but when I actually use the regression for prediction, I get values greater than one quite often for very standard individuals. I apologize for the large code block, but I thought more information may be helpful.
library(ResourceSelection)
library(MASS)
data=read.csv(file.choose())
data$Arth2004 = as.factor(data$Arth2004)
data$Gen = as.factor(data$Gen)
data$Smoke1993 = as.factor(data$Smoke1993)
data$HT1993 = as.factor(data$HT1993)
data$HC1993 = as.factor(data$HC1993)
data$BMI1993 = as.numeric(data$BMI1993)
logistic <- glm(Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 + HT1993, data=data, family="binomial")
summary(logistic)
hoslem.test(logistic$y, fitted(logistic))
confint(logistic)
min(data$BMI1993)
median(data$BMI1993)
max(data$BMI1993)
e=2.71828
The output is as follows:
Call:
glm(formula = Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 +
HT1993, family = "binomial", data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0362 -1.0513 -0.7831 1.1844 1.8807
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.346104 0.158043 -14.845 < 2e-16 ***
Gen1 -0.748286 0.048398 -15.461 < 2e-16 ***
Smoke19931 -0.059342 0.064606 -0.919 0.358
BMI1993 0.084056 0.006005 13.997 < 2e-16 ***
HC19931 0.388217 0.047820 8.118 4.72e-16 ***
HT19931 0.341375 0.058423 5.843 5.12e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10890 on 7895 degrees of freedom
Residual deviance: 10309 on 7890 degrees of freedom
AIC: 10321
Number of Fisher Scoring iterations: 4
Hosmer and Lemeshow goodness of fit (GOF) test
data: logistic$y, fitted(logistic)
X-squared = 18.293, df = 8, p-value = 0.01913
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -2.65715966 -2.03756775
Gen1 -0.84336906 -0.65364134
Smoke19931 -0.18619647 0.06709748
BMI1993 0.07233866 0.09588198
HC19931 0.29454661 0.48200673
HT19931 0.22690608 0.45595006
[1] 18
[1] 26
[1] 43
A non-smoking female w/ median BMI (26), hypertension, and high cholesterol yields the following:
e^(26*0.084056+1*0.388217+1*0.341375-0*0.748286-0*0.059342-2.346104)
[1] 1.7664
I think the issue is related somehow to BMI considering that is the only variable that is numeric. Does anyone know why this regression produces probabilities greater than 1?
By default, family = "binomial" uses the logit link function (see ?family). So the probability you're looking for is 1.7664 / (1+1.7664).

What happens to the coefficients when we switch labels (0/1) - in practice?

I am trying to see in practice what was explained here what happens to the coefficients once labels are switched but I am not getting what is expected. Here is my attempt:
I am using the example of natality public-use data given as an example in "Practical Data Science with R" Where the output is a logical variable that classifies new born babies if they are atRisk with levels FALSE and TRUE
load(url("https://github.com/WinVector/zmPDSwR/tree/master/CDC/NatalRiskData.rData"))
train <- sdata[sdata$ORIGRANDGROUP<=5,]
test <- sdata[sdata$ORIGRANDGROUP>5,]
complications <- c("ULD_MECO","ULD_PRECIP","ULD_BREECH")
riskfactors <- c("URF_DIAB", "URF_CHYPER", "URF_PHYPER",
"URF_ECLAM")
y <- "atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
summary(model)
This results to the following coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***
PWGT 0.003762 0.001487 2.530 0.011417 *
UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***
CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .
GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***
DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **
DPLURALtwin 0.312319 0.241088 1.295 0.195163
ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***
ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951
ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***
URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187
URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676
URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029
URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489
OK, now let us switch the labels in our atRisk variable:
esdata$atRisk <- factor(sdata$atRisk)
levels(sdata$atRisk) <- c("TRUE", "FALSE")
and re-run the above analysis where I am expecting to see a change in the signs of the above reported coefficients, however, I am getting exactly the same coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***
PWGT 0.003762 0.001487 2.530 0.011417 *
UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***
CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .
GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***
DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **
DPLURALtwin 0.312319 0.241088 1.295 0.195163
ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***
ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951
ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***
URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187
URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676
URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029
URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489
What is that am I doing wrong here? Can you help please
its because you set train <- sdata[sdata$ORIGRANDGROUP<=5,] and then you change sdata$atRisk <- factor(sdata$atRisk) but your model is using the train dataset, whose levels DID NOT get changed.
Instead you can do
y <- "!atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
Call:
glm(formula = fmla, family = binomial(link = "logit"), data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.2641 0.1358 0.1511 0.1818 0.9732
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.412189 0.289352 15.249 < 2e-16 ***
PWGT -0.003762 0.001487 -2.530 0.011417 *
UPREVIS 0.063289 0.015252 4.150 3.33e-05 ***
CIG_RECTRUE -0.313169 0.187230 -1.673 0.094398 .
GESTREC3< 37 weeks -1.545183 0.140795 -10.975 < 2e-16 ***
DPLURALtriplet or higher -1.394193 0.498866 -2.795 0.005194 **
DPLURALtwin -0.312319 0.241088 -1.295 0.195163
ULD_MECOTRUE -0.818426 0.235798 -3.471 0.000519 ***
ULD_PRECIPTRUE -0.191720 0.357680 -0.536 0.591951
ULD_BREECHTRUE -0.749237 0.178129 -4.206 2.60e-05 ***
URF_DIABTRUE 0.346467 0.287514 1.205 0.228187
URF_CHYPERTRUE -0.560025 0.389678 -1.437 0.150676
URF_PHYPERTRUE -0.161599 0.250003 -0.646 0.518029
URF_ECLAMTRUE -0.498064 0.776948 -0.641 0.521489
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2698.7 on 14211 degrees of freedom
Residual deviance: 2463.0 on 14198 degrees of freedom
AIC: 2491
Number of Fisher Scoring iterations: 7

Logistic Regression: Strange Variable arises

I am using R to perform logistic regression on my data set. My data set has more than 50 variables.
I am running the following code:
glm(X...ResponseFlag ~ NetWorth + LOR + IntGrandChld + OccupInput, family = binomial, data = data)
When I see summary() I got the following output:
> summary(ResponseModel)
Call:
glm(formula = X...ResponseFlag ~ NetWorth + LOR + IntGrandChld +
OccupInput, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2785 -0.9576 -0.8925 1.3736 1.9721
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.971166 0.164439 -5.906 3.51e-09 ***
NetWorth 0.082168 0.019849 4.140 3.48e-05 ***
LOR -0.019716 0.006494 -3.036 0.0024 **
IntGrandChld -0.021544 0.085274 -0.253 0.8005
OccupInput2 0.005796 0.138390 0.042 0.9666
OccupInput3 0.471020 0.289642 1.626 0.1039
OccupInput4 -0.031880 0.120636 -0.264 0.7916
OccupInput5 -0.148898 0.129922 -1.146 0.2518
OccupInput6 -0.481183 0.416277 -1.156 0.2477
OccupInput7 -0.057485 0.218309 -0.263 0.7923
OccupInput8 0.505676 0.123955 4.080 4.51e-05 ***
OccupInput9 -0.382375 0.821362 -0.466 0.6415
OccupInputA -12.903334 178.064831 -0.072 0.9422
OccupInputB 0.581272 1.003193 0.579 0.5623
OccupInputC -0.034188 0.294507 -0.116 0.9076
OccupInputD 0.224634 0.385959 0.582 0.5606
OccupInputE -1.292358 1.072864 -1.205 0.2284
OccupInputF 14.132144 308.212341 0.046 0.9634
OccupInputH 0.622677 1.006982 0.618 0.5363
OccupInputU 0.087526 0.095740 0.914 0.3606
OccupInputV -1.010939 0.637746 -1.585 0.1129
OccupInputW 0.262031 0.256238 1.023 0.3065
OccupInputX 0.332209 0.428806 0.775 0.4385
OccupInputY 0.059771 0.157135 0.380 0.7037
OccupInputZ 0.638520 0.711979 0.897 0.3698
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5885.1 on 4467 degrees of freedom
Residual deviance: 5809.6 on 4443 degrees of freedom
AIC: 5859.6
Number of Fisher Scoring iterations: 12
From the output, it is seen that some new variable like OccupInput2... has arisen. Actually OccupInput had values 1,2,3,...A,B,C,D.. But it did not happen for NetWorth,LOR.
I am new to R and do not have any explanation, why there are new variables.
Can anybody give me an explanation? Thank you in advance.
I would assume that OccupInput in your model is a factor variable. R introduces so-called dummy variables, when you include factorial regressors in a linear model.
What you see as OccupInput2 and so forth in the table are the coefficients associated with the individual factor levels (the reference level OccupInput1 is covered by the intercept term).
You can verify the type of OccupInput from the output of the sapply(data, class) call, which yields the data types of the columns in your input data frame.

Resources