I have a little convergence concerns for one of my models, the data are in file attached: https://mon-partage.fr/f/06LTiBGt/
. To explain the data a little, it is a matter of knowing whether the formation of the pre-nymphs is impacted by the different modalities. The obs column corresponds to formed / unformed nymph. The day variables are very important in the model. I would like to use at least the hive variable as a random effect.
I tried to add to the function the bobyqa control, but the problem of convergence persists and follow the ?convergence. But alls optmizers converge C
Can i considers that it's false positive ?
Thank you in advance,
library("lme4", lib.loc="~/R/win-library/3.3")
> glmpn<-glmer(Obs~moda*jour+1|ruch)+1|code_test),data=dataall_pn,family=binomial(logit),glmerControl(optimizer="bobyqa"))
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0054486 (tol = 0.001, component 1)
> summary(glmpn)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: Obs ~ moda * jour + (1 | ruch) + (1 | code_test)
Data: dataall_pn
Control: glmerControl(optimizer = "bobyqa")
AIC BIC logLik deviance df.resid
22477.2 22651.2 -11216.6 22433.2 20072
Scaled residuals:
Min 1Q Median 3Q Max
-8.4511 -0.9370 0.4435 0.6890 1.5559
Random effects:
Groups Name Variance Std.Dev.
ruch (Intercept) 0.2811 0.5302
code_test (Intercept) 0.2475 0.4975
Number of obs: 20094, groups: ruch, 7; code_test, 5
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.47806 0.46287 -7.514 5.73e-14 ***
modaA 0.46866 0.50489 0.928 0.353281
modaL -2.77363 0.75599 -3.669 0.000244 ***
modaLA 2.04869 0.52218 3.923 8.73e-05 ***
modaP 2.19098 0.48984 4.473 7.72e-06 ***
modaB 1.75376 0.49874 3.516 0.000437 ***
modaBP 2.27875 0.52120 4.372 1.23e-05 ***
modaPL 2.01771 0.48696 4.143 3.42e-05 ***
modaBL 1.06337 0.48795 2.179 0.029312 *
modaBLP 1.93218 0.51939 3.720 0.000199 ***
jour 0.41973 0.02981 14.079 < 2e-16 ***
modaA:jour -0.06559 0.04188 -1.566 0.117369
modaL:jour 0.19876 0.06491 3.062 0.002198 **
modaLA:jour -0.22419 0.04267 -5.254 1.49e-07 ***
modaP:jour -0.25363 0.04012 -6.322 2.58e-10 ***
modaB:jour -0.19555 0.04097 -4.773 1.82e-06 ***
modaBP:jour -0.23478 0.04262 -5.509 3.61e-08 ***
modaPL:jour -0.24454 0.03988 -6.131 8.71e-10 ***
modaBL:jour -0.16590 0.04003 -4.145 3.40e-05 ***
modaBLP:jour -0.21726 0.04245 -5.119 3.08e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation matrix not shown by default, as p = 20 > 12.
Use print(x, correlation=TRUE) or
vcov(x) if you need it
convergence code: 0
Model failed to converge with max|grad| = 0.0054486 (tol = 0.001, component 1)
Related
My dataset size is 42542 x 14 and I am trying to build different models like logistic regression, KNN, RF, Decision trees and compare the accuracies.
I get a high accuracy but low ROC AUC for every model.
The data has about 85% samples with target variable = 1 and 15% with target variable 0. I tried taking samples in order to handle this imbalance, but it still gives the same results.
Coeffs for glm are as follow:
glm(formula = loan_status ~ ., family = "binomial", data = lc_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7617 0.3131 0.4664 0.6129 1.6734
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.264e+00 8.338e-01 -9.911 < 2e-16 ***
annual_inc 5.518e-01 3.748e-02 14.721 < 2e-16 ***
home_own 4.938e-02 3.740e-02 1.320 0.186780
inq_last_6mths1 -2.094e-01 4.241e-02 -4.938 7.88e-07 ***
inq_last_6mths2-5 -3.805e-01 4.187e-02 -9.087 < 2e-16 ***
inq_last_6mths6-10 -9.993e-01 1.065e-01 -9.380 < 2e-16 ***
inq_last_6mths11-15 -1.448e+00 3.510e-01 -4.126 3.68e-05 ***
inq_last_6mths16-20 -2.323e+00 7.946e-01 -2.924 0.003457 **
inq_last_6mths21-25 -1.399e+01 1.970e+02 -0.071 0.943394
inq_last_6mths26-30 1.039e+01 1.384e+02 0.075 0.940161
inq_last_6mths31-35 -1.973e+00 1.230e+00 -1.604 0.108767
loan_amnt -1.838e-05 3.242e-06 -5.669 1.43e-08 ***
purposecredit_card 3.286e-02 1.130e-01 0.291 0.771169
purposedebt_consolidation -1.406e-01 1.032e-01 -1.362 0.173108
purposeeducational -3.591e-01 1.819e-01 -1.974 0.048350 *
purposehome_improvement -2.106e-01 1.189e-01 -1.771 0.076577 .
purposehouse -3.327e-01 1.917e-01 -1.735 0.082718 .
purposemajor_purchase -7.310e-03 1.288e-01 -0.057 0.954732
purposemedical -4.955e-01 1.530e-01 -3.238 0.001203 **
purposemoving -4.352e-01 1.636e-01 -2.661 0.007800 **
purposeother -3.858e-01 1.105e-01 -3.493 0.000478 ***
purposerenewable_energy -8.150e-01 3.036e-01 -2.685 0.007263 **
purposesmall_business -9.715e-01 1.186e-01 -8.191 2.60e-16 ***
purposevacation -4.169e-01 2.012e-01 -2.072 0.038294 *
purposewedding 3.909e-02 1.557e-01 0.251 0.801751
open_acc -1.408e-04 4.147e-03 -0.034 0.972923
gradeB -4.377e-01 6.991e-02 -6.261 3.83e-10 ***
gradeC -5.858e-01 8.340e-02 -7.024 2.15e-12 ***
gradeD -7.636e-01 9.558e-02 -7.990 1.35e-15 ***
gradeE -7.832e-01 1.115e-01 -7.026 2.13e-12 ***
gradeF -9.730e-01 1.325e-01 -7.341 2.11e-13 ***
gradeG -1.031e+00 1.632e-01 -6.318 2.65e-10 ***
verification_statusSource Verified 6.340e-02 4.435e-02 1.429 0.152898
verification_statusVerified 6.864e-02 4.400e-02 1.560 0.118739
dti -4.683e-03 2.791e-03 -1.678 0.093373 .
fico_range_low 6.705e-03 9.292e-04 7.216 5.34e-13 ***
term 5.773e-01 4.499e-02 12.833 < 2e-16 ***
emp_length2-4 years 6.341e-02 4.911e-02 1.291 0.196664
emp_length5-9 years -3.136e-02 5.135e-02 -0.611 0.541355
emp_length10+ years -2.538e-01 5.185e-02 -4.895 9.82e-07 ***
delinq_2yrs2+ 5.919e-02 9.701e-02 0.610 0.541754
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 25339 on 29779 degrees of freedom
Residual deviance: 23265 on 29739 degrees of freedom
AIC: 23347
Number of Fisher Scoring iterations: 10
The confusion matrix for LR is as below:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 32 40
1 1902 10788
Accuracy : 0.8478
95% CI : (0.8415, 0.854)
No Information Rate : 0.8485
P-Value [Acc > NIR] : 0.5842
Kappa : 0.0213
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.016546
Specificity : 0.996306
Pos Pred Value : 0.444444
Neg Pred Value : 0.850118
Prevalence : 0.151544
Detection Rate : 0.002507
Detection Prevalence : 0.005642
Balanced Accuracy : 0.506426
'Positive' Class : 0
Is there any way I can improve the AUC?
If someone presents a confusion matrix and talks about low ROC AUC, it usually means that he/she has converted predictions/probabilities into 0 and 1, while ROC AUC formula does not require that - it works on raw probabilities, what gives much better results. If the aim is to obtain the best AUC value, it is good to set it as an evaluation metric while training, which enables to obtain better results than with other metrics.
I use lme4 in R to fit the mixed model
model <- glmer(responcevariable ~ fixedvariable1 + fixedvariable2 + fixedvariable3 + fixedvariable + (1|randomvariable1) + (1|randomvariable2) + (1|randomvariable3), data=Dataset, family=binomial)
And I get
Data: Dataset
AIC BIC logLik deviance df.resid
5005.8 5072.2 -2492.9 4985.8 5612
Scaled residuals:
Min 1Q Median 3Q Max
-3.5750 -0.4896 -0.2675 0.5618 11.6250
Random effects:
Groups Name Variance Std.Dev.
randomvariable1 (Intercept) 0.007826 0.08847
randomvariable2 (Intercept) 1.366346 1.16891
randomvariable3 (Intercept) 0.011879 0.10899
Number of obs: 5622, groups: randomvariable1, 49; randomvariable2, 5; randomvariable3, 4
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.98557 0.66851 -17.929 < 2e-16 ***
fixedvariable1a -0.31754 0.09732 -3.263 0.00110 **
fixedvariable1b 0.26805 0.08614 3.112 0.00186 **
fixedvariable2a -0.61098 0.09521 -6.417 1.39e-10 ***
fixedvariable2b -0.50402 0.10526 -4.788 1.68e-06 ***
fixedvariable3 7.57652 0.26308 28.799 < 2e-16 ***
fixedvariable4 -0.30746 0.07852 -3.915 9.03e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
How can I know that the effect of random variable is significant?
I am doing a submodel testing. The smaller model is nested in the bigger model. The bigger model has one continuous variable compared to the smaller model. I use the likelihood ratio test. The result is quite strange. Both models have the same statistics such as residual deviance and df. I also find two models have the same estimated coefficients are std.errors. How is the fact possible?
summary(m2221)
Call:
glm(formula = clm ~ veh_age + veh_body + agecat + veh_value:veh_age +
veh_value:area, family = "binomial", data = Car)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9245 -0.3939 -0.3683 -0.3437 2.7095
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.294118 0.382755 -3.381 0.000722 ***
veh_age2 0.051790 0.098463 0.526 0.598897
veh_age3 -0.166801 0.094789 -1.760 0.078457 .
veh_age4 -0.239862 0.096154 -2.495 0.012611 *
veh_bodyCONVT -2.184124 0.707884 -3.085 0.002033 **
veh_bodyCOUPE -0.850675 0.393625 -2.161 0.030685 *
veh_bodyHBACK -1.105087 0.374134 -2.954 0.003140 **
veh_bodyHDTOP -0.973472 0.383404 -2.539 0.011116 *
veh_bodyMCARA -0.649036 0.469407 -1.383 0.166765
veh_bodyMIBUS -1.295135 0.404691 -3.200 0.001373 **
veh_bodyPANVN -0.903032 0.395295 -2.284 0.022345 *
veh_bodyRDSTR -1.108488 0.826541 -1.341 0.179883
veh_bodySEDAN -1.097931 0.373578 -2.939 0.003293 **
veh_bodySTNWG -1.129122 0.373713 -3.021 0.002516 **
veh_bodyTRUCK -1.156099 0.384088 -3.010 0.002613 **
veh_bodyUTE -1.343958 0.377653 -3.559 0.000373 ***
agecat2 -0.198002 0.058382 -3.391 0.000695 ***
agecat3 -0.224492 0.056905 -3.945 7.98e-05 ***
agecat4 -0.253377 0.056774 -4.463 8.09e-06 ***
agecat5 -0.441906 0.063227 -6.989 2.76e-12 ***
agecat6 -0.447231 0.072292 -6.186 6.15e-10 ***
veh_age1:veh_value -0.000637 0.026387 -0.024 0.980740
veh_age2:veh_value 0.035386 0.031465 1.125 0.260753
veh_age3:veh_value 0.114485 0.036690 3.120 0.001806 **
veh_age4:veh_value 0.189866 0.057573 3.298 0.000974 ***
veh_value:areaB 0.044099 0.021550 2.046 0.040722 *
veh_value:areaC 0.021892 0.019189 1.141 0.253931
veh_value:areaD -0.023616 0.024939 -0.947 0.343658
veh_value:areaE -0.013506 0.026886 -0.502 0.615415
veh_value:areaF 0.057780 0.026602 2.172 0.029850 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 33767 on 67855 degrees of freedom
Residual deviance: 33592 on 67826 degrees of freedom
AIC: 33652
Number of Fisher Scoring iterations: 5
summary(m222)
Call:
glm(formula = clm ~ veh_value + veh_age + veh_body + agecat +
veh_value:veh_age + veh_value:area, family = "binomial",
data = Car)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9245 -0.3939 -0.3683 -0.3437 2.7095
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.294118 0.382755 -3.381 0.000722 ***
veh_value -0.000637 0.026387 -0.024 0.980740
veh_age2 0.051790 0.098463 0.526 0.598897
veh_age3 -0.166801 0.094789 -1.760 0.078457 .
veh_age4 -0.239862 0.096154 -2.495 0.012611 *
veh_bodyCONVT -2.184124 0.707884 -3.085 0.002033 **
veh_bodyCOUPE -0.850675 0.393625 -2.161 0.030685 *
veh_bodyHBACK -1.105087 0.374134 -2.954 0.003140 **
veh_bodyHDTOP -0.973472 0.383404 -2.539 0.011116 *
veh_bodyMCARA -0.649036 0.469407 -1.383 0.166765
veh_bodyMIBUS -1.295135 0.404691 -3.200 0.001373 **
veh_bodyPANVN -0.903032 0.395295 -2.284 0.022345 *
veh_bodyRDSTR -1.108488 0.826541 -1.341 0.179883
veh_bodySEDAN -1.097931 0.373578 -2.939 0.003293 **
veh_bodySTNWG -1.129122 0.373713 -3.021 0.002516 **
veh_bodyTRUCK -1.156099 0.384088 -3.010 0.002613 **
veh_bodyUTE -1.343958 0.377653 -3.559 0.000373 ***
agecat2 -0.198002 0.058382 -3.391 0.000695 ***
agecat3 -0.224492 0.056905 -3.945 7.98e-05 ***
agecat4 -0.253377 0.056774 -4.463 8.09e-06 ***
agecat5 -0.441906 0.063227 -6.989 2.76e-12 ***
agecat6 -0.447231 0.072292 -6.186 6.15e-10 ***
veh_value:veh_age2 0.036023 0.034997 1.029 0.303331
veh_value:veh_age3 0.115122 0.039476 2.916 0.003543 **
veh_value:veh_age4 0.190503 0.058691 3.246 0.001171 **
veh_value:areaB 0.044099 0.021550 2.046 0.040722 *
veh_value:areaC 0.021892 0.019189 1.141 0.253931
veh_value:areaD -0.023616 0.024939 -0.947 0.343658
veh_value:areaE -0.013506 0.026886 -0.502 0.615415
veh_value:areaF 0.057780 0.026602 2.172 0.029850 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 33767 on 67855 degrees of freedom
Residual deviance: 33592 on 67826 degrees of freedom
AIC: 33652
anova(m2221,m222, test ="LRT")###
Analysis of Deviance Table
Model 1: clm ~ veh_age + veh_body + agecat + veh_value:veh_age +
veh_value:area
Model 2: clm ~ veh_value + veh_age + veh_body + agecat + veh_value:veh_age +
veh_value:area
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 67826 33592
2 67826 33592 0 0
You have specified the same model two different ways. To show this, I'll first explain what is going under the hood a bit, then I'll walk through the coefficients of your two models to show they are the same, and I'll end with some higher level intuition and explanation.
Different formula creating different interaction terms
First, the only difference between your two models is the first models includes veh_value as a predictor, whereas in the second does not. However, veh_value is interacted with other predictors in both models.
So let's consider a simple reproducible example and see what R does when we do this. I'll use the model.matrix to simply feed in two different formulas and view the continuous and dummy variables that R creates as a result.
colnames(model.matrix(am ~ mpg + factor(cyl):mpg, mtcars))
#[1] "(Intercept)" "mpg" "mpg:factor(cyl)6" "mpg:factor(cyl)8"
colnames(model.matrix(am ~ factor(cyl):mpg, mtcars))
#"(Intercept)" "factor(cyl)4:mpg" "factor(cyl)6:mpg" "factor(cyl)8:mpg"
Notice in the first call I included the continuous predictor mpg whereas I did not include it in the second call (a simpler example of what you are doing).
Now note that the second model matrix contains an extra interaction variable (factor(cyl)4:mpg) in the model that the first does not. In other words, because we did not include mpg in the model directly, all levels of cyl get included in the interaction.
Your models are the same
Your models are essentially doing the same thing as the simple example above and we can show that the coefficients at the end of the day are actually the same when added together.
In your first model all 4 levels of veh_age are included when it is included as interaction with veh_value but veh_value is not included in the model.
veh_age1:veh_value -0.000637 0.026387 -0.024 0.980740
veh_age2:veh_value 0.035386 0.031465 1.125 0.260753
veh_age3:veh_value 0.114485 0.036690 3.120 0.001806 **
veh_age4:veh_value 0.189866 0.057573 3.298 0.000974 ***
In your second model only 3 levels of veh_age are included when it is interacted with veh_value because veh_value is included in the model.
veh_value:veh_age2 0.036023 0.034997 1.029 0.303331
veh_value:veh_age3 0.115122 0.039476 2.916 0.003543 **
veh_value:veh_age4 0.190503 0.058691 3.246 0.001171 **
Now, here is the critical piece to see that the models are actually the same. It's easiest to show by just walking through all of the levels of veh_age.
First consider veh_age = 1
For both models, the coefficient on veh_value conditioned on veh_age when veh_age = 1 is -0.000637
# For first model
veh_age1:veh_value -0.000637 0.026387 -0.024 0.980740
# For second model
veh_value -0.000637 0.026387 -0.024 0.980740
Now consider veh_age = 2
For both models, the coefficient on veh_value conditioned on veh_age when veh_age = 2 is 0.035386
# For first model
veh_age2:veh_value 0.035386 0.031465 1.125 0.260753
# For second model - note sum of the two below is 0.035386
veh_value -0.000637 0.026387 -0.024 0.980740
veh_value:veh_age2 0.036023 0.034997 1.029 0.303331
Intution
When you include the interaction veh_value:veh_age you are essentially saying that you want the coefficient of veh_value a continuous variable to be conditioned on veh_age, a categorical variable. Including both the interaction veh_value:veh_age and veh_value as predictors is saying the same thing. You want to know the coefficient of veh_value but you want to condition it based on the value of veh_age.
If I use the code:
plot(allEffects(AP18), multiline = TRUE, rescale.axis = FALSE, main = NULL)
for the following model:
AP18<-glmer(cbind(NPP,NPC)~Cond+OTTM+SlopeAp+EssCondEnd+SlopeAp*Cond+SlopeAp*OTTM+(1|Ind)+(1|Groupe),family=binomial, data=SElearn)
I get the following:
Cond has 3 levels (A, B and C). In the third plot, of the interaction between OTTM and SlopeAP, what is it doing with Cond? I ask because I am trying to recreate the graphs using predict. With a continuous variable that wasn't included in the model I would set it to 0, but with the categorical I don't know what to do. It is not set to the reference value (in this case A), because doing that does not give the same graph.
For reference, the model summary is:
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: cbind(NPP, NPC) ~ Cond + OTTM + SlopeAp + EssCondEnd + SlopeAp *
Cond + SlopeAp * OTTM + (1 | Ind) + (1 | Groupe)
Data: SElearn
Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
AIC BIC logLik deviance df.resid
1023.1 1063.8 -500.5 1001.1 287
Scaled residuals:
Min 1Q Median 3Q Max
-3.4584 -0.9527 0.0458 0.8498 3.5794
Random effects:
Groups Name Variance Std.Dev.
Ind (Intercept) 3.338e-01 5.777e-01
Groupe (Intercept) 1.997e-15 4.468e-08
Number of obs: 298, groups: Ind, 20; Groupe, 4
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.535316 0.618425 -2.483 0.013042 *
CondB 0.937548 0.411961 2.276 0.022857 *
CondC 2.415791 0.402306 6.005 1.92e-09 ***
OTTMCBA 3.610364 1.085016 3.327 0.000876 ***
SlopeAp 15.657892 7.896060 1.983 0.047367 *
EssCondEnd 0.019554 0.006844 2.857 0.004275 **
CondB:SlopeAp -6.188813 4.297260 -1.440 0.149817
CondC:SlopeAp -14.062097 4.117263 -3.415 0.000637 ***
OTTMCBA:SlopeAp -35.422102 11.157396 -3.175 0.001500 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Thanks in advance!
I am trying to see in practice what was explained here what happens to the coefficients once labels are switched but I am not getting what is expected. Here is my attempt:
I am using the example of natality public-use data given as an example in "Practical Data Science with R" Where the output is a logical variable that classifies new born babies if they are atRisk with levels FALSE and TRUE
load(url("https://github.com/WinVector/zmPDSwR/tree/master/CDC/NatalRiskData.rData"))
train <- sdata[sdata$ORIGRANDGROUP<=5,]
test <- sdata[sdata$ORIGRANDGROUP>5,]
complications <- c("ULD_MECO","ULD_PRECIP","ULD_BREECH")
riskfactors <- c("URF_DIAB", "URF_CHYPER", "URF_PHYPER",
"URF_ECLAM")
y <- "atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
summary(model)
This results to the following coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***
PWGT 0.003762 0.001487 2.530 0.011417 *
UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***
CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .
GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***
DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **
DPLURALtwin 0.312319 0.241088 1.295 0.195163
ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***
ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951
ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***
URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187
URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676
URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029
URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489
OK, now let us switch the labels in our atRisk variable:
esdata$atRisk <- factor(sdata$atRisk)
levels(sdata$atRisk) <- c("TRUE", "FALSE")
and re-run the above analysis where I am expecting to see a change in the signs of the above reported coefficients, however, I am getting exactly the same coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***
PWGT 0.003762 0.001487 2.530 0.011417 *
UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***
CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .
GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***
DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **
DPLURALtwin 0.312319 0.241088 1.295 0.195163
ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***
ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951
ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***
URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187
URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676
URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029
URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489
What is that am I doing wrong here? Can you help please
its because you set train <- sdata[sdata$ORIGRANDGROUP<=5,] and then you change sdata$atRisk <- factor(sdata$atRisk) but your model is using the train dataset, whose levels DID NOT get changed.
Instead you can do
y <- "!atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
Call:
glm(formula = fmla, family = binomial(link = "logit"), data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.2641 0.1358 0.1511 0.1818 0.9732
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.412189 0.289352 15.249 < 2e-16 ***
PWGT -0.003762 0.001487 -2.530 0.011417 *
UPREVIS 0.063289 0.015252 4.150 3.33e-05 ***
CIG_RECTRUE -0.313169 0.187230 -1.673 0.094398 .
GESTREC3< 37 weeks -1.545183 0.140795 -10.975 < 2e-16 ***
DPLURALtriplet or higher -1.394193 0.498866 -2.795 0.005194 **
DPLURALtwin -0.312319 0.241088 -1.295 0.195163
ULD_MECOTRUE -0.818426 0.235798 -3.471 0.000519 ***
ULD_PRECIPTRUE -0.191720 0.357680 -0.536 0.591951
ULD_BREECHTRUE -0.749237 0.178129 -4.206 2.60e-05 ***
URF_DIABTRUE 0.346467 0.287514 1.205 0.228187
URF_CHYPERTRUE -0.560025 0.389678 -1.437 0.150676
URF_PHYPERTRUE -0.161599 0.250003 -0.646 0.518029
URF_ECLAMTRUE -0.498064 0.776948 -0.641 0.521489
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2698.7 on 14211 degrees of freedom
Residual deviance: 2463.0 on 14198 degrees of freedom
AIC: 2491
Number of Fisher Scoring iterations: 7