I'm using glm model with cross-validation (10-folds) of caret package. I'd like to remove the non-significate variables of the model, for example, TX_RESP_Q108B, TX_RESP_Q108C, TX_RESP_Q065.Q etc.
Input
tc <- trainControl("cv", 10, savePredictions = T, classProbs = T)
fit1 <- train(response~., data = my_data, method = "glm", family = "binomial", trControl = tc)
Output
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3268 -0.6676 -0.5238 -0.3620 3.0841
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.507239 0.171328 -20.471 < 2e-16 ***
ID_TURNO2 -0.412723 0.024630 -16.757 < 2e-16 ***
ID_TURNO3 -2.089175 0.234093 -8.925 < 2e-16 ***
TX_RESP_Q089B 0.452723 0.047093 9.613 < 2e-16 ***
TX_RESP_Q108B -0.001968 0.330746 -0.006 0.99525
TX_RESP_Q108C 0.220013 0.222716 0.988 0.32322
TX_RESP_Q108D 0.371279 0.178481 2.080 0.03751 *
TX_RESP_Q108E 0.115137 0.164007 0.702 0.48266
TX_RESP_Q108F 0.301288 0.162907 1.849 0.06439 .
TX_RESP_Q079B 0.410358 0.027035 15.179 < 2e-16 ***
TX_RESP_Q005.L 0.558060 0.047220 11.818 < 2e-16 ***
TX_RESP_Q005.Q 0.090621 0.036774 2.464 0.01373 *
TX_RESP_Q005.C -0.144208 0.029758 -4.846 1.26e-06 ***
`TX_RESP_Q005^4` -0.015192 0.023936 -0.635 0.52563
TX_RESP_Q078B 0.400272 0.046226 8.659 < 2e-16 ***
TX_RESP_Q009B 0.035596 0.131552 0.271 0.78671
TX_RESP_Q009C -0.072846 0.150077 -0.485 0.62740
TX_RESP_Q009D 0.270092 0.028457 9.491 < 2e-16 ***
TX_RESP_Q009E 0.184751 0.033084 5.584 2.35e-08 ***
TX_RESP_Q009F 0.118077 0.066456 1.777 0.07561 .
TX_RESP_Q070B 0.483582 0.026233 18.434 < 2e-16 ***
TX_RESP_Q013.L -0.142476 0.069901 -2.038 0.04152 *
TX_RESP_Q013.Q -0.096416 0.056399 -1.710 0.08735 .
TX_RESP_Q013.C 0.075822 0.052626 1.441 0.14965
`TX_RESP_Q013^4` 0.089282 0.049417 1.807 0.07081 .
`TX_RESP_Q013^5` -0.039960 0.041490 -0.963 0.33550
`TX_RESP_Q013^6` 0.084890 0.031894 2.662 0.00778 **
TX_RESP_Q048.L 0.569328 0.063179 9.011 < 2e-16 ***
TX_RESP_Q048.Q -0.016526 0.057944 -0.285 0.77548
TX_RESP_Q048.C 0.112465 0.052459 2.144 0.03204 *
TX_RESP_Q107B 0.167977 0.275342 0.610 0.54182
TX_RESP_Q107C -0.084265 0.198600 -0.424 0.67135
TX_RESP_Q107D 0.017702 0.159386 0.111 0.91156
TX_RESP_Q107E 0.225705 0.146949 1.536 0.12455
TX_RESP_Q107F 0.253538 0.146698 1.728 0.08394 .
TX_RESP_Q062.L -0.027898 0.104988 -0.266 0.79045
TX_RESP_Q062.Q 0.054648 0.076644 0.713 0.47584
TX_RESP_Q062.C -0.021481 0.044358 -0.484 0.62819
TX_RESP_Q045B 0.348216 0.071149 4.894 9.87e-07 ***
TX_RESP_Q045C 0.118404 0.071593 1.654 0.09816 .
TX_RESP_Q045D -0.067446 0.077291 -0.873 0.38287
TX_RESP_Q058B 0.073366 0.076204 0.963 0.33567
TX_RESP_Q058C 0.095275 0.081153 1.174 0.24039
TX_RESP_Q058D 0.167319 0.085421 1.959 0.05014 .
TX_RESP_Q059B -0.206194 0.103281 -1.996 0.04589 *
TX_RESP_Q059C -0.185812 0.105676 -1.758 0.07869 .
TX_RESP_Q059D -0.098488 0.108455 -0.908 0.36383
TX_RESP_Q060B 0.273180 0.060671 4.503 6.71e-06 ***
TX_RESP_Q060C 0.368747 0.063615 5.797 6.77e-09 ***
TX_RESP_Q060D 0.396086 0.067710 5.850 4.92e-09 ***
TX_RESP_Q061B 0.066926 0.087237 0.767 0.44298
TX_RESP_Q061C -0.006212 0.092005 -0.068 0.94617
TX_RESP_Q061D 0.012422 0.096713 0.128 0.89780
TX_RESP_Q063.L 0.024938 0.098261 0.254 0.79965
TX_RESP_Q063.Q 0.070767 0.071592 0.988 0.32291
TX_RESP_Q063.C -0.040853 0.042945 -0.951 0.34146
TX_RESP_Q065.L -0.039285 0.051561 -0.762 0.44612
TX_RESP_Q065.Q 0.019015 0.036056 0.527 0.59792
TX_RESP_Q065.C 0.025068 0.025339 0.989 0.32252
TX_RESP_Q067B -0.075401 0.093803 -0.804 0.42150
TX_RESP_Q067C -0.108572 0.093832 -1.157 0.24724
TX_RESP_Q067D -0.038972 0.094387 -0.413 0.67969
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 52275 on 56144 degrees of freedom
Residual deviance: 48930 on 56083 degrees of freedom
AIC: 49054
Number of Fisher Scoring iterations: 6
I tried (for example)
f2 <- update(f~., TURMA_PROFICIENTE~ID_TURNO + TX_RESP_Q089B + TX_RESP_Q108D)
fit2 <- train(f2, data = dados, method = "glm", family = "binomial", trControl = tc)
#Error mensage
Error in eval(expr, envir, enclos) : object 'TX_RESP_Q089B' not found
This might be a bad idea (as described). You probably should not look at a single model fit, figure out what is significant using the same data, and eliminate terms. It is kind of a self-fulfilling prophecy.
You do seem to have a lot of data, which helps minimize the risk, but I would instead use caret's sbf function to do the feature selection inside of cross-validation to remove the features. This will help protect you from selection bias and give you honest estimates of performance.
Related
I am trying to figure out how to calculate the marginal effects of my model using the, "clogit," function in the survival package. The margins package does not seem to work with this type of model, but does work with "multinom" and "mclogit." However, I am investigating the affects of choice characteristics, and not individual characteristics, so it needs to be a conditional logit model. The mclogit function works with the margins package, but these results are widely different from the results using the clogit function, why is that? Any help calculating the marginal effects from the clogit function would be greatly appreciated.
mclogit output:
Call:
mclogit(formula = cbind(selected, caseID) ~ SysTEM + OWN + cost +
ENVIRON + NEIGH + save, data = atl)
Estimate Std. Error z value Pr(>|z|)
SysTEM 0.139965 0.025758 5.434 5.51e-08 ***
OWN 0.008931 0.026375 0.339 0.735
cost -0.103012 0.004215 -24.439 < 2e-16 ***
ENVIRON 0.675341 0.037104 18.201 < 2e-16 ***
NEIGH 0.419054 0.031958 13.112 < 2e-16 ***
save 0.532825 0.023399 22.771 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null Deviance: 18380
Residual Deviance: 16670
Number of Fisher Scoring iterations: 4
Number of observations: 8364
clogit output:
Call:
coxph(formula = Surv(rep(1, 25092L), selected) ~ SysTEM + OWN +
cost + ENVIRON + NEIGH + save + strata(caseID), data = atl,
method = "exact")
n= 25092, number of events= 8364
coef exp(coef) se(coef) z Pr(>|z|)
SysTEM 0.133184 1.142461 0.034165 3.898 9.69e-05 ***
OWN -0.015884 0.984241 0.036346 -0.437 0.662
cost -0.179833 0.835410 0.005543 -32.442 < 2e-16 ***
ENVIRON 1.186329 3.275036 0.049558 23.938 < 2e-16 ***
NEIGH 0.658657 1.932195 0.042063 15.659 < 2e-16 ***
save 0.970051 2.638079 0.031352 30.941 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
SysTEM 1.1425 0.8753 1.0685 1.2216
OWN 0.9842 1.0160 0.9166 1.0569
cost 0.8354 1.1970 0.8264 0.8445
ENVIRON 3.2750 0.3053 2.9719 3.6091
NEIGH 1.9322 0.5175 1.7793 2.0982
save 2.6381 0.3791 2.4809 2.8053
Concordance= 0.701 (se = 0.004 )
Rsquare= 0.103 (max possible= 0.688 )
Likelihood ratio test= 2740 on 6 df, p=<2e-16
Wald test = 2465 on 6 df, p=<2e-16
Score (logrank) test = 2784 on 6 df, p=<2e-16
margins output for mclogit
margins(model2A)
SysTEM OWN cost ENVIRON NEIGH save
0.001944 0.000124 -0.001431 0.00938 0.00582 0.0074
margins output for clogit
margins(model2A)
Error in match.arg(type) :
'arg' should be one of “risk”, “expected”, “lp”
I am currently trying to run a linear model on a large data set, but am running into issues with some specific variables.
pv_model <- lm(SalePrice ~ MSSubClass + LotConfig + GarageArea + LotFrontage, data = train)
summary(pv_model)
Here is code for my regression. SalePrice, MSSubClass, GarageArea, and LotFrontage are all numeric fields, while LotConfig is a factored variable.
Here is the output of my pv_model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98154.64 17235.51 5.695 1.75e-08 ***
MSSubClass 50.05 58.38 0.857 0.391539
LotConfigCulDSac 69949.50 12740.62 5.490 5.42e-08 ***
LotConfigFR2 19998.34 14592.31 1.370 0.170932
LotConfigFR3 21390.99 34126.44 0.627 0.530962
LotConfigInside 21666.04 5597.33 3.871 0.000118 ***
GarageArea 175.67 10.96 16.035 < 2e-16 ***
LotFrontage101 42571.20 42664.89 0.998 0.318682
LotFrontage102 26051.49 35876.54 0.726 0.467968
LotFrontage103 36528.81 35967.56 1.016 0.310131
LotFrontage104 218129.42 58129.56 3.752 0.000188 ***
LotFrontage105 61737.12 27618.21 2.235 0.025673 *
LotFrontage106 40806.22 58159.42 0.702 0.483120
LotFrontage107 36744.69 29494.94 1.246 0.213211
LotFrontage108 71537.30 42565.91 1.681 0.093234 .
LotFrontage109 -29193.02 42528.98 -0.686 0.492647
LotFrontage110 73589.28 27706.92 2.656 0.008068 **
As you can see, the first variables operate correctly. Both the factored and numeric fields respond appropriately. That is, until it gets to LotFrontage. For whatever reason, the model runs the regression on every single level of LotFrontage.
For reference, LotFrontage describes the square footage of the subject's front yard. I have properly cleaned the data and replaced NA values. I really am at a loss for why this particular column is acting so unusually.
Any help is greatly appreciated.
If I download the data from the kaggle link or use a github link and do:
train = read.csv("train.csv")
class(x$LotFrontage)
[1] "integer"
pv_model <- lm(SalePrice ~ MSSubClass + LotConfig + GarageArea + LotFrontage,
data = train)
summary(pv_model)
Call:
lm(formula = SalePrice ~ MSSubClass + LotConfig + GarageArea +
LotFrontage, data = train)
Residuals:
Min 1Q Median 3Q Max
-380310 -33812 -4418 24345 487970
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11915.866 9455.677 1.260 0.20785
MSSubClass 105.699 45.345 2.331 0.01992 *
LotConfigCulDSac 81789.113 10547.120 7.755 1.89e-14 ***
LotConfigFR2 17736.355 11787.227 1.505 0.13266
LotConfigFR3 17649.409 31418.281 0.562 0.57439
LotConfigInside 13073.201 5002.092 2.614 0.00907 **
GarageArea 208.708 8.725 23.920 < 2e-16 ***
LotFrontage 722.380 88.294 8.182 7.12e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Suggest that you read in the csv again like above.
My dataset size is 42542 x 14 and I am trying to build different models like logistic regression, KNN, RF, Decision trees and compare the accuracies.
I get a high accuracy but low ROC AUC for every model.
The data has about 85% samples with target variable = 1 and 15% with target variable 0. I tried taking samples in order to handle this imbalance, but it still gives the same results.
Coeffs for glm are as follow:
glm(formula = loan_status ~ ., family = "binomial", data = lc_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7617 0.3131 0.4664 0.6129 1.6734
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.264e+00 8.338e-01 -9.911 < 2e-16 ***
annual_inc 5.518e-01 3.748e-02 14.721 < 2e-16 ***
home_own 4.938e-02 3.740e-02 1.320 0.186780
inq_last_6mths1 -2.094e-01 4.241e-02 -4.938 7.88e-07 ***
inq_last_6mths2-5 -3.805e-01 4.187e-02 -9.087 < 2e-16 ***
inq_last_6mths6-10 -9.993e-01 1.065e-01 -9.380 < 2e-16 ***
inq_last_6mths11-15 -1.448e+00 3.510e-01 -4.126 3.68e-05 ***
inq_last_6mths16-20 -2.323e+00 7.946e-01 -2.924 0.003457 **
inq_last_6mths21-25 -1.399e+01 1.970e+02 -0.071 0.943394
inq_last_6mths26-30 1.039e+01 1.384e+02 0.075 0.940161
inq_last_6mths31-35 -1.973e+00 1.230e+00 -1.604 0.108767
loan_amnt -1.838e-05 3.242e-06 -5.669 1.43e-08 ***
purposecredit_card 3.286e-02 1.130e-01 0.291 0.771169
purposedebt_consolidation -1.406e-01 1.032e-01 -1.362 0.173108
purposeeducational -3.591e-01 1.819e-01 -1.974 0.048350 *
purposehome_improvement -2.106e-01 1.189e-01 -1.771 0.076577 .
purposehouse -3.327e-01 1.917e-01 -1.735 0.082718 .
purposemajor_purchase -7.310e-03 1.288e-01 -0.057 0.954732
purposemedical -4.955e-01 1.530e-01 -3.238 0.001203 **
purposemoving -4.352e-01 1.636e-01 -2.661 0.007800 **
purposeother -3.858e-01 1.105e-01 -3.493 0.000478 ***
purposerenewable_energy -8.150e-01 3.036e-01 -2.685 0.007263 **
purposesmall_business -9.715e-01 1.186e-01 -8.191 2.60e-16 ***
purposevacation -4.169e-01 2.012e-01 -2.072 0.038294 *
purposewedding 3.909e-02 1.557e-01 0.251 0.801751
open_acc -1.408e-04 4.147e-03 -0.034 0.972923
gradeB -4.377e-01 6.991e-02 -6.261 3.83e-10 ***
gradeC -5.858e-01 8.340e-02 -7.024 2.15e-12 ***
gradeD -7.636e-01 9.558e-02 -7.990 1.35e-15 ***
gradeE -7.832e-01 1.115e-01 -7.026 2.13e-12 ***
gradeF -9.730e-01 1.325e-01 -7.341 2.11e-13 ***
gradeG -1.031e+00 1.632e-01 -6.318 2.65e-10 ***
verification_statusSource Verified 6.340e-02 4.435e-02 1.429 0.152898
verification_statusVerified 6.864e-02 4.400e-02 1.560 0.118739
dti -4.683e-03 2.791e-03 -1.678 0.093373 .
fico_range_low 6.705e-03 9.292e-04 7.216 5.34e-13 ***
term 5.773e-01 4.499e-02 12.833 < 2e-16 ***
emp_length2-4 years 6.341e-02 4.911e-02 1.291 0.196664
emp_length5-9 years -3.136e-02 5.135e-02 -0.611 0.541355
emp_length10+ years -2.538e-01 5.185e-02 -4.895 9.82e-07 ***
delinq_2yrs2+ 5.919e-02 9.701e-02 0.610 0.541754
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 25339 on 29779 degrees of freedom
Residual deviance: 23265 on 29739 degrees of freedom
AIC: 23347
Number of Fisher Scoring iterations: 10
The confusion matrix for LR is as below:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 32 40
1 1902 10788
Accuracy : 0.8478
95% CI : (0.8415, 0.854)
No Information Rate : 0.8485
P-Value [Acc > NIR] : 0.5842
Kappa : 0.0213
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.016546
Specificity : 0.996306
Pos Pred Value : 0.444444
Neg Pred Value : 0.850118
Prevalence : 0.151544
Detection Rate : 0.002507
Detection Prevalence : 0.005642
Balanced Accuracy : 0.506426
'Positive' Class : 0
Is there any way I can improve the AUC?
If someone presents a confusion matrix and talks about low ROC AUC, it usually means that he/she has converted predictions/probabilities into 0 and 1, while ROC AUC formula does not require that - it works on raw probabilities, what gives much better results. If the aim is to obtain the best AUC value, it is good to set it as an evaluation metric while training, which enables to obtain better results than with other metrics.
I'm trying to fit a polynomial function (for learning reasons) to the BostonHousing data with the mlr package.
I'm unable to figure out how to provide a custom formula to the lm function used, especially I want to add a polynomial function to one of the input variables (for testing purposes).
How can this best be achieved?
library(mlr)
data("BostonHousing", package = "mlbench")
regr.task <- makeRegrTask(data = BostonHousing, target = "medv")
regr.learner <- makeLearner("regr.lm")
# I would like to specify the formula used by "regr.lm" myself, how can this be achieved?
regr.train <- train(regr.learner, regr.task)
lm.results <- getLearnerModel(regr.train)
summary(lm.results)
Call:
stats::lm(formula = f, data = d)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas1 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
b 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
I am trying to see in practice what was explained here what happens to the coefficients once labels are switched but I am not getting what is expected. Here is my attempt:
I am using the example of natality public-use data given as an example in "Practical Data Science with R" Where the output is a logical variable that classifies new born babies if they are atRisk with levels FALSE and TRUE
load(url("https://github.com/WinVector/zmPDSwR/tree/master/CDC/NatalRiskData.rData"))
train <- sdata[sdata$ORIGRANDGROUP<=5,]
test <- sdata[sdata$ORIGRANDGROUP>5,]
complications <- c("ULD_MECO","ULD_PRECIP","ULD_BREECH")
riskfactors <- c("URF_DIAB", "URF_CHYPER", "URF_PHYPER",
"URF_ECLAM")
y <- "atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
summary(model)
This results to the following coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***
PWGT 0.003762 0.001487 2.530 0.011417 *
UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***
CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .
GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***
DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **
DPLURALtwin 0.312319 0.241088 1.295 0.195163
ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***
ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951
ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***
URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187
URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676
URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029
URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489
OK, now let us switch the labels in our atRisk variable:
esdata$atRisk <- factor(sdata$atRisk)
levels(sdata$atRisk) <- c("TRUE", "FALSE")
and re-run the above analysis where I am expecting to see a change in the signs of the above reported coefficients, however, I am getting exactly the same coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.412189 0.289352 -15.249 < 2e-16 ***
PWGT 0.003762 0.001487 2.530 0.011417 *
UPREVIS -0.063289 0.015252 -4.150 3.33e-05 ***
CIG_RECTRUE 0.313169 0.187230 1.673 0.094398 .
GESTREC3< 37 weeks 1.545183 0.140795 10.975 < 2e-16 ***
DPLURALtriplet or higher 1.394193 0.498866 2.795 0.005194 **
DPLURALtwin 0.312319 0.241088 1.295 0.195163
ULD_MECOTRUE 0.818426 0.235798 3.471 0.000519 ***
ULD_PRECIPTRUE 0.191720 0.357680 0.536 0.591951
ULD_BREECHTRUE 0.749237 0.178129 4.206 2.60e-05 ***
URF_DIABTRUE -0.346467 0.287514 -1.205 0.228187
URF_CHYPERTRUE 0.560025 0.389678 1.437 0.150676
URF_PHYPERTRUE 0.161599 0.250003 0.646 0.518029
URF_ECLAMTRUE 0.498064 0.776948 0.641 0.521489
What is that am I doing wrong here? Can you help please
its because you set train <- sdata[sdata$ORIGRANDGROUP<=5,] and then you change sdata$atRisk <- factor(sdata$atRisk) but your model is using the train dataset, whose levels DID NOT get changed.
Instead you can do
y <- "!atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications, riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
Call:
glm(formula = fmla, family = binomial(link = "logit"), data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.2641 0.1358 0.1511 0.1818 0.9732
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.412189 0.289352 15.249 < 2e-16 ***
PWGT -0.003762 0.001487 -2.530 0.011417 *
UPREVIS 0.063289 0.015252 4.150 3.33e-05 ***
CIG_RECTRUE -0.313169 0.187230 -1.673 0.094398 .
GESTREC3< 37 weeks -1.545183 0.140795 -10.975 < 2e-16 ***
DPLURALtriplet or higher -1.394193 0.498866 -2.795 0.005194 **
DPLURALtwin -0.312319 0.241088 -1.295 0.195163
ULD_MECOTRUE -0.818426 0.235798 -3.471 0.000519 ***
ULD_PRECIPTRUE -0.191720 0.357680 -0.536 0.591951
ULD_BREECHTRUE -0.749237 0.178129 -4.206 2.60e-05 ***
URF_DIABTRUE 0.346467 0.287514 1.205 0.228187
URF_CHYPERTRUE -0.560025 0.389678 -1.437 0.150676
URF_PHYPERTRUE -0.161599 0.250003 -0.646 0.518029
URF_ECLAMTRUE -0.498064 0.776948 -0.641 0.521489
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2698.7 on 14211 degrees of freedom
Residual deviance: 2463.0 on 14198 degrees of freedom
AIC: 2491
Number of Fisher Scoring iterations: 7