I have a dataset with a number of binary predictors and binary outcome. I am trying to use logistic regression to predict the outcome and use caret package.
For some reason, after training my model does not produce result, but finishes without any errors. However, when I train with cross-validation, I get the result.
> Model = train(success ~ . - contestid - index - tags, data = p.train,
+ method = "glm",
+ family = binomial(link = "logit"),
+ trControl = trainControl(method = "none"));
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> Model$results
[1] Accuracy Kappa parameter
<0 rows> (or 0-length row.names)
With cross-validation:
> Model = train(success ~ . - contestid - index - tags, data = p.train,
+ method = "glm",
+ family = binomial(link = "logit"),
+ trControl = trainControl(method = "cv"));
There were 22 warnings (use warnings() to see them)
> Model$results
parameter Accuracy Kappa AccuracySD KappaSD
1 none 0.8 0.4208333 0.1972027 0.460482
> Model$resample
Accuracy Kappa Resample
1 0.75 0.5000000 Fold01
2 0.50 0.2000000 Fold02
3 1.00 1.0000000 Fold03
4 0.75 0.5000000 Fold04
5 1.00 1.0000000 Fold05
6 1.00 NA Fold06
7 0.75 0.5000000 Fold07
8 0.75 0.0000000 Fold08
9 0.50 -0.3333333 Fold09
10 1.00 NA Fold10
All warnings are the same, about the fitted probabilities, since my data allows perfect separation. However, this does not prevent training with cv to produce results.
What might be the reason for the absence of results in the first case?
Thanks
Related
I am performing logistic regression on the model with CHD sickness vs a few variables (see the data frame).
ind sbp tobacco ldl adiposity typea obesity alcohol age chd
1 1 160 12.00 5.73 23.11 49 25.30 97.20 52 1
2 2 144 0.01 4.41 28.61 55 28.87 2.06 63 1
...
I performed backward stepwise selection on this model to receive the best model, but I get as the result the model that contains only the intercept. Why can it be? What does it mean?
model <-glm(chd ~ ., data = CHD, family = "binomial"(link = logit))
intercept_only <- glm(chd ~ 1, data=CHD, family = "binomial"(link = logit))
#perform backward stepwise regression
back <- step(intercept_only, direction='backward', scope=formula(model), trace=0)
#view results of backward stepwise regression
Step Df Deviance Resid. Df Resid. Dev AIC
1 NA NA 461 596.1084 598.1084```
To do backward regression, you should start with a model that contains variables, rather than the model with intercept only:
back <- step(model, direction='backward', scope=formula(model), trace=0)
The intercept_only model should only be used if you set direction='forward' or direction='both'.
I have noticed that for some runs of:
train=as.h2o(u)
mod = h2o.glm(family= "binomial", x= c(1:15), y="dc",
training_frame=train, missing_values_handling = "Skip",
lambda = 0, compute_p_values = TRUE, nfolds = 10,
keep_cross_validation_predictions= TRUE)
there are NaNs in cross-validation metrics summary of AUC for some cv iterations of the model.
For example:
print(mod#model$cross_validation_metrics_summary["auc",])
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid cv_8_valid cv_9_valid cv_10_valid
auc 0.63244045 0.24962118 0.25 0.6666667 0.8095238 1.0 0.6666667 0.46666667 NaN NaN 1.0 0.2
NaN in CV seems to appear less frequently when I set smaller nfolds=7.
How these NaN values should be interpreted and when h2o cross-validation outputs them?
I suppose it happens when AUC can't be assessed correctly in an iteration. My training set has 70 complete rows.
Can such AUC cross-validation results (containing NaNs) be considered as reliable?
There are specific cases that could cause division by zero when calculating the ROC curve, which could cause an AUC to be NaN. It's probable that due to small data you have some folds that have no true positives and are causing this issue.
We can test this by keeping the fold column and then counting the values of dc in each fold:
...
train <- as.h2o(u)
mod <- h2o.glm(family = "binomial"
, x = c(1:15)
, y = "dc"
, training_frame = train
, missing_values_handling = "Skip"
, lambda = 0
, compute_p_values = TRUE
, nfolds = 10
, keep_cross_validation_fold_assignment = TRUE
, seed = 1234)
fold <- as.data.frame(h2o.cross_validation_fold_assignment(mod))
df <- cbind(u,fold)
table(df[c("dc","fold_assignment")])
fold_assignment
dc 0 1 2 3 4 5 6 7 8 9
0 4 6 6 2 9 6 6 4 4 6
1 2 2 3 4 0 2 0 0 1 2
mod#model$cross_validation_metrics_summary["auc",]
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid
auc 0.70238096 0.19357596 0.875 0.6666667 0.5 0.375 NaN 0.5833333 NaN
cv_8_valid cv_9_valid cv_10_valid
auc NaN 1.0 0.9166667
We see that the folds with NaN are the same folds that have only dc=0.
Not counting the NaN, the wide variety of AUC for your folds (from 0.2 to 1) tells us that this is not a robust model, and it is likely being overfitted. Can you add more data?
I have Likert type data, some 4-point scale and others 3-point scale. For example:
variable.x <- c(5,2,4,5,3,1)
variable.y <- c(0,1,1,0,2,1)
my.data <- cbind(variable.x, variable.y)
library(psych)
polychoric(my.data)
#Call: polychoric(x = my.data)
#Polychoric correlations
# vrbl.x vrbl.y
#variable.x 1.00
#variable.y -0.25 1.00
# with tau of
# 0 1 2 3 4
#variable.x -0.97 -0.43 0 0.43 Inf
#variable.y -0.43 0.97 Inf Inf Inf
How can I obtain a significance value for the correlation of -0.25?
Strangely the use of the polycor package gives totally different results:
library(polycor)
polychor(variable.x, variable.y, ML=TRUE, std.err=TRUE)
#Polychoric Correlation, ML est. = -0.7599 (0.2588)
#Test of bivariate normality: Chisquare = 9.088, df = 7, p = 0.2464
# Row Thresholds
# Threshold Std.Err.
#1 -0.9388 0.5519
#2 -0.5774 0.5040
#3 -0.1906 0.5267
#4 0.3979 0.5690
# Column Thresholds
# Threshold Std.Err.
#1 -0.3891 0.5683
#2 0.9692 0.5568
Now the correlation coefficient is -0.76 now, not -0.25. Again, how would I find its significance?
I'm doing survival analysis with interval-cesored data and I'm using intcox() function from the intcox package in R, which is based on the coxph function.
The function returns the output without likelihood ratio test values:
> intcox(surv~sexo,data=dados)
Call:
intcox(formula = surv ~ sexo, data = dados)
coef exp(coef) se(coef) z p
sexojuvenil 2.596 13.4 NA NA NA
sexomacho -0.105 0.9 NA NA NA
Likelihood ratio test=NA on 2 df, p=NA n= 156
I don't know why this is happening... Here is the application of the coxph() function to the same data:
> coxph(Surv(dias_seg,status)~sexo,data=dados)
Call:
coxph(formula = Surv(dias_seg, status) ~ sexo, data = dados)
coef exp(coef) se(coef) z p
sexojuvenil 2.320 10.172 0.630 3.684 0.00023
sexomacho -0.169 0.844 0.252 -0.671 0.50000
Likelihood ratio test=9.28 on 2 df, p=0.00967 n= 156, number of events= 77
str(dados$sexo)
Factor w/ 3 levels "femea","juvenil",..: 3 3 3 3 3 3 3 3 3 3 ...
Can you help me to solve this problem?
Thanks in advance.
I was told by Volkmar Henschel (author of intcox package) that the "The fit with intcox gives an object of class ”coxph” without the standard errors of the regression coefficients".
More descriptions on this document:
ftp://ftp.auckland.ac.nz/pub/software/CRAN/doc/vignettes/intcox/intcox.pdf
Suppose I have to estimate coefficients a,b in regression:
y=a*x+b*z+c
I know in advance that y is always in range y>=0 and y<=x, but regression model produces sometimes y outside of this range.
Sample data:
mydata<-data.frame(y=c(0,1,3,4,9,11),x=c(1,3,4,7,10,11),z=c(1,1,1,9,6,7))
round(predict(lm(y~x+z,data=mydata)),2)
1 2 3 4 5 6
-0.87 1.79 3.12 4.30 9.34 10.32
First predicted value is <0.
I tried model without intercept: all predictions are >0, but third prediction of y is >x (4.03>3)
round(predict(lm(y~x+z-1,data=mydata)),2)
1 2 3 4 5 6
0.76 2.94 4.03 4.67 8.92 9.68
I also considered to model proportion y/x instead of y:
mydata$y2x<-mydata$y/mydata$x
round(predict(lm(y2x~x+z,data=mydata)),2)
1 2 3 4 5 6
0.15 0.39 0.50 0.49 0.97 1.04
round(predict(lm(y2x~x+z-1,data=mydata)),2)
1 2 3 4 5 6
0.08 0.33 0.46 0.47 0.99 1.07
But now sixth prediction is >1, but proportion should be in range [0,1].
I also tried to apply method where glm is used with offset option: Regression for a Rate variable in R
and
http://en.wikipedia.org/wiki/Poisson_regression#.22Exposure.22_and_offset
but this was not successfull.
Please note, in my data dependent variable: proportion y/x is both zero-inflated and one-inflated.
Any idea, what is suitable approach to build a model in R ('glm','lm')?
You're on the right track: if 0 ≤ y ≤ x then 0 ≤ (y/x) ≤ 1. This suggests fitting y/x to a logistic model in glm(...). Details are below, but considering that you've only got 6 points, this is a pretty good fit.
The major concern is that the model is not valid unless the error in (y/x) is Normal with constant variance (or, equivalently, the error in y increases with x). If this is true then we should get a (more or less) linear Q-Q plot, which we do.
One nuance: the interface to the glm logistic model wants two columns for y: "number of successes (S)" and "number of failures (F)". It then calculates the probability as S/(S+F). So we have to provide two columns which mimic this: y and x-y. Then glm(...) will calculate y/(y+(x-y)) = y/x.
Finally, the fit summary suggests that x is important and z may or may not be. You might want to try a model that excludes z and see if that improves AIC.
fit = glm(cbind(y,x-y)~x+z, data=mydata, family=binomial(logit))
summary(fit)
# Call:
# glm(formula = cbind(y, x - y) ~ x + z, family = binomial(logit),
# data = mydata)
# Deviance Residuals:
# 1 2 3 4 5 6
# -0.59942 -0.35394 0.62705 0.08405 -0.75590 0.81160
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.0264 1.2177 -1.664 0.0961 .
# x 0.6786 0.2695 2.518 0.0118 *
# z -0.2778 0.1933 -1.437 0.1507
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# (Dispersion parameter for binomial family taken to be 1)
# Null deviance: 13.7587 on 5 degrees of freedom
# Residual deviance: 2.1149 on 3 degrees of freedom
# AIC: 15.809
par(mfrow=c(2,2))
plot(fit) # residuals, Q-Q, Scale-Location, and Leverage Plots
mydata$pred <- predict(fit, type="response")
par(mfrow=c(1,1))
plot(mydata$y/mydata$x,mydata$pred,xlim=c(0,1),ylim=c(0,1), xlab="Actual", ylab="Predicted")
abline(0,1, lty=2, col="blue")