Logistic stepwise regression with a fixed number of predictors - r

For a course I'm attending, I have to perform a logistic stepwise regression to reduce the number of predictors of a feature to a fixed number and estimate the accuracy of the resulting model.
I've been trying with regsubsets() from the leaps package, but I can't get its accuracy.
Now I'm trying with caret, because I can set its metric to "Accuracy", but I can't fix the number of predictors when I use method = "glmStepAIC" in the train() function, because it has no tune parameters.
step.model <- train(Outcome ~ .,
data = myDataset,
method = "glmStepAIC",
metric = "Accuracy",
trControl = trainControl(method = "cv", number = 10),
trace = FALSE)
I've found this question, but the answer and the links don't seem to work for me. stepwise regression using caret in R
If not with caret, what would be the best way to achieve the reduced model with fixed number of predictors?

You can specify the number of variables to keep in stepwise selection using the glmulti package. In this example columns a through g are related to the outcome, but columns A through E are not. In glmulti, confsetsize is the number of models to select and set minsize equal to maxsize for the number of variables to keep.
library(MASS)
library(dplyr)
set.seed(100)
dat=data.frame(a=rnorm(10000))
for (i in 2:12) {
dat[,i]=rnorm(10000)
}
names(dat)=c("a", letters[2:7], LETTERS[1:5])
Yy=rep(0, 10000)
for (i in 1:7) {
Yy=Yy+i*dat[,i]
}
Yy=1/(1+exp(-Yy))
outcome=c()
for (i in 1:10000) {
outcome[i]=sample(c(1,0), 1, prob=c(Yy[i], 1-Yy[i]))
}
dat=mutate(dat, outcome=factor(outcome))
library(glmulti)
mod=glmulti(outcome ~ .,
data=dat,
level=1,
method="g",
crit="aic",
confsetsize=5,
plotty=F, report=T,
fitfunction="glm",
family="binomial",
minsize=7,
maxsize=7,
conseq=3)
output
mod#objects[[1]]
Call: fitfunc(formula = as.formula(x), family = "binomial", data = data)
Coefficients:
(Intercept) a b c d e f g
-0.01386 1.11590 1.99116 3.00459 4.00436 4.86382 5.94198 6.89312
Degrees of Freedom: 9999 Total (i.e. Null); 9992 Residual
Null Deviance: 13860
Residual Deviance: 2183 AIC: 2199

Related

How to set a ppv in caret for random forest in r?

So I'm interested in creating a model that optimizes PPV. I've create a RF model (below) that outputs me a confusion matrix, for which I then manually calculate sensitivity, specificity, ppv, npv, and F1. I know right now accuracy is optimized but I'm willing to forgo sensitivity and specificity to get a much higher ppv.
data_ctrl_null <- trainControl(method="cv", number = 5, classProbs = TRUE, summaryFunction=twoClassSummary, savePredictions=T, sampling=NULL)
set.seed(5368)
model_htn_df <- train(outcome ~ ., data=htn_df, ntree = 1000, tuneGrid = data.frame(mtry = 38), trControl = data_ctrl_null, method= "rf",
preProc=c("center","scale"),metric="ROC", importance=TRUE)
model_htn_df$finalModel #provides confusion matrix
Results:
Call:
randomForest(x = x, y = y, ntree = 1000, mtry = param$mtry, importance = TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 38
OOB estimate of error rate: 16.2%
Confusion matrix:
no yes class.error
no 274 19 0.06484642
yes 45 57 0.44117647
My manual calculation: sen = 55.9% spec = 93.5%, ppv = 75.0%, npv = 85.9% (The confusion matrix switches my no and yes as outcomes, so I also switch the numbers when I calculate the performance metrics.)
So what do I need to do to get a PPV = 90%?
This is a similar question, but I'm not really following it.
We define a function to calculate PPV and return the results with a name:
PPV <- function (data,lev = NULL,model = NULL) {
value <- posPredValue(data$pred,data$obs, positive = lev[1])
c(PPV=value)
}
Let's say we have the following data:
library(randomForest)
library(caret)
data=iris
data$Species = ifelse(data$Species == "versicolor","versi","others")
trn = sample(nrow(iris),100)
Then we train by specifying PPV to be the metric:
mdl <- train(Species ~ ., data = data[trn,],
method = "rf",
metric = "PPV",
trControl = trainControl(summaryFunction = PPV,
classProbs = TRUE))
Random Forest
100 samples
4 predictor
2 classes: 'others', 'versi'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:
mtry PPV
2 0.9682811
3 0.9681759
4 0.9648426
PPV was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
Now you can see it is trained on PPV. However you cannot force the training to achieve a PPV of 0.9.. It really depends on the data, if your independent variables have no predictive power, it will not improve however much you train it right?

Standard Error of Ridge Logistic Regression Coefficient using caret

I am using caret package in R, to perform Ridge Logistic Regression.
Now I am able to find the coefficients for each variable.
Question is: How to know the standard error of coefficient for each variable produce using Ridge logistic regression?
Here is the sample code that I have:-
Ridge1 <- train(Group ~., data = train, method = 'glmnet',
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = lambda),
family="binomial")
Coefficient of Ridge logistic regression
coef(Ridge1$finalModel, Ridge1$bestTune$lambda)
How to get a result as in logistic regression model (ie: the standard error, wald statistic, p-value.. etc?)
You don't get p-values and confidence intervals from ridge or glmnet regressions because it is very difficult to estimate the distribution of the estimator when a penalization term is present. The first part of the publication for R package hmi touches on this and you can check out post such as this and this
We can try something below, for example getting the optimal lambda from caret and using that in another package hmi to estimate confidence intervals and p-values, but I would interpret these with caution, they are very different from a custom logistic glm.
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
X = as.matrix(PimaIndiansDiabetes[,-ncol(PimaIndiansDiabetes)])
y = as.numeric(PimaIndiansDiabetes$diabetes)-1
lambda = 10^seq(-5,4,length.out=25)
Ridge1 <- train(x=X,y=factor(y), method = 'glmnet',family="binomial",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = lambda))
bestLambda = Ridge1$bestTune$lambda
Use hdi, but note that the coefficients will not be exactly the same as what you get with caret, or glmnet:
library(hdi)
fit = ridge.proj(X,y,family="binomial",lambda=bestLambda)
cbind(fit$bhat,fit$se,fit$pval)
[,1] [,2] [,3]
pregnant 0.1137868935 0.0314432291 2.959673e-04
glucose 0.0329008177 0.0035806920 3.987411e-20
pressure -0.0122503030 0.0051224313 1.677961e-02
triceps 0.0009404808 0.0067935741 8.898952e-01
insulin -0.0012293122 0.0008902878 1.673395e-01
mass 0.0787408742 0.0145166392 5.822097e-08
pedigree 0.9120151630 0.2927090989 1.834633e-03
age 0.0116844697 0.0092017927 2.041546e-01

How to calculate R-squared after using bagging function to develop CART decision trees?

I am using the following bagging function with ipred to bootstrap the sample 500 times in R in order to develop decision trees:
baggedsample <- bagging(p ~., data, nbagg=500, coob=TRUE, control = list
(minbucket=5))
After this, I would like to know the R-squared.
I notice that if I do the bagging with caret function, R-squared would be automatically calculated as follows:
# Specify 10-fold cross validation
ctrl <- trainControl(method = "cv", number = 10)
# CV bagged model
baggedsample <- train(
p ~ .,
data,
method = "treebag",
trControl = ctrl,
importance = TRUE
)
# assess results
baggedsample
RMSE Rsquared MAE
## 36477.25 0.7001783 24059.85
Appreciate any guidance on this issue, thanks.
Since you do not provide any data, I will illustrate using the built-in iris data.
You can simply compute R-squared from the formula.
attach(iris)
BAG = bagging(Sepal.Length ~ ., data=iris)
R2 = 1 - sum((Sepal.Length - predict(BAG))^2) /
sum((Sepal.Length - mean(Sepal.Length))^2)
R2
[1] 0.824782

How to calculate R Squared value for Lasso regression using glmnet in R

I am performing lasso regression in R using glmnet package:
fit.lasso <- glmnet(x,y)
plot(fit.lasso,xvar="lambda",label=TRUE)
Then using cross-validation:
cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)
One tutorial (last slide) suggest the following for R^2:
R_Squared = 1 - cv.lasso$cvm/var(y)
But it did not work.
I want to understand the model efficiency/performance in fitting the data. As we usually get R^2 and adjusted R^2 when performing lm() function in r.
If you are using "gaussian" family, you can access R-squared value by
fit.lasso$glmnet.fit$dev.ratio
I use the example data to demonstrate it
library(glmnet)
load data
data(BinomialExample)
head(x)
head(y)
For cross validation
cvfit = cv.glmnet(x, y, family = "binomial", type.measure = "class")
rsq = 1 - cvfit$cvm/var(y)
plot(cvfit$lambda,rsq)
Firtst fit the Lasso model with the selected lambda
...
lasso.model <- glmnet(x=X,y=Y, family = "binomial", alpha=1, lambda = cv.model$lambda.min )
then you could get the pseudo R2 from the fitted model
`lasso.model$dev.ratio`
this value give the deviance explained by the model/Null deviance

How to get coefficients and p values of SVM model in R

I wonder if there is a way to get all coefficients and p-values in svmLinear method from the e1071package. I tried summary(modelname) but that did not work.
Below is the code for my svm model with 10-fold cross validation:
library("e1071")
library("caret")
load(df) ## my dataset
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE) ## 10 fold cross validation
fitsvm <- train(Attrition ~., data=df, method = "svmLinear", trControl = ctrl) ##train model
summary (fitsvm)
Length Class Mode
1 ksvm S4
I could get them with glm - logistic regression:
fit <- train(Attrition ~., data= df, method="glm", family="binomial", trControl= tc)
summary(fit)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.424e+00 1.254e+00 2.731 0.006318 **
I'd be glad if someone can show me a way, thanks a lot!
SVM does not assume a probabilistic model, so there are no Std-Errors or p. values.
You can get the coefficients, though. In the e1071 package, the alpha*y are stored in fit$coefs, and the Support-Vectors are stored in fit$SV. You have to be careful how to extract them. If you only have a binary classification, then the coefs of the separation plane b+w1*x1+w2*x2+...=0 are simply:
w = t(fit$SV) %*% fit$coefs
b = -fit$rho
If you only have 2d features, you can plot the separation-line using:
abline(-b/w[2], -w[1]/w[2])
It is a bit more tricky for multi-class. You can check my answer here for a detailed explanation how to extract w and b from the coefs and SV.

Resources