Standard Error of Ridge Logistic Regression Coefficient using caret - r

I am using caret package in R, to perform Ridge Logistic Regression.
Now I am able to find the coefficients for each variable.
Question is: How to know the standard error of coefficient for each variable produce using Ridge logistic regression?
Here is the sample code that I have:-
Ridge1 <- train(Group ~., data = train, method = 'glmnet',
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = lambda),
family="binomial")
Coefficient of Ridge logistic regression
coef(Ridge1$finalModel, Ridge1$bestTune$lambda)
How to get a result as in logistic regression model (ie: the standard error, wald statistic, p-value.. etc?)

You don't get p-values and confidence intervals from ridge or glmnet regressions because it is very difficult to estimate the distribution of the estimator when a penalization term is present. The first part of the publication for R package hmi touches on this and you can check out post such as this and this
We can try something below, for example getting the optimal lambda from caret and using that in another package hmi to estimate confidence intervals and p-values, but I would interpret these with caution, they are very different from a custom logistic glm.
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
X = as.matrix(PimaIndiansDiabetes[,-ncol(PimaIndiansDiabetes)])
y = as.numeric(PimaIndiansDiabetes$diabetes)-1
lambda = 10^seq(-5,4,length.out=25)
Ridge1 <- train(x=X,y=factor(y), method = 'glmnet',family="binomial",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = lambda))
bestLambda = Ridge1$bestTune$lambda
Use hdi, but note that the coefficients will not be exactly the same as what you get with caret, or glmnet:
library(hdi)
fit = ridge.proj(X,y,family="binomial",lambda=bestLambda)
cbind(fit$bhat,fit$se,fit$pval)
[,1] [,2] [,3]
pregnant 0.1137868935 0.0314432291 2.959673e-04
glucose 0.0329008177 0.0035806920 3.987411e-20
pressure -0.0122503030 0.0051224313 1.677961e-02
triceps 0.0009404808 0.0067935741 8.898952e-01
insulin -0.0012293122 0.0008902878 1.673395e-01
mass 0.0787408742 0.0145166392 5.822097e-08
pedigree 0.9120151630 0.2927090989 1.834633e-03
age 0.0116844697 0.0092017927 2.041546e-01

Related

How to add p-values to odds ratio with logistic svyglm()?

I am using the code below to get odds ratios and confidence intervals for my svyglm model.
model <- svyglm(y ~ x + covariate,
design = survey_design,
family = quasibinomial(link = logit))
exp(cbind(OR = coef(model), confint(model)))
I get the p-values when I use summary, however, this returns coefficients that then need to be exponentiated. How do I add these p-values to the odds ratio and confint table?
There's a tidy method for svlglm objects in the broom package.
library(broom)
tidy(model, expo=TRUE, conf.int=TRUE)

Logistic stepwise regression with a fixed number of predictors

For a course I'm attending, I have to perform a logistic stepwise regression to reduce the number of predictors of a feature to a fixed number and estimate the accuracy of the resulting model.
I've been trying with regsubsets() from the leaps package, but I can't get its accuracy.
Now I'm trying with caret, because I can set its metric to "Accuracy", but I can't fix the number of predictors when I use method = "glmStepAIC" in the train() function, because it has no tune parameters.
step.model <- train(Outcome ~ .,
data = myDataset,
method = "glmStepAIC",
metric = "Accuracy",
trControl = trainControl(method = "cv", number = 10),
trace = FALSE)
I've found this question, but the answer and the links don't seem to work for me. stepwise regression using caret in R
If not with caret, what would be the best way to achieve the reduced model with fixed number of predictors?
You can specify the number of variables to keep in stepwise selection using the glmulti package. In this example columns a through g are related to the outcome, but columns A through E are not. In glmulti, confsetsize is the number of models to select and set minsize equal to maxsize for the number of variables to keep.
library(MASS)
library(dplyr)
set.seed(100)
dat=data.frame(a=rnorm(10000))
for (i in 2:12) {
dat[,i]=rnorm(10000)
}
names(dat)=c("a", letters[2:7], LETTERS[1:5])
Yy=rep(0, 10000)
for (i in 1:7) {
Yy=Yy+i*dat[,i]
}
Yy=1/(1+exp(-Yy))
outcome=c()
for (i in 1:10000) {
outcome[i]=sample(c(1,0), 1, prob=c(Yy[i], 1-Yy[i]))
}
dat=mutate(dat, outcome=factor(outcome))
library(glmulti)
mod=glmulti(outcome ~ .,
data=dat,
level=1,
method="g",
crit="aic",
confsetsize=5,
plotty=F, report=T,
fitfunction="glm",
family="binomial",
minsize=7,
maxsize=7,
conseq=3)
output
mod#objects[[1]]
Call: fitfunc(formula = as.formula(x), family = "binomial", data = data)
Coefficients:
(Intercept) a b c d e f g
-0.01386 1.11590 1.99116 3.00459 4.00436 4.86382 5.94198 6.89312
Degrees of Freedom: 9999 Total (i.e. Null); 9992 Residual
Null Deviance: 13860
Residual Deviance: 2183 AIC: 2199

How to get an MSE for a random forest regression model on the test dataset

I have a random forest regression model trained on the training dataset. I want to use it on the test dataset and get the MSE of the forest. How can I do this?
My model is this:
fit = randomForest(out.cut$Vir_factor_freq ~ .,out.cut[,-1], importance = TRUE, ntree = 700, replace = TRUE, mtry = 7)
Where out.cut is the training data. I have used:
pred = predict(fit, out.cut, type="response", OOB = TRUE)
Where out.cut is the test data, all well and good I get the predictions but I want to know the accuracy same way you can extract the MSE from the fit model via fit$mse and isolate the ith value.

How to calculate R Squared value for Lasso regression using glmnet in R

I am performing lasso regression in R using glmnet package:
fit.lasso <- glmnet(x,y)
plot(fit.lasso,xvar="lambda",label=TRUE)
Then using cross-validation:
cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)
One tutorial (last slide) suggest the following for R^2:
R_Squared = 1 - cv.lasso$cvm/var(y)
But it did not work.
I want to understand the model efficiency/performance in fitting the data. As we usually get R^2 and adjusted R^2 when performing lm() function in r.
If you are using "gaussian" family, you can access R-squared value by
fit.lasso$glmnet.fit$dev.ratio
I use the example data to demonstrate it
library(glmnet)
load data
data(BinomialExample)
head(x)
head(y)
For cross validation
cvfit = cv.glmnet(x, y, family = "binomial", type.measure = "class")
rsq = 1 - cvfit$cvm/var(y)
plot(cvfit$lambda,rsq)
Firtst fit the Lasso model with the selected lambda
...
lasso.model <- glmnet(x=X,y=Y, family = "binomial", alpha=1, lambda = cv.model$lambda.min )
then you could get the pseudo R2 from the fitted model
`lasso.model$dev.ratio`
this value give the deviance explained by the model/Null deviance

How to get coefficients and p values of SVM model in R

I wonder if there is a way to get all coefficients and p-values in svmLinear method from the e1071package. I tried summary(modelname) but that did not work.
Below is the code for my svm model with 10-fold cross validation:
library("e1071")
library("caret")
load(df) ## my dataset
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE) ## 10 fold cross validation
fitsvm <- train(Attrition ~., data=df, method = "svmLinear", trControl = ctrl) ##train model
summary (fitsvm)
Length Class Mode
1 ksvm S4
I could get them with glm - logistic regression:
fit <- train(Attrition ~., data= df, method="glm", family="binomial", trControl= tc)
summary(fit)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.424e+00 1.254e+00 2.731 0.006318 **
I'd be glad if someone can show me a way, thanks a lot!
SVM does not assume a probabilistic model, so there are no Std-Errors or p. values.
You can get the coefficients, though. In the e1071 package, the alpha*y are stored in fit$coefs, and the Support-Vectors are stored in fit$SV. You have to be careful how to extract them. If you only have a binary classification, then the coefs of the separation plane b+w1*x1+w2*x2+...=0 are simply:
w = t(fit$SV) %*% fit$coefs
b = -fit$rho
If you only have 2d features, you can plot the separation-line using:
abline(-b/w[2], -w[1]/w[2])
It is a bit more tricky for multi-class. You can check my answer here for a detailed explanation how to extract w and b from the coefs and SV.

Resources