How to get coefficients and p values of SVM model in R - r

I wonder if there is a way to get all coefficients and p-values in svmLinear method from the e1071package. I tried summary(modelname) but that did not work.
Below is the code for my svm model with 10-fold cross validation:
library("e1071")
library("caret")
load(df) ## my dataset
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE) ## 10 fold cross validation
fitsvm <- train(Attrition ~., data=df, method = "svmLinear", trControl = ctrl) ##train model
summary (fitsvm)
Length Class Mode
1 ksvm S4
I could get them with glm - logistic regression:
fit <- train(Attrition ~., data= df, method="glm", family="binomial", trControl= tc)
summary(fit)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.424e+00 1.254e+00 2.731 0.006318 **
I'd be glad if someone can show me a way, thanks a lot!

SVM does not assume a probabilistic model, so there are no Std-Errors or p. values.
You can get the coefficients, though. In the e1071 package, the alpha*y are stored in fit$coefs, and the Support-Vectors are stored in fit$SV. You have to be careful how to extract them. If you only have a binary classification, then the coefs of the separation plane b+w1*x1+w2*x2+...=0 are simply:
w = t(fit$SV) %*% fit$coefs
b = -fit$rho
If you only have 2d features, you can plot the separation-line using:
abline(-b/w[2], -w[1]/w[2])
It is a bit more tricky for multi-class. You can check my answer here for a detailed explanation how to extract w and b from the coefs and SV.

Related

Logistic stepwise regression with a fixed number of predictors

For a course I'm attending, I have to perform a logistic stepwise regression to reduce the number of predictors of a feature to a fixed number and estimate the accuracy of the resulting model.
I've been trying with regsubsets() from the leaps package, but I can't get its accuracy.
Now I'm trying with caret, because I can set its metric to "Accuracy", but I can't fix the number of predictors when I use method = "glmStepAIC" in the train() function, because it has no tune parameters.
step.model <- train(Outcome ~ .,
data = myDataset,
method = "glmStepAIC",
metric = "Accuracy",
trControl = trainControl(method = "cv", number = 10),
trace = FALSE)
I've found this question, but the answer and the links don't seem to work for me. stepwise regression using caret in R
If not with caret, what would be the best way to achieve the reduced model with fixed number of predictors?
You can specify the number of variables to keep in stepwise selection using the glmulti package. In this example columns a through g are related to the outcome, but columns A through E are not. In glmulti, confsetsize is the number of models to select and set minsize equal to maxsize for the number of variables to keep.
library(MASS)
library(dplyr)
set.seed(100)
dat=data.frame(a=rnorm(10000))
for (i in 2:12) {
dat[,i]=rnorm(10000)
}
names(dat)=c("a", letters[2:7], LETTERS[1:5])
Yy=rep(0, 10000)
for (i in 1:7) {
Yy=Yy+i*dat[,i]
}
Yy=1/(1+exp(-Yy))
outcome=c()
for (i in 1:10000) {
outcome[i]=sample(c(1,0), 1, prob=c(Yy[i], 1-Yy[i]))
}
dat=mutate(dat, outcome=factor(outcome))
library(glmulti)
mod=glmulti(outcome ~ .,
data=dat,
level=1,
method="g",
crit="aic",
confsetsize=5,
plotty=F, report=T,
fitfunction="glm",
family="binomial",
minsize=7,
maxsize=7,
conseq=3)
output
mod#objects[[1]]
Call: fitfunc(formula = as.formula(x), family = "binomial", data = data)
Coefficients:
(Intercept) a b c d e f g
-0.01386 1.11590 1.99116 3.00459 4.00436 4.86382 5.94198 6.89312
Degrees of Freedom: 9999 Total (i.e. Null); 9992 Residual
Null Deviance: 13860
Residual Deviance: 2183 AIC: 2199

Standard Error of Ridge Logistic Regression Coefficient using caret

I am using caret package in R, to perform Ridge Logistic Regression.
Now I am able to find the coefficients for each variable.
Question is: How to know the standard error of coefficient for each variable produce using Ridge logistic regression?
Here is the sample code that I have:-
Ridge1 <- train(Group ~., data = train, method = 'glmnet',
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = lambda),
family="binomial")
Coefficient of Ridge logistic regression
coef(Ridge1$finalModel, Ridge1$bestTune$lambda)
How to get a result as in logistic regression model (ie: the standard error, wald statistic, p-value.. etc?)
You don't get p-values and confidence intervals from ridge or glmnet regressions because it is very difficult to estimate the distribution of the estimator when a penalization term is present. The first part of the publication for R package hmi touches on this and you can check out post such as this and this
We can try something below, for example getting the optimal lambda from caret and using that in another package hmi to estimate confidence intervals and p-values, but I would interpret these with caution, they are very different from a custom logistic glm.
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
X = as.matrix(PimaIndiansDiabetes[,-ncol(PimaIndiansDiabetes)])
y = as.numeric(PimaIndiansDiabetes$diabetes)-1
lambda = 10^seq(-5,4,length.out=25)
Ridge1 <- train(x=X,y=factor(y), method = 'glmnet',family="binomial",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = lambda))
bestLambda = Ridge1$bestTune$lambda
Use hdi, but note that the coefficients will not be exactly the same as what you get with caret, or glmnet:
library(hdi)
fit = ridge.proj(X,y,family="binomial",lambda=bestLambda)
cbind(fit$bhat,fit$se,fit$pval)
[,1] [,2] [,3]
pregnant 0.1137868935 0.0314432291 2.959673e-04
glucose 0.0329008177 0.0035806920 3.987411e-20
pressure -0.0122503030 0.0051224313 1.677961e-02
triceps 0.0009404808 0.0067935741 8.898952e-01
insulin -0.0012293122 0.0008902878 1.673395e-01
mass 0.0787408742 0.0145166392 5.822097e-08
pedigree 0.9120151630 0.2927090989 1.834633e-03
age 0.0116844697 0.0092017927 2.041546e-01

How to calculate R-squared after using bagging function to develop CART decision trees?

I am using the following bagging function with ipred to bootstrap the sample 500 times in R in order to develop decision trees:
baggedsample <- bagging(p ~., data, nbagg=500, coob=TRUE, control = list
(minbucket=5))
After this, I would like to know the R-squared.
I notice that if I do the bagging with caret function, R-squared would be automatically calculated as follows:
# Specify 10-fold cross validation
ctrl <- trainControl(method = "cv", number = 10)
# CV bagged model
baggedsample <- train(
p ~ .,
data,
method = "treebag",
trControl = ctrl,
importance = TRUE
)
# assess results
baggedsample
RMSE Rsquared MAE
## 36477.25 0.7001783 24059.85
Appreciate any guidance on this issue, thanks.
Since you do not provide any data, I will illustrate using the built-in iris data.
You can simply compute R-squared from the formula.
attach(iris)
BAG = bagging(Sepal.Length ~ ., data=iris)
R2 = 1 - sum((Sepal.Length - predict(BAG))^2) /
sum((Sepal.Length - mean(Sepal.Length))^2)
R2
[1] 0.824782

How to calculate R Squared value for Lasso regression using glmnet in R

I am performing lasso regression in R using glmnet package:
fit.lasso <- glmnet(x,y)
plot(fit.lasso,xvar="lambda",label=TRUE)
Then using cross-validation:
cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)
One tutorial (last slide) suggest the following for R^2:
R_Squared = 1 - cv.lasso$cvm/var(y)
But it did not work.
I want to understand the model efficiency/performance in fitting the data. As we usually get R^2 and adjusted R^2 when performing lm() function in r.
If you are using "gaussian" family, you can access R-squared value by
fit.lasso$glmnet.fit$dev.ratio
I use the example data to demonstrate it
library(glmnet)
load data
data(BinomialExample)
head(x)
head(y)
For cross validation
cvfit = cv.glmnet(x, y, family = "binomial", type.measure = "class")
rsq = 1 - cvfit$cvm/var(y)
plot(cvfit$lambda,rsq)
Firtst fit the Lasso model with the selected lambda
...
lasso.model <- glmnet(x=X,y=Y, family = "binomial", alpha=1, lambda = cv.model$lambda.min )
then you could get the pseudo R2 from the fitted model
`lasso.model$dev.ratio`
this value give the deviance explained by the model/Null deviance

caret: different RMSE on the same data

I think, that my problem is quite weird. When I use RMSE metric for the best model selection by train function, I obtain different RMSE value from computed by my own function on the same data. Where ist the problem? Does my function work wrong?
library(caret)
library(car)
library(nnet)
data(oil)
ztest=fattyAcids[c(81:96),]
fit<-list(r1=c(1:80))
pred<-list(r1=c(81:96))
ctrl <- trainControl(method = "LGOCV",index=fit,indexOut=pred)
model <- train(Palmitic~Stearic+Oleic+Linoleic+Linolenic+Eicosanoic,
fattyAcids,
method='nnet',
linout=TRUE,
trace=F,
maxit=10000,
skip=F,
metric="RMSE",
tuneGrid=expand.grid(.size=c(10,11,12,9),.decay=c(0.005,0.001,0.01)),
trControl = ctrl,
preProcess = c("range"))
model
forecast <- predict(model, ztest)
Blad<-function(zmienna,prognoza){
RMSE<-((sum((zmienna-prognoza)^2))/length(zmienna))^(1/2)
estymatory <- c(RMSE)
names(estymatory) <-c('RMSE')
estymatory
}
Blad(ztest$Palmitic,forecast)
The resampled estimates shown in the output of train are calculated using rows 81:96. Once train figures out the right tuning parameter settings, it refits using all the data (1:96). The model from that data is used to make the new predictions.
For this reason, the model performance
> getTrainPerf(model)
TrainRMSE TrainRsquared method
1 0.9230175 0.8364212 nnet
is worse than the other predictions:
> Blad(ztest$Palmitic,forecast)
RMSE
0.3355387
The predictions in forecast are created from a model that included those same data points, which is why it looks better.
Max

Resources