I have trained a binary SVM classifier and made predictions like the following:
classifier = svm(formula = type ~ .,
data = train,
type = 'C-classification',
kernel = 'polynomial')
y_pred = predict(classifier, newdata = test[1:57])
The label that I am training against (type) is a factor. The prediction (y_pred) in this case is also a factor list. How can I obtain the probability/logits of these predictions so that I can produce a ROC curve?
To solve this problem, probability = TRUE need to be specified both when constructing the classifier and making predictions:
classifier = svm(formula = type ~ .,
data = train,
type = 'C-classification',
probability=TRUE,
kernel = 'polynomial')
y_pred = predict(classifier, newdata = test[1:57], probability=TRUE)
Then the attr() can be used to retrieve the probability scores:
prob = as.data.frame(attr(y_pred, "probabilities”))
Related
I am new to data analysis with R so any help is appreciated.
I have a dataset with some explanatory variables and one target variable. The target variable is either Yes or No only. So I would like to use logistic regression for model fitting.
This is how I plot a roc curve
myModel = train(
myTarget ~ .,
myTrainData,
method = "glm",
metric = "ROC",
trControl = myControl,
na.action = na.pass
)
myPred = predict(myModel , newdata = myTestData, type="prob")
eval <- evalm(data.frame(myPred , myTestData$myTarget)
eval$roc
Now, I would like to find the sensitivity given an alpha value / Type I error
And show the information like the following if possible, how can I achieve it?
confusionMatrix(myPred, reference = myTestData$myTarget)
I am using the Stata dataset ANES.dta with the information about the 2000 presidential election in the USA. I build two models on this dataset - one logit and one LPM. I want to compare the two models with each other using the following Goodness of fit measures - accuracy, sensitivity and specificity of the both models.
I am new to R, I have mainly used STATA so far and that's why I'm wondering if it is normal to get absolutely the same values in confusion matrices for a logit model and a LPM model, based on the same data? Am I doing something wrong?
rm(list=ls())
library(foreign)
dat <- read.dta("ANES.dta", convert.factors = FALSE)
dat_clear <- na.omit(dat)
head(dat_clear)
#Logit model
m1_logit <- glm(vote ~ gender + income + pro_choice ,
data = dat_clear, family = binomial(link = "logit") ,
na.action = na.omit)
summary(m1_logit)
#LPM
m2_lpm <- lm(vote ~ gender + income + pro_choice,
data = dat_clear, na.action = na.omit)
summary(m2_lpm)
#Confusion matrix for logit model
dat_clear$prediction_log <- predict(m1_logit, newdata = dat_clear, type = "response")
dat_clear$vote_pred_log <- as.numeric(dat_clear$prediction_log > .5)
table(observed = dat_clear$vote, predicted = dat_clear$vote_pred_log)
#Confusion matrix for LPM model
dat_clear$prediction_lpm <- predict(m2_lpm, newdata = dat_clear, type = "response")
dat_clear$vote_pred_lpm <- as.numeric(dat_clear$prediction_lpm > .5)
table(observed = dat_clear$vote, predicted = dat_clear$vote_pred_lpm)
This is what the confusion matrices look like
I am trying to find AUC on a training data for my logistic regression model using glm
I split data to train and test set, fitted a logistic regression model regression model using glm, computed predicted value and trying to find AUC
d<-read.csv(file.choose(), header=T)
set.seed(12345)
train = runif(nrow(d))<.5
table(train)
fit = glm(y~ ., binomial, d)
phat<-predict(fit,type = 'response')
d$phat=phat
g <- roc(y ~ phat, data = d, print.auc=T)
plot(g)
Another user-friendly option is to use the caret library, which makes it pretty straightforward to fit and compare regression/classification models in R. The following example code uses the GermanCredit dataset to predict credit worthiness using a logistic regression model. The code is adapted from this blog: https://www.r-bloggers.com/evaluating-logistic-regression-models/.
library(caret)
## example from https://www.r-bloggers.com/evaluating-logistic-regression-models/
data(GermanCredit)
## 60% training / 40% test data
trainIndex <- createDataPartition(GermanCredit$Class, p = 0.6, list = FALSE)
GermanCreditTrain <- GermanCredit[trainIndex, ]
GermanCreditTest <- GermanCredit[-trainIndex, ]
## logistic regression based on 10-fold cross-validation
trainControl <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
fit <- train(
form = Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own +
CreditHistory.Critical,
data = GermanCreditTrain,
trControl = trainControl,
method = "glm",
family = "binomial",
metric = "ROC"
)
## AUC ROC for training data
print(fit)
## AUC ROC for test data
## See https://topepo.github.io/caret/measuring-performance.html#measures-for-class-probabilities
predictTest <- data.frame(
obs = GermanCreditTest$Class, ## observed class labels
predict(fit, newdata = GermanCreditTest, type = "prob"), ## predicted class probabilities
pred = predict(fit, newdata = GermanCreditTest, type = "raw") ## predicted class labels
)
twoClassSummary(data = predictTest, lev = levels(predictTest$obs))
I like using the performance command found in the ROCR library.
library(ROCR)
# responsev = response variable
d.prediction<-prediction(predict(fit, type="response"), train$responsev)
d.performance<-performance(d.prediction,measure = "tpr",x.measure="fpr")
d.test.prediction<-prediction(predict(fit,newdata=d.test, type="response"), d.test$DNF)
d.test.prefermance<-performance(d.test.prediction, measure="tpr", x.measure="fpr")
# What is the actual numeric performance of our model?
performance(d.prediction,measure="auc")
performance(d.test.prediction,measure="auc")
I have a model like the following:
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
my_data <- Sonar
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = T,
summaryFunction = twoClassSummary
)
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
How do I plot the ROC curve for this model? As I understand it, the probabilities must be saved (which I did in trainControl), but because of the random sampling which bootstrapping uses to generate a 'test' set, I am not sure how caret calculates the ROC value and how to generate a curve.
To isolate the class probabilities for the best performing parameters, I am doing:
for (a in 1:length(model$bestTune))
{model$pred <-
model$pred[model$pred[, paste(colnames(model$bestTune)[a])] == model$bestTune[1, a], ]}
Please advise.
Thanks!
First an explanation:
If you are not going to check how each possible hyper parameter combination predicted on each sample in each re-sample you can set savePredictions = "final" in trainControl to save space:
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = "final",
summaryFunction = twoClassSummary
)
after running the model:
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
results of interest are in model$pred
here you can check how many samples were tested in each re-sample (I set 25 repetitions)
nrow(model$pred[model$pred$Resample == "Resample01",])
#83
caret always provides prediction from rows not used in the model build.
nrow(my_data) #208
83/208 makes sense for the test samples for boot632
Now to build the ROC curve. You may opt for several options here:
-average the probability for each sample and use that (this is usual for CV since you have all samples repeated the same number of times, but it can be done with boot also).
-plot all as is without averaging
-plot ROC for each re-sample.
I will show you the second approach:
Create a data frame of class probabilities and true outcomes:
for_lift = data.frame(Class = model$pred$obs, xgbTree = model$pred$R)
plot ROC:
pROC::plot.roc(pROC::roc(response = for_lift$Class,
predictor = for_lift$xgbTree,
levels = c("M", "R")),
lwd=1.5)
You can also do this with ggplot, to do so I find it easiest to make a lift object using caret function lift
lift_obj = lift(Class ~ xgbTree, data = for_lift, class = "R")
specify which class the probability was used ^.
library(ggplot2)
ggplot(lift_obj$data)+
geom_line(aes(1-Sp , Sn, color = liftModelVar))+
scale_color_discrete(guide = guide_legend(title = "method"))
I would like to make sure that I am using the prediction method here correctly; maybe I am misinterpreting the parameter "s" here!? My intent is to use the best lambda obtained from cross validation to make my final predictions on a holdout dataset.
# set alpha to 1 for lasso
cv.fit <- cv.glmnet(x = mat, y = class, family = "binomial", alpha = 1, nfolds = 10)
val.m <- as.matrix(val.df[, -match(c("Id", "class"), names(val.df))])
preds <- predict(cv.fit, val.m, type="response", s = cv.lasso.fit$lambda.min)
It would be nice if someone could give me reassurance.