How to avoid getting grid arrangements with plotROCCurves()? - r

I want to plot ROC Curves into one plot, so that I can compare each one. An easy on going tutorial for that can be found on MLR-org.com.
There you found this example:
library(mlr)
n = getTaskSize(sonar.task)
train.set = sample(n, size = round(2/3 * n))
test.set = setdiff(seq_len(n), train.set)
lrn1 = makeLearner("classif.lda", predict.type = "prob")
mod1 = train(lrn1, sonar.task, subset = train.set)
pred1 = predict(mod1, task = sonar.task, subset = test.set)
lrn2 = makeLearner("classif.ksvm", predict.type = "prob")
mod2 = train(lrn2, sonar.task, subset = train.set)
pred2 = predict(mod2, task = sonar.task, subset = test.set)
df = generateThreshVsPerfData(list(lda = pred1, ksvm = pred2), measures = list(fpr, tpr))
plotROCCurves(df)
This should generate a graphic like this:
But instead I get always this one:
Is anybody there who can help me to get only one graphic?

Related

ROC-AUC plot is in elbow shape

I want to create a ROC curve for my logistic regression model but my current code is not giving me the traditional or desired result. Below is the code: -
over3 <- SMOTE(pol ~., data = train, perc.under = 150)
over3
set.seed(645)
logit_model4 <- glm(pol ~., data = over3, family = "binomial")
logit_model4
summary(logit_model4)
fitted.results4 <- predict(logit_model4, test, type = "response")
fitted.results4
fitted.results4 <- ifelse(fitted.results4 > 0.5, 1, 0)
fitted.results4
table(test$pol, fitted.results4)
library(pROC)
pim <- roc(response = test$pol, predictor = fitted.results3, partial.auc = c(100,90),
partial.auc.correct = T, percent = T)
plot(pim)
with the resulting figure as follows: -
However, I want the plot's output to be in the traditional way i.e.:-
Hope someone can help me out
Don't dichotomize your results!
fitted.results4 <- ifelse(fitted.results4 > 0.5, 1, 0)
The ROC curve is built by assessing every possible threshold. By dichotomizing your result, you are preventing that from happening.
Instead you should use the quantitative predicted probabilities directly:
fitted.results4 <- predict(logit_model4, test, type = "response")
library(pROC)
pim <- roc(response = test$pol, predictor = fitted.results4, partial.auc = c(100,90), partial.auc.correct = T, percent = T)
plot(pim)

Use pdp package to get probability scale partial dependence plots for all classes

I have been following the example here to create partial dependence plots but I would like to combine the approach used to get plots for all levels in a multiclass with the one to get predictions on the probability scale (see pages 430-431).
This is my approach but it doesn't work because pred.fun is not allowed to have a third arguement
library(e1071)
iris.svm <- svm(Species ~ ., data = iris, kernel = "radial", gamma = 0.75,
cost = 0.25, probability = TRUE)
pred.prob <- function(object, newdata,i) { # see ?predict.svm
pred <- predict(object, newdata, probability = TRUE)
prob.class <- attr(pred, which = "probabilities")[, i]
mean(prob.class)
}
pred.prob(iris.svm,iris,"setosa")
pd <- NULL
for (i in 1:3) {
tmp <- partial(iris.svm, pred.var = c("Petal.Width", "Petal.Length"),
pred.fun = pred.prob,
which.class = i, grid.resolution = 101, progress = "text")
pd <- rbind(pd, cbind(tmp, Species = levels(iris$Species)[i]))
}
Any recommendations for how to get around this requirement or a different approach?
It looks like the package has actually been updated since the article I referred to was published. Now all you need to do is set the prob argument to TRUE and it will predict on the probability scale.
pd <- NULL
for (i in 1:3) {
tmp <- partial(iris.svm, pred.var = c("Petal.Width", "Petal.Length"),
prob = T,
which.class = i, grid.resolution = 101, progress = "text")
pd <- rbind(pd, cbind(tmp, Species = levels(iris$Species)[i]))
}
I hope this helps someone else to avoid wasting an afternoon!

error in linear regression while using the train function in caret package

I have a data set called value that have four variables (ER is the dependent variable) and 400 observations (after removing N/A). I tried to divide the dataset into training and test sets and train the model using linear regression in the caret package. But I always get the errors:
In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ... :
extra argument ‘trcontrol’ is disregarded.
Below is my code:
ctrl_lm <- trainControl(method = "cv", number = 5, verboseIter = FALSE)
value_rm = na.omit(value)
set.seed(1)
datasplit <- createDataPartition(y = value_rm[[1]], p = 0.8, list = FALSE)
train.value <- value_rm[datasplit,]
test.value <- value_rm[-datasplit,]
lmCVFit <- train(ER~., data = train.value, method = "lm",
trcontrol = ctrl_lm, metric = "Rsquared")
predictedVal <- predict(lmCVFit, test.value)
modelvalues <- data.frame(obs = test.value$ER, pred = predictedVal)
lmcv.out = defaultSummary(modelvalues)
The right sintax is trControl, not trcontrol. Try this:
library(caret)
set.seed(1)
n <- 100
value <- data.frame(ER=rnorm(n), X=matrix(rnorm(3*n),ncol=3))
ctrl_lm <- trainControl(method = "cv", number = 5, verboseIter = FALSE)
value_rm = na.omit(value)
set.seed(1)
datasplit <- createDataPartition(y = value_rm[[1]], p = 0.8, list = FALSE)
train.value <- value_rm[datasplit,]
test.value <- value_rm[-datasplit,]
lmCVFit <- train(ER~., data = train.value, method = "lm",
trControl = ctrl_lm, metric = "Rsquared")
predictedVal <- predict(lmCVFit, test.value)
modelvalues <- data.frame(obs = test.value$ER, pred = predictedVal)
( lmcv.out <- defaultSummary(modelvalues) )
# RMSE Rsquared MAE
# 1.2351006 0.1190862 1.0371477

Error with prediction - ROCR package (using probabilities)

I have used "rfe" function with svm to create a model with reduced features. Then I use "predict" on test data which outputs class labels (binary), 0 class probabilities, 1 class probabilities. I then tried using prediction function, in ROCR package, on predicted probabilities and true class labels but get the following error and am not sure why as the lengths of the 2 arrays are equal:
> pred_svm <- prediction(pred_svm_2class[,2], as.numeric(as.character(y)))
Error in prediction(pred_svm_2class[, 2], as.numeric(as.character(y))) :
Number of predictions in each run must be equal to the number of labels for each run.
I have the code below and the input is here click me.It is a small dataset with binary classification, so code runs fast.
library("caret")
library("ROCR")
sensor6data_2class <- read.csv("/home/sensei/clustering/svm_2labels.csv")
sensor6data_2class <- within(sensor6data_2class, Class <- as.factor(Class))
set.seed("1298356")
inTrain_svm_2class <- createDataPartition(y = sensor6data_2class$Class, p = .75, list = FALSE)
training_svm_2class <- sensor6data_2class[inTrain_svm_2class,]
testing_svm_2class <- sensor6data_2class[-inTrain_svm_2class,]
trainX <- training_svm_2class[,1:20]
y <- training_svm_2class[,21]
ctrl_svm_2class <- rfeControl(functions = rfFuncs , method = "repeatedcv", number = 5, repeats = 2, allowParallel = TRUE)
model_train_svm_2class <- rfe(x = trainX, y = y, data = training_svm_2class, sizes = c(1:20), metric = "Accuracy", rfeControl = ctrl_svm_2class, method="svmRadial")
pred_svm_2class = predict(model_train_svm_2class, newdata=testing_svm_2class)
pred_svm <- prediction(pred_svm_2class[,2], y)
Thanks and appreciate your help.
This is because in the line
pred_svm <- prediction(pred_svm_2class[,2], y)
pred_svm_2class[,2] is the predictions on test data and y is the labels for training data. Just generate the labels for test in a separate variable like this
y_test <- testing_svm_2class[,21]
And now if you do
pred_svm <- prediction(pred_svm_2class[,2], y_test)
There will be no error. Full code below -
# install.packages("caret")
# install.packages("ROCR")
# install.packages("e1071")
# install.packages("randomForest")
library("caret")
library("ROCR")
sensor6data_2class <- read.csv("svm_2labels.csv")
sensor6data_2class <- within(sensor6data_2class, Class <- as.factor(Class))
set.seed("1298356")
inTrain_svm_2class <- createDataPartition(y = sensor6data_2class$Class, p = .75, list = FALSE)
training_svm_2class <- sensor6data_2class[inTrain_svm_2class,]
testing_svm_2class <- sensor6data_2class[-inTrain_svm_2class,]
trainX <- training_svm_2class[,1:20]
y <- training_svm_2class[,21]
y_test <- testing_svm_2class[,21]
ctrl_svm_2class <- rfeControl(functions = rfFuncs , method = "repeatedcv", number = 5, repeats = 2, allowParallel = TRUE)
model_train_svm_2class <- rfe(x = trainX, y = y, data = training_svm_2class, sizes = c(1:20), metric = "Accuracy", rfeControl = ctrl_svm_2class, method="svmRadial")
pred_svm_2class = predict(model_train_svm_2class, newdata=testing_svm_2class)
pred_svm <- prediction(pred_svm_2class[,2], y_test)

L2 regularized MLR using caret, and how to make sure I am using best model while predicting

I am trying to do L2-regularized MLR on a data set using caret. Following is what I have done so far to achieve this:
r_squared <- function ( pred, actual){
mean_actual = mean (actual)
ss_e = sum ((pred - actual )^2)
ss_total = sum ((actual-mean_actual)^2 )
r_squared = 1 - (ss_e/ss_total)
}
df = as.data.frame(matrix(rnorm(10000, 10, 3), 1000))
colnames(df)[1] = "response"
set.seed(753)
inTraining <- createDataPartition(df[["response"]], p = .75, list = FALSE)
training <- df[inTraining,]
testing <- df[-inTraining,]
testing_response <- base::subset(testing,
select = c(paste ("response")))
gridsearch_for_lambda = data.frame (alpha = 0,
lambda = c (2^c(-15:15), 3^c(-15:15)))
regression_formula = as.formula (paste ("response", "~ ", " .", sep = " "))
train_control = trainControl (method="cv", number =10,
savePredictions =TRUE , allowParallel = FALSE )
model = train (regression_formula,
data = training,
trControl = train_control,
method = "glmnet",
tuneGrid =gridsearch_for_lambda,
preProcess = NULL
)
prediction = predict (model, newdata = testing)
testing_response[["predicted"]] = prediction
r_sq = round (r_squared(testing_response[["predicted"]],
testing_response[["response"]] ),3)
Here I am concerned about assurance that the model I am using for prediction is the best one (the optimal tuned lambda value).
P.S.: The data is sampled from random normal distribution, which is not giving a good R^2 value, but I want to get the idea correctly

Resources