I am new using SVM in r. I am doing analysis on a breast cancer dataset and I would like to plot the decision boundary on two of the features. but the boundary does not show where it supposed to show. The following is the code:
set.seed(134)
fitc = trainControl(method = "cv", number = 10, search = "random", savePredictions = T)
modfitsvm = train(diagnosis~., data = trainset, method = "svmLinear", trControl = fitc, tuneLength = 10)
plot(modfitsvm)
bestmod = svm(diagnosis~., data = trainset, kernel = "linear", cost = 0.18)
# train data error
svm.ytrain<-trainset$diagnosis
svm.predy<-predict(bestmod, trainset)
mean(svm.ytrain!=svm.predy)
# test data error
svm.ytest<-testset$diagnosis
svm.predtesty<-predict(bestmod, testset)
mean(svm.ytest!=svm.predtesty)
svmtab = confusionMatrix(svm.predtesty,testset$diagnosis)
fourfoldplot(svmtab$table, conf.level = 0, margin = 1, main=paste("SVM Testing(",round(svmtab$overall[1]*100,3),"%)",sep=""))
I am not seeing boundary but the model achieved 94% accuracy.
Related
I'm having doubts during the hyperparameters tune step. I think I might be making some confusion.
I split my dataset into training (70%), validation (15%) and testing (15%). Below is the code used for regression with Random Forest.
1. Training
I perform the initial training with the dataset, as follows:
rf_model <- ranger(y ~.,
date = train ,
num.trees = 500,
mtry = 5,
min.node.size = 100,
importance = "impurity")
I get the R squared and the RMSE using the actual and predicted data from the training set.
pred_rf <- predict(rf_model,train)
pred_rf <- data.frame(pred = pred_rf, obs = train$y)
RMSE_rf <- RMSE(pred_rf$pred, pred_rf$obs)
R2_rf <- (color(pred_rf$pred, pred_rf$obs)) ^2
2. Parameter optimization
Using a parameter grid, the best model is chosen based on performance.
hyper_grid <- expand.grid(mtry = seq(3, 12, by = 4),
sample_size = c(0.5,1),
min.node.size = seq(20, 500, by = 100),
MSE = as.numeric(NA),
R2 = as.numeric(NA),
OOB_RMSE = as.numeric(NA)
)
And I perform the search for the best model according to the smallest OOB error, for example.
for (i in 1:nrow(hyper_grid)) {
model <- ranger(formula = y ~ .,
date = train,
num.trees = 500,
mtry = hyper_grid$mtry[i],
sample.fraction = hyper_grid$sample_size[i],
min.node.size = hyper_grid$min.node.size[i],
importance = "impurity",
replace = TRUE,
oob.error = TRUE,
verbose = TRUE
)
hyper_grid$OOB_RMSE[i] <- sqrt(model$prediction.error)
hyper_grid[i, "MSE"] <- model$prediction.error
hyper_grid[i, "R2"] <- model$r.squared
hyper_grid[i, "OOB_RMSE"] <- sqrt(model$prediction.error)
}
Choose the best performing model
x <- hyper_grid[which.min(hyper_grid$OOB_RMSE), ]
The final model:
rf_fit_model <- ranger(formula = y ~ .,
date = train,
num.trees = 100,
mtry = x$mtry,
sample.fraction = x$sample_size,
min.node.size = x$min.node.size,
oob.error = TRUE,
verbose = TRUE,
importance = "impurity"
)
Perform model prediction with validation data
rf_predict_val <- predict(rf_fit_model, validation)
rf_predict_val <- as.data.frame(rf_predict_val[1])
names(rf_predict_val) <- "pred"
rf_predict_val <- data.frame(pred = rf_predict_val, obs = validation$y)
RMSE_rf_fit <- RMSE rf_predict_val$pred, rf_predict_val$obs)
R2_rf_fit <- (cor(rf_predict_val$pred, rf_predict_val$obs)) ^ 2
Well, now I wonder if I should replicate the model evaluation with the test data.
The fact is that the validation data is being used only as a "test" and is not effectively helping to validate the model.
I've used cross validation in other methods, but I'd like to do it more manually. One of the reasons is that the CV via caret is very slow.
I'm in the right way?
Code using Caret, but very slow:
ctrl <- trainControl(method = "repeatedcv",
repeats = 10)
grid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = 1000,
shrinkage = c(0.01,0.1),
n.minobsinnode = 50)
gbmTune <- train(y ~ ., data = train,
method = "gbm",
tuneGrid = grid,
verbose = TRUE,
trControl = ctrl)
I am executing a kNN classification with PCA as preprocessing (threshold = 0.8). However, when it seems to have concluded and I execute varImp() function, it does not give the importance of the Principal Components (as it should be doing). When I open the model fit information, the preProcess section is also NULL.
It seems as it is not doing the PCA.
The code for training control and training that I am using is:
ctrl <- trainControl(method = "cv",
number = 10,
summaryFunction = defaultSummary,
preProcOptions = list(thresh=0.8),
classProbs = TRUE
)
set.seed(150)
knn.fit = train(defective~.,
data = fTR,
method = "knn",
preProcess = 'pca',
tuneGrid = data.frame(k = 4),
#tuneGrid = data.frame(k = seq(3,5,1)),
#tuneLength = 10,
trControl = ctrl,
metric = "Kappa")
What am I doing wrong?
I am trying to return the ROC curves for a test dataset using the MLevals package.
# Load data
train <- readRDS(paste0("Data/train.rds"))
test <- readRDS(paste0("Data/test.rds"))
# Create factor class
train$class <- ifelse(train$class == 1, 'yes', 'no')
# Set up control function for training
ctrl <- trainControl(method = "cv",
number = 5,
returnResamp = 'none',
summaryFunction = twoClassSummary(),
classProbs = T,
savePredictions = T,
verboseIter = F)
gbmGrid <- expand.grid(interaction.depth = 10,
n.trees = 18000,
shrinkage = 0.01,
n.minobsinnode = 4)
# Build using a gradient boosted machine
set.seed(5627)
gbm <- train(class ~ .,
data = train,
method = "gbm",
metric = "ROC",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
# Predict results -
pred <- predict(gbm, newdata = test, type = "prob")[,"yes"]
roc <- evalm(data.frame(pred, test$class))
I have used the following post, ROC curve for the testing set using Caret package,
to try and plot the ROC from test data using MLeval and yet I get the following error message:
MLeval: Machine Learning Model Evaluation
Input: data frame of probabilities of observed labels
Error in names(x) <- value :
'names' attribute [3] must be the same length as the vector [2]
Can anyone please help? Thanks.
Please provide a reproducible example with sample data so we can replicate the error and test for solutions (i.e., we cannot access train.rds or test.rds).
Nevertheless, the below may fix your issue.
pred <- predict(gbm, newdata = test, type = "prob")
roc <- evalm(data.frame(pred, test$class))
Two questions
Visualizing the error of a model
Calculating the log loss
(1) I'm trying to tune a multinomial GBM classifier, but I'm not sure how to adapt to the outputs. I understand that LogLoss is meant to be minimized, but in the below plot, for any range of iterations or trees, it only appears to increase.
inTraining <- createDataPartition(final_data$label, p = 0.80, list = FALSE)
training <- final_data[inTraining,]
testing <- final_data[-inTraining,]
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE, savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)
gbmGrid1 <- expand.grid(.interaction.depth = (1:5)*2, .n.trees = (1:10)*25, .shrinkage = 0.1, .n.minobsinnode = 10)
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,
verbose = 1, metric = "ROC", tuneGrid = gbmGrid1)
plot(gbmFit1)
--
(2) on a related note, when I try to directly investigate mnLogLoss I get this error, which keeps me from trying to quantify the error.
mnLogLoss(testing, levels(testing$label)) : 'lev' cannot be NULL
I suspect you set the learning rate too high. So using an example dataset:
final_data = iris
final_data$label=final_data$Species
final_data$Species=NULL
inTraining <- createDataPartition(final_data$label, p = 0.80, list = FALSE)
training <- final_data[inTraining,]
testing <- final_data[-inTraining,]
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3,
verboseIter = FALSE, savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)
gbmGrid1 <- expand.grid(.interaction.depth = 1:3, .n.trees = (1:10)*10, .shrinkage = 0.1, .n.minobsinnode = 10)
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,
verbose = 1, tuneGrid = gbmGrid1,metric="logLoss")
plot(gbmFit1)
A bit different from yours but you can see the upward trend after 20. It really depends on your data but if you have a high learning rate, you arrive very quickly at a minimum and anything after that introduces noise. You can see this illustration from Boehmke's book and also check out a more statistics based discussion.
Let's lower the learning rate and you can see:
gbmGrid1 <- expand.grid(.interaction.depth = 1:3, .n.trees = (1:10)*10, .shrinkage = 0.01, .n.minobsinnode = 10)
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,
verbose = 1, tuneGrid = gbmGrid1,metric="logLoss")
plot(gbmFit1)
Note that you most likely need more iterations to reach a lower loss, like what you see with the first.
I am conducting knn regression on my data, and would like to:
a) cross-validate through repeatedcv to find an optimal k;
b) when building knn model, using PCA at 90% level threshold to reduce dimensionality.
library(caret)
library(dplyr)
set.seed(0)
data = cbind(rnorm(20, 100, 10), matrix(rnorm(400, 10, 5), ncol = 20)) %>%
data.frame()
colnames(data) = c('True', paste0('Day',1:20))
tr = data[1:15, ] #training set
tt = data[16:20,] #test set
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
#trying to find the optimal k from 1:10
trControl = train.control,
preProcess = c('scale','pca'),
metric = "RMSE",
data = tr)
My questions:
(1) I notice that someone suggested to change the pca parameter in trainControl:
ctrl <- trainControl(preProcOptions = list(thresh = 0.8))
mod <- train(Class ~ ., data = Sonar, method = "pls",
trControl = ctrl)
If I change the parameter in the trainControl, does it mean the PCA is still conducted during the KNN? Similar concern as this question
(2) I found another example which fits my situation - I am hoping to change the threshold to 90% but I don't know where can I change it in Caret's train function, especially I still need the scale option.
I apologize for my tedious long description and random references. Thank you in advance!
(Thank you Camille for the suggestions to make the code work!)
To answer your questions:
I notice that someone suggested to change the pca parameter in
trainControl:
mod <- train(Class ~ ., data = Sonar, method = "pls",trControl = ctrl)
If I change the parameter in the trainControl, does it mean the PCA is
still conducted during the KNN?
Yes if you do it with:
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3,preProcOptions = list(thresh = 0.9))
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = train.control,
preProcess = c('scale','pca'),
metric = "RMSE",
data = tr)
You can check under preProcess:
k$preProcess
Created from 15 samples and 20 variables
Pre-processing:
- centered (20)
- ignored (0)
- principal component signal extraction (20)
- scaled (20)
PCA needed 9 components to capture 90 percent of the variance
This will answer 2) which is to use preProcess separately:
mdl = preProcess(tr[,-1],method=c("scale","pca"),thresh=0.9)
mdl
Created from 15 samples and 20 variables
Pre-processing:
- centered (20)
- ignored (0)
- principal component signal extraction (20)
- scaled (20)
PCA needed 9 components to capture 90 percent of the variance
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = train.control,
metric = "RMSE",
data = predict(mdl,tr))