Setting Random seeds do not affect classification methods C5.0 and ctree - r

I want to compare between two different classification methods, namely ctree and C5.0 in the libraries partyand c50 respectively, the comparison is to test their sensitivity to the initial start points. The test should be carried 30 times for each time the number of wrong classified items are calculated and stored in a vector then by using t-test I hope to see if they are really different or not.
library("foreign"); # for read.arff
library("party") # for ctree
library("C50") # for C5.0
trainTestSplit <- function(data, trainPercentage){
newData <- list();
all <- nrow(data);
splitPoint <- floor(all * trainPercentage);
newData$train <- data[1:splitPoint, ];
newData$test <- data[splitPoint:all, ];
return (newData);
}
ctreeErrorCount <- function(st,ss){
set.seed(ss);
model <- ctree(Class ~ ., data=st$train);
class <- st$test$Class;
st$test$Class <- NULL;
pre = predict(model, newdata=st$test, type="response");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
C50ErrorCount <- function(st,ss){
model <- C5.0(Class ~ ., data=st$train, seed=ss);
class <- st$test$Class;
pre = predict(model, newdata=st$test, type="class");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
compare <- function(n = 30){
data <- read.arff(file.choose());
set.seed(100);
errors = list(ctree = c(), c50 = c());
seeds <- floor(abs(rnorm(n) * 10000));
for(i in 1:n){
splitData <- trainTestSplit(data, 0.66);
errors$ctree[i] <- ctreeErrorCount(splitData, seeds[i]);
errors$c50[i] <- C50ErrorCount(splitData, seeds[i]);
}
cat("\n\n");
cat("============= ctree Vs C5.0 =================\n");
cat(paste(errors$ctree, " ", errors$c50, "\n"))
tt <- t.test(errors$ctree, errors$c50);
print(tt);
}
The program shown is supposedly doing the job of comparison, but because of the number of errors does not change in the vectors then the t.test function produces an error. I used iris inside R (but changing class to Class) and Winchester breast cancer data which can be downloaded here to test it but any data can be used as long as it has Class attribute
But I get in to the problem that the result of both methods remain constant and not changes while I am changing the random seed, theoretically ,as described in their documentation,both of the functions use random seeds, ctree uses set.seed(x) while C5.0 uses an argument called seed to set seed, unfortunatly I can not find the effect.
Could you please tell me how to control initials of these functions

ctrees does only depend on a random seed in the case where you configure it to use a random selection of input variables (ie that mtry > 0 within ctree_control). See http://cran.r-project.org/web/packages/party/party.pdf (p. 11)
In regards to C5.0-trees the seed is used this way:
ctrl = C5.0Control(sample=0.5, seed=ss);
model <- C5.0(Class ~ ., data=st$train, control = ctrl);
Notice that the seed is used to select a sample of the data, not within the algoritm itself. See http://cran.r-project.org/web/packages/C50/C50.pdf (p. 5)

Related

R: Understanding K-Fold Validation Correctly?

Good Afternoon.
I wanted a sanity check after doing research about k-Fold Cross-Validation. I will provide my understanding, and then provide an example of how to execute the preconceived understanding in R.
I would really appreciate any help on if I'm thinking about this incorrectly, or if my code is not reflecting my thought process / the correct procedures. Take the basic predictive modeling scenario on a continuous response variable:
Have a population dataset (xDF)
I want to split the dataset into k=10 separate parts, train a model on 9 of them (binded), and then validate on the remaining validation set
I then want to loop through each validation set to observe how the model performs on un-trained segments of the data
Model performance measures (RMSE for this example) on the kth-fold validation set that display similar results on the k+1...k+9th validation set reveals that the model is well-generalized
R Code:
#Declaring randomly sampled validation indices
ind <- sample(seq_len(nrow(xDF)), size = nrow(xDF))
n <- (nrow(xDF)/10)
nr <- nrow(xDF)
validation_ind <- split(ind, rep(1:ceiling(nr/n), each=n, length.out=nr))
#Looping through validation sets to obtain Model Performance measure of each set
RMSEsF <- double(10)
RMSEsFT <- double(10)
R2F <- double(10)
R2FT <- double(10)
rsq <- function (x, y) cor(x, y) ^ 2
for (i in 1:10){
validate = as.data.frame(xDF[unlist(validation_ind[i]),])
train = as.data.frame(xDF[unlist(validation_ind[-i]),])
rf_train = randomForest(y~.,data=train,mtry=3)
predictions_rf = predict(rf_train,validate)
predictions_rft = predict(rf_train, train)
RMSEsF[i] = RMSE(predictions_rf, validate$y)
RMSEsFT[i] = RMSE(predictions_rft, train$y)
R2F[i] = rsq(predictions_rf, validate$y)
R2FT[i] = rsq(predictions_rft, train$y)
print(".")
}
RMSEsF
RMSEsFT
Am I going about this correctly?
Many thanks in advance.

How to predict in kknn function? library(kknn)

I try to use kknn + loop to create a leave-out-one cross validation for a model, and compare that with train.kknn.
I have split the data into two parts: training (80% data), and test (20% data). In the training data, I exclude one point in the loop to manually create LOOCV.
I think something gets wrong in predict(knn.fit, data.test). I have tried to find how to predict in kknn through the kknn package instruction and online but all the examples are "summary(model)" and "table(validation...)" rather than the prediction on a separate test data. The code predict(model, dataset) works successfully in train.kknn function, so I thought I could use the similar arguments in kknn.
I am not sure if there is such a prediction function in kknn. If yes, what arguments should I give?
Look forward to your suggestion. Thank you.
library(kknn)
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
# train.data + validation.data is the 80% data I split.
}
pred.knn <- predict(knn.fit, data.test) # data.test is 20% data.
Here is the error message:
Error in switch(type, raw = object$fit, prob = object$prob,
stop("invalid type for prediction")) : EXPR must be a length 1
vector
Actually I try to compare train.kknn and kknn+loop to compare the results of the leave-out-one CV. I have two more questions:
1) in kknn: is it possible to use another set of data as test data to see the knn.fit prediction?
2) in train.kknn: I split the data and use 80% of the whole data and intend to use the rest 20% for prediction. Is it an correct common practice?
2) Or should I just use the original data (the whole data set) for train.kknn, and create a loop: data[-i,] for training, data[i,] for validation in kknn? So they will be the counterparts?
I find that if I use the training data in the train.kknn function and use prediction on test data set, the best k and kernel are selected and directly used in generating the predicted value based on the test dataset.
In contrast, if I use kknn function and build a loop of different k values, the model generates the corresponding prediction results based on
the test data set each time the k value is changed. Finally, in kknn + loop, the best k is selected based on the best actual prediction accuracy rate of test data. In short, the best k train.kknn selected may not work best on test data.
Thank you.
For objects returned by kknn, predict gives the predicted value or the predicted probabilities of R1 for the single row contained in validation.data:
predict(knn.fit)
predict(knn.fit, type="prob")
The predict command also works on objects returned by train.knn.
For example:
train.kknn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 10,
kernel = "rectangular", scale = TRUE)
class(train.kknn.fit)
# [1] "train.kknn" "kknn"
pred.train.kknn <- predict(train.kknn.fit, data.test)
table(pred.train.kknn, as.factor(data.test$R1))
The train.kknn command implements a leave-one-out method very close to the loop developed by #vcai01. See the following example:
set.seed(43210)
n <- 500
data.train <- data.frame(R1=rbinom(n,1,0.5), matrix(rnorm(n*10), ncol=10))
library(kknn)
pred.kknn <- array(0, nrow(data.train))
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
pred.kknn[i] <- predict(knn.fit)
}
knn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 40,
kernel = "rectangular", scale = TRUE)
pred.train.kknn <- predict(knn.fit, data.train)
table(pred.train.kknn, pred.kknn)
# pred.kknn
# pred.train.kknn 1 2
# 0 374 14
# 1 9 103

Custom caret metric for "precision at k" by groups

What is the proper way to create a custom metric function to use in caret::train that contains an argument and can summarize subsets of the training data?
Imagine we have credit score and loan data and would like to train a model to predict the top lending prospects within different categories of loans (home mortgage, car loan, student loan, etc.) We have a limited amount of money and want to diversify our portfolio, so we want to identify a handful of low-risk loans to make in each category.
As an example, we can use the GermanLoans data from the caret package. In this training data, each loan is classified as either "Good" or "Bad". After rearranging some columns, we have the column Purpose that identifies the type of loan requested.
## Load packages
library(data.table); library(caret); library(xgboost); library(Metrics)
## Load data and convert dependent variable (Class) to factor
data(GermanCredit)
setDT(GermanCredit, keep.rownames=TRUE)
GermanCredit[, `:=`(rn=as.numeric(rn), Class=factor(Class, levels=c("Good", "Bad")))]
## Now we need to collapse a few columns...
## - Columns containing purpose for getting loan
colsPurpose <- names(GermanCredit)[names(GermanCredit) %like% "Purpose."]
## - Replace purpose columns with a single factor column
GermanCredit[, Purpose:=melt(GermanCredit, id.var="rn", measure.vars=colsPurpose)[
value==1][order(rn), factor(sub("Purpose.", "", variable))]]
## - Drop purpose columns
GermanCredit[, colsPurpose:=NULL, with=FALSE]
Now we need to create the custom metric function. Something like precision at k (where k is the number of loans we'd like to make in each category) averaged over groups seems appropriate, but I am open to suggestions. In any case, the function should look something like this:
twoClassGroup <- function (data, lev=NULL, model=NULL, k, ...) {
if(length(levels(data$obs)) > 2)
stop(paste("Your outcome has", length(levels(data$obs)),
"levels. The twoClassGroup() function isn't appropriate."))
if (!all(levels(data$pred) == levels(data$obs)))
stop("levels of observed and predicted data do not match")
[subset the data, probably using data$rowIndex]
[calculate the metrics, based on data$pred and data$obs]
[return a named vector of metrics]
}
Finally, we can train the model.
## Train a model (just an example; may or may not be appropriate for this problem)
creditModel <- train(
Class ~ . - Purpose, data=GermanCredit, method="xgbTree",
trControl=trainControl(
method="cv", number=6, returnResamp="none", summaryFunction=twoClassGroup,
classProbs=TRUE, allowParallel=TRUE, verboseIter=TRUE),
tuneGrid = expand.grid(
nrounds=500, max_depth=6, eta=0.02, gamma=0, colsample_bytree=1, min_child_weight=6),
metric="someCustomMetric", preProc=c("center", "scale"))
## Add predictions
GermanCredit[, `:=`(pred=predict(creditModel, GermanCredit, type="raw"),
prob=predict(creditModel, GermanCredit, type="prob")[[levels(creditModel)[1]]])]
Questions
How do I pass the value of k to twoClassGroup from the train call? Adding it within the main function arguments doesn't work, nor does adding it within trControl or tuneGrid.
How do I subset the data within twoClassGroup in order to calculate the model precision for the top k values within each value of Purpose? The data object within the twoClassGroup function is not the same as the one passed to the original train function.
This attempt mostly works, but I'm hoping someone can share a better method. Rather than passing dt and k arguments from train, they're "hardcoded" in twoClassGroup. Also, the value from Metrics::mapk seems very low, although the resulting model does appear to pick the best loan prospects.
library(Metrics)
twoClassGroup <- function (data, lev=NULL, model=NULL, dt=GermanCredit, k=10) {
if(length(levels(data$obs)) > 2)
stop(paste("Your outcome has", length(levels(data$obs)),
"levels. The twoClassGroup() function isn't appropriate."))
if (!all(levels(data$pred) == levels(data$obs)))
stop("levels of observed and predicted data do not match")
data <- data.table(data, group=dt[data$rowIndex, Purpose])
## You can ignore these extra metrics...
## <-----
sens <- sensitivity(data$pred, data$obs, positive=lev[1])
spec <- specificity(data$pred, data$obs, positive=lev[1])
precision <- posPredValue(data$pred, data$obs)
recall <- sens
Fbeta <- function(precision, recall, beta=1) {
val <- (1+beta^2)*(precision*recall)/(precision*beta^2 + recall)
if(is.nan(val)) val <- 0
return(val)
}
F0.5 <- Fbeta(precision, recall, beta=0.5)
F1 <- Fbeta(precision, recall, beta=1)
F2 <- Fbeta(precision, recall, beta=2)
## ----->
## This is the important one...
mapk <- data[, .(obs=list(obs), pred=list(pred)), by=group][, mapk(k, obs, pred)]
return(c(sensitivity=sens, specificity=spec, F0.5=F0.5, F1=F1, F2=F2, mapk=mapk))
}
In the train call from the original post, the value of metric would be "mapk" rather than "someCustomMetric".

Variable importance for support vector machine and naive Bayes classifiers in R

I’m working on building predictive classifiers in R on a cancer dataset.
I’m using random forest, support vector machine and naive Bayes classifiers. I’m unable to calculate variable importance on SVM and NB models
I end up receiving the following error.
Error in UseMethod("varImp") :
no applicable method for 'varImp' applied to an object of class "c('svm.formula', 'svm')"
I would greatly appreciate it if anyone could help me.
Given
library(e1071)
model <- svm(Species ~ ., data = iris)
class(model)
# [1] "svm.formula" "svm"
library(caret)
varImp(model)
# Error in UseMethod("varImp") :
# no applicable method for 'varImp' applied to an object of class "c('svm.formula', 'svm')"
methods(varImp)
# [1] varImp.bagEarth varImp.bagFDA varImp.C5.0* varImp.classbagg*
# [5] varImp.cubist* varImp.dsa* varImp.earth* varImp.fda*
# [9] varImp.gafs* varImp.gam* varImp.gbm* varImp.glm*
# [13] varImp.glmnet* varImp.JRip* varImp.lm* varImp.multinom*
# [17] varImp.mvr* varImp.nnet* varImp.pamrtrained* varImp.PART*
# [21] varImp.plsda varImp.randomForest* varImp.RandomForest* varImp.regbagg*
# [25] varImp.rfe* varImp.rpart* varImp.RRF* varImp.safs*
# [29] varImp.sbf* varImp.train*
There is no function varImp.svm in methods(varImp), therefore the error. You might want to have a look at this post on Cross Validated, too.
If you use R, the variable importance can be calculated with Importance method in rminer package. This is my sample code:
library(rminer)
M <- fit(y~., data=train, model="svm", kpar=list(sigma=0.10), C=2)
svm.imp <- Importance(M, data=train)
In detail, refer to the following link https://cran.r-project.org/web/packages/rminer/rminer.pdf
I have created a loop that iteratively removes one predictor at a time and captures in a data frame various performance measures derived from the confusion matrix. This is not supposed to be a one size fits all solution, I don't have the time for it, but it should not be difficult to apply modifications.
Make sure that the predicted variable is last in the data frame.
I mainly needed specificity values from the models and by removing one predictor at a time, I can evaluate the importance of each predictor, i.e. by removing a predictor, the smallest specificity of the model(less predictor number i) means that the predictor has the most importance. You need to know on what indicator you will attribute importance.
You can also add another for loop inside to change between kernels, i.e. linear, polynomial, radial, but you might have to account for the other parameters,e.g. gamma. Change "label_fake" with your target variable and df_final with your data frame.
SVM version:
set.seed(1)
varimp_df <- NULL # df with results
ptm1 <- proc.time() # Start the clock!
for(i in 1:(ncol(df_final)-1)) { # the last var is the dep var, hence the -1
smp_size <- floor(0.70 * nrow(df_final)) # 70/30 split
train_ind <- sample(seq_len(nrow(df_final)), size = smp_size)
training <- df_final[train_ind, -c(i)] # receives all the df less 1 var
testing <- df_final[-train_ind, -c(i)]
tune.out.linear <- tune(svm, label_fake ~ .,
data = training,
kernel = "linear",
ranges = list(cost =10^seq(1, 3, by = 0.5))) # you can choose any range you see fit
svm.linear <- svm(label_fake ~ .,
kernel = "linear",
data = training,
cost = tune.out.linear[["best.parameters"]][["cost"]])
train.pred.linear <- predict(svm.linear, testing)
testing_y <- as.factor(testing$label_fake)
conf.matrix.svm.linear <- caret::confusionMatrix(train.pred.linear, testing_y)
varimp_df <- rbind(varimp_df,data.frame(
var_no=i,
variable=colnames(df_final[,i]),
cost_param=tune.out.linear[["best.parameters"]][["cost"]],
accuracy=conf.matrix.svm.linear[["overall"]][["Accuracy"]],
kappa=conf.matrix.svm.linear[["overall"]][["Kappa"]],
sensitivity=conf.matrix.svm.linear[["byClass"]][["Sensitivity"]],
specificity=conf.matrix.svm.linear[["byClass"]][["Specificity"]]))
runtime1 <- as.data.frame(t(data.matrix(proc.time() - ptm1)))$elapsed # time for running this loop
runtime1 # divide by 60 and you get minutes, /3600 you get hours
}
Naive Bayes version:
varimp_nb_df <- NULL
ptm1 <- proc.time() # Start the clock!
for(i in 1:(ncol(df_final)-1)) {
smp_size <- floor(0.70 * nrow(df_final))
train_ind <- sample(seq_len(nrow(df_final)), size = smp_size)
training <- df_final[train_ind, -c(i)]
testing <- df_final[-train_ind, -c(i)]
x = training[, names(training) != "label_fake"]
y = training$label_fake
model_nb_var = train(x,y,'nb', trControl=ctrl)
predict_nb_var <- predict(model_nb_var, newdata = testing )
confusion_matrix_nb_1 <- caret::confusionMatrix(predict_nb_var, testing$label_fake)
varimp_nb_df <- rbind(varimp_nb_df, data.frame(
var_no=i,
variable=colnames(df_final[,i]),
accuracy=confusion_matrix_nb_1[["overall"]][["Accuracy"]],
kappa=confusion_matrix_nb_1[["overall"]][["Kappa"]],
sensitivity=confusion_matrix_nb_1[["byClass"]][["Sensitivity"]],
specificity=confusion_matrix_nb_1[["byClass"]][["Specificity"]]))
runtime1 <- as.data.frame(t(data.matrix(proc.time() - ptm1)))$elapsed # time for running this loop
runtime1 # divide by 60 and you get minutes, /3600 you get hours
}
Have fun!

How to perform 10 fold cross validation with LibSVM in R?

I know that in MatLab this is really easy ('-v 10').
But I need to do it in R. I did find one comment about adding cross = 10 as parameter would do it. But this is not confirmed in the help file so I am sceptical about it.
svm(Outcome ~. , data= source, cost = 100, gamma =1, cross=10)
Any examples of a successful SVM script for R would also be appreciated as I am still running into some dead ends?
Edit: I forgot to mention outside of the tags that I use the libsvm package for this.
I am also trying to perform a 10 fold cross validation. I think that using tune is not the right way in order to perform it, since this function is used to optimize the parameters, but not to train and test the model.
I have the following code to perform a Leave-One-Out cross validation. Suppose that dataset is a data.frame with your data stored. In each LOO step, the observed vs. predicted matrix is added, so that at the end, result contains the global observed vs. predicted matrix.
#LOOValidation
for (i in 1:length(dataset)){
fit = svm(classes ~ ., data=dataset[-i,], type='C-classification', kernel='linear')
pred = predict(fit, dataset[i,])
result <- result + table(true=dataset[i,]$classes, pred=pred);
}
classAgreement(result)
So in order to perform a 10-fold cross validation, I guess we should manually partition the dataset, and use the folds to train and test the model.
for (i in 1:10)
train <- getFoldTrainSet(dataset, i)
test <- getFoldTestSet(dataset,i)
fit = svm(classes ~ ., train, type='C-classification', kernel='linear')
pred = predict(fit, test)
results <- c(results,table(true=test$classes, pred=pred));
}
# compute mean accuracies and kappas ussing results, which store the result of each fold
I hope this help you.
Here is a simple way to create 10 test and training folds using no packages:
#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- yourData[testIndexes, ]
trainData <- yourData[-testIndexes, ]
#Use test and train data howeever you desire...
}
Here is my generic code to run a k-fold cross validation aided by cvsegments to generate the index folds.
# k fold-cross validation
set.seed(1)
k <- 80;
result <- 0;
library('pls');
folds <- cvsegments(nrow(imDF), k);
for (fold in 1:k){
currentFold <- folds[fold][[1]];
fit = svm(classes ~ ., data=imDF[-currentFold,], type='C-classification', kernel='linear')
pred = predict(fit, imDF[currentFold,])
result <- result + table(true=imDF[currentFold,]$classes, pred=pred);
}
classAgreement(result)

Resources