Recursive feature elimination and variable selection - r

How to get OA and Kappa value for each variable like this table in the figure below?
The study used RFE using Caret

You get this table if you do rfe on a dataset for classification. It looks like the article cleaned and renamed some column names but that is it.
library(caret)
data(mdrr)
mdrrDescr <- mdrrDescr[,-nearZeroVar(mdrrDescr)]
mdrrDescr <- mdrrDescr[, -findCorrelation(cor(mdrrDescr), .8)]
set.seed(1)
inTrain <- createDataPartition(mdrrClass, p = .75, list = FALSE)[,1]
train <- mdrrDescr[ inTrain, ]
test <- mdrrDescr[-inTrain, ]
trainClass <- mdrrClass[ inTrain]
testClass <- mdrrClass[-inTrain]
set.seed(2)
ctrl <- rfeControl(functions = rfFuncs,method = "cv",number = 5, verbose = FALSE)
rf_profile <- rfe(train, trainClass,
ntree = 50,
rfeControl = ctrl)
rf_profile$results contains the results that you can see in the table.
rf_profile$results
Variables Accuracy Kappa AccuracySD KappaSD
1 4 0.7355696 0.4599432 0.06290770 0.1274150
2 8 0.7934494 0.5736408 0.08328405 0.1725036
3 16 0.8060759 0.6011138 0.05961418 0.1222687
4 61 0.8260759 0.6411303 0.07101790 0.1483737
and if you want the names with those variables you can get them like this
rf_profile$optVariables[rf_profile$results$Variables]
[1] "VRA1" "TI2" "Xt" "G.O..Cl."

Related

R: Multiclass Matrices

I am working with the R programming language. I am trying to learn how to make a "confusion matrix" for multiclass variables (e.g. How to construct the confusion matrix for a multi class variable).
Suppose I generate some data and fit a decision tree model :
#load libraries
library(rpart)
library(caret)
#generate data
a <- rnorm(1000, 10, 10)
b <- rnorm(1000, 10, 5)
d <- rnorm(1000, 5, 10)
group_1 <- sample( LETTERS[1:3], 1000, replace=TRUE, prob=c(0.33,0.33,0.34) )
e = data.frame(a,b,d, group_1)
e$group_1 = as.factor(d$group_1)
#split data into train and test set
trainIndex <- createDataPartition(e$group_1, p = .8,
list = FALSE,
times = 1)
training <- e[trainIndex,]
test <- e[-trainIndex,]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 5,
## repeated ten times
repeats = 1)
#fit decision tree model
TreeFit <- train(group_1 ~ ., data = training,
method = "rpart2",
trControl = fitControl)
From here, I am able to store the results into a "confusion matrix":
pred <- predict(TreeFit,test)
table_example <- table(pred,test$group_1)
This satisfies my requirements - but this "table" requires me to manually calculate the different accuracy metrics of "A", "B" and "C" (as well as the total accuracy).
My question: Is it possible to use the caret::confusionMatrix() command for this problem?
e.g.
pred <- predict(TreeFit, test, type = "prob")
labels_example <- as.factor(ifelse(pred[,2]>0.5, "1", "0"))
con <- confusionMatrix(labels_example, test$group_1)
This way, I would be able to directly access the accuracy measurements from the confusion matrix. E.g. metric = con$overall[1]
Thanks
Is this what you're looking for?
pred <- predict(
TreeFit,
test)
con <- confusionMatrix(
test$group_1,
pred)
con
con$overall[1]
Same output as in:
table(test$group_1, pred)
Plus accuracy metrics.

Use of PCA results as input to XGboost model throwing an error: Feature names stored in `object` and `newdata` are different

I use PCA on my divided train dataset and project the test dataset to the results after removing irrelevant columns.
data <- read.csv('bottom10.csv')
set.seed(1)
inTrain <- createDataPartition(data$cuisine, p = .8)[[1]]
dataTrain <- data[,-1][inTrain,][,-1]
dataTest <- data[,-1][-inTrain,][,-1]
cuisine.pca <- prcomp(dataTrain[,-1])
Then I extract the first 500 components and project the test dataset.
traincom <- cuisine.pca$x[,1:500]
testcom <- scale(dataTest[,-1], cuisine.pca$center) %*% cuisine.pca$rotation
Then I transfer the labels into integer, and combine components and labels into xgbDMatrix form.
label_train <- as.integer(dataTrain$cuisine) - 1
label_test <- as.integer(dataTest$cuisine) - 1
xgb_train <- xgb.DMatrix(data = traincom, label = label_train)
xgb_test <- xgb.DMatrix(data = testcom, label = label_test)
Then I build the xgboost model as
xgb.fit <- xgboost(cuisine~., data = xgb_train, nrounds = 40, num_class = 10, early_stopping_rounds = 5)
And after I run this, there is a warning but the training can still run.
xgboost: label will be ignored
I can predict the train dataset using the model but when I try to predict test dataset there will be an error.
xgb_pred <- predict(xgb.fit, newdata = xgb_train)
sum(label_train == xgb_pred)/length(label_train)
xgb_pred <- predict(xgb.fit, newdata = xgb_test, rescale = T)
Error in predict.xgb.Booster(xgb.fit, newdata = xgb_test, rescale = T) :
Feature names stored in `object` and `newdata` are different!
Please let me know what am I doing wrong?
Regards

Meaning: improvement in RMSE during crossvalidation, although not on testset?

In the code below I train a NN with crossvalidation on the first 20000 records in the dataset. The dataset contains 8 predictors.
First I have split my data in 2 parts:
the first 20.000 rows (trainset)
and the last 4003 rows (out of sample test set)
I have done 2 runs:
run 1) a run with 3 predictors
run 2) a run with all 8 predictors (see code below).
Based on crossvalidation within the 20.000 rows from the trainset, the RMSE (for the optimal parametersetting) improves from 2.30 (run 1) to 2.11 (run 2).
Although when I test both models on the 4003 rows from the out of sample test set, the RMSE improves only neglible from 2.64 (run 1) to 2.63 (run 2).
What can be concluded from these contradiction in the results?
Thanks!
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Chapter 7: Non-Linear Regression Models
### Required packages: AppliedPredictiveModeling, caret, doMC (optional),
### earth, kernlab, lattice, nnet
################################################################################
library(caret)
### Load the data
mydata <- read.csv(file="data.csv", header=TRUE, sep=",")
validatiex <- mydata[20001:24003,c(1:8)]
validatiey <- mydata[20001:24003,9]
mydata <- mydata[1:20000,]
x <- mydata[,c(1:8)]
y <- mydata[,9]
parti <- createDataPartition(y, times = 1, p=0.8, list = FALSE)
x_train <- x[parti,]
x_test <- x[-parti,]
y_train <- y[parti]
y_test <- y[-parti]
set.seed(100)
indx <- createFolds(y_train, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
## train neural net:
nnetGrid <- expand.grid(decay = c(.1),
size = c(5, 15, 30),
bag = FALSE)
set.seed(100)
nnetTune <- train(x = x_train, y = y_train,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 30 * (ncol(x_train) + 1) + 30 + 1,
maxit = 1000,
repeats = 25,
allowParallel = FALSE)
nnetTune
plot(nnetTune)
predictions <- predict(nnetTune, validatiex, type="raw")
mse <- mean((validatiey - predictions)^2)
mse <- sqrt(mse)
print (mse)

I want to use AUPRC as the performance measure, in a GBM run using caret package. How can I use a customized metric such as auprc?

I am trying to use AUPRC as my custom metric for a gbm model fit because I have imbalanced classifier. However, when i try to incorporate the custom metric I am getting the following error mentioned in the code. Not sure what I am doing wrong.
Also the auprcSummary() works on its own when i run it inline. It is giving me an error when i try to incorporate it in train().
library(dplyr) # for data manipulation
library(caret) # for model-building
library(pROC) # for AUC calculations
library(PRROC) # for Precision-Recall curve calculations
auprcSummary <- function(data, lev = NULL, model = NULL){
index_class2 <- data$Class == "Class2"
index_class1 <- data$Class == "Class1"
the_curve <- pr.curve(data$Class[index_class2],
data$Class[index_class1],
curve = FALSE)
out <- the_curve$auc.integral
names(out) <- "AUPRC"
out
}
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
summaryFunction = auprcSummary,
classProbs = TRUE)
set.seed(5627)
orig_fit <- train(Class ~ .,
data = toanalyze.train,
method = "gbm",
verbose = FALSE,
metric = "AUPRC",
trControl = ctrl)
This is the error I am getting:
Error in order(scores.class0) : argument 1 is not a vector
Is it because pr.curve() takes only numeric vectors as inputs (scores/probabilities?)
caret has a built-in function called prSummary that computes that for you. You don't have to write your own.
I think this approach yields an appropriate custom summary function:
library(caret)
library(pROC)
library(PRROC)
library(mlbench) #for the data set
data(Ionosphere)
in pr.curve function the classification scores may be either provided separately for the data points of each of the classes, i.e., as scores.class0 for the data points from the positive/foreground class and as scores.class1 for the data points of the negative/background class; or the classification scores for all data points are provided as scores.class0 and the labels are provided as numerical values (1 for the positive class, 0 for the negative class) as weights.class0 (I copied this from the help of the function I apologize if it is unclear).
I opted to provide the later - probability for all in scores.class0 and class assignment in weights.class0.
caret states that if the classProbs argument of the trainControl object is set to TRUE, additional columns in data will be present that contains the class probabilities. So for the Ionosphere data columns good and bad should be present:
levels(Ionosphere$Class)
#output
[1] "bad" "good"
to convert to 0/1 labeling one can just do:
as.numeric(Ionosphere$Class) - 1
good will become 1
bad will become 0
now we have all the data for the custom function
auprcSummary <- function(data, lev = NULL, model = NULL){
prob_good <- data$good #take the probability of good class
the_curve <- pr.curve(scores.class0 = prob_good,
weights.class0 = as.numeric(data$obs)-1, #provide the class labels as 0/1
curve = FALSE)
out <- the_curve$auc.integral
names(out) <- "AUPRC"
out
}
Instead of using data$good which will work on this data set alone one can extract the class names and use that to get the desired column:
lvls <- levels(data$obs)
prob_good <- data[,lvls[2]]
It is important to note each time you update the summaryFunction you need to update the trainControl object.
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
summaryFunction = auprcSummary,
classProbs = TRUE)
orig_fit <- train(y = Ionosphere$Class, x = Ionosphere[,c(1,3:34)], #omit column 2 to avoid a bunch of warnings related to the data set
method = "gbm",
verbose = FALSE,
metric = "AUPRC",
trControl = ctrl)
orig_fit$results
#output
shrinkage interaction.depth n.minobsinnode n.trees AUPRC AUPRCSD
1 0.1 1 10 50 0.9722775 0.03524882
4 0.1 2 10 50 0.9758017 0.03143379
7 0.1 3 10 50 0.9739880 0.03316923
2 0.1 1 10 100 0.9786706 0.02502183
5 0.1 2 10 100 0.9817447 0.02276883
8 0.1 3 10 100 0.9772322 0.03301064
3 0.1 1 10 150 0.9809693 0.02078601
6 0.1 2 10 150 0.9824430 0.02284361
9 0.1 3 10 150 0.9818318 0.02287886
Seems reasonable

"The format of predictions is incorrect"

Implementation of ROCR curve, kNN ,K 10 fold cross validation.
I am using Ionosphere dataset.
Here is the attribute information for your reference:
-- All 34 are continuous, as described above
-- The 35th attribute is either "good" or "bad" according to the definition
summarized above. This is a binary classification task.
data1<-read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data',header = FALSE)
knn on its own works, kNN with kfold also works. But when I put in the ROCR code it doesnt like it.
I get the error: "The format of predictions is incorrect".
I checked the dataframes pred and Class 1. The dimensions are same. I tried with data.test$V35 instead of Class1 I get the same error with this option.
install.packages("class")
library(class)
nrFolds <- 10
data1[,35]<-as.numeric(data1[,35])
# generate array containing fold-number for each sample (row)
folds <- rep_len(1:nrFolds, nrow(data1))
# actual cross validation
for(k in 1:nrFolds) {
# actual split of the data
fold <- which(folds == k)
data.train <- data1[-fold,]
data.test <- data1[fold,]
Class<-data.train[,35]
Class1<-data.test[,35]
# train and test your model with data.train and data.test
pred<-knn(data.train, data.test, Class, k = 5, l = 0, prob = FALSE, use.all = TRUE)
data<-data.frame('predict'=pred, 'actual'=Class1)
count<-nrow(data[data$predict==data$actual,])
total<-nrow(data.test)
avg = (count*100)/total
avg =format(round(avg, 2), nsmall = 2)
method<-"KNN"
accuracy<-avg
cat("Method = ", method,", accuracy= ", accuracy,"\n")
}
install.packages("ROCR")
library(ROCR)
rocrPred=prediction(pred, Class1, NULL)
rocrPerf=performance(rocrPred, 'tpr', 'fpr')
plot(rocrPerf, colorize=TRUE, text.adj=c(-.2,1.7))
Any help is appreciated.
This worked for me..
install.packages("class")
library(class)
library(ROCR)
nrFolds <- 10
data1[,35]<-as.numeric(data1[,35])
# generate array containing fold-number for each sample (row)
folds <- rep_len(1:nrFolds, nrow(data1))
# actual cross validation
for(k in 1:nrFolds) {
# actual split of the data
fold <- which(folds == k)
data.train <- data1[-fold,]
data.test <- data1[fold,]
Class<-data.train[,35]
Class1<-data.test[,35]
# train and test your model with data.train and data.test
pred<-knn(data.train, data.test, Class, k = 5, l = 0, prob = FALSE, use.all = TRUE)
data<-data.frame('predict'=pred, 'actual'=Class1)
count<-nrow(data[data$predict==data$actual,])
total<-nrow(data.test)
avg = (count*100)/total
avg =format(round(avg, 2), nsmall = 2)
method<-"KNN"
accuracy<-avg
cat("Method = ", method,", accuracy= ", accuracy,"\n")
pred <- prediction(Class1,pred)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=T, add=TRUE)
abline(0,1)
}

Resources