Why is rpart more accurate than Caret rpart in R - r

This post mentions that Caret rpart is more accurate than rpart due to bootstrapping and cross validation:
Why do results using caret::train(..., method = "rpart") differ from rpart::rpart(...)?
Although when I compare both methods, I get an accuracy of 0.4879 for Caret rpart and 0.7347 for rpart (I have copied my code below).
Besides that the classificationtree for Caret rpart has only a few nodes (splits) compared to rpart
Does anyone understand these differences?
Thank you!
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Loading libraries and the data
This is an R Markdown document. First we load the libraries and the data and split the trainingdata into a training and a testset.
```{r section1, echo=TRUE}
# load libraries
library(knitr)
library(caret)
suppressMessages(library(rattle))
library(rpart.plot)
# set the URL for the download
wwwTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wwwTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# download the datasets
training <- read.csv(url(wwwTrain))
testing <- read.csv(url(wwwTest))
# create a partition with the training dataset
inTrain <- createDataPartition(training$classe, p=0.05, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet <- training[-inTrain, ]
dim(TrainSet)
# set seed for reproducibility
set.seed(12345)
```
## Cleaning the data
```{r section2, echo=TRUE}
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet <- TestSet[, -NZV]
dim(TrainSet)
dim(TestSet)
# remove variables that are mostly NA
AllNA <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet <- TestSet[, AllNA==FALSE]
dim(TrainSet)
dim(TestSet)
# remove identification only variables (columns 1 to 5)
TrainSet <- TrainSet[, -(1:5)]
TestSet <- TestSet[, -(1:5)]
dim(TrainSet)
```
## Prediction modelling
First we build a classification model using Caret with the rpart method:
```{r section4, echo=TRUE}
mod_rpart <- train(classe ~ ., method = "rpart", data = TrainSet)
pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
mod_rpart$finalModel
fancyRpartPlot(mod_rpart$finalModel)
```
Second we build a similar model using rpart:
```{r section7, echo=TRUE}
# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)
# prediction on Test dataset
predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
confMatDecTree
```

A simple explanation is that you did not tune either models, and at the default settings rpart performed better by pure chance.
When you do use the same parameters then you should expect the same performance.
Lets do some tuning with caret:
set.seed(1)
mod_rpart <- train(classe ~ .,
method = "rpart",
data = TrainSet,
tuneLength = 50,
metric = "Accuracy",
trControl = trainControl(method = "repeatedcv",
number = 4,
repeats = 5,
summaryFunction = multiClassSummary,
classProbs = TRUE))
pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
#output
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 4359 243 92 135 38
B 446 2489 299 161 276
C 118 346 2477 300 92
D 190 377 128 2240 368
E 188 152 254 219 2652
Overall Statistics
Accuracy : 0.7628
95% CI : (0.7566, 0.7688)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7009
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.8223 0.6900 0.7622 0.7332 0.7741
Specificity 0.9619 0.9214 0.9444 0.9318 0.9466
Pos Pred Value 0.8956 0.6780 0.7432 0.6782 0.7654
Neg Pred Value 0.9316 0.9253 0.9495 0.9469 0.9490
Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2339 0.1335 0.1329 0.1202 0.1423
Detection Prevalence 0.2611 0.1970 0.1788 0.1772 0.1859
Balanced Accuracy 0.8921 0.8057 0.8533 0.8325 0.8603
that is a bit better then rpart with default settings (cp = 0.01)
how about if we set the optimal cp as chosen by caret:
modFitDecTree <- rpart(classe ~ .,
data = TrainSet,
method = "class",
control = rpart.control(cp = mod_rpart$bestTune))
predictDecTree <- predict(modFitDecTree, newdata = TestSet, type = "class" )
confusionMatrix(predictDecTree, TestSet$classe)
#part of ouput
Accuracy : 0.7628

Related

Approximated SHAP values for multi-classification problem using randomForest

I would like to use the fastshap package to obtain SHAP values plots for every category of my outcome in a multi-classification problem using a random forest classifier. I could only found chunks of the code around, but no explanation on how to procede from the beginning in obtaining the SHAP values in this case. Here is the code I have so far (my y has 5 classes, here I am trying to obtain SHAP values for class 3):
library(randomForest)
library(fastshap)
set.seed(42)
sample <- sample.int(n = nrow(ITA), size = floor(.75*nrow(ITA)), replace=F)
train <- ITA [sample,]
test <- ITA [-sample,]
set.seed(42)
rftrain <-randomForest(y ~ ., data=train, ntree=500, importance = TRUE)
p_function_3<- function(object, newdata)
caret::predict.train(object,
newdata = newdata,
type = "prob")[,3]
shap_values_G <- fastshap::explain(rftrain,
X = train,
pred_wrapper = p_function_3,
nsim = 50,
newdata=train[which(y==3),])
Now, I took the code largely from an example I found online, and I tried to adapt it (I am not an expert R user), but it does not work.. Can you please help me in correcting it? Thanks!
Here is a working example (with a different dataset), but I think the logic is the same.
library(randomForest)
library(fastshap)
set.seed(42)
ix <- sample(nrow(iris), 0.75 * nrow(iris))
train <- iris[ix, ]
test <- iris[-ix, ]
xvars <- c("Sepal.Width", "Sepal.Length")
yvar <- "Species"
fit <- randomForest(reformulate(xvars, yvar), data = train, ntree = 500)
pred_3 <- function(model, newdata) {
predict(model, newdata = newdata, type = "prob")[, "virginica"]
}
shap_values_3 <- fastshap::explain(
fit,
X = train, # Reference data
feature_names = xvars,
pred_wrapper = pred_3,
nsim = 50,
newdata = train[train$Species == "virginica", ] # For these rows, you will calculate explanations
)
head(shap_values_3)
# Sepal.Width Sepal.Length
# <dbl> <dbl>
# 1 0.101 0.381
# 2 0.159 -0.0109
# 3 0.0736 -0.0285
# 4 0.0564 0.161
# 5 0.0649 0.594
# 6 0.232 0.0305

How to obtain confusion matrix using caret package?

I was trying to analyse example provided by caret package for confusionMatrix i.e.
lvs <- c("normal", "abnormal")
truth <- factor(rep(lvs, times = c(86, 258)),
levels = rev(lvs))
pred <- factor(
c(
rep(lvs, times = c(54, 32)),
rep(lvs, times = c(27, 231))),
levels = rev(lvs))
xtab <- table(pred, truth)
confusionMatrix(xtab)
However to be sure I don't quite understand it. Let's just pick for example this very simple model :
set.seed(42)
x <- sample(0:1, 100, T)
y <- rnorm(100)
glm(x ~ y, family = binomial('logit'))
And I don't know how can I analogously perform confusion matrix for this glm model. Do you understand how it can be done ?
EDIT
I tried to run an example provided in comments :
train <- data.frame(LoanStatus_B = as.numeric(rnorm(100)>0.5), b= rnorm(100), c = rnorm(100), d = rnorm(100))
logitMod <- glm(LoanStatus_B ~ ., data=train, family=binomial(link="logit"))
library(caret)
# Use your model to make predictions, in this example newdata = training set, but replace with your test set
pdata <- predict(logitMod, newdata = train, type = "response")
confusionMatrix(data = as.numeric(pdata>0.5), reference = train$LoanStatus_B)
but I gain error : dataandreference` should be factors with the same levels
Am I doing something incorrectly ?
You just need to turn them into factors:
confusionMatrix(data = as.factor(as.numeric(pdata>0.5)),
reference = as.factor(train$LoanStatus_B))
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 61 31
# 1 2 6
#
# Accuracy : 0.67
# 95% CI : (0.5688, 0.7608)
# No Information Rate : 0.63
# P-Value [Acc > NIR] : 0.2357
#
# Kappa : 0.1556
#
# Mcnemar's Test P-Value : 1.093e-06
#
# Sensitivity : 0.9683
# Specificity : 0.1622
# Pos Pred Value : 0.6630
# Neg Pred Value : 0.7500
# Prevalence : 0.6300
# Detection Rate : 0.6100
# Detection Prevalence : 0.9200
# Balanced Accuracy : 0.5652
#
# 'Positive' Class : 0

manually making a random forest model doesn't give the same results

I tried recreating a random forest model using caret and I appear to get slightly different results.
##Set up data
attach(sat.act)
sat.act<- na.omit(sat.act)
#rename outcome and make as factor
sat.act <- sat.act %>% mutate(gender=ifelse(gender==1,"male","female"))
sat.act$gender <- as.factor(sat.act$gender)
#create train and test
set.seed(123)
indexes<-createDataPartition(y=sat.act$gender,p=0.7,list=FALSE)
train<-sat.act[indexes,]
test<-sat.act[-indexes,]
Create a model using 5-fold cv to find the best mtry
set.seed(123)
ctrl <- trainControl(method = "cv",
number = 5,
savePredictions = TRUE,
summaryFunction = twoClassSummary,
classProbs = TRUE)
model <- train(gender ~ ., data=train,
trControl = ctrl,
method= "rf",
preProc=c("center","scale"),
metric="ROC",
importance=TRUE)
> model$finalModel
#Call:
# randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
# Type of random forest: classification
# Number of trees: 500
#No. of variables tried at each split: 2
# OOB estimate of error rate: 39%
#Confusion matrix:
# female male class.error
#female 238 72 0.2322581
#male 116 56 0.6744186
Cross validation showed best mtry is 2. Make another model and input mtry=2 and see the results.
set.seed(123)
ctrl_other <- trainControl(method="none", savePredictions = TRUE, summaryFunction=twoClassSummary, classProbs=TRUE)
model_other <- train(gender ~., data=train, trControl=ctrl_other, importance=TRUE, tuneGrid = data.frame(mtry = 2))
> model_other$finalModel
#Call:
# randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
# Type of random forest: classification
# Number of trees: 500
#No. of variables tried at each split: 2
#
# OOB estimate of error rate: 37.34%
#Confusion matrix:
# female male class.error
#female 245 65 0.2096774
#male 115 57 0.6686047
So you can see what appears to be two of the same models (both with mtry=2 and ntree=500) but you get different results for the final model. Why?

Missing values in Confusion matrix in Decision Tree in R

I have a problem in building confusion matrix using Decision tree method. The data set is extremely imbalanced and the population of third label ("C") is about 1%.
I have no idea why prediction results of C is all zero(0).
# load the package
install.packages('rpart')
library(rpart)
library(caret)
# load data
data<-read.csv("Drisk0122_01.csv", header=TRUE)
data<-data[ , c(3:43)]
data$Class<-factor(data$Class, levels = c(1,2, 3), labels=c("A", "B", "C"))
set.seed(42)
training.samples <- createDataPartition(y=data$Class, p = 0.7, list = FALSE)
training.samples
train <- data[training.samples, ]
test <- data[-training.samples, ]
############tree
install.packages("tree")
library(tree)
treemod<-tree(Class~. , data=train)
plot(treemod)
text(treemod)
cv.trees<-cv.tree(treemod, FUN=prune.misclass ) # for classification decision tree
plot(cv.trees)
prune.trees <- prune.misclass(treemod, best=4) # for regression decision tree, use prune.tree function
plot(prune.trees)
text(prune.trees, pretty=0)
library(e1071)
treepred <- predict(prune.trees, test, type='class')
confusionMatrix(treepred, test$Class)
The results are as follows:
confusionMatrix(treepred, test$Class)
Confusion Matrix and Statistics
Reference
Prediction A B C
A 2324 360 28
B 211 427 3
C 0 0 0
Overall Statistics
Accuracy : 0.8205
95% CI : (0.807, 0.8333)
No Information Rate : 0.756
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4775
Mcnemar's Test P-Value : 4.526e-15
Statistics by Class:
Class: A Class: B Class: C
Sensitivity 0.9168 0.5426 0.000000
Specificity 0.5257 0.9166 1.000000
Pos Pred Value 0.8569 0.6661 NaN
Neg Pred Value 0.6708 0.8673 0.990755
Prevalence 0.7560 0.2347 0.009245
Detection Rate 0.6931 0.1273 0.000000
Detection Prevalence 0.8088 0.1912 0.000000
Balanced Accuracy 0.7212 0.7296 0.500000
Please find the image of the results from here

Calculating Brier Score and Integrated Brier Score using ranger R package

I want to calculate Brier score and integrated Brier score for my analysis using "ranger" R package.
As an example, I use the veteran data from the "survival" package as follows
install.packages("ranger")
library(ranger)
install.packages("survival")
library(survival)
#load veteran data
data(veteran)
data <- veteran
# training and test data
n <- nrow(data)
testind <- sample(1:n,n*0.7)
trainind <- (1:n)[-testind]
#train ranger
rg <- ranger(Surv(time, status) ~ ., data = data[trainind,])
# use rg to predict test data
pred <- predict(rg,data=data[testind,],num.trees=rg$num.trees)
#cummulative hazard function for each sample
pred$chf
#survival probability for each sample
pred$survival
How can I calculate Brier score and integrated Brier score?
The Integrated Brier Score (IBS) can be calculated using the pec function of the pec package but you need to define a predictSurvProb command to extract survival probability predictions from the ranger modeling approach (?pec:::predictSurvProb for a list of available models).
A possibile solution is:
predictSurvProb.ranger <- function (object, newdata, times, ...) {
ptemp <- ranger:::predict.ranger(object, data = newdata, importance = "none")$survival
pos <- prodlim::sindex(jump.times = object$unique.death.times,
eval.times = times)
p <- cbind(1, ptemp)[, pos + 1, drop = FALSE]
if (NROW(p) != NROW(newdata) || NCOL(p) != length(times))
stop(paste("\nPrediction matrix has wrong dimensions:\nRequested newdata x times: ",
NROW(newdata), " x ", length(times), "\nProvided prediction matrix: ",
NROW(p), " x ", NCOL(p), "\n\n", sep = ""))
p
}
This function can be used as follows:
library(ranger)
library(survival)
data(veteran)
dts <- veteran
n <- nrow(dts)
set.seed(1)
testind <- sample(1:n,n*0.7)
trainind <- (1:n)[-testind]
rg <- ranger(Surv(time, status) ~ ., data = dts[trainind,])
# A formula to be inputted into the pec command
frm <- as.formula(paste("Surv(time, status)~",
paste(rg$forest$independent.variable.names, collapse="+")))
library(pec)
# Using pec for IBS estimation
PredError <- pec(object=rg,
formula = frm, cens.model="marginal",
data=dts[testind,], verbose=F, maxtime=200)
The IBS can be evaluated using the print.pec command, indicating in times the time points at which to show the IBS:
print(PredError, times=seq(10,200,50))
# ...
# Integrated Brier score (crps):
#
# IBS[0;time=10) IBS[0;time=60) IBS[0;time=110) IBS[0;time=160)
# Reference 0.043 0.183 0.212 0.209
# ranger 0.041 0.144 0.166 0.176

Resources