What is the meaning of component err.rate of class randomForest? - r

I'm using the function randomForest from package randomForest. One of the objects of class randomForest is err.rate which is
(classification only) vector error rates of the prediction on the input data, the i-th element being the (OOB) error rate for all trees up to the i-th.
Could you please explain what is the meaning of this component? Thank you so much for your help!
I take the dataset Sonar, Mines vs. Rocks as an code example.
library(mlbench)
data(Sonar)
library(boot)
library(randomForest)
n <- 208
ntrain <- 100
ntest <- 108
train.idx <- sample(1:n, ntrain, replace = FALSE)
train.set <- Sonar[train.idx, ]
test.set <- Sonar[-train.idx, ]
rf <- randomForest(Class ~ ., data = train.set, keep.inbag = TRUE, importance = TRUE)
head(rf$err.rate)
Here is the result of the code
OOB M R
[1,] 0.1891892 0.1500000 0.2352941
[2,] 0.2931034 0.2307692 0.3437500
[3,] 0.2739726 0.2647059 0.2820513
[4,] 0.2911392 0.2894737 0.2926829
[5,] 0.2413793 0.2682927 0.2173913
[6,] 0.2555556 0.2142857 0.2916667
[7,] 0.2553191 0.2444444 0.2653061
[8,] 0.2268041 0.1956522 0.2549020
[9,] 0.2783505 0.2608696 0.2941176

One component of randomForest is bagging where you get a consensus prediction from i number of trees.
As you increase the number of trees, the OOB error is computed at each step. The OOB error is not calculated from comparing the prediction obtained from 1 tree onto OOB samples with respect to that tree, but rather you use the averaged prediction across trees from which this sample is not used. I recommend checking this for an overview.
So in the example you have, we can visualize this:
library(ggplot2)
library(tidyr)
plotdf <- pivot_longer(data.frame(ntrees=1:nrow(rf$err.rate),rf$err.rate),-ntrees)
ggplot(plotdf,aes(x=ntrees,y=value,col=name)) +
geom_line() + theme_bw()
M and R are lines for error in prediction for that specific label, and OOB (your first column) is simply the average of the two. As the number of trees increase, your OOB error gets lower because you get a better prediction from more trees.
The nice thing about randomForest is that you don't need the cross-validation, because the OOB estimate is usually quite indicative. Below we can try to show that we get the same result:
set.seed(12)
# split in 5 parts
trn = split(1:nrow(Sonar),sample(1:nrow(Sonar) %% 5))
sim = vector("list",5)
# the number of trees we incrementally grow
ntrees = c(1,20*(1:50)+1)
for(CV in 1:5){
idx = trn[[CV]]
train.set <- Sonar[-idx, ]
test.set <- Sonar[idx, ]
# first forest, n=1, but works
mdl <- randomForest(Class ~ ., data = train.set, ntree=1,
keep.inbag = TRUE, importance = TRUE,keep.forest=TRUE)
err_rate <- vector("numeric",51)
err_rate[1] <- mean(predict(mdl,test.set)!=test.set$Class)
#growing the tree
for(i in 1:50){
mdl <- grow(mdl,10)
err_rate[i+1] <- mean(predict(mdl,test.set)!=test.set$Class)
}
sim[[CV]] <- data.frame(ntrees=ntrees,err_rate=err_rate,CV=CV)
}
sim = do.call(rbind,sim)
#plot
ggplot(sim,aes(x=ntrees,y=err_rate)) + geom_line(aes(group=CV),alpha=0.2) +
stat_summary(fun.y=mean,geom="line",col="blue")+theme_bw()

Related

kNN algorithm not working while using caret

I am trying to run LOOCV kNN on this dataset (104x182 where the first 62 samples are B and the following 42 are C). I first conducted a PCA on the standardized version of this dataset (giving me 104 PCs). I then try to perform LOOCV kNN for i = 3:98 where i refers to the number of PCs I will use for my kNN model. For each i I pull out the highest accuracy, which k it occurs at and store it within a data frame.
# required packages
library(MASS)
library(class)
library(tidyverse)
library(caret)
# reading in and cleaning data
data <- read.csv("chowdary.csv")
og_data <- data[, -1]
st_data <- as.data.frame(cbind(og_data[, 1], scale(og_data[, -1])))
colnames(st_data)[1] <- "tumour"
# PCA for dimension reduction
# on standardized data
pca_all <- prcomp(og_data[, -1], center=TRUE, scale=TRUE)
# creating data frame to store best k value for each number of PCs
kdf_pca_all_cc <- tibble(i=as.numeric(), # this is for storing number of PCs used,
pca_all_k=as.numeric(), # k value,
pca_all_acc=as.numeric(), # accuracy value,
pca_all_kapp=as.numeric()) # and kappa value
# kNN
k_kNN <- 3:97 # number of PCs to use in each iteration of the model
train_control <- trainControl(method="LOOCV")
kNN_data <- as.data.frame(cbind(as.factor(st_data[, 1]), pca_all$x)) # data used in kNN model below
for (i in k_kNN){
a111 <- train(V1~ .,
method="knn",
tuneGrid=expand.grid(k=1:25),
trControl=train_control,
metric="Accuracy",
data=kNN_data[, 1:i])
b111 <- a111$results[as.integer(a111$bestTune), ] # this is to store the best accuracy rate, along with its k and kappa value
kdf_pca_all_cc <- kdf_pca_all_cc %>%
add_row(i=i-1,
pca_all_k=b111[, 1],
pca_all_acc=b111[, 2],
pca_all_kapp=b111[, 3])
}
For example, for i = 5, the kNN model would be using the following data:
head(kNN_data[, 1:5])
V1 PC1 PC2 PC3 PC4
1 1 3.299844 0.2587487 -1.00501632 2.0273727
2 1 1.427856 -1.0455044 -1.79970790 2.5244021
3 1 3.087657 1.2563404 1.67591441 -1.4270431
4 1 3.107778 1.5893396 2.65871270 -2.8217264
5 1 3.244306 0.5982652 0.37011029 0.3642425
6 1 3.000098 0.5471276 -0.01178315 1.0857886
However, whenever I try to run the for-loop, I am given the following warning message:
Error: Metric Accuracy not applicable for regression models
In addition: Warning message:
In train.default(x, y, weights = w, ...) :
You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
I have no idea how to fix this. Any help would be much appreciated.
Also, as a side note, is there a faster way to run this for-loop? It takes quite a while but I have no idea how to make it more efficient. Thank you.

How to convert one-fold cross-validation to K-fold cross-validation in R

I have a GAM model for which I would like to calculate AUC, TSS (True Skill Statistic) and RMSE through 5-fold cross-validation in R. Unfortunately, the caret package does not support GAM and therefore cannot be used. As I didn’t find any alternative, I tried to build the code for cross-validation myself, and it works well, with the only problem that it is only one-fold cross-validation. Could anybody help me to make this 5-fold? Sorry if this is an elementary question, I am new to R.
sample <- sample(c(TRUE, FALSE), nrow(DF), replace=TRUE, prob=c(0.8,0.2))
train <- DF[sample, ]
test <- DF[!sample, ]
predicted <- predict(GAM, test, type="response")
# Calculating RMSE
RMSE(test$Y, predicted)
# Calculating AUC
auc(test$Y, predicted)
GAM_TSS <- gam(Y ~ X1 + X2 + X3 + X4 + s(X5, k = 3), train, family = "binomial")
test$pred <- predict(GAM_TSS, type="response", newdata=test)
roc.curve <- roc(test$Y, test$pred, ci=T)
plot(roc.curve)
threshold <- 0.1
CM <- confusionMatrix(factor(test$pred>threshold), factor(test$P_A==1), positive="TRUE")
CM <- CM$byClass
Sensitivity <- CM[['Sensitivity']]
Specificity <- CM[['Specificity']]
# Calculating TSS
TSS = Sensitivity + Specificity - 1
TSS
I have come across precisely this problem with GAM in the past. My approach was to create a vector to split data randomly into parts as equally sized as possible, then loop through the fold ids as follows:
k <- 5
FoldID <- rep(1:k, ceiling(nrow(modelData)/k))
length(FoldID) <- nrow(modelData)
FoldID <- sample(FoldID, replace = FALSE)
for(fold in 1:k){
train_data <- modelData[FoldID != fold, ]
val_data <- modelData[FoldID == fold, ]
# Create training model and predictions
# Calculate RMSE data etc.
# Add a line with fold validation results to a dataframe
}
# Calculate column means of your validation results frame
I will leave you to fill in the gaps to suit your own requirements. It would also be a good idea to add an outer loop (outside the FoldID creation) for repeats.

SVM performance not consistent with AUC score

I have a dataset that contains information about patients. It includes several variables and their clinical status (0 if they are healthy, 1 if they are sick).
I have tried to implement an SVM model to predict patient status based on these variables.
library(e1071)
Index <-
order(Ytrain, decreasing = FALSE)
SVMfit_Var <-
svm(Xtrain[Index, ], Ytrain[Index],
type = "C-classification", gamma = 0.005, probability = TRUE, cost = 0.001, epsilon = 0.1)
preds1 <-
predict(SVMfit_Var, Xtest, probability = TRUE)
preds1 <-
attr(preds1, "probabilities")[,1]
samples <- !is.na(Ytest)
pred <- prediction(preds1[samples],Ytest[samples])
AUC<-performance(pred,"auc")#y.values[[1]]
prediction <- predict(SVMfit_Var, Xtest)
xtab <- table(Ytest, prediction)
To test the performance of the model, I have calculated the ROC AUC, and with the validation set I obtain an AUC = 0.997.
But when I view the predictions, all the patients have been assigned as healthy.
AUC = 0.997
> xtab
prediction
Ytest 0 1
0 72 0
1 52 0
Can anyone help me with this problem?
Did you look at the probabilities versus the fitted values? You can read about how probability works with SVM here.
If you want to look at the performance you can use the library DescTools and the function Conf or with the library caret and the function confusionMatrix. (They provide the same output.)
library(DescTools)
library(caret)
# for the training performance with DescTools
Conf(table(SVMfit_Var$fitted, Ytrain[Index]))
# svm.model$fitted, y-values for training
# training performance with caret
confusionMatrix(SVMfit_Var$fitted, as.factor(Ytrain[Index]))
# svm.model$fitted, y-values
# if y.values aren't factors, use as.factor()
# for testing performance with DescTools
# with `table()` in your question, you must flip the order:
# predicted first, then actual values
Conf(table(prediction, Ytest))
# and for caret
confusionMatrix(prediction, as.factor(Ytest))
Your question isn't reproducible, so I went through this with iris data. The probability was the same for every observation. I included this, so you can see this with another data set.
library(e1071)
library(ROCR)
library(caret)
data("iris")
# make it binary
df1 <- iris %>% filter(Species != "setosa") %>% droplevels()
# check the subset
summary(df1)
set.seed(395) # keep the sample repeatable
tr <- sample(1:nrow(df1), size = 70, # 70%
replace = F)
# create the model
svm.fit <- svm(df1[tr, -5], df1[tr, ]$Species,
type = "C-classification",
gamma = .005, probability = T,
cost = .001, epsilon = .1)
# look at probabilities
pb.fit <- predict(svm.fit, df1[-tr, -5], probability = T)
# this shows EVERY row has the same outcome probability distro
pb.fit <- attr(pb.fit, "probabilities")[,1]
# look at performance
performance(prediction(pb.fit, df1[-tr, ]$Species), "auc")#y.values[[1]]
# [1] 0.03555556 that's abysmal!!
# test the model
p.fit = predict(svm.fit, df1[-tr, -5])
confusionMatrix(p.fit, df1[-tr, ]$Species)
# 93% accuracy with NIR at 50%... the AUC score was not useful
# check the trained model performance
confusionMatrix(svm.fit$fitted, df1[tr, ]$Species)
# 87%, with NIR at 50%... that's really good

R: how to improve gradient boosting model fit

I tried fitting a gradient boosted model (weak learners are max.depth = 2 trees) to the iris data set using gbm in the gbm package. I set the number of iterations to M = 1000 with a learning rate of learning.rate = 0.001. I then compared the results to those of a regression tree (using rpart). However, it seems that the regression tree is outperforming the gradient boosted model. What's the reason behind this? And how can I improve the gradient boosted model's performance? I thought a learning rate of 0.001 should suffice with 1000 iterations/boosted trees.
library(rpart)
library(gbm)
data(iris)
train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M <- 1000
gbm.model <- gbm(Sepal.Length ~ ., data = train.dat, distribution = "gaussian", n.trees = M,
interaction.depth = 2, shrinkage = learning.rate, bag.fraction = 1, train.fraction = 1)
yhats.gbm <- predict(gbm.model, newdata = test.dat, n.trees = M)
tree.mod <- rpart(Sepal.Length ~ ., data = train.dat)
yhats.tree <- predict(tree.mod, newdata = test.dat)
> sqrt(mean((test.dat$Sepal.Length - yhats.gbm)^2))
[1] 1.209446
> sqrt(mean((test.dat$Sepal.Length - yhats.tree)^2))
[1] 0.6345438
In the iris dataset, there are 3 different species, first 50 rows are setosa, next 50 are versicolor and last 50 are virginica. So I think it's better if you mix the rows, and also make the Species column relevant.
library(ggplot2)
ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,col=Species)) + geom_point()
Secondly, you should do this over different a few replicates to see its uncertainty. For this we can use caret, and we can define the training samples before hand and also provide a fixed grid. What we are interested in, is the error during the training with cross-validation, which is similar to what you are doing:
set.seed(999)
idx = split(sample(nrow(iris)),1:nrow(iris) %% 3)
tr = trainControl(method="cv",index=idx)
this_grid = data.frame(interaction.depth=2,shrinkage=0.001,
n.minobsinnode=10,n.trees=1000)
gbm_fit = train(Sepal.Width ~ . ,data=iris,method="gbm",
distribution="gaussian",tuneGrid=tg,trControl=tr)
Then we use the same samples to fit rpart:
#the default for rpart
this_grid = data.frame(cp=0.01)
rpart_fit = train(Sepal.Width ~ . ,data=iris,method="rpart",
trControl=tr,tuneGrid=this_grid)
Finally we compare them, and they are very similar:
gbm_fit$resample
RMSE Rsquared MAE Resample
1 0.3459311 0.5000575 0.2585884 0
2 0.3421506 0.4536114 0.2631338 1
3 0.3428588 0.5600722 0.2693837 2
RMSE Rsquared MAE Resample
1 0.3492542 0.3791232 0.2695451 0
2 0.3320841 0.4276960 0.2550386 1
3 0.3284239 0.4343378 0.2570833 2
So I suspect there's something weird in the example above. Again it always depend on your data, for some data like for example iris, rpart might be good enough because there are very strong predictors. Also for complex models like gbm, you most likely need to train using something like the above to find the optimal parameters.

Prediction of 'mlm' linear model object from `lm()`

I have three datasets:
response - matrix of 5(samples) x 10(dependent variables)
predictors - matrix of 5(samples) x 2(independent variables)
test_set - matrix of 10(samples) x 10(dependent variables defined in response)
response <- matrix(sample.int(15, size = 5*10, replace = TRUE), nrow = 5, ncol = 10)
colnames(response) <- c("1_DV","2_DV","3_DV","4_DV","5_DV","6_DV","7_DV","8_DV","9_DV","10_DV")
predictors <- matrix(sample.int(15, size = 7*2, replace = TRUE), nrow = 5, ncol = 2)
colnames(predictors) <- c("1_IV","2_IV")
test_set <- matrix(sample.int(15, size = 10*2, replace = TRUE), nrow = 10, ncol = 2)
colnames(test_set) <- c("1_IV","2_IV")
I'm doing a multivariate linear model using a training set defined as the combination of response and predictor sets, and I would like to use this model to make predictions for the test set:
training_dataframe <- data.frame(predictors, response)
fit <- lm(response ~ predictors, data = training_dataframe)
predictions <- predict(fit, data.frame(test_set))
However, the results for predictions are really odd:
predictions
First off the matrix dimensions are 5 x 10, which is the number of samples in the response variable by the number of DVs.
I'm not very skilled with this type of analysis in R, but shouldn't I be getting a 10 x 10 matrix, so that I have predictions for each row in my test_set?
Any help with this issue would be greatly appreciated,
Martin
You are stepping into a poorly supported part in R. The model class you have is "mlm", i.e., "multiple linear models", which is not the standard "lm" class. You get it when you have several (independent) response variables for a common set of covariates / predictors. Although lm() function can fit such model, predict method is poor for "mlm" class. If you look at methods(predict), you would see a predict.mlm*. Normally for a linear model with "lm" class, predict.lm is called when you call predict; but for a "mlm" class the predict.mlm* is called.
predict.mlm* is too primitive. It does not allow se.fit, i.e., it can not produce prediction errors, confidence / prediction intervals, etc, although this is possible in theory. It can only compute prediction mean. If so, why do we want to use predict.mlm* at all?! The prediction mean can be obtained by a trivial matrix-matrix multiplication (in standard "lm" class this is a matrix-vector multiplication), so we can do it on our own.
Consider this small, reproduce example.
set.seed(0)
## 2 response of 10 observations each
response <- matrix(rnorm(20), 10, 2)
## 3 covariates with 10 observations each
predictors <- matrix(rnorm(30), 10, 3)
fit <- lm(response ~ predictors)
class(fit)
# [1] "mlm" "lm"
beta <- coef(fit)
# [,1] [,2]
#(Intercept) 0.5773235 -0.4752326
#predictors1 -0.9942677 0.6759778
#predictors2 -1.3306272 0.8322564
#predictors3 -0.5533336 0.6218942
When you have a prediction data set:
# 2 new observations for 3 covariats
test_set <- matrix(rnorm(6), 2, 3)
we first need to pad an intercept column
Xp <- cbind(1, test_set)
Then do this matrix multiplication
pred <- Xp %*% beta
# [,1] [,2]
#[1,] -2.905469 1.702384
#[2,] 1.871755 -1.236240
Perhaps you have noticed that I did not even use a data frame here. Yes it is unnecessary as you have everything in matrix form. For those R wizards, maybe using lm.fit or even qr.solve is more straightforward.
But as a complete answer, it is a must to demonstrate how to use predict.mlm to get our desired result.
## still using previous matrices
training_dataframe <- data.frame(response = I(response), predictors = I(predictors))
fit <- lm(response ~ predictors, data = training_dataframe)
newdat <- data.frame(predictors = I(test_set))
pred <- predict(fit, newdat)
# [,1] [,2]
#[1,] -2.905469 1.702384
#[2,] 1.871755 -1.236240
Note the I() when I use data.frame(). This is a must when we want to obtain a data frame of matrices. You can compare the difference between:
str(data.frame(response = I(response), predictors = I(predictors)))
#'data.frame': 10 obs. of 2 variables:
# $ response : AsIs [1:10, 1:2] 1.262954.... -0.32623.... 1.329799.... 1.272429.... 0.414641.... ...
# $ predictors: AsIs [1:10, 1:3] -0.22426.... 0.377395.... 0.133336.... 0.804189.... -0.05710.... ...
str(data.frame(response = response, predictors = predictors))
#'data.frame': 10 obs. of 5 variables:
# $ response.1 : num 1.263 -0.326 1.33 1.272 0.415 ...
# $ response.2 : num 0.764 -0.799 -1.148 -0.289 -0.299 ...
# $ predictors.1: num -0.2243 0.3774 0.1333 0.8042 -0.0571 ...
# $ predictors.2: num -0.236 -0.543 -0.433 -0.649 0.727 ...
# $ predictors.3: num 1.758 0.561 -0.453 -0.832 -1.167 ...
Without I() to protect the matrix input, data are messy. It is amazing that this will not cause problem to lm, but predict.mlm will have a hard time obtaining the correct matrix for prediction, if you don't use I().
Well, I would recommend using a "list" instead of a "data frame" in this case. data argument in lm as well newdata argument in predict allows list input. A "list" is a more general structure than a data frame, which can hold any data structure without difficulty. We can do:
## still using previous matrices
training_list <- list(response = response, predictors = predictors)
fit <- lm(response ~ predictors, data = training_list)
newdat <- list(predictors = test_set)
pred <- predict(fit, newdat)
# [,1] [,2]
#[1,] -2.905469 1.702384
#[2,] 1.871755 -1.236240
Perhaps in the very end, I should stress that it is always safe to use formula interface, rather than matrix interface. I will use R built-in dataset trees as a reproducible example.
fit <- lm(cbind(Girth, Height) ~ Volume, data = trees)
## use the first two rows as prediction dataset
predict(fit, newdata = trees[1:2, ])
# Girth Height
#1 9.579568 71.39192
#2 9.579568 71.39192
Perhaps you still remember my saying that predict.mlm* is too primitive to support se.fit. This is the chance to test it.
predict(fit, newdata = trees[1:2, ], se.fit = TRUE)
#Error in predict.mlm(fit, newdata = trees[1:2, ], se.fit = TRUE) :
# the 'se.fit' argument is not yet implemented for "mlm" objects
Oops... How about confidence / prediction intervals (actually without the ability to compute standard error it is impossible to produce those intervals)? Well, predict.mlm* will just ignore it.
predict(fit, newdata = trees[1:2, ], interval = "confidence")
# Girth Height
#1 9.579568 71.39192
#2 9.579568 71.39192
So this is so different compared with predict.lm.

Resources