How to create confusion matrix for upsampled ML model - r

I am using RStudio, the caret package, to create predictive models. I want to create a confusion matrix, but I do not know how to access observed values after the resampling has been performed.
I have an imbalanced dataset so I've used upsampling with the following code:
library(caret)
control <- trainControl(method = "LGOCV", number = 1000)
control$sampling = "up"
# I create my predictive model with random forest:
metric = "Accuracy"
set.seed(123)
fit.rand = train(Diet~., data = year3data, method = "ranger", metric = metric, trControl = control)
Now I want to find weighted accuracy using a confusion matrix, but all the code I know of requires input of 'true' values and predicted values. I do not know how to access the true observation values from the upsampled dataset, and I know they won't be the same as those in the original dataset. Below is an example of the code I'd like to use:
confusionMatrix(data = fit.rand$pred, reference = fit.rand$obs, mode = "prec_recall")
The object fit.rand does have fit.rand$pred, but it does not have fit.rand$obs. I would like to know how to access the observations (post-resampling) that were used to create fit.rand please. Thank you!
TLDR -> code and the problem below
library(caret)
control <- trainControl(method = "LGOCV", number = 1000)
control$sampling = "up"
metric = "Accuracy"
set.seed(123)
fit.rand = train(Diet~., data = year3data, method = "ranger", metric = metric, trControl = control)
confusionMatrix(data = fit.rand$pred, reference = fit.rand$obs, mode = "prec_recall")
confusionMatrix is the part of the code that I am having a problem with because fit.rand$obs does not exist. I would like to know how to access the observation values used to create fit.rand, because the resampling process has changed them from the original values in my 'year3data' dataframe.

Related

Scaling a continuous feature in the test set according to the train set

I'm building a GBM classifier to predict a certain target variable.
My data contains many continuous variables, and I want to scale only one of them (age) using the scale function. I should scale this variable in the train set and then scale it in the test set according to the train set, and that is so I don't get information leakage. My question is how do I apply this in R?
The way I'm doing this is by scaling the age feature separately in the train set and the test set which is not quite right. Here is my code (I use the caret package):
for (i in (1:10)) {
print(i)
set.seed(i)
IND = createDataPartition(y = MYData$Target_feature, p=0.8, list = FALSE)
TRAIN_set = MYData[IND, ]
TEST_set = MYData[-IND,]
TRAIN_set$age = scale(TRAIN_set$age)
TEST_set$age = scale(TEST_set$age)
GBMModel <- train(Target_feature~., data = TRAIN_set,
method = "gbm",
metric="ROC",
trControl = ctrlCV,
tuneGrid = gbmGRID,
verbose = FALSE
)
AUCs_Trn[i] = auc(roc(TRAIN_set$Target_feature,predict(GBMModel,TRAIN_set, type='prob')[,1]))
AUCs_Tst[i] = auc(roc(TEST_set$Target_feature,predict(GBMModel,TEST_set, type='prob')[,1]))
}
NOTE: I only want to scale the age feature.
One way to do it is to manually scale the test data by the mean and standard deviation from the training set (equivalent to what scale() does).
test$age_scaled = (test$age - mean(train$age) ) / sd(train$age)

Storing Multiple Models in Loop and save these to compare Variables

I'm interested in using RandomForest as my model for a classification problem. I have been able to run a very simple model for initial testing. However, I want to try a nested loop to run various models and save these to a vector. This is to achieve two principal objectives:
To extract the best model of these from my loop (or maybe get an average of these models?)
To compare the most important variables between my models and see which are the most commonly top selected features per prediction.
I am currently testing with the Iris dataset to see how feasible this is before applying on a larger dataset with many more features (> 100)
Nested Model Example
What I have so far is the following:
#Set Control
myControl = trainControl(method = "cv", number = 10)
#Set a counter
myCounter <- 0
RFModel_Vector <- c()
#Nested Loop to select best model
for (i in 0:2)
{
# Train a default Random Forest Model
RFModel_Vector <- randomForest(y = factor(iris$Species),
x = iris[, colnames(iris) != "Species"],
importance = TRUE,
proximity = TRUE,
trControl = myControl,
metric = "Accuracy",
ntree = 100)
# Count Number of Loops
myCounter = counter + 1
print (myCounter)
}
I have also seen that there is a function caretList that can be used for ensemble methods.
I'm not entirely sure on how to go about this. Any help?
Create a list to store the output as model output objects are list themselves and it is better to store it in a list
RFModel_Vector <- vector('list', 3)
for (i in seq_along(RFModel_Vector))
{
# Train a default Random Forest Model
RFModel_Vector[[i]] <- randomForest(y = factor(iris$Species),
x = iris[, colnames(iris) != "Species"],
importance = TRUE,
proximity = TRUE,
trControl = myControl,
metric = "Accuracy",
ntree = 100)
}

Caret traincontrol method="null" gives Accuracy numeric(0), other methods work perfectly

I have a data which I want to perform knn on.
Here's the code
errboot <- function(data, k, number){
require(caret)
classes <- data[,1]
fit <- train(as.factor(classes)~.,
method = "knn",
tuneGrid = expand.grid(k=k),
metric = "Accuracy",
data = data,
trControl = trainControl(method = "none",
number = number))
err <- 1-fit$results$Accuracy
return(err)}
According to the manual, "none" should fit one model to the training set.
Note, changing method to "boot632" or "boot" etc. work perfectly well, but somehow when changing it to "none" it gives numeric(0) in the results.
The data I am using is a data frame with first column being the classes, and the rest of the two being features.
Can anyone see what the error is?
For the record I am using latest caret version (6.0-92)

R function for finding the sensitivity given an alpha value

I am new to data analysis with R so any help is appreciated.
I have a dataset with some explanatory variables and one target variable. The target variable is either Yes or No only. So I would like to use logistic regression for model fitting.
This is how I plot a roc curve
myModel = train(
myTarget ~ .,
myTrainData,
method = "glm",
metric = "ROC",
trControl = myControl,
na.action = na.pass
)
myPred = predict(myModel , newdata = myTestData, type="prob")
eval <- evalm(data.frame(myPred , myTestData$myTarget)
eval$roc
Now, I would like to find the sensitivity given an alpha value / Type I error
And show the information like the following if possible, how can I achieve it?
confusionMatrix(myPred, reference = myTestData$myTarget)

How to pass predictions from tunRanger to confusion matrix?

I am trying to predict binary outcome (class1 and class2) by tuneRanger function in r as
library(mlr)
library(tuneRanger)
task = makeClassifTask(data = train, target = "outcome")
estimateTimeTuneRanger(task)
res = tuneRanger(task, measure = list(multiclass.brier),
num.trees = 1000,num.threads = 8, iters = 70)
a<-predict(res$model, newdata = test)
My question is how to get confusion matrix after this? Predict gives me probabilities and if I use
confusionMatrix(a, test$outcome, positive = 'Class2')
I will have the error: Error: data and reference should be factors with the same levels.
Do I need to define another random forest model and use the optimal parameters from tuneRanger?
In advance thank you for your attention
I had the same problem and I used:
a<-predict(res$model**$model.learner**, newdata = test)
From there you can get a$predictions that you could use to get the confussion matrix.

Resources