I'm trying to use the "lime" package to interpret a Random Forest model with the "import85" dataset, but when I run the explain command it returns an error:
library(lime)
library(caret)
data("imports85", package = "randomForest")
imp85 <- imports85[,-2]
imp85 <- imp85[complete.cases(imp85), ]
imp85[] <- lapply(imp85, function(x) if (is.factor(x)) x[, drop=TRUE] else x)
stopifnot(require(randomForest))
NROW(imp85)*0.7
idx <- sample(1:NROW(imp85),135)
test <- imp85[idx, c(1:4, 6:25)]
train <- imp85[-idx, c(1:4, 6:25)]
resp <- imp85[[5]][-idx]
model <- train(train, resp, method = 'rf')
explainer <- lime(train, model)
explanation <- explain(test, explainer, n_labels = 1, n_features = 2)
Error in predict.randomForest(modelFit, newdata, type = "prob") :
Type of predictors in new data do not match that of the training data.
How can I solve it?
EDIT 1:
I tried to force the factor variable levels to be the same for both train and test datasets, but it doesn't work
Related
I am using the R programming language. I am trying to use the "caret" library to train a multiclass classification algorithm.
I found the following website which supposedly contains a function that can be used for multiclass classification problem:https://www.r-bloggers.com/2012/07/error-metrics-for-multi-class-problems-in-r-beyond-accuracy-and-kappa/
require(compiler)
multiClassSummary <- cmpfun(function (data, lev = NULL, model = NULL){
#Load Libraries
require(Metrics)
require(caret)
#Check data
if (!all(levels(data[, "pred"]) == levels(data[, "obs"])))
stop("levels of observed and predicted data do not match")
#Calculate custom one-vs-all stats for each class
prob_stats <- lapply(levels(data[, "pred"]), function(class){
#Grab one-vs-all data for the class
pred <- ifelse(data[, "pred"] == class, 1, 0)
obs <- ifelse(data[, "obs"] == class, 1, 0)
prob <- data[,class]
#Calculate one-vs-all AUC and logLoss and return
cap_prob <- pmin(pmax(prob, .000001), .999999)
prob_stats <- c(auc(obs, prob), logLoss(obs, cap_prob))
names(prob_stats) <- c('ROC', 'logLoss')
return(prob_stats)
})
prob_stats <- do.call(rbind, prob_stats)
rownames(prob_stats) <- paste('Class:', levels(data[, "pred"]))
#Calculate confusion matrix-based statistics
CM <- confusionMatrix(data[, "pred"], data[, "obs"])
#Aggregate and average class-wise stats
#Todo: add weights
class_stats <- cbind(CM$byClass, prob_stats)
class_stats <- colMeans(class_stats)
#Aggregate overall stats
overall_stats <- c(CM$overall)
#Combine overall with class-wise stats and remove some stats we don't want
stats <- c(overall_stats, class_stats)
stats <- stats[! names(stats) %in% c('AccuracyNull',
'Prevalence', 'Detection Prevalence')]
#Clean names and return
names(stats) <- gsub('[[:blank:]]+', '_', names(stats))
return(stats)
})
I would like to use this function to train a "Decision Tree" model using the "F1 Score" (or any metric suitable for a multiclass problem). For example:
library(caret)
library(plyr)
library(C50)
library(dplyr)
library(compiler)
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = multiClassSummary, classProbs = TRUE)
train_model <- train(my_data$response ~., data = my_data, method = "C5.0",
trControl=train.control ,
preProcess = c("center", "scale"),
tuneLength = 15,
metric = "F1")
But this produces the following error:
Error in ctrl$summaryFunction(testOutput, lev, method):
Your outcome has 4 levels. The prSummary() function isn't appropriate.
birth <- import("smoker_data1.xlsx")
## Splitting the dataset in test and train datasets
mysplit <- sample.split(birth, SplitRatio = 0.65)
train <- subset(birth, mysplit == T)
test <- subset(birth, mysplit == F)
## Build Random Forest model on the test set
mod1 <- randomForest(smoke~., train)
Error message: Error: Error in y - ymean : non-numeric argument to binary operator**
I think the best way is to check the data type for smoke variable first.
If possible try to change the variable using as.factor().
library(readxl)
birth <- read_excel("smoker_data1.xlsx")
## Splitting the dataset in test and train datasets
mysplit <- sample.split(birth, SplitRatio = 0.65)
train <- subset(birth, mysplit == T)
test <- subset(birth, mysplit == F)
train$smoke <- as.factor(train$smoke)
## Build Random Forest model on the test set
mod1 <- randomForest(smoke~., train)
I already tried with the data you gave, just need to specify the type of data correctly before fitting randomForest function.
data1$baby_wt <- as.numeric(data1$baby_wt)
data1$income <- as.factor(data1$income)
data1$mother_a <- as.numeric(data1$mother_a)
data1$smoke <- as.factor(data1$smoke)
data1$gestation <- as.numeric(data1$gestation)
data1$mother_wt <- as.numeric(data1$mother_wt)
library(caret)
library(randomForest)
predictors <- names(data1)[!names(data1) %in% "smoke"]
inTrainingSet <- createDataPartition(data1$smoke, p=0.7, list=F)
train<- data1[inTrainingSet,]
test<- data1[-inTrainingSet,]
library(randomForest)
m.rf = randomForest(smoke~., data=train, mtry=sqrt(ncol(x)), ntree=5000,
importance=T, proximity=T, probability=T)
m.rf
#############################################
# Test Performance
#############################################
m.pred = predict(m.rf, test[-4], response="class")
m.table <- table(m.pred, test$smoke)
library(caret)
confusionMatrix(m.table)
I want to extract the predictions for new unseen data using the function caret::extractPrediction with a random forest model but I cannot figure out, why my code throws the error Error: $ operator is invalid for atomic vectors. How should the input parameters be structured, to use this function?
Here is my reproducible code:
library(caret)
dat <- as.data.frame(ChickWeight)
# create column set
dat$set <- rep("train", nrow(dat))
# split into train and validation set
set.seed(1)
dat[sample(nrow(dat), 50), which(colnames(dat) == "set")] <- "validation"
# predictors and response
all_preds <- dat[which(dat$set == "train"), which(names(dat) %in% c("Time", "Diet"))]
response <- dat[which(dat$set == "train"), which(names(dat) == "weight")]
# set train control parameters
contr <- caret::trainControl(method="repeatedcv", number=3, repeats=5)
# recursive feature elimination caret
set.seed(1)
model <- caret::train(x = all_preds,
y = response,
method ="rf",
ntree = 250,
metric = "RMSE",
trControl = contr)
# validation set
vali <- dat[which(dat$set == "validation"), ]
# not working
caret::extractPrediction(models = model, testX = vali[,-c(3,5,1)], testY = vali[,1])
caret::extractPrediction(models = model, testX = vali, testY = vali)
# works without problems
caret::predict.train(model, newdata = vali)
I found a solution by looking at the documentation of extractPrediction. Basically, the argument models doesn't want a single model instance, but a list of models. So I just inserted list(my_rf = model) and not just model.
caret::extractPrediction(models = list(my_rf = model), testX = vali[,-c(3,5,1)], testY = vali[,1])
I have an imbalanced data, and I want to do stratified cross validation and use precision recall auc as my evaluation metric.
I use prSummary in r package caret with stratified index, and I encounter an error when computing performance.
The following is a sample which can be reproduced. I found that there are only ten sample to compute p-r auc, and because of the imbalanced, there is only one class so that it cannot compute p-r auc. (The reason that I found that only ten sample to compute p-r auc is because I modified the prSummary to force this function to print out the data)
library(randomForest)
library(mlbench)
library(caret)
# Load Dataset
data(Sonar)
dataset <- Sonar
x <- dataset[,1:60]
y <- dataset[,61]
# make this data very imbalance
y[4:length(y)] <- "M"
y <- as.factor(y)
dataset$Class <- y
# create index and indexOut
seed <- 1
set.seed(seed)
folds <- 2
idxAll <- 1:nrow(x)
cvIndex <- createFolds(factor(y), folds, returnTrain = T)
cvIndexOut <- lapply(1:length(cvIndex), function(i){
idxAll[-cvIndex[[i]]]
})
names(cvIndexOut) <- names(cvIndex)
# set the index, indexOut and prSummaryCorrect
control <- trainControl(index = cvIndex, indexOut = cvIndexOut,
method="cv", summaryFunction = prSummary, classProbs = T)
metric <- "AUC"
set.seed(seed)
mtry <- sqrt(ncol(x))
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(Class~., data=dataset, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
Here is the error message:
Error in ROCR::prediction(y_pred, y_true) :
Number of classes is not equal to 2.
ROCR currently supports only evaluation of binary classification tasks.
I think I found the weird thing...
Even I specified the cross validation index, the summary function(no matter prSummary or other summary function) will still randomly(I am not sure) select ten sample to computing performance.
The way I did is defined a summary function with tryCatch to avoid this error occur.
prSummaryCorrect <- function (data, lev = NULL, model = NULL) {
print(data)
print(dim(data))
library(MLmetrics)
library(PRROC)
if (length(levels(data$obs)) != 2)
stop(levels(data$obs))
if (length(levels(data$obs)) > 2)
stop(paste("Your outcome has", length(levels(data$obs)),
"levels. The prSummary() function isn't appropriate."))
if (!all(levels(data[, "pred"]) == levels(data[, "obs"])))
stop("levels of observed and predicted data do not match")
res <- tryCatch({
auc <- MLmetrics::PRAUC(y_pred = data[, lev[2]], y_true = ifelse(data$obs == lev[2], 1, 0))
}, warning = function(war) {
print(war)
auc <- NA
}, error = function(e){
print(dim(data))
auc <- NA
}, finally = {
print("finally")
auc <- NA
})
c(AUC = res,
Precision = precision.default(data = data$pred, reference = data$obs, relevant = lev[2]),
Recall = recall.default(data = data$pred, reference = data$obs, relevant = lev[2]),
F = F_meas.default(data = data$pred, reference = data$obs, relevant = lev[2]))
}
My testing data a my trining data have different factors levels. I try to merge levels but it doesnt works.
library(mgcv)
library(ff)
myData <- read.csv.ffdf(file = "myFile.csv")
myData$myVar <- as.factor(myData$myVar)
testData <- read.csv(file = "test.csv")
testData$myVar <- as.factor(testData$myVar)
form <- dependent ~ .
model <- gam(form, data=myData)
model$xlevels[["myVar"]] <- union(model$xlevels[["myVar"]], levels(testData$myVar))
predictedData <- predict(model, newdata=testData)
then R gives me this error:
Error in predict.gam(model, newdata = testData) : 1001, 1213,1231 not in original fit
Calls: predict -> predict.gam