I have a question regarding the feature importance function in the Caret package.
I have a dataset which has more numeric and factor features.
I used the command below to get the feature importance of the model. It gives me the importance of each (sub_feature) for the factor variables. However, I just want the importance of the feature itself without go in detail for each factor of the feature.
gbmImp <- caret::varImp(xgb1, scale = TRUE)
I will create some example data as we don't have any from your question:
library(caret)
# example data
df <- data.frame("x" = rnorm(100),
"fac" = as.factor(sample(c(rep("A", 30), rep("B", 35), rep("C", 35)))),
"y" = as.numeric((rpois(100, 4))))
# model
model <- train(y ~ ., method = "glm", data = df)
# feature importance
varImp(model, scale = TRUE)
This returns the feature importance that you do not want in your question:
# glm variable importance
#
# Overall
# facB 100.00
# facC 13.08
# x 0.00
You can convert the factor variables to numeric and do the same thing:
# make the factor variable numeric
trans_df <- transform(df, fac = as.numeric(fac))
# model
trans_model <- train(y ~ ., method = "glm", data = trans_df)
# feature importance
varImp(trans_model, scale = TRUE)
This returns the importance for the 'overall' feature:
# glm variable importance
#
# Overall
# x 100
# fac 0
However, I do not know whether the as.numeric() operation on the factor variable doesn't result in a different feature importance when we run varImp(trans_model, scale = TRUE).
Also, check out this SO thread if you find that your specific factor/character variables are problematic when converting to numeric.
The df is splitted in the train and test dataframes. the train dataframe is splitted in training and testing dataframes. The dependent variable Y is binary (factor) with values 0 and 1. I'm trying to predict the probability with this code (neural networks, caret package):
library(caret)
model_nn <- train(
Y ~ ., training,
method = "nnet",
metric="ROC",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE,
classProbs=TRUE
)
)
model_nn_v2 <- model_nn
nnprediction <- predict(model_nn, testing, type="prob")
cmnn <-confusionMatrix(nnprediction,testing$Y)
print(cmnn) # The confusion matrix is to assess/compare the model
However, it gives me this error:
Error: At least one of the class levels is not a valid R variable name;
This will cause errors when class probabilities are generated because the
variables names will be converted to X0, X1 . Please use factor levels
that can be used as valid R variable names (see ?make.names for help).
I don't understand what means "use factor levels that can be used as valid R variable names". The dependent variable Y is already a factor, but is not a valid R variable name?.
PS: The code works perfectly if you erase classProbs=TRUE in trainControl() and metric="ROC" in train(). However, the "ROC" metric is my metric of comparison for the best model in my case, so I'm trying to make a model with "ROC" metric.
EDIT: Code example:
# You have to run all of this BEFORE running the model
classes <- c("a","b","b","c","c")
floats <- c(1.5,2.3,6.4,2.3,12)
dummy <- c(1,0,1,1,0)
chr <- c("1","2","2,","3","4")
Y <- c("1","0","1","1","0")
df <- cbind(classes, floats, dummy, chr, Y)
df <- as.data.frame(df)
df$floats <- as.numeric(df$floats)
df$dummy <- as.numeric(df$dummy)
classes <- c("a","a","a","b","c")
floats <- c(5.5,2.6,7.3,54,2.1)
dummy <- c(0,0,0,1,1)
chr <- c("3","3","3,","2","1")
Y <- c("1","1","1","0","0")
df <- cbind(classes, floats, dummy, chr, Y)
df <- as.data.frame(df)
df$floats <- as.numeric(df$floats)
df$dummy <- as.numeric(df$dummy)
There are two separate issues here.
The first is the error message, which says it all: you have to use something else than "0", "1" as values for your dependent factor variable Y.
You can do this by at least two ways, after you have built your dataframe df; the first one is hinted at the error message, i.e. use make.names:
df$Y <- make.names(df$Y)
df$Y
# "X1" "X1" "X1" "X0" "X0"
The second way is to use the levels function, by which you will have explicit control over the names themselves; showing it here again with names X0 and X1
levels(df$Y) <- c("X0", "X1")
df$Y
# [1] X1 X1 X1 X0 X0
# Levels: X0 X1
After adding either one of the above lines, the shown train() code will run smoothly (replacing training with df), but it will still not produce any ROC values, giving instead the warning:
Warning messages:
1: In train.default(x, y, weights = w, ...) :
The metric "ROC" was not in the result set. Accuracy will be used instead.
which bring us to the second issue here: in order to use the ROC metric, you have to add summaryFunction = twoClassSummary in the trControlargument of train():
model_nn <- train(
Y ~ ., df,
method = "nnet",
metric="ROC",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE,
classProbs=TRUE,
summaryFunction = twoClassSummary # ADDED
)
)
Running the above snippet with the toy data you have provided still gives an error (missing ROC values), but probably this is due to the very small dataset used here combined with the large number of CV folds, and it will not happen with your own, full dataset (it works OK if I reduce the CV folds to number=3)...
I'm dealing with a large dataset that involves more than 100 features (which are all relevant because they have already been filtered; the original dataset had over 500 features). I created a random forest model via the train() function from the caret package and using the "ranger" method.
Here's the question: how does one extract all of the variables by importance, as opposed to only the top 20 most important variables? The varImp() function yields only the top 20 variables by default.
Here's some sample code (minus the training set, which is very large):
library(caret)
rforest_model <- train(target_variable ~ .,
data = train_data_set,
method = "ranger",
importance = "impurity)
And here's the code for extracting variable importance:
varImp(rforest_model)
The varImp function extracts importance for all variables (even if they are not used by the model), it just prints out the top 20 variables. Consider this example:
library(mlbench) #for data set
library(caret)
library(tidyverse)
set.seed(998)
data(Ionosphere)
rforest_model <- train(y = Ionosphere$Class,
x = Ionosphere[,1:34],
method = "ranger",
importance = "impurity")
nrow(varImp(rforest_model)$importance) #34 variables extracted
lets check them:
varImp(rforest_model)$importance %>%
as.data.frame() %>%
rownames_to_column() %>%
arrange(Overall) %>%
mutate(rowname = forcats::fct_inorder(rowname )) %>%
ggplot()+
geom_col(aes(x = rowname, y = Overall))+
coord_flip()+
theme_bw()
note that V2 is a zero variance feature in this data set hence it has 0 importance and is not used by the model at all.
I am using caret for my non-binary (three classes) decision tree classification. My dataset is skewed so I want to use F1 instead of accuracy for my training and testing. How do I set this?
For an MWE lets predict the cut in the diamonds dataset:
library(ggplot2)
library(caret)
inTrain <- createDataPartition(diamonds$cut, p=0.75, list=FALSE)
training <- diamonds[inTrain,]
testing <- diamonds[-inTrain,]
fitModel <- train(cut ~ ., training, method = "rpart")
How to use F1 here?
The page at http://topepo.github.io/caret/training.html details how to create a new metric for the train function -
You need to create a new function with three parameters -
data - "is a reference for a data frame or matrix with columns called obs and pred for the observed and predicted outcome values (either numeric data for regression or character values for classification)"
lev - "is a character string that has the outcome factor levels taken from the training data. For regression, a value of NULL is passed into the function."
name - "is a character string for the model being used"
The function should calculate the F-score for the observed labels and predicted labels in the data object, and name the result based on the metric -
for example a function calculating accuracy
summaryStats <- function (data, lev = NULL, model = NULL) {
cor <- sum(data$pred==data$obs)
incor <- sum(data$pred!=data$obs)
out <- cor/(cor + incor)
names(out) <- c("acc")
out
}
Then create a new trainControl object and train your model --
fitControl <- trainControl(summaryFunction = summaryStats)
fitModel <- train(cut ~ ., training, trControl = fitControl, metric = "acc", maximize=TRUE)
I have been using the predict function in R to predict a randomForests model outcomes for a testing set when it suddenly it would only return the predicted levels instead of the probabilities. I specified the type as response but it still returns factors. What possibly could cause this?
The data consists in 23 variables, 20 of which are factors (unordered) and two of which are numeric. I am trying to predict whether a product will sell or not (0 or 1). Here is the code for the prediction:
library(randomForest)
rf = randomForest(sold ~., data = train, ntree=200, nodesize=25)
prf <- predict(rf, newdata = test, type ="response")
set type="prob"
data(iris)
library(randomForest)
seed(1234)
train.key = sort(sample(1:dim(iris)[1],100))
iris.train = iris[train.key,]
iris.test = iris[-train.key,]
rf = randomForest(Species ~., data = iris.train)
predicted.prob = predict(rf,newData=iris.test,type ="prob")