about recipes package in R - r

Hi I am using recipes for feature engineering in machine learning models.
However, when I used step_dummy, dummy variables are regarded as numeric variables, not factor.
I think this might be problematic when we use random forest or other tree models.
How can we change this? PDP shows that dummy predictor is treated as numeric. so X-axis has 0.25, 0.5.......
This should have only 0 and 1 (since dummy).
library(modeldata)
library(recipes)
library(caret)
library(ranger)
library(ggplot2)
library(pdp)
data(okc)
okc <- okc[complete.cases(okc),]
rec <- recipe(~ diet + age + height, data = okc)
dummies <- rec %>% step_dummy(diet)
dummies <- prep(dummies, training = okc)
dummy_data <- bake(dummies, new_data = okc)
summary(dummy_data)
dummy_data<-na.omit(dummy_data )
dummy_data<-dummy_data[1:2000,]
dummy_data$diet_strictly.anything<-factor(dummy_data$diet_strictly.anything)%>% factor(labels = c("No", "Yes"))
myTrainingControl <- trainControl(method = "cv",
number = 5,
savePredictions = TRUE,
classProbs = TRUE,
summaryFunction = twoClassSummary,
verboseIter = F)
fit_rf <- caret::train(diet_strictly.anything ~ .,
data =dummy_data,
method = "ranger",
tuneLength = 2,
importance = "permutation",
trControl = myTrainingControl)
# Define a prediction function wrapper which requires two arguments
predict.function <- function(object, newdata) {
predict(object, newdata, type="prob")[,2] %>% as.vector()
}
plt_ICE <- pdp::partial(fit_rf,
pred.var = "diet_mostly.vegetarian",
pred.fun = predict.function,
train = dummy_data) %>% autoplot(alpha = 0.1)
plt_ICE

From the step_dummy documentation:
step_dummy creates a a specification of a recipe step that will convert nominal data (e.g. character or factors) into one or more numeric binary model terms for the levels of the original data.
The function appears to be working as expected in this case, by converting the categorical variable diet (stored as a character type in the okc data) into a set of binary numeric variables corresponding to the levels of diet.
If you're treating the variables as outcomes (i.e. trying to predict if someone has a specific type of diet), you're right that the dummy variables should not be encoded as numeric. If you're interested in changing the 'diet' dummies back to factors, a tidy approach might be:
library(tidyverse)
dummy_data <- dummy_data %>%
mutate_at(vars(starts_with('diet')), list(as.factor))
If you're using those dummy variables as predictors, tree-based modeling tools in R (I've primarily used rpart, randomForest and ranger) can handle dummy variables as predictors encoded as numeric, and the interpretation of variable importance measures would be the same as if the variables were encoded as 2-level factors or logical variables.

Related

Feature Importance for machine learning models in (Caret)package

I have a question regarding the feature importance function in the Caret package.
I have a dataset which has more numeric and factor features.
I used the command below to get the feature importance of the model. It gives me the importance of each (sub_feature) for the factor variables. However, I just want the importance of the feature itself without go in detail for each factor of the feature.
gbmImp <- caret::varImp(xgb1, scale = TRUE)
I will create some example data as we don't have any from your question:
library(caret)
# example data
df <- data.frame("x" = rnorm(100),
"fac" = as.factor(sample(c(rep("A", 30), rep("B", 35), rep("C", 35)))),
"y" = as.numeric((rpois(100, 4))))
# model
model <- train(y ~ ., method = "glm", data = df)
# feature importance
varImp(model, scale = TRUE)
This returns the feature importance that you do not want in your question:
# glm variable importance
#
# Overall
# facB 100.00
# facC 13.08
# x 0.00
You can convert the factor variables to numeric and do the same thing:
# make the factor variable numeric
trans_df <- transform(df, fac = as.numeric(fac))
# model
trans_model <- train(y ~ ., method = "glm", data = trans_df)
# feature importance
varImp(trans_model, scale = TRUE)
This returns the importance for the 'overall' feature:
# glm variable importance
#
# Overall
# x 100
# fac 0
However, I do not know whether the as.numeric() operation on the factor variable doesn't result in a different feature importance when we run varImp(trans_model, scale = TRUE).
Also, check out this SO thread if you find that your specific factor/character variables are problematic when converting to numeric.

ROC metric in train(), caret package

The df is splitted in the train and test dataframes. the train dataframe is splitted in training and testing dataframes. The dependent variable Y is binary (factor) with values 0 and 1. I'm trying to predict the probability with this code (neural networks, caret package):
library(caret)
model_nn <- train(
Y ~ ., training,
method = "nnet",
metric="ROC",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE,
classProbs=TRUE
)
)
model_nn_v2 <- model_nn
nnprediction <- predict(model_nn, testing, type="prob")
cmnn <-confusionMatrix(nnprediction,testing$Y)
print(cmnn) # The confusion matrix is to assess/compare the model
However, it gives me this error:
Error: At least one of the class levels is not a valid R variable name;
This will cause errors when class probabilities are generated because the
variables names will be converted to X0, X1 . Please use factor levels
that can be used as valid R variable names (see ?make.names for help).
I don't understand what means "use factor levels that can be used as valid R variable names". The dependent variable Y is already a factor, but is not a valid R variable name?.
PS: The code works perfectly if you erase classProbs=TRUE in trainControl() and metric="ROC" in train(). However, the "ROC" metric is my metric of comparison for the best model in my case, so I'm trying to make a model with "ROC" metric.
EDIT: Code example:
# You have to run all of this BEFORE running the model
classes <- c("a","b","b","c","c")
floats <- c(1.5,2.3,6.4,2.3,12)
dummy <- c(1,0,1,1,0)
chr <- c("1","2","2,","3","4")
Y <- c("1","0","1","1","0")
df <- cbind(classes, floats, dummy, chr, Y)
df <- as.data.frame(df)
df$floats <- as.numeric(df$floats)
df$dummy <- as.numeric(df$dummy)
classes <- c("a","a","a","b","c")
floats <- c(5.5,2.6,7.3,54,2.1)
dummy <- c(0,0,0,1,1)
chr <- c("3","3","3,","2","1")
Y <- c("1","1","1","0","0")
df <- cbind(classes, floats, dummy, chr, Y)
df <- as.data.frame(df)
df$floats <- as.numeric(df$floats)
df$dummy <- as.numeric(df$dummy)
There are two separate issues here.
The first is the error message, which says it all: you have to use something else than "0", "1" as values for your dependent factor variable Y.
You can do this by at least two ways, after you have built your dataframe df; the first one is hinted at the error message, i.e. use make.names:
df$Y <- make.names(df$Y)
df$Y
# "X1" "X1" "X1" "X0" "X0"
The second way is to use the levels function, by which you will have explicit control over the names themselves; showing it here again with names X0 and X1
levels(df$Y) <- c("X0", "X1")
df$Y
# [1] X1 X1 X1 X0 X0
# Levels: X0 X1
After adding either one of the above lines, the shown train() code will run smoothly (replacing training with df), but it will still not produce any ROC values, giving instead the warning:
Warning messages:
1: In train.default(x, y, weights = w, ...) :
The metric "ROC" was not in the result set. Accuracy will be used instead.
which bring us to the second issue here: in order to use the ROC metric, you have to add summaryFunction = twoClassSummary in the trControlargument of train():
model_nn <- train(
Y ~ ., df,
method = "nnet",
metric="ROC",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE,
classProbs=TRUE,
summaryFunction = twoClassSummary # ADDED
)
)
Running the above snippet with the toy data you have provided still gives an error (missing ROC values), but probably this is due to the very small dataset used here combined with the large number of CV folds, and it will not happen with your own, full dataset (it works OK if I reduce the CV folds to number=3)...

extracting more than 20 variables by importance via varImp

I'm dealing with a large dataset that involves more than 100 features (which are all relevant because they have already been filtered; the original dataset had over 500 features). I created a random forest model via the train() function from the caret package and using the "ranger" method.
Here's the question: how does one extract all of the variables by importance, as opposed to only the top 20 most important variables? The varImp() function yields only the top 20 variables by default.
Here's some sample code (minus the training set, which is very large):
library(caret)
rforest_model <- train(target_variable ~ .,
data = train_data_set,
method = "ranger",
importance = "impurity)
And here's the code for extracting variable importance:
varImp(rforest_model)
The varImp function extracts importance for all variables (even if they are not used by the model), it just prints out the top 20 variables. Consider this example:
library(mlbench) #for data set
library(caret)
library(tidyverse)
set.seed(998)
data(Ionosphere)
rforest_model <- train(y = Ionosphere$Class,
x = Ionosphere[,1:34],
method = "ranger",
importance = "impurity")
nrow(varImp(rforest_model)$importance) #34 variables extracted
lets check them:
varImp(rforest_model)$importance %>%
as.data.frame() %>%
rownames_to_column() %>%
arrange(Overall) %>%
mutate(rowname = forcats::fct_inorder(rowname )) %>%
ggplot()+
geom_col(aes(x = rowname, y = Overall))+
coord_flip()+
theme_bw()
note that V2 is a zero variance feature in this data set hence it has 0 importance and is not used by the model at all.

How to train non-binary classification rpart with F1 as metric instead of accuracy?

I am using caret for my non-binary (three classes) decision tree classification. My dataset is skewed so I want to use F1 instead of accuracy for my training and testing. How do I set this?
For an MWE lets predict the cut in the diamonds dataset:
library(ggplot2)
library(caret)
inTrain <- createDataPartition(diamonds$cut, p=0.75, list=FALSE)
training <- diamonds[inTrain,]
testing <- diamonds[-inTrain,]
fitModel <- train(cut ~ ., training, method = "rpart")
How to use F1 here?
The page at http://topepo.github.io/caret/training.html details how to create a new metric for the train function -
You need to create a new function with three parameters -
data - "is a reference for a data frame or matrix with columns called obs and pred for the observed and predicted outcome values (either numeric data for regression or character values for classification)"
lev - "is a character string that has the outcome factor levels taken from the training data. For regression, a value of NULL is passed into the function."
name - "is a character string for the model being used"
The function should calculate the F-score for the observed labels and predicted labels in the data object, and name the result based on the metric -
for example a function calculating accuracy
summaryStats <- function (data, lev = NULL, model = NULL) {
cor <- sum(data$pred==data$obs)
incor <- sum(data$pred!=data$obs)
out <- cor/(cor + incor)
names(out) <- c("acc")
out
}
Then create a new trainControl object and train your model --
fitControl <- trainControl(summaryFunction = summaryStats)
fitModel <- train(cut ~ ., training, trControl = fitControl, metric = "acc", maximize=TRUE)

Predict outcome in R

I have been using the predict function in R to predict a randomForests model outcomes for a testing set when it suddenly it would only return the predicted levels instead of the probabilities. I specified the type as response but it still returns factors. What possibly could cause this?
The data consists in 23 variables, 20 of which are factors (unordered) and two of which are numeric. I am trying to predict whether a product will sell or not (0 or 1). Here is the code for the prediction:
library(randomForest)
rf = randomForest(sold ~., data = train, ntree=200, nodesize=25)
prf <- predict(rf, newdata = test, type ="response")
set type="prob"
data(iris)
library(randomForest)
seed(1234)
train.key = sort(sample(1:dim(iris)[1],100))
iris.train = iris[train.key,]
iris.test = iris[-train.key,]
rf = randomForest(Species ~., data = iris.train)
predicted.prob = predict(rf,newData=iris.test,type ="prob")

Resources