I have a question regarding the feature importance function in the Caret package.
I have a dataset which has more numeric and factor features.
I used the command below to get the feature importance of the model. It gives me the importance of each (sub_feature) for the factor variables. However, I just want the importance of the feature itself without go in detail for each factor of the feature.
gbmImp <- caret::varImp(xgb1, scale = TRUE)
I will create some example data as we don't have any from your question:
library(caret)
# example data
df <- data.frame("x" = rnorm(100),
"fac" = as.factor(sample(c(rep("A", 30), rep("B", 35), rep("C", 35)))),
"y" = as.numeric((rpois(100, 4))))
# model
model <- train(y ~ ., method = "glm", data = df)
# feature importance
varImp(model, scale = TRUE)
This returns the feature importance that you do not want in your question:
# glm variable importance
#
# Overall
# facB 100.00
# facC 13.08
# x 0.00
You can convert the factor variables to numeric and do the same thing:
# make the factor variable numeric
trans_df <- transform(df, fac = as.numeric(fac))
# model
trans_model <- train(y ~ ., method = "glm", data = trans_df)
# feature importance
varImp(trans_model, scale = TRUE)
This returns the importance for the 'overall' feature:
# glm variable importance
#
# Overall
# x 100
# fac 0
However, I do not know whether the as.numeric() operation on the factor variable doesn't result in a different feature importance when we run varImp(trans_model, scale = TRUE).
Also, check out this SO thread if you find that your specific factor/character variables are problematic when converting to numeric.
Related
I am using the package neuralnet for R and would like to use supervised learning.
In my setting I have 15 explanatory variables (8 of them are dummy variables, every dummy contains 0 and 1 as values, the other explanatory variables are real numbered).
However, I want to use all explanatory variables to predict a target vector (real numbered).
So, my setting is a regression problem.
If I run my neural net without the dummies, the neuralnet()-function produces results.
However, by incorporating the dummies, I get the error message:
Obviously, the dummies cause the error. Running the function without them works fine.
How can I make the neuralnet considering the dummies properly and create an output?
Please find below a reproduceable example as well as the neural network setting:
#### install packages
# install.packages("devtools")
# require(devtools)
# devtools::install_github("bips-hb/neuralnet") # CRAN version contains bug, use github version
# require(neuralnet)
### create data
set.seed(1)
dt <- matrix(rnorm(200), nrow = 200, ncol = 3)
dummy1 <- as.factor( c(rep(1,100), rep(0,100)) ) # create vector with data (1,0) for first dummy, save as factor
dummy2 <- as.factor( c(rep(0,100), rep(1,100)) ) # create vector with data (0,1) for second dummy, save as factor
dummy_df <- data.frame(dummy1, dummy2) # merging both dummies into dataframe
class(dummy_df[,1]) # factor
class(dummy_df[,2]) # factor
# bringing original data and dummies together
train <- cbind(as.data.frame(dt), dummy_df)
# see colnames
colnames(train)
# start neural net
nnet <- neuralnet(formula = V1 ~ V2 + V3 + dummy1 + dummy2, # use V1 as target for supervised learning
data = train,
hidden = 1, # neurons
threshold = 0.01, # termination derivative is reached
rep = 5, # trainings
startweights = NULL, # starting weights
learningrate.factor = list(minus = 0.5, plus = 1.2), # increasing and decrasing factors
algorithm = "rprop+", # rprop algorithm with weight updating
err.fct = "sse", # use sum squared errors as error function
act.fct = "tanh", # use hyperbolic tangent
linear.output = TRUE, # output function is linear, regression problem
lifesign = "full", # print behavior
stepmax = 200000)
Thank you for your help!
It appears that neuralnet does not take factors as input (predictors). Just use the numeric values of the dummies, then it works.
dummy1 <- as.numeric(dummy1)
dummy2 <- as.numeric(dummy2)
dummy_df <- data.frame(dummy1, dummy2)
This works fine as long as you have factors with only two levels. If you have more than two levels use dummy coding.
Adding to Dominiks answer, of which he seems absolutely correct, incorporating the dummy variables are rather simple, either using a package like (i believe) dummmies, or the base package and model.matrix as shown below
dt <- matrix(rnorm(200), nrow = 200, ncol = 3)
dummy1 <- as.factor( c(rep(1,100), rep(0,100)) ) # create vector with data (1,0) for first dummy, save as factor
dummy2 <- as.factor( c(rep(0,100), rep(1,100)) ) # create vector with data (0,1) for second dummy, save as factor
dummy_df <- data.frame(dummy1, dummy2) # merging both dummies into dataframe
mm <- model.matrix(~ dummy1 + dummy2 - 1, data = dummy_df) #-1 removes intercept.
train <- cbind(dt, mm)
neuralnet::neuralnet(V1 ~ . , data = train) #possibly adding -1 to the formula is sensible as well.
Hi I am using recipes for feature engineering in machine learning models.
However, when I used step_dummy, dummy variables are regarded as numeric variables, not factor.
I think this might be problematic when we use random forest or other tree models.
How can we change this? PDP shows that dummy predictor is treated as numeric. so X-axis has 0.25, 0.5.......
This should have only 0 and 1 (since dummy).
library(modeldata)
library(recipes)
library(caret)
library(ranger)
library(ggplot2)
library(pdp)
data(okc)
okc <- okc[complete.cases(okc),]
rec <- recipe(~ diet + age + height, data = okc)
dummies <- rec %>% step_dummy(diet)
dummies <- prep(dummies, training = okc)
dummy_data <- bake(dummies, new_data = okc)
summary(dummy_data)
dummy_data<-na.omit(dummy_data )
dummy_data<-dummy_data[1:2000,]
dummy_data$diet_strictly.anything<-factor(dummy_data$diet_strictly.anything)%>% factor(labels = c("No", "Yes"))
myTrainingControl <- trainControl(method = "cv",
number = 5,
savePredictions = TRUE,
classProbs = TRUE,
summaryFunction = twoClassSummary,
verboseIter = F)
fit_rf <- caret::train(diet_strictly.anything ~ .,
data =dummy_data,
method = "ranger",
tuneLength = 2,
importance = "permutation",
trControl = myTrainingControl)
# Define a prediction function wrapper which requires two arguments
predict.function <- function(object, newdata) {
predict(object, newdata, type="prob")[,2] %>% as.vector()
}
plt_ICE <- pdp::partial(fit_rf,
pred.var = "diet_mostly.vegetarian",
pred.fun = predict.function,
train = dummy_data) %>% autoplot(alpha = 0.1)
plt_ICE
From the step_dummy documentation:
step_dummy creates a a specification of a recipe step that will convert nominal data (e.g. character or factors) into one or more numeric binary model terms for the levels of the original data.
The function appears to be working as expected in this case, by converting the categorical variable diet (stored as a character type in the okc data) into a set of binary numeric variables corresponding to the levels of diet.
If you're treating the variables as outcomes (i.e. trying to predict if someone has a specific type of diet), you're right that the dummy variables should not be encoded as numeric. If you're interested in changing the 'diet' dummies back to factors, a tidy approach might be:
library(tidyverse)
dummy_data <- dummy_data %>%
mutate_at(vars(starts_with('diet')), list(as.factor))
If you're using those dummy variables as predictors, tree-based modeling tools in R (I've primarily used rpart, randomForest and ranger) can handle dummy variables as predictors encoded as numeric, and the interpretation of variable importance measures would be the same as if the variables were encoded as 2-level factors or logical variables.
I've fitted a mixed model using the lme4 package. I transformed my independent variables with the scale() function prior to fitting the model. I now want to display my results on a graph using predict(), so I need the predicted data to be back on the original scale. How do I do this?
Simplified example:
database <- mtcars
# Scale data
database$wt <- scale(mtcars$wt)
database$am <- scale(mtcars$am)
# Make model
model.1 <- glmer(vs ~ scale(wt) + scale(am) + (1|carb), database, family = binomial, na.action = "na.fail")
# make new data frame with all values set to their mean
xweight <- as.data.frame(lapply(lapply(database[, -1], mean), rep, 100))
# make new values for wt
xweight$wt <- (wt = seq(min(database$wt), max(database$wt), length = 100))
# predict from new values
a <- predict(model.1, newdata = xweight, type="response", re.form=NA)
# returns scaled prediction
I've tried using this example to back-transform the predictions:
# save scale and center values
scaleList <- list(scale = attr(database$wt, "scaled:scale"),
center = attr(database$wt, "scaled:center"))
# back-transform predictions
a.unscaled <- a * scaleList$scale + scaleList$center
# Make model with unscaled data to compare
un.model.1 <- glmer(vs ~ wt + am + (1|carb), mtcars, family = binomial, na.action = "na.fail")
# make new data frame with all values set to their mean
un.xweight <- as.data.frame(lapply(lapply(mtcars[, -1], mean), rep, 100))
# make new values for wt
un.xweight$wt <- (wt = seq(min(mtcars$wt), max(mtcars$wt), length = 100))
# predict from new values
b <- predict(un.model.1, newdata = xweight, type="response", re.form=NA)
all.equal(a.unscaled,b)
# [1] "Mean relative difference: 0.7223061"
This doesn't work - there shouldn't be any difference. What have I done wrong?
I've also looked at a number of similar questions but not managed to apply any to my case (How to unscale the coefficients from an lmer()-model fitted with a scaled response, unscale and uncenter glmer parameters, Scale back linear regression coefficients in R from scaled and centered data, https://stats.stackexchange.com/questions/302448/back-transform-mixed-effects-models-regression-coefficients-for-fixed-effects-f).
The problem with your approach is that it only "unscales" based on the wt variable, whereas you scaled all of the variables in your regression model. One approach that works is to adjust all of the variables in your new (prediction) data frame using the centering/scaling values that were used on the original data frame:
## scale variable x using center/scale attributes
## of variable y
scfun <- function(x,y) {
scale(x,
center=attr(y,"scaled:center"),
scale=attr(y,"scaled:scale"))
}
## scale prediction frame
xweight_sc <- transform(xweight,
wt = scfun(wt, database$wt),
am = scfun(am, database$am))
## predict
p_unsc <- predict(model.1,
newdata=xweight_sc,
type="response", re.form=NA)
Comparing this p_unsc to your predictions from the unscaled model (b in your code), i.e. all.equal(b,p_unsc), gives TRUE.
Another reasonable approach would be to
unscale/uncenter all of your parameters using the "unscaling" approaches presented in one of the linked question (such as this one), generating a coefficient vector beta_unsc
construct the appropriate model matrix from your prediction frame:
X <- model.matrix(formula(model,fixed.only=TRUE),
newdata=pred_frame)
compute the linear predictor and back-transform:
pred <- plogis(X %*% beta_unsc)
The df is splitted in the train and test dataframes. the train dataframe is splitted in training and testing dataframes. The dependent variable Y is binary (factor) with values 0 and 1. I'm trying to predict the probability with this code (neural networks, caret package):
library(caret)
model_nn <- train(
Y ~ ., training,
method = "nnet",
metric="ROC",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE,
classProbs=TRUE
)
)
model_nn_v2 <- model_nn
nnprediction <- predict(model_nn, testing, type="prob")
cmnn <-confusionMatrix(nnprediction,testing$Y)
print(cmnn) # The confusion matrix is to assess/compare the model
However, it gives me this error:
Error: At least one of the class levels is not a valid R variable name;
This will cause errors when class probabilities are generated because the
variables names will be converted to X0, X1 . Please use factor levels
that can be used as valid R variable names (see ?make.names for help).
I don't understand what means "use factor levels that can be used as valid R variable names". The dependent variable Y is already a factor, but is not a valid R variable name?.
PS: The code works perfectly if you erase classProbs=TRUE in trainControl() and metric="ROC" in train(). However, the "ROC" metric is my metric of comparison for the best model in my case, so I'm trying to make a model with "ROC" metric.
EDIT: Code example:
# You have to run all of this BEFORE running the model
classes <- c("a","b","b","c","c")
floats <- c(1.5,2.3,6.4,2.3,12)
dummy <- c(1,0,1,1,0)
chr <- c("1","2","2,","3","4")
Y <- c("1","0","1","1","0")
df <- cbind(classes, floats, dummy, chr, Y)
df <- as.data.frame(df)
df$floats <- as.numeric(df$floats)
df$dummy <- as.numeric(df$dummy)
classes <- c("a","a","a","b","c")
floats <- c(5.5,2.6,7.3,54,2.1)
dummy <- c(0,0,0,1,1)
chr <- c("3","3","3,","2","1")
Y <- c("1","1","1","0","0")
df <- cbind(classes, floats, dummy, chr, Y)
df <- as.data.frame(df)
df$floats <- as.numeric(df$floats)
df$dummy <- as.numeric(df$dummy)
There are two separate issues here.
The first is the error message, which says it all: you have to use something else than "0", "1" as values for your dependent factor variable Y.
You can do this by at least two ways, after you have built your dataframe df; the first one is hinted at the error message, i.e. use make.names:
df$Y <- make.names(df$Y)
df$Y
# "X1" "X1" "X1" "X0" "X0"
The second way is to use the levels function, by which you will have explicit control over the names themselves; showing it here again with names X0 and X1
levels(df$Y) <- c("X0", "X1")
df$Y
# [1] X1 X1 X1 X0 X0
# Levels: X0 X1
After adding either one of the above lines, the shown train() code will run smoothly (replacing training with df), but it will still not produce any ROC values, giving instead the warning:
Warning messages:
1: In train.default(x, y, weights = w, ...) :
The metric "ROC" was not in the result set. Accuracy will be used instead.
which bring us to the second issue here: in order to use the ROC metric, you have to add summaryFunction = twoClassSummary in the trControlargument of train():
model_nn <- train(
Y ~ ., df,
method = "nnet",
metric="ROC",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE,
classProbs=TRUE,
summaryFunction = twoClassSummary # ADDED
)
)
Running the above snippet with the toy data you have provided still gives an error (missing ROC values), but probably this is due to the very small dataset used here combined with the large number of CV folds, and it will not happen with your own, full dataset (it works OK if I reduce the CV folds to number=3)...
How can I use dummy vars in caret without destroying my target variable?
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
dummies <- dummyVars( Purchase ~ ., data = data)
data2 <- predict(dummies, newdata = data)
split_factor = 0.5
n_samples = nrow(data2)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
train <- data2[train_idx, ]
test <- data2[-train_idx, ]
modelFit<- train(Purchase~ ., method='lda',preProcess=c('scale', 'center'), data=train)
will fail, as the Purchase variable is missing. In case I replace it with data$Purchase <- ifelse(data$Purchase == "CH",1,0) beforehand caret complains that this no longer is a classification but a regression problem
At least the example code seems to have a few issues indicated in the comments below. To answer your questions:
The result of ifelse is an integer vector, not a factor, so the train function defaults to regression
Passing the dummyVars directly to the function is done by using the train(x = , y =, ...) instead of a formula
To avoid these problems, check the class of your objects carefully.
Be aware that option preProcess in train() will apply the preprocessing to all numeric variables, including the dummies. Option 2 below avoid this, be standardizing the data before calling train().
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
# Make sure that all variables that should be a factor are defined as such
newFactorIndex <- c("StoreID","SpecialCH","SpecialMM","STORE")
data[, newFactorIndex] <- lapply(data[,newFactorIndex], factor)
library(caret)
# See help for dummyVars. The function does not take a dependent variable and predict will give an error
# I don't include the target variable here, so predicting dummies on new data will drop unknown columns
# including the target variable
dummies <- dummyVars(~., data = data[,-1])
# I don't change the data yet to apply standardization to the numeric variables,
# before turning the categorical variables into dummies
split_factor = 0.5
n_samples = nrow(data)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
# Option 1 (as asked): Specify independent and dependent variables separately
# Note that dummy variables will be standardized by preProcess as per the original code
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
data2 <- data.frame(predict(dummies, newdata = data))
modelFit<- train(y = data[train_idx, "Purchase"], x = data2[train_idx,], method='lda',preProcess=c('scale', 'center'))
# Option 2: Append dependent variable to the independent variables (needs to be a data frame to allow factor and numeric)
# Note that I also shift the proprocessing away from train() to
# avoid standardizing the dummy variables
train <- data[train_idx, ]
test <- data[-train_idx, ]
preprocessor <- preProcess(train[!sapply(train, is.factor)], method = c('center',"scale"))
train <- predict(preprocessor, train)
test <- predict(preprocessor, test)
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
train <- data.frame(predict(dummies, newdata = train))
test <- data.frame(predict(dummies, newdata = test))
# Reattach the target variable to the training data that has been
# dropped by predict(dummies,...)
train$Purchase <- data$Purchase[train_idx]
modelFit<- train(Purchase ~., data = train, method='lda')