I'm dealing with a large dataset that involves more than 100 features (which are all relevant because they have already been filtered; the original dataset had over 500 features). I created a random forest model via the train() function from the caret package and using the "ranger" method.
Here's the question: how does one extract all of the variables by importance, as opposed to only the top 20 most important variables? The varImp() function yields only the top 20 variables by default.
Here's some sample code (minus the training set, which is very large):
library(caret)
rforest_model <- train(target_variable ~ .,
data = train_data_set,
method = "ranger",
importance = "impurity)
And here's the code for extracting variable importance:
varImp(rforest_model)
The varImp function extracts importance for all variables (even if they are not used by the model), it just prints out the top 20 variables. Consider this example:
library(mlbench) #for data set
library(caret)
library(tidyverse)
set.seed(998)
data(Ionosphere)
rforest_model <- train(y = Ionosphere$Class,
x = Ionosphere[,1:34],
method = "ranger",
importance = "impurity")
nrow(varImp(rforest_model)$importance) #34 variables extracted
lets check them:
varImp(rforest_model)$importance %>%
as.data.frame() %>%
rownames_to_column() %>%
arrange(Overall) %>%
mutate(rowname = forcats::fct_inorder(rowname )) %>%
ggplot()+
geom_col(aes(x = rowname, y = Overall))+
coord_flip()+
theme_bw()
note that V2 is a zero variance feature in this data set hence it has 0 importance and is not used by the model at all.
Related
I have three questions about the sample code below which illustrates the use of step_lencode_mixed.
I read in the vignette that: "For each factor predictor, a generalized linear model is fit to the outcome and the coefficients are returned as the encoding."
In the output from the example below the column 'partial' is the return from step_lencode_mixed. My questions:
Should I use this partial as encoded catagorical variabele "where_town" in the new model to be fitted?
Is there a complete model (Class ~ ., data = okc_train) with all variables on Class fitted in the background and is the contribution from variabele "where_town" returned as partial?
If I convert the partial with the logit2prob function, I notice that the outcome is almost identical to the rate. For that reason I suppose the outcome is not a coefficient?
Thanks a lot!
# ------------------------------------------------------------------------------
# Feature Engineering and Selection: A Practical Approach for Predictive Models
# by Max Kuhn and Kjell Johnson
#
# ------------------------------------------------------------------------------
#
# Code for Section 5.4 at
# https://bookdown.org/max/FES/categorical-supervised-encoding.html
#
# ------------------------------------------------------------------------------
#
# Code requires these packages:
library(tidymodels)
library(embed)
# Create example data ----------------------------------------------------------
load("../Data_Sets/OkCupid/okc.RData")
load("../Data_Sets/OkCupid/okc_binary.RData")
options(width = 120)
partial_rec <-
recipe(Class ~ ., data = okc_train) %>%
step_lencode_mixed(
where_town,
outcome = vars(Class)
) %>%
prep()
okc_train2 <- okc_train %>% select(where_town, Class)
partial_rec2 <-
recipe(Class ~ ., data = okc_train2) %>%
step_lencode_mixed(
where_town,
outcome = vars(Class)
) %>%
prep()
# Organize results -------------------------------------------------------------
partial_pooled <-
tidy(partial_rec, number = 1) %>%
dplyr::select(-terms, -id) %>%
setNames(c("where_town", "partial"))
partial_pooled <- left_join(partial_pooled, okc_props)
partial_pooled2 <-
tidy(partial_rec2, number = 1) %>%
dplyr::select(-terms, -id) %>%
setNames(c("where_town", "partial"))
all.equal(partial_pooled, partial_pooled2)
>
[1] TRUE
Should I use this partial as encoded catagorical variabele "where_town" in the new model to be fitted?
Yes. You don't have to do it manually though. The bake() function does that for you automatically (same as if you include the recipe in a workflow)
Is there a complete model (Class ~ ., data = okc_train) with all variables on Class fitted in the background and is the contribution from variable "where_town" returned as partial?
Yes. There is more information in the tidymodels book (section 17.3).
If I convert the partial with the logit2prob function, I notice that the outcome is almost identical to the rate. For that reason, I suppose the outcome is not a coefficient?
A simpler method to do the conversion to the rate is binomial()$linkinv(partial_pooled$partial).
The value given in the partial column is the log-odds value (hence the negative numbers); we use logistic regression (mixed model) to estimate. It uses an empirical Bayes estimation method that shrinks the coefficient estimates toward the overall (population) estimate.
The amount of shrinkage, for this model, is based on a few things but is mostly driven by the per-category sample size. Smaller sample sizes are affected more than categories with larger amounts of data. So the raw and shrunken estimates for berkeley are about the same since there were 2676 data points there but belvedere_tiburon has larger differences in estimates because the sample size was 35.
I would like to map the function pdp::partial to nested randomForest models. I'll then use the output to plot the 3d partial dependency plots for each group via facet_wrap(). When mapping the function to the models I get an error that the predictor variables can not be found in the training data -- but they are there when I check the tibble so I'm at a loss for what to do.
library(tidyverse)
library(pdp)
library(randomForest)
data(boston)
glimpse(boston)
#Make groups, nest data by groups, apply random forest model to nested data
boston %>%
mutate(grp=ifelse(age<80, "young", "old"))%>%
nest(data= -grp)%>%
mutate(fit = map(data, ~ randomForest(cmedv ~ ., data = boston, importance = TRUE)))%>%
{.->>GrpModels}
#Map pdp::partial to fitted models for two predictor variables
GrpModels%>%
mutate(p=map2(fit,data, ~pdp::partial(fit,train=data, pred.var=c("lstat", "rm"))))%>%
unnest(p)%>%{.->>checkpdp}
Error: Problem with mutate() column p. i p = map2(...). x lstat,
rm not found in the training data.
This seems to work, although I'm not sure why plotting with geom_tile() does not quite do what I thought it would. I used geom_point() instead. In short, I needed to get pred.var as a list and then pass the three inputs (fit, data, and predictor variables) to pmap.
GrpModels %>%
mutate(preds = data.table::transpose(as.list(c('lstat','rm')))) %>%
mutate(p = pmap(list(fit, data, preds),
.f = ~pdp::partial(object=..1, train = ..2,
pred.var = ..3)))%>%
select(-data,-fit,-preds)%>%
unnest_wider(p)%>%
unnest(c(yhat,lstat,rm))%>%{.->>checkpdp}%>%
ggplot(.,aes(x=lstat,y=rm,color=yhat))+
#geom_tile()+
geom_point(shape=15, size=2)+
facet_wrap(~grp, scales='free')
Hi I am using recipes for feature engineering in machine learning models.
However, when I used step_dummy, dummy variables are regarded as numeric variables, not factor.
I think this might be problematic when we use random forest or other tree models.
How can we change this? PDP shows that dummy predictor is treated as numeric. so X-axis has 0.25, 0.5.......
This should have only 0 and 1 (since dummy).
library(modeldata)
library(recipes)
library(caret)
library(ranger)
library(ggplot2)
library(pdp)
data(okc)
okc <- okc[complete.cases(okc),]
rec <- recipe(~ diet + age + height, data = okc)
dummies <- rec %>% step_dummy(diet)
dummies <- prep(dummies, training = okc)
dummy_data <- bake(dummies, new_data = okc)
summary(dummy_data)
dummy_data<-na.omit(dummy_data )
dummy_data<-dummy_data[1:2000,]
dummy_data$diet_strictly.anything<-factor(dummy_data$diet_strictly.anything)%>% factor(labels = c("No", "Yes"))
myTrainingControl <- trainControl(method = "cv",
number = 5,
savePredictions = TRUE,
classProbs = TRUE,
summaryFunction = twoClassSummary,
verboseIter = F)
fit_rf <- caret::train(diet_strictly.anything ~ .,
data =dummy_data,
method = "ranger",
tuneLength = 2,
importance = "permutation",
trControl = myTrainingControl)
# Define a prediction function wrapper which requires two arguments
predict.function <- function(object, newdata) {
predict(object, newdata, type="prob")[,2] %>% as.vector()
}
plt_ICE <- pdp::partial(fit_rf,
pred.var = "diet_mostly.vegetarian",
pred.fun = predict.function,
train = dummy_data) %>% autoplot(alpha = 0.1)
plt_ICE
From the step_dummy documentation:
step_dummy creates a a specification of a recipe step that will convert nominal data (e.g. character or factors) into one or more numeric binary model terms for the levels of the original data.
The function appears to be working as expected in this case, by converting the categorical variable diet (stored as a character type in the okc data) into a set of binary numeric variables corresponding to the levels of diet.
If you're treating the variables as outcomes (i.e. trying to predict if someone has a specific type of diet), you're right that the dummy variables should not be encoded as numeric. If you're interested in changing the 'diet' dummies back to factors, a tidy approach might be:
library(tidyverse)
dummy_data <- dummy_data %>%
mutate_at(vars(starts_with('diet')), list(as.factor))
If you're using those dummy variables as predictors, tree-based modeling tools in R (I've primarily used rpart, randomForest and ranger) can handle dummy variables as predictors encoded as numeric, and the interpretation of variable importance measures would be the same as if the variables were encoded as 2-level factors or logical variables.
Here is a workflow that trains an XGB model using tidr list columns, rsmaple folds and purrr map:
library(rsample)
library(xgboost)
library(Metrics)
# keep just numeric features for this example
pdata_split <- initial_split(diamonds %>% select(-cut, -color, -clarity), 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)
train_cv <- vfold_cv(training_data, 5) %>%
# create training and validation sets within each fold
mutate(train = map(splits, ~training(.x)),
validate = map(splits, ~testing(.x)))
# xgb across each fold
mod.xgb <- train_cv %>%
# convert regression data to a dmatrix for xgb. Just simple price ~ carat for here and now
mutate(train_dmatrix = map(train, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price)),
validate_dmatrix = map(validate, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price))) %>%
mutate(regression = map(train_dmatrix, ~xgboost(.x, objective = "reg:squarederror", nrounds = 100))) %>% # fit the model
mutate(predictions =map2(.x = regression, .y = validate_dmatrix, ~predict(.x, .y))) %>% # predictions
mutate(validation_actuals = map(validate, ~.x$carat)) %>% # get the actuals for computing evaluation metrics
mutate(mae = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::mae(actual = .x, predicted = .y))) %>% # mae
mutate(rmse = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::rmse(actual = .x, predicted = .y))) # rmse
My actual script and data uses crossing() and other models with their own hyper parameters in order to pick the best model. So, the real block the above is based on allows me to compare several models since it actually contains several models.
I like this workflow because using dplyr verbs and the pipe operator, I can make changes as needed while progressing through each step, then apply them to each fold using map functions.
Now that I'm at the test phase and passed the cross validation phase, I'd like to emulate that 'flow' except I do not have folds so there is no need for map_* functions.
However, I still need to make transformations such as the one above adding an xgb.DMatrix since I am using xgboost.
Example, below what I created to test my chosen xgb model:
library(rsample)
library(xgboost)
library(Metrics)
# keep just numeric features for this example
pdata_split <- initial_split(diamonds %>% select(-cut, -color, -clarity), 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)
# create xgb.DMatrix'
training_data_xgb_matrix <- xgb.DMatrix(training_data %>% select(-price) %>% as.matrix(), label = training_data$price)
test_data_xgb_matrix <- xgb.DMatrix(testing_data %>% select(-price) %>% as.matrix(), label = testing_data$price)
# create a regression
model_xgb <- xgboost(training_data_xgb_matrix, nrounds = 100, objective = "reg:squarederror")
# predict on test data
xgb_predictions <- predict(model_xgb, test_data_xgb_matrix)
# evaluate using rmse
test_rmse <- rmse(actual = testing_data$price, predicted = xgb_predictions)
test_rmse
# 1370.185
So, that is doing it step by step. My question is, can I somehow do this in a similar way to using the approach above during cross validation, particularity just adding a new column to a existing df / list column?
What is the 'tidy' way of evaluating a model on test data? Is it possible to start with training_data, append test data in a new column and start a workflow to reach the same result with rmse in it's own column following a call to mutate()?
training_data %>%
(add test data in a new column) %>%
mutate(convert training data to a xgb.DMatrix) %>%
mutate(convert test data to a xgb.DMatrix) %>%
mutate(fit a regression model based on the training data xgb.DMatrix) %>%
mutate(predict with the regression model on test data xgb.DMatrix) %>%
mutate(calculate rmse)
Is this possible?
Can anyone walk me through how to get the slopes and intercepts produced by the segmented package out and placed in a data frame? This will ultimately be used to line up slopes and intercepts back to their original value. See data (that I took from another post) below.
#load packages
library(segmented)
library(tidyverse)
#set seed and develop data
set.seed(1)
Y<-c(13,21,12,11,16,9,7,5,8,8)
X<-c(74,81,80,79,89,96,69,88,53,72)
age<-c(50.45194,54.89382,46.52569,44.84934,53.25541,60.16029,50.33870,
51.44643,38.20279,59.76469)
dat=data.frame(Y=Y,off.set.term=log(X),age=age)
#run initial GLM
glm.fit=glm(Y~age+off.set.term,data=dat,family=poisson)
summary(glm.fit)
#run segmented glm
glm.fitted.segmented <- segmented(glm.fit, seg.Z=~age + off.set.term, psi =
list(age = c(50,53), off.set.term = c(4.369448)))
#Get summary, slopes and intercepts
summary(glm.fitted.segmented)
slope(glm.fitted.segmented)
intercept(glm.fitted.segmented)
library(broom)
library(dplyr)
library(tidyr)
library(stringr)
slopes <-
bind_rows(lapply(slope(glm.fitted.segmented), tidy), .id = "variable") %>%
mutate(type = str_extract(.rownames, "^[a-z]+"),
model = str_extract(.rownames, "[0-9]+$")) %>%
select(variable, model, type, estimate = "Est.")
intercepts <-
bind_rows(lapply(intercept(glm.fitted.segmented), tidy), .id = "variable") %>%
mutate(type = str_extract(.rownames, "^[a-z]+"),
model = str_extract(.rownames, "[0-9]+$")) %>%
select(variable, model, type, estimate = "Est.")
bind_rows(slopes, intercepts) %>%
spread(type, estimate)
Using the tidy function, you can easily pull out the data.frame for each variable then extract the model and type of unit. Bind it all together and spread the type and estimate value to end with variable, model, intercept, and slope.