See vip for each class in multi-class randomforest? - r

In this example of multi-class classification using randomforest model, the author creates a vip chart as shown below. Is there a way to view which variables are influencing the model to predict each of the response categories? For example, I'd like to be able to see that "x" variable is driving something to be classified as "x" volcano type (response category).
https://juliasilge.com/blog/multinomial-volcano-eruptions/
library(vip)
rf_spec %>%
set_engine("ranger", importance = "permutation") %>%
fit(
volcano_type ~ .,
data = juice(volcano_prep) %>%
select(-volcano_number) %>%
janitor::clean_names()
) %>%
vip(geom = "point")

Created an explainer using explain_tidymodels, then used the DALEX vignette.
https://www.tmwr.org/explain.html
https://modeloriented.github.io/DALEX/articles/multilabel_classification.html

Related

Extracting estimates with ranger decision trees

I am getting the error message Error: No tidy method for objects of class ranger when trying to extract the estimates for a regression model built with the ranger package in R.
Here is my code:
# libraries
library(tidymodels)
library(textrecipes)
library(LiblineaR)
library(ranger)
library(tidytext)
# create the recipe
comments.rec <- recipe(year ~ comments, data = oa.comments) %>%
step_tokenize(comments, token = "ngrams", options = list(n = 2, n_min = 1)) %>%
step_tokenfilter(comments, max_tokens = 1e3) %>%
step_stopwords(comments, stopword_source = "stopwords-iso") %>%
step_tfidf(comments) %>%
step_normalize(all_predictors())
# workflow with recipe
comments.wf <- workflow() %>%
add_recipe(comments.rec)
# create the regression model using support vector machine
svm.spec <- svm_linear() %>%
set_engine("LiblineaR") %>%
set_mode("regression")
svm.fit <- comments.wf %>%
add_model(svm.spec) %>%
fit(data = oa.comments)
# extract the estimates for the support vector machine model
svm.fit %>%
pull_workflow_fit() %>%
tidy() %>%
arrange(-estimate)
Below is the table of estimates for each tokenized term in the data set (this is a dirty data set for demo purposes)
term estimate
<chr> <dbl>
1 Bias 2015.
2 tfidf_comments_2021 0.877
3 tfidf_comments_2019 0.851
4 tfidf_comments_2020 0.712
5 tfidf_comments_2018 0.641
6 tfidf_comments_https 0.596
7 tfidf_comments_plan s 0.462
8 tfidf_comments_plan 0.417
9 tfidf_comments_2017 0.399
10 tfidf_comments_libraries 0.286
However, when using the ranger engine to create a regression model from random forests, I have no such luck and get the error message above
# create the regression model using random forests
rf.spec <- rand_forest(trees = 50) %>%
set_engine("ranger") %>%
set_mode("regression")
rf.fit <- comments.wf %>%
add_model(rf.spec) %>%
fit(data = oa.comments)
# extract the estimates for the random forests model
rf.fit %>%
pull_workflow_fit() %>%
tidy() %>%
arrange(-estimate)
To put this back to you in a simpler form that I think highlights the issue - if you had a decision tree model, how would you produce coefficients on the data in the dataset? What would those mean?
I think what you are looking for here is some form a attribution to each column. There are tools to do this built into tidymodels, but you should read on what it's actually reporting.
For you, you can get a basic idea of what those numbers would look like by using the vip package, though the produced numbers are definitely not comparable directly to your svm ones.
install.packages('vip')
library(vip)
rf.fit %>%
pull_workflow_fit() %>%
vip(geom = "point") +
labs(title = "Random forest variable importance")
You'll produce a plot with relative importance scores. To get the numbers
rf.fit %>%
pull_workflow_fit() %>%
vi()
tidymodels has a decent walkthrough doing this here but, given you have a model that can estimate importance scores you should be good to go.
Tidymodels tutorial page - 'a case study'
edit: if you HAVEN'T done this you may need to rerun your initial model with a new parameter passed during the 'set_engine' step of your code that gives ranger an idea of what kind of importance scores you are looking for/how they should be computed.

How to Get Variable/Feature Importance From Tidymodels ranger object?

I have a ranger object from the tidymodels rand_forest function:
rf <- rand_forest(mode = "regression", trees = 1000) %>% fit(pay_rate ~ age+profession)
I want to get the feature importance of each variable (I have many more than in this example). I've tried things like rf$variable.importance, or importance(rf), but the former returns NULL and the latter function doesn't exist. I tried using the vip package, but that doesn't work for a ranger object. How can I extract feature importances from this object?
You need to add importance = "impurity" when you set the engine for ranger. This will provide variable importance scores. Once this is set, you can use extract_fit_parsnip with vip to plot the variable importance.
small example:
library(tidymodels)
library(vip)
rf_mod <- rand_forest(mode = "regression", trees = 100) %>%
set_engine("ranger", importance = "impurity")
rf_recipe <-
recipe(mpg ~ ., data = mtcars)
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(rf_recipe)
rf_workflow %>%
fit(mtcars) %>%
extract_fit_parsnip() %>%
vip(num_features = 10)
More information is available in the tidymodels get started guide

How can I pull slope and intercept variables produced by the segmented package and put into a dataframe using r?

Can anyone walk me through how to get the slopes and intercepts produced by the segmented package out and placed in a data frame? This will ultimately be used to line up slopes and intercepts back to their original value. See data (that I took from another post) below.
#load packages
library(segmented)
library(tidyverse)
#set seed and develop data
set.seed(1)
Y<-c(13,21,12,11,16,9,7,5,8,8)
X<-c(74,81,80,79,89,96,69,88,53,72)
age<-c(50.45194,54.89382,46.52569,44.84934,53.25541,60.16029,50.33870,
51.44643,38.20279,59.76469)
dat=data.frame(Y=Y,off.set.term=log(X),age=age)
#run initial GLM
glm.fit=glm(Y~age+off.set.term,data=dat,family=poisson)
summary(glm.fit)
#run segmented glm
glm.fitted.segmented <- segmented(glm.fit, seg.Z=~age + off.set.term, psi =
list(age = c(50,53), off.set.term = c(4.369448)))
#Get summary, slopes and intercepts
summary(glm.fitted.segmented)
slope(glm.fitted.segmented)
intercept(glm.fitted.segmented)
library(broom)
library(dplyr)
library(tidyr)
library(stringr)
slopes <-
bind_rows(lapply(slope(glm.fitted.segmented), tidy), .id = "variable") %>%
mutate(type = str_extract(.rownames, "^[a-z]+"),
model = str_extract(.rownames, "[0-9]+$")) %>%
select(variable, model, type, estimate = "Est.")
intercepts <-
bind_rows(lapply(intercept(glm.fitted.segmented), tidy), .id = "variable") %>%
mutate(type = str_extract(.rownames, "^[a-z]+"),
model = str_extract(.rownames, "[0-9]+$")) %>%
select(variable, model, type, estimate = "Est.")
bind_rows(slopes, intercepts) %>%
spread(type, estimate)
Using the tidy function, you can easily pull out the data.frame for each variable then extract the model and type of unit. Bind it all together and spread the type and estimate value to end with variable, model, intercept, and slope.

extracting more than 20 variables by importance via varImp

I'm dealing with a large dataset that involves more than 100 features (which are all relevant because they have already been filtered; the original dataset had over 500 features). I created a random forest model via the train() function from the caret package and using the "ranger" method.
Here's the question: how does one extract all of the variables by importance, as opposed to only the top 20 most important variables? The varImp() function yields only the top 20 variables by default.
Here's some sample code (minus the training set, which is very large):
library(caret)
rforest_model <- train(target_variable ~ .,
data = train_data_set,
method = "ranger",
importance = "impurity)
And here's the code for extracting variable importance:
varImp(rforest_model)
The varImp function extracts importance for all variables (even if they are not used by the model), it just prints out the top 20 variables. Consider this example:
library(mlbench) #for data set
library(caret)
library(tidyverse)
set.seed(998)
data(Ionosphere)
rforest_model <- train(y = Ionosphere$Class,
x = Ionosphere[,1:34],
method = "ranger",
importance = "impurity")
nrow(varImp(rforest_model)$importance) #34 variables extracted
lets check them:
varImp(rforest_model)$importance %>%
as.data.frame() %>%
rownames_to_column() %>%
arrange(Overall) %>%
mutate(rowname = forcats::fct_inorder(rowname )) %>%
ggplot()+
geom_col(aes(x = rowname, y = Overall))+
coord_flip()+
theme_bw()
note that V2 is a zero variance feature in this data set hence it has 0 importance and is not used by the model at all.

Fitting several regression models after group_by with dplyr and applying the resulting models into test sets

I have a big dataset that I want to partition based on the values of a particular variable (in my case lifetime), and then run logistic regression on each partition. Following the answer of #tchakravarty in Fitting several regression models with dplyr I wrote the following code:
lifetimemodels = data %>% group_by(lifetime) %>% sample_frac(0.7)%>%
do(lifeModel = glm(churn ~., x= TRUE, family=binomial(link='logit'), data = .))
My question now is how I can use the resulting logistic models on computing the AUC on the rest of the data (the 0.3 fraction that was not chosen) which should again be grouped by lifetime?
Thanks a lot in advance!
You could adapt your dplyr approach to use the tidyr and purrr framework. You look at grouping/nesting, and the mutate and map functions to create list frames to store pieces of your workflow.
The test/training split you are looking for is part of modelr a package built to assist modelling within the purrr framework. Specifically the cross_vmc and cross_vkfold functions.
A toy example using mtcars (just to illustrate the framework).
library(dplyr)
library(tidyr)
library(purrr)
library(modelr)
analysis <- mtcars %>%
nest(-cyl) %>%
unnest(map(data, ~crossv_mc(.x, 1, test = 0.3))) %>%
mutate(model = map(train, ~lm(mpg ~ wt, data = .x))) %>%
mutate(pred = map2(model, train, predict)) %>%
mutate(error = map2_dbl(model, test, rmse))
This:
takes mtcars
nest into a list frame called data by cyl
Separate each data into a training set by mapping crossv_mc to each element, then using unnest to make the test and train list columns.
Map the lm model to each train, store that in model
Map the predict function to model and train and store in pred
Map the rmse function to model and test sets and store in error.
There are probably users out there more familiar than me with the workflow, so please correct/elaborate.

Resources