How to map pdp::partial to nested randomForest models? - r

I would like to map the function pdp::partial to nested randomForest models. I'll then use the output to plot the 3d partial dependency plots for each group via facet_wrap(). When mapping the function to the models I get an error that the predictor variables can not be found in the training data -- but they are there when I check the tibble so I'm at a loss for what to do.
library(tidyverse)
library(pdp)
library(randomForest)
data(boston)
glimpse(boston)
#Make groups, nest data by groups, apply random forest model to nested data
boston %>%
mutate(grp=ifelse(age<80, "young", "old"))%>%
nest(data= -grp)%>%
mutate(fit = map(data, ~ randomForest(cmedv ~ ., data = boston, importance = TRUE)))%>%
{.->>GrpModels}
#Map pdp::partial to fitted models for two predictor variables
GrpModels%>%
mutate(p=map2(fit,data, ~pdp::partial(fit,train=data, pred.var=c("lstat", "rm"))))%>%
unnest(p)%>%{.->>checkpdp}
Error: Problem with mutate() column p. i p = map2(...). x lstat,
rm not found in the training data.

This seems to work, although I'm not sure why plotting with geom_tile() does not quite do what I thought it would. I used geom_point() instead. In short, I needed to get pred.var as a list and then pass the three inputs (fit, data, and predictor variables) to pmap.
GrpModels %>%
mutate(preds = data.table::transpose(as.list(c('lstat','rm')))) %>%
mutate(p = pmap(list(fit, data, preds),
.f = ~pdp::partial(object=..1, train = ..2,
pred.var = ..3)))%>%
select(-data,-fit,-preds)%>%
unnest_wider(p)%>%
unnest(c(yhat,lstat,rm))%>%{.->>checkpdp}%>%
ggplot(.,aes(x=lstat,y=rm,color=yhat))+
#geom_tile()+
geom_point(shape=15, size=2)+
facet_wrap(~grp, scales='free')

Related

Function to update `ggplot2::labs(caption=)` using data passed to `ggplot2::ggplot(data=)`

I've recently written my first ggplot2 stat and geom methods. I want to write another that uses the data passed in ggplot2::ggplot(data=) to add a p-value as a caption to the figure. Is that possible?
For example, I would like to write something like this:
library(ggplot2)
mtcars |>
ggplot(aes(x = mpg, y = cyl)) +
add_pvalue()
Where add_pvalue() would calculate a p-value (e.g. an anova p-value for different mean MPG by the number of cylinders), and add the p-value as a caption, labs(caption = "p = 0.45").
Thank you!
Daniel, it's possible. You can use this example. Hope that help you !
library(ggplot2)
library(glue)
p_value <- 0.05
mtcars |>
ggplot(aes(x = mpg, y = cyl)) +
labs(caption = glue("p = {p_value}"))
You could do something like the following, picking your preferred statistical model, "types" of p-values, and formatting of the p-value. If you wanted to build in lots of functionality to make it useful for a wide variety of models, you'd want to add conditional extractor functions for those models.
# Packages
library(ggplot2)
library(dplyr)
library(rlang)
# Define "add_pvalue()" function
# adds p-value from linear regression of y on x
# note that this assumes x and y are reals or integers
add_pvalue <- function(ggplot_obj) {
# Get x and y variable names from ggplot object
x <- ggplot_obj$mapping$x %>%
rlang::quo_get_expr() %>%
deparse()
y <- ggplot_obj$mapping$y %>%
rlang::quo_get_expr() %>%
deparse()
# Build regression model formula, fit model, return model summary
mod <- paste0(y, "~ ", x) %>%
as.formula() %>%
lm(data = ggplot_obj$data) %>%
summary()
# Extract two-tailed t-test p-value from model object (reformat as desired)
pval <- mod$coefficients[x, "Pr(>|t|)"]
# Add p_value as plot caption
ggplot_obj +
labs(caption = paste0("p = ", pval))
}
# Example with p-value for linear model and 95% confidence intervals
mtcars %>%
ggplot(aes(x = mpg, y = cyl)) %>%
add_pvalue() +
geom_smooth(method = "lm", se = TRUE, level = 0.95)
#> `geom_smooth()` using formula 'y ~ x'
Note that blindly fitting a linear regression or ANOVA to your data is probably not the best decision since x or y may not be real or integer types. If they aren't, this won't really make sense since some models either throw runtime errors or employ one-hot encoding when passed other types of variables.
Similarly, the p-values you obtain may be utterly meaningless if, for example, each row in the data is not an independent observation, you run lots of models that invalidate the sampling assumptions of p-values, your hypothesis doesn't match the test, etc.
Finally, you could also try using the output of stat_smooth() that is produced when you call geom_smooth() to do this. The upside would be that you wouldn't need to fit the model twice to have both that geom and the p-value (using the standard error and coefficients plus normal distribution to get the p-value). That's a bit outside of the scope and would be more limiting since you're stuck with the models it employs and the same issues plague those as well. It's also pretty annoying to extract those: Method to extract stat_smooth line fit

extracting more than 20 variables by importance via varImp

I'm dealing with a large dataset that involves more than 100 features (which are all relevant because they have already been filtered; the original dataset had over 500 features). I created a random forest model via the train() function from the caret package and using the "ranger" method.
Here's the question: how does one extract all of the variables by importance, as opposed to only the top 20 most important variables? The varImp() function yields only the top 20 variables by default.
Here's some sample code (minus the training set, which is very large):
library(caret)
rforest_model <- train(target_variable ~ .,
data = train_data_set,
method = "ranger",
importance = "impurity)
And here's the code for extracting variable importance:
varImp(rforest_model)
The varImp function extracts importance for all variables (even if they are not used by the model), it just prints out the top 20 variables. Consider this example:
library(mlbench) #for data set
library(caret)
library(tidyverse)
set.seed(998)
data(Ionosphere)
rforest_model <- train(y = Ionosphere$Class,
x = Ionosphere[,1:34],
method = "ranger",
importance = "impurity")
nrow(varImp(rforest_model)$importance) #34 variables extracted
lets check them:
varImp(rforest_model)$importance %>%
as.data.frame() %>%
rownames_to_column() %>%
arrange(Overall) %>%
mutate(rowname = forcats::fct_inorder(rowname )) %>%
ggplot()+
geom_col(aes(x = rowname, y = Overall))+
coord_flip()+
theme_bw()
note that V2 is a zero variance feature in this data set hence it has 0 importance and is not used by the model at all.

Extract Model for Specific Factor

Say I've fit a model as follows fit = lm(Y ~ X + Dummy1 + Dummy2)
How can I extract the regression for a specific dummy variable?
I'm hoping to do something like the following to plot all the regressions:
plot(...)
abline(extracted.lm.dummy1)
abline(extracted.lm.dummy2)
I would look into the sjPlot package. Here is the documentation for sjp.lm, which can be used to visualize linear models in various ways. The package also has some nice tools for tabular summaries of models.
An example:
library(sjPlot)
library(dplyr)
# add a second categorical variable to the iris dataset
# then generate a linear model
set.seed(123)
fit <- iris %>%
mutate(Category = factor(sample(c("A", "B"), 150, replace = TRUE))) %>%
lm(Sepal.Length ~ Sepal.Width + Species + Category, data = .)
Different kinds of plot include:
Marginal effects plot, probably closest to what you want
sjp.lm(fit, type = "eff", vars = c("Category", "Species"))
"Forest plot" (beta coefficients + confidence interval)
sjp.lm(fit)

Fitting several regression models after group_by with dplyr and applying the resulting models into test sets

I have a big dataset that I want to partition based on the values of a particular variable (in my case lifetime), and then run logistic regression on each partition. Following the answer of #tchakravarty in Fitting several regression models with dplyr I wrote the following code:
lifetimemodels = data %>% group_by(lifetime) %>% sample_frac(0.7)%>%
do(lifeModel = glm(churn ~., x= TRUE, family=binomial(link='logit'), data = .))
My question now is how I can use the resulting logistic models on computing the AUC on the rest of the data (the 0.3 fraction that was not chosen) which should again be grouped by lifetime?
Thanks a lot in advance!
You could adapt your dplyr approach to use the tidyr and purrr framework. You look at grouping/nesting, and the mutate and map functions to create list frames to store pieces of your workflow.
The test/training split you are looking for is part of modelr a package built to assist modelling within the purrr framework. Specifically the cross_vmc and cross_vkfold functions.
A toy example using mtcars (just to illustrate the framework).
library(dplyr)
library(tidyr)
library(purrr)
library(modelr)
analysis <- mtcars %>%
nest(-cyl) %>%
unnest(map(data, ~crossv_mc(.x, 1, test = 0.3))) %>%
mutate(model = map(train, ~lm(mpg ~ wt, data = .x))) %>%
mutate(pred = map2(model, train, predict)) %>%
mutate(error = map2_dbl(model, test, rmse))
This:
takes mtcars
nest into a list frame called data by cyl
Separate each data into a training set by mapping crossv_mc to each element, then using unnest to make the test and train list columns.
Map the lm model to each train, store that in model
Map the predict function to model and train and store in pred
Map the rmse function to model and test sets and store in error.
There are probably users out there more familiar than me with the workflow, so please correct/elaborate.

Plotting predicted survival curves for continuous covariates in ggplot

How can I plot survival curves for representative values of a continuous covariate in a cox proportional hazards model? Specifically, I would like to do this in ggplot using a "survfit.cox" "survfit" object.
This may seem like a question that has already been answered, but I have searched through everything in SO with the terms 'survfit' and 'newdata' (plus many other search terms). This is the thread that comes closest to answering my question so far: Plot Kaplan-Meier for Cox regression
In keeping with the reproducible example offered in one of the answers to that post:
url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
df <- read.table(url, header = TRUE)
library(dplyr)
library(ggplot2)
library(survival)
library(magrittr)
library(broom)
# Identifying the 25th and 75th percentiles for prio (continuous covariate)
summary(df$prio)
# Cox proportional hazards model with other covariates
# 'prio' is our explanatory variable of interest
m1 <- coxph(Surv(week, arrest) ~
fin + age + race + prio,
data = df)
# Creating new df to get survival predictions
# Want separate curves for the the different 'fin' and 'race'
# groups as well as the 25th and 75th percentile of prio
newdf <- df %$%
expand.grid(fin = levels(fin),
age = 30,
race = levels(race),
prio = c(1,4))
# Obtain the fitted survival curve, then tidy
# into a dataframe that can be used in ggplot
survcurv <- survfit(m1, newdata = newdf) %>%
tidy()
The problem is, that once I have this dataframe called survcurv, I cannot tell which of the 'estimate' variables belongs to which pattern because none of the original variables are retained. For example, which of the 'estimate' variables represents the fitted curve for 30 year old, race = 'other', prio = '4', fin = 'no'?
In all other examples i've seen, usually one puts the survfit object into a generic plot() function and does not add a legend. I want to use ggplot and add a legend for each of the predicted curves.
In my own dataset, the model is a lot more complex and there are a lot more curves than I show here, so as you can imagine seeing 40 different 'estimate.1'..'estimate.40' variables makes it hard to understand what is what.
Thanks for providing a well phrased question and a good example. I'm a little surpirsed that tidy does a relatively poor job here of creating sensible output. Please see below for my attempt at creating some plottable data:
library(tidyr)
newdf$group <- as.character(1:nrow(newdf))
survcurv <- survfit(m1, newdata = newdf) %>%
tidy() %>%
gather('key', 'value', -time, -n.risk, -n.event, -n.censor) %>%
mutate(group = substr(key, nchar(key), nchar(key)),
key = substr(key, 1, nchar(key) - 2)) %>%
left_join(newdf, 'group') %>%
spread(key, value)
And the create a plot (perhaps you'd like to use geom_step instead, but there is not step shaped ribbon, unfortunately):
ggplot(survcurv, aes(x = time, y = estimate, ymin = conf.low, ymax = conf.high,
col = race, fill = race)) +
geom_line(size = 1) +
geom_ribbon(alpha = 0.2, col = NA) +
facet_grid(prio ~ fin)
Try defining your survcurv like this:
survcurv <-
lapply(1:nrow(newdf),
function(x, m1, newdata){
cbind(newdata[x, ], survfit(m1, newdata[x, ]) %>% tidy)
},
m1,
newdf) %>%
bind_rows()
This will include all of the predictor values as columns with the predicted estimates.

Resources