feature_spec in TFdatasets with multiple response variables - r

I'm looking to predict a set of responses using a common set of features using the tensorflow package in R. I've worked through a couple of single response regression examples, and am attempting to modify that code to accommodate my data. Here I'm using the hearts dataset to demonstrate.
library(keras)
library(tensorflow)
library(tfdatasets)
library(tfprobability)
data(hearts)
head(hearts)
hearts <- tensor_slices_dataset(hearts) %>% dataset_batch(32)
#This works
spec <- feature_spec(hearts, x = c(age, sex, cp, trestbps, chol, fbs, restecg,
exang, oldpeak, slope, ca, thal, thalach),
y = target)
#This doesn't
spec <- feature_spec(hearts, x = c(age, sex, cp, trestbps, chol, fbs, restecg,
exang, oldpeak, slope, ca, thal, thalach),
y = c(target, thalach))
So, how can I pass multiple response variables to feature_spec? It seems like this is possible in Python from this post (https://towardsdatascience.com/bayesian-neural-networks-with-tensorflow-probability-fbce27d6ef6).
See the code chunk under 'Data Handling'. Is this just not an option in R?

Related

ggcoef_model error when two random intercepts

When trying to graph the conditional fixed effects of a glmmTMB model with two random intercepts in GGally I get the error:
There was an error calling "tidy_fun()". Most likely, this is because the
function supplied in "tidy_fun=" was misspelled, does not exist, is not
compatible with your object, or was missing necessary arguments (e.g. "conf.level=" or "conf.int="). See error message below.
Error: Error in "stop_vctrs()":
! Can't recycle "..1" (size 3) to match "..2" (size 2).`
I have tinkered with figuring out the issue and it seems to be related to the two random intercepts included in the model. I have also tried extracting the coefficient and standard error information separately through broom.mixed::tidy and then feeding the data frame into GGally:ggcoef() with no avail. Any suggestions?
# Example with built-in randu data set
data(randu)
randu$A <- factor(rep(c(1,2), 200))
randu$B <- factor(rep(c(1,2,3,4), 100))
# Model
test <- glmmTMB(y ~ x + z + (0 +x|A) + (1|B), family="gaussian", data=randu)
# A few of my attempts at graphing--works fine when only one random effects term is in model
ggcoef_model(test)
ggcoef_model(test, tidy_fun = broom.mixed::tidy)
ggcoef_model(test, tidy_fun = broom.mixed::tidy, conf.int = T, intercept=F)
ggcoef_model(test, tidy_fun = broom.mixed::tidy(test, effects="fixed", component = "cond", conf.int = TRUE))
There are some (old!) bugs that have recently been fixed (here, here) that would make confidence interval reporting on RE parameters break for any model with multiple random terms (I think). I believe that if you are able to install updated versions of both glmmTMB and broom.mixed:
remotes::install_github("glmmTMB/glmmTMB/glmmTMB#ci_tweaks")
remotes::install_github("bbolker/broom.mixed")
then ggcoef_model(test) will work.

Is it possible to use lqmm with a mira object?

I am using the package lqmm, to run a linear quantile mixed model on an imputed object of class mira from the package mice. I tried to make a reproducible example:
library(lqmm)
library(mice)
summary(airquality)
imputed<-mice(airquality,m=5)
summary(imputed)
fit1<-lqmm(Ozone~Solar.R+Wind+Temp+Day,random=~1,
tau=0.5, group= Month, data=airquality,na.action=na.omit)
fit1
summary(fit1)
fit2<-with(imputed, lqmm(Ozone~Solar.R+Wind+Temp+Day,random=~1,
tau=0.5, group= Month, na.action=na.omit))
"Error in lqmm(Ozone ~ Solar.R + Wind + Temp + Day, random = ~1, tau = 0.5, :
`data' must be a data frame"
Yes, it is possible to get lqmm() to work in mice. Viewing the code for lqmm(), it turns out that it's a picky function. It requires that the data argument is supplied, and although it appears to check if the data exists in another environment, it doesn't seem to work in this context. Fortunately, all we have to do to get this to work is capture the data supplied from mice and give it to lqmm().
fit2 <- with(imputed,
lqmm(Ozone ~ Solar.R + Wind + Temp + Day,
data = data.frame(mget(ls())),
random = ~1, tau = 0.5, group = Month, na.action = na.omit))
The explanation is that ls() gets the names of the variables available, mget() gets those variables as a list, and data.frame() converts them into a data frame.
The next problem you're going to find is that mice::pool() requires there to be tidy() and glance() methods to properly pool the multiple imputations. It looks like neither broom nor broom.mixed have those defined for lqmm. I threw together a very quick and dirty implementation, which you could use if you can't find anything else.
To get pool(fit2) to run you'll need to create the function tidy.lqmm() as below. Then pool() will assume the sample size is infinite and perform the calculations accordingly. You can also create the glance.lqmm() function before running pool(fit2), which will tell pool() the residual degrees of freedom. Afterwards you can use summary(pooled) to find the p-values.
tidy.lqmm <- function(x, conf.int = FALSE, conf.level = 0.95, ...) {
broom:::as_tidy_tibble(data.frame(
estimate = coef(x),
std.error = sqrt(
diag(summary(x, covariance = TRUE,
R = 50)$Cov[names(coef(x)),
names(coef(x))]))))
}
glance.lqmm <- function(x, ...) {
broom:::as_glance_tibble(
logLik = as.numeric(stats::logLik(x)),
df.residual = summary(x, R = 2)$rdf,
nobs = stats::nobs(x),
na_types = "rii")
}
Note: lqmm uses bootstrapping to estimate the standard error. By default it uses R = 50 bootstrapping replicates, which I've copied in the tidy.lqmm() function. You can change that line to increase the number of replicates if you like.
WARNING: Use these functions and the results with caution. I know just enough to be dangerous. To me it looks like these functions work to give sensible results, but there are probably intricacies that I'm not aware of. If you can find a more authoritative source for similar functions that work, or someone who is familiar with lqmm or pooling mixed models, I'd trust them more than me.

Create SHAP plots for tidymodel objects

This question refers to Obtaining summary shap plot for catboost model with tidymodels in R. Given the comment below the question, the OP found a solution but did not share it with the community so far.
I want to analyze my tree ensembles fitted with the tidymodels package with SHAP value plots such as plots for single observations like
and to summarize the effect of all features of my dataset like
DALEXtra provides a function to create SHAP values for tidymodels explain.tidymodels(). force_plot from the fastshap package provide a wrapper for the plot function of the underlying python package SHAP. But I can't understand how to make the function work with the output of the explain.tidymodels() function.
Question : How can one generate such SHAP plots in R using tidymodels and explain.tidymodels?
MWE (for SHAP values with explain.tidymodels)
library(MASS)
library(tidyverse)
library(tidymodels)
library(parsnip)
library(treesnip)
library(catboost)
library(fastshap)
library(DALEXtra)
set.seed(1337)
rec <- recipe(crim ~ ., data = Boston)
split <- initial_split(Boston)
train_data <- training(split)
test_data <- testing(split) %>% dplyr::select(-crim) %>% as.matrix()
model_default<-
parsnip::boost_tree(
mode = "regression"
) %>%
set_engine(engine = 'catboost', loss_function = 'RMSE')
#sometimes catboost is not loaded correctly the following two lines
#ensure prevent fitting errors
#https://github.com/curso-r/treesnip/issues/21 error is mentioned on last post
set_dependency("boost_tree", eng = "catboost", "catboost")
set_dependency("boost_tree", eng = "catboost", "treesnip")
model_fit_wf <- model_fit_wf <- workflow() %>% add_model(model_tune) %>% add_recipe(rec) %>% {parsnip::fit(object = ., data = train_data)}
SHAP_wf <- explain_tidymodels(model_fit_wf, data = X, y = train_data$crim, new_data = test_data
Perhaps this will help. At the very least, it is a step in the right direction.
First, ensure you have fastshap and reticulate installed (i.e., install.packages("...")). Next, set up a virtual environment and install shap (pip install ...). Also, install matplotlib 3.2.2 for the dependency plots (check out GitHub issues on this -- an older version of matplotlib is necessary).
RStudio has great information on virtual environment setup. That said, virtual environment setup requires more or less troubleshooting depending on the IDE of use. (Sadly, some work settings restrict the use of open source RStudio due to licensing.)
Docs for library(fastshap) are also helpful on this front.
Here's a workflow for lightgbm (from treesnip docs, lightly modified).
library(tidymodels)
library(treesnip)
data("diamonds", package = "ggplot2")
diamonds <- diamonds %>% sample_n(1000)
# vfold resamples
diamonds_splits <- vfold_cv(diamonds, v = 5)
model_spec <- boost_tree(mtry = 5, trees = 500) %>% set_mode("regression")
# model specs
lightgbm_model <- model_spec %>%
set_engine("lightgbm", nthread = 6)
#workflows
lightgbm_wf <- workflow() %>%
add_model(
lightgbm_model
)
rec_ordered <- recipe(
price ~ .
, data = diamonds
)
lightgbm_fit_ordered <- fit_resamples(
add_recipe(
lightgbm_wf, rec_ordered
), resamples = diamonds_splits)
Prior to prediction we want to fit our workflow
fit_workflow <- lightgbm_wf %>%
add_recipe(rec_ordered) %>%
fit(data = diamonds)
Now we have a fit workflow and can predict. To use the fastshap::explain function, we need to create a predict function (this doesn't always hold: depending on the engine used it may or may not work out of the box -- see docs).
predict_function_gbm <- function(model, newdata) {
predict(model, newdata) %>% pluck(.,1)
}
Let's get the mean prediction value (used below) while we're at it. This also serves as a check to ensure the function is functioning.
mean_preds <- mean(
predict_function_gbm(
fit_workflow, diamonds %>% select(-price)
)
)
Now we create our explanations (shap values). Note the pred_wrapper and X arguments here (see fastshap github issues for other examples -- i.e. glmnet).
fastshap::explain(
fit_workflow,
X = as.data.frame(diamonds %>% select(-price)),
pred_wrapper = predict_function_gbm,
nsim = 10
) -> explanations_gbm
This should produce a force plot.
fastshap::force_plot(
object = explanations_gbm[1,],
feature_values = as.data.frame(diamonds %>% select(-price))[1,],
display = "viewer",
baseline = mean_preds)
This allows multiple, vertically stacked:
fastshap::force_plot(
object = explanations_gbm[1:20,],
feature_values = as.data.frame(diamonds %>% select(-price))[1:20,],
display = "viewer",
baseline = mean_preds)
Add link = "logit" for classification. Change display to "html" for Rmarkdown rendering.
Now for summary plots and dependency plots.
The trick is using reticulate to access the functions directly. Note that the same logic hold for libraries like transformers, numpy, etc.
First, for dependency plot.
library(reticulate)
shap = import("shap")
np = import("numpy")
shap$dependence_plot(
"rank(3)",
data.matrix(explanations_gbm),
data.matrix(diamond %>% select(-price))
)
See shap docs for explanation of rank(3) -- rank(1) etc will also work.
Unforunately it threw an error when I attempted naming the feature directly (i.e., "cut").
Now for the summary plot:
shap$summary_plot(
data.matrix(explanations_gbm),
data.matrix(diamond %>% select(-price))
)
Final note: rendering the plot repeatedly will produce buggy visualizations. Hopefully this provides a point of depature for catboost visualizations.

Plotting SVM Linear Separator in R

I'm trying to plot the 2-dimensional hyperplanes (lines) separating a 3-class problem with e1071's svm. I used the default method (so there is no formula involved) like so:
library('e1071')
## S3 method for class 'default':
machine <- svm(x, y, kernel="linear")
I cannot seem to plot it by using the plot.svm method:
plot(machine, x)
Error in plot.svm(machine, x) : missing formula.
But I did not use the formula method, I used the default one, and if I pass '~' or '~.' as a formula argument it'll complain about the matrix x not being a data.frame.
Is there a way of plotting the fitted separator/s for the 2D problem while using the default method?
How may I achieve this?
Thanks in advance.
It appears that although svm() allows you to specify your input using either the default or formula method, plot.svm() only allows a formula method. Also, by only giving x to plot.svm(), you are not giving it all the info it needs. It also needs y.
Try this:
library(e1071)
x <- prcomp(iris[,1:4])$x[,1:2]
y <- iris[,5]
df <- data.frame(cbind(x[],y[]))
machine <- svm(y ~ PC1 + PC2, data=df)
plot(machine, data=df)
It appears that your x has more than two feature-variables or columns.
Since plot.svm() plots only 2-Dimensions at a time, you need to specify these dimensions explicitly by providing a formula argument.
Ex:- ## more than two variables: fix 2 dimensions
data(iris)
m2 <- svm(Species~., data = iris)
plot(m2, iris, Petal.Width ~ Petal.Length,slice = list(Sepal.Width = 3, Sepal.Length = 4))
In cases where the data-frames has only two dimensions by default, you can ignore the formula argument.
Ex:- ## a simple example
data(cats, package = "MASS")
m <- svm(Sex~., data = cats)
plot(m, cats)
These details can be found at plot.svm() documentation here https://www.rdocumentation.org/packages/e1071/versions/1.7-3/topics/plot.svm

Error in plot, formula missing when using svm

I am trying to plot my svm model.
library(foreign)
library(e1071)
x <- read.arff("contact-lenses.arff")
#alt: x <- read.arff("http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/contact-lenses.arff")
model <- svm(`contact-lenses` ~ . , data = x, type = "C-classification", kernel = "linear")
The contact lens arff is the inbuilt data file in weka.
However, now i run into an error trying to plot the model.
plot(model, x)
Error in plot.svm(model, x) : missing formula.
The problem is that in in your model, you have multiple covariates. The plot() will only run automatically if your data= argument has exactly three columns (one of which is a response). For example, in the ?plot.svm help page, you can call
data(cats, package = "MASS")
m1 <- svm(Sex~., data = cats)
plot(m1, cats)
So since you can only show two dimensions on a plot, you need to specify what you want to use for x and y when you have more than one to choose from
cplus<-cats
cplus$Oth<-rnorm(nrow(cplus))
m2 <- svm(Sex~., data = cplus)
plot(m2, cplus) #error
plot(m2, cplus, Bwt~Hwt) #Ok
plot(m2, cplus, Hwt~Oth) #Ok
So that's why you're getting the "Missing Formula" error.
There is another catch as well. The plot.svm will only plot continuous variables along the x and y axes. The contact-lenses data.frame has only categorical variables. The plot.svm function simply does not support this as far as I can tell. You'll have to decide how you want to summarize that information in your own visualization.

Resources