sampling from posterior predictive distribution (stan vs inla) - r

I'm trying to implement functions from bayesplot package on a INLA object and a little unsure of how to draw from the posterior predictive distribution. I think I almost have it but rstan draws are more variable than the INLA ones.
In rstan, using the simplified example from bayesplot vignette I can:
library(bayesplot)
library(ggplot2)
library(rstanarm)
library(ggpubr)
library(tidyverse)
#rstan model set up
roaches$roach100 <- roaches$roach1 / 100 # pre-treatment number of roaches (in 100s)
fit_poisson <- stan_glm(y ~ roach100 + treatment + senior, offset = log(exposure2), family = poisson(link = "log"), data = roaches, seed = 1111, refresh = 0)
#In order to use the PPC functions from the bayesplot package we need a vector y of outcome values:
y <- roaches$y
#and a matrix yrep of draws from the posterior predictive distribution,
yrep_poisson <- posterior_predict(fit_poisson, draws = 500)
#then plot:
p1 <- bayesplot::ppc_dens_overlay(y, yrep_poisson[1:50, ])
p1
I want to replicate that plot on a INLA object. According to the bayesplot vignette you can do this as they have provided code to define a simple pp_check method that creates fitted model objects of class e.g. foo:
pp_check.foo <- function(object, type = c("multiple", "overlaid"), ...) {
type <- match.arg(type)
y <- object[["y"]]
yrep <- object[["yrep"]]
stopifnot(nrow(yrep) >= 50)
samp <- sample(nrow(yrep), size = ifelse(type == "overlaid", 50, 5))
yrep <- yrep[samp, ]
if (type == "overlaid") {
ppc_dens_overlay(y, yrep, ...)
} else {
ppc_hist(y, yrep, ...)
}
}
To use pp_check.foo we can just make a list with y and yrep components and give it class foo:
x <- list(y = rnorm(200), yrep = matrix(rnorm(1e5), nrow = 500, ncol = 200))
class(x) <- "foo"
#create plot above:
pp_check(x, type = "overlaid")
INLA
#create same model but in inla:
library(INLA)
fit_poisson_inla <- inla(y ~ roach100 + treatment + senior, offset = log(exposure2), data = roaches,
control.predictor = list(compute = T),
family = "poisson")
inla_object_name$marginals.fitted.values returns a posterior predictive distribution for each y:
fit_poisson_inla$marginals.fitted.values
#so to get distribution for first oberservation:
fitted.Predictor.1 <- fit_poisson_inla$marginals.fitted.values[[1]]
I think repeatedly sampling from this would give me what I need but there are only 75 values (dim(fitted.Predictor.1) per observation used to create this distribution when in reality I would want to be sampling from a full range of values. I think we can do this (section 4.3 here) by using inla.tmarginal using linear predictor:
fitted_dist <- fit_poisson_inla$marginals.linear.predictor
#should i have used "inla.rmarginal(n, marginal)"?
marginal_dist <- lapply(fitted_dist, function(y) inla.tmarginal(function(x) {exp(x)}, y)) %>% map(~ as.data.frame(.) %>% rename(., xx = x))
#resample 500 times
yrep_poisson_inla <- as.matrix(bind_rows(rerun(500, lapply(marginal_dist, function(x) sample(x$xx, 1)) %>% as.data.frame())))
#convert to class foo for pp_check
x <- list(y = y, yrep = yrep_poisson_inla[1:50, ])
class(x) <- "foo"
p2 <- pp_check(x, type = "overlaid")
#plot
ggarrange(p1, p2, ncol = 1, nrow = 2, labels = c("rstan", "inla sample"))
My question is how do I correctly get a matrix of draws from the posterior predictive distribution from this inla (fit_poisson_inla) object to pass into pp_check? yrep_poisson produces discrete values while yrep_poisson_inla produces continuous values. There is a lot more variation in the rstan draws than INLA (second plot). Is what I have done correct and this is just some sampling issue or is it an artifact of the different methods? In more complicated examples the differences could be substantial.
Thanks

Related

Does caret::train() in r have a standardized output across different fit methods/models?

I'm working with the train() function from the caret package to fit multiple regression and ML models to test their fit. I'd like to write a function that iterates through all model types and enters the best fit into a dataframe. Biggest issue is that caret doesn't provide all the model fit statistics that I'd like so they need to be derived from the raw output. Based on my exploration there doesn't seem to be a standardized way caret outputs each models fit.
Another post (sorry don't have a link) created this function which pulls from fit$results and fit$bestTune to get pre calculated RMSE, R^2, etc.
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
One example of another fit statistic I need to calculate using raw output is BIC. The two functions below do that. The residuals (y_actual - y_predicted) are needed along with the number of x variables (k) and the number of rows used in the prediction (n). k and n must be derived from the output not the original dataset due to the models dropping x variables (feature selection) or rows (omitting NAs) based on its algorithm.
calculate_MSE <- function(residuals){
# residuals can be replaced with y_actual-y_predicted
mse <- mean(residuals^2)
return(mse)
}
calculate_BIC <- function(n, mse, k){
BIC <- n*log(mse)+k*log(n)
return(BIC)
}
The real question is is there a standardized output of caret::train() for x variables or either y_actual, y_predicted, or residuals?
I tried fit$finalModel$model and other methods but to no avail.
Here is a reproducible example along with the function I'm using. Please consider the functions above a part of this reproducible example.
library(rlist)
library(data.table)
# data
df <- data.frame(y1 = rnorm(50, 0, 1),
y2 = rnorm(50, .25, 1.5),
x1 = rnorm(50, .4, .9),
x2 = rnorm(50, 0, 1.1),
x3 = rnorm(50, 1, .75))
missing_index <- sample(1:50, 7, replace = F)
df[missing_index,] <- NA
# function to fit models and pull results
fitModels <- function(df, Ys, Xs, models){
# empty list
results <- list()
# number of for loops
loops_counter <- 0
# for every y
for(y in 1:length(Ys)){
# for every model
for(m in 1:length(models)){
# track loops
loops_counter <- loops_counter + 1
# fit the model
set.seed(1) # seed for reproducability
fit <- tryCatch(train(as.formula(paste(Ys[y], paste(Xs, collapse = ' + '),
sep = ' ~ ')),
data = df,
method = models[m],
na.action = na.omit,
tuneLength = 10),
error = function(e) {return(NA)})
# pull results
results[[loops_counter]] <- c(Y = Ys[y],
model = models[m],
sample_size = nrow(fit$finalModel$model),
RMSE = get_best_result(fit)[[2]],
R2 = get_best_result(fit)[[3]],
MAE = get_best_result(fit)[[4]],
BIC = calculate_BIC(n = length(fit$finalModel),
mse = calculate_MSE(fit$finalModel$residuals),
k = length(fit$finalModel$xNames)))
}
}
# list bind
results_df <- list.rbind(results)
return(results_df)
}
linear_models <- c('lm', 'glmnet', 'ridge', 'lars', 'enet')
fits <- fitModels(df, c(y1, y2), c(x1,x2,x3), linear_models)

SHAP Importance for Ranger in R

Having a binary Classification problem:
how would be possible to get the Shap Contribution for variables for a Ranger model?
Sample data:
library(ranger)
library(tidyverse)
# Binary Dataset
df <- iris
df$Target <- if_else(df$Species == "setosa",1,0)
df$Species <- NULL
# Train Ranger Model
model <- ranger(
x = df %>% select(-Target),
y = df %>% pull(Target))
I have tried with several libraries(DALEX, shapr, fastshap, shapper) but I didnt get any solution.
I wish getting some result like SHAPforxgboost for xgboost like:
the output of shap.values which is the shap contribution of variables
the shap.plot.summary
Good Morning!,
According to what I have found, you can use ranger() with fastshap() as following:
library(fastshap)
library(ranger)
library(tidyverse)
data(iris)
# Binary Dataset
df <- iris
df$Target <- if_else(df$Species == "setosa",1,0)
df$Species <- NULL
x <- df %>% select(-Target)
# Train Ranger Model
model <- ranger(
x = df %>% select(-Target),
y = df %>% pull(Target))
# Prediction wrapper
pfun <- function(object, newdata) {
predict(object, data = newdata)$predictions
}
# Compute fast (approximate) Shapley values using 10 Monte Carlo repetitions
system.time({ # estimate run time
set.seed(5038)
shap <- fastshap::explain(model, X = x, pred_wrapper = pfun, nsim = 10)
})
# Load required packages
library(ggplot2)
theme_set(theme_bw())
# Aggregate Shapley values
shap_imp <- data.frame(
Variable = names(shap),
Importance = apply(shap, MARGIN = 2, FUN = function(x) sum(abs(x)))
)
Then for example, for variable importance, you can do:
# Plot Shap-based variable importance
ggplot(shap_imp, aes(reorder(Variable, Importance), Importance)) +
geom_col() +
coord_flip() +
xlab("") +
ylab("mean(|Shapley value|)")
Also, if you want individual predictions, the following is possible:
# Plot individual explanations
expl <- fastshap::explain(model, X = x ,pred_wrapper = pfun, nsim = 10, newdata = x[1L, ])
autoplot(expl, type = "contribution")
All this information has been found in here, and there is more to it: https://bgreenwell.github.io/fastshap/articles/fastshap.html
Check the link and solve your doubts ! :)
I launched two R packages to perform such tasks: One is "kernelshap" (crunching), the other one is "shapviz" (plotting).
library(randomForest)
library(kernelshap)
Ilibrary(shapviz)
set.seed(1)
fit <- randomForest(Sepal.Length ~ ., data = iris,)
# bg_X is usually a small (50-200 rows) subset of the data
# Step 1: Calculate Kernel SHAP values
s <- kernelshap(fit, iris[-1], bg_X = iris)
# Step 2: Turn them into a shapviz object
sv <- shapviz(s)
# Step 3: Gain insights...
sv_importance(sv, show_numbers = TRUE)
sv_dependence(sv, v = "Petal.Length", color_var = "auto")

Predict segmented lm outside of package

I have an array of outputs from hundreds of segmented linear models (made using the segmented package in R). I want to be able to use these outputs on new data, using the predict function. To be clear, I do not have the segmented linear model objects in my workspace; I just saved and reimported the relevant outputs (e.g. the coefficients and breakpoints). For this reason I can't simply use the predict.segmented function from the segmented package.
Below is a toy example based on this link that seems promising, but does not match the output of the predict.segmented function.
library(segmented)
set.seed(12)
xx <- 1:100
zz <- runif(100)
yy <- 2 + 1.5*pmax(xx-35,0) - 1.5*pmax(xx-70,0) +
15*pmax(zz-0.5,0) + rnorm(100,0,2)
dati <- data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
o<-## S3 method for class 'lm':
segmented(out.lm,seg.Z=~x,psi=list(x=c(30,60)),
control=seg.control(display=FALSE))
# Note that coefficients with U in the name are differences in slopes, not slopes.
# Compare:
slope(o)
coef(o)[2] + coef(o)[3]
coef(o)[2] + coef(o)[3] + coef(o)[4]
# prediction
pred <- data.frame(x = 1:100)
pred$dummy1 <- pmax(pred$x - o$psi[1,2], 0)
pred$dummy2 <- pmax(pred$x - o$psi[2,2], 0)
pred$dummy3 <- I(pred$x > o$psi[1,2]) * (coef(o)[2] + coef(o)[3])
pred$dummy4 <- I(pred$x > o$psi[2,2]) * (coef(o)[2] + coef(o)[3] + coef(o)[4])
names(pred)[-1]<- names(model.frame(o))[-c(1,2)]
# compute the prediction, using standard predict function
# computing confidence intervals further
# suppose that the breakpoints are fixed
pred <- data.frame(pred, predict(o, newdata= pred,
interval="confidence"))
# Try prediction using the predict.segment version to compare
test <- predict.segmented(o)
plot(pred$fit, test, ylim = c(0, 100))
abline(0,1, col = "red")
# At least one segment not being predicted correctly?
Can I use the base r predict() function (not the segmented.predict() function) with the coefficients and break points saved from segmented linear models?
UPDATE
I figured out that the code above has issues (don't use it). Through some reverse engineering of the segmented.predict() function, I produced the design matrix and use that to predict values instead of directly using the predict() function. I do not consider this a full answer of the original question yet because predict() can also produce confidence intervals for the prediction, and I have not yet implemented that--question still open for someone to add confidence intervals.
library(segmented)
## Define function for making matrix of dummy variables (this is based on code from predict.segmented())
dummy.matrix <- function(x.values, x_names, psi.est = TRUE, nameU, nameV, diffSlope, est.psi) {
# This function creates a model matrix with dummy variables for a segmented lm with two breakpoints.
# Inputs:
# x.values: the x values of the segmented lm
# x_names: the name of the column of x values
# psi.est: this is legacy from the predict.segmented function, leave it set to 'TRUE'
# obj: the segmented lm object
# nameU: names (class character) of 3rd and 4th coef, which are "U1.x" "U2.x" for lm with two breaks. Example: names(c(obj$coef[3], obj$coef[4]))
# nameV: names (class character) of 5th and 6th coef, which are "psi1.x" "psi2.x" for lm with two breaks. Example: names(c(obj$coef[5], obj$coef[6]))
# diffSlope: the coefficients (class numeric) with the slope differences; called U1.x and U2.x for lm with two breaks. Example: c(o$coef[3], o$coef[4])
# est.psi: the estimated break points (class numeric); these are the estimated breakpoints from segmented.lm. Example: c(obj$psi[1,2], obj$psi[2,2])
#
n <- length(x.values)
k <- length(est.psi)
PSI <- matrix(rep(est.psi, rep(n, k)), ncol = k)
newZ <- matrix(x.values, nrow = n, ncol = k, byrow = FALSE)
dummy1 <- pmax(newZ - PSI, 0)
if (psi.est) {
V <- ifelse(newZ > PSI, -1, 0)
dummy2 <- if (k == 1)
V * diffSlope
else V %*% diag(diffSlope)
newd <- cbind(x.values, dummy1, dummy2)
colnames(newd) <- c(x_names, nameU, nameV)
} else {
newd <- cbind(x.values, dummy1)
colnames(newd) <- c(x_names, nameU)
}
# if (!x_names %in% names(coef(obj.seg)))
# newd <- newd[, -1, drop = FALSE]
return(newd)
}
## Test dummy matrix function----------------------------------------------
set.seed(12)
xx<-1:100
zz<-runif(100)
yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)
dati<-data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
#1 segmented variable, 2 breakpoints: you have to specify starting values (vector) for psi:
o<-segmented(out.lm,seg.Z=~x,psi=c(30,60),
control=seg.control(display=FALSE))
slope(o)
plot.segmented(o)
summary(o)
# Test dummy matrix fn with the same dataset
newdata <- dati
nameU1 <- c("U1.x", "U2.x")
nameV1 <- c("psi1.x", "psi2.x")
diffSlope1 <- c(o$coef[3], o$coef[4])
est.psi1 <- c(o$psi[1,2], o$psi[2,2])
test <- dummy.matrix(x.values = newdata$x, x_names = "x", psi.est = TRUE,
nameU = nameU1, nameV = nameV1, diffSlope = diffSlope1, est.psi = est.psi1)
# Predict response variable using matrix multiplication
col1 <- matrix(1, nrow = dim(test)[1])
test <- cbind(col1, test) # Now test is the same as model.matrix(o)
predY <- coef(o) %*% t(test)
plot(predY[1,])
lines(predict.segmented(o), col = "blue") # good, predict.segmented gives same answer

Is it possible to sample from a conditional density in R given some conditional data?

In R, using the np package, I have created the bandwidths for a conditional density. What I would like to do is, given some new conditional vector, sample from the resulting distribution.
Current code:
library('np')
# Generate some test data.
somedata = data.frame(replicate(10,runif(100, 0, 1)))
# Conditional variables.
X <- data.frame(somedata[, c('X1', 'X2', 'X3')])
# Dependent variables.
Y <- data.frame(somedata[, c('X4', 'X5', 'X6')])
# Warning, this can be slow (but shouldn't be too bad).
bwsome = npcdensbw(xdat=X, ydat=Y)
# TODO: Given some vector t of conditional data, how can I sample from the resulting distribution?
I am quite new to R, so while I did read the package documentation, I haven't been able to figure out if what I vision makes sense or is possible. If necessary, I would happily use a different package.
Here is the Example 2.49 from: https://cran.r-project.org/web/packages/np/vignettes/np_faq.pdf , it gives the following
solution for for 2 variables:
###
library(np)
data(faithful)
n <- nrow(faithful)
x1 <- faithful$eruptions
x2 <- faithful$waiting
## First compute the bandwidth vector
bw <- npudensbw(~x1 + x2, ckertype = "gaussian")
plot(bw, view = "fixed", ylim = c(0, 3))
## Next generate draws from the kernel density (Gaussian)
n.boot <- 1000
i.boot <- sample(1:n, n.boot, replace = TRUE)
x1.boot <- rnorm(n.boot,x1[i.boot],bw$bw[1])
x2.boot <- rnorm(n.boot,x2[i.boot],bw$bw[2])
## Plot the density for the bootstrap sample using the original
## bandwidths
plot(npudens(~x1.boot+x2.boot,bws=bw$bw), view = "fixed")
Following this hint from #coffeejunky, the following is a possible
solution to your problem with 6 variables:
## Generate some test data.
somedata = data.frame(replicate(10, runif(100, 0, 1)))
## Conditional variables.
X <- data.frame(somedata[, c('X1', 'X2', 'X3')])
## Dependent variables.
Y <- data.frame(somedata[, c('X4', 'X5', 'X6')])
## First compute the bandwidth vector
n <- nrow(somedata)
bw <- npudensbw(~X$X1 + X$X2 + X$X3 + Y$X4 + Y$X5 + Y$X6, ckertype = "gaussian")
plot(bw, view = "fixed", ylim = c(0, 3))
## Next generate draws from the kernel density (Gaussian)
n.boot <- 1000
i.boot <- sample(1:n, n.boot, replace=TRUE)
x1.boot <- rnorm(n.boot, X$X1[i.boot], bw$bw[1])
x2.boot <- rnorm(n.boot, X$X2[i.boot], bw$bw[2])
x3.boot <- rnorm(n.boot, X$X3[i.boot], bw$bw[3])
x4.boot <- rnorm(n.boot, Y$X4[i.boot], bw$bw[4])
x5.boot <- rnorm(n.boot, Y$X5[i.boot], bw$bw[5])
x6.boot <- rnorm(n.boot, Y$X6[i.boot], bw$bw[6])
## Plot the density for the bootstrap sample using the original
## bandwidths
ob1 <- npudens(~x1.boot + x2.boot + x3.boot + x4.boot + x5.boot + x6.boot, bws = bw$bw)
plot(ob1, view = "fixed", ylim = c(0, 3))

binning continuous variables by IV value in R

I am building a logistic regression model in R. I want to bin continuous predictors in an optimal way in relationship to the target variable. There are two things that I know of:
the continuous variables are binned such that its IV (information value) is maximized
maximize the chi-square in the two way contingency table -- the target has two values 0 and 1, and the binned continuous variable has the binned buckets
Does anyone know of any functions in R that can perform such binning?
Your help will be greatly appreciated.
For the first point, you could bin using the weight of evidence (woe) with the package woebinning which optimizes the number of bins for the IV
library(woeBinning)
# get the bin cut points from your dataframe
cutpoints <- woe.binning(dataset, "target_name", "Variable_name")
woe.binning.plot(cutpoints)
# apply the cutpoints to your dataframe
dataset_woe <- woe.binning.deploy(dataset, cutpoint, add.woe.or.dum.var = "woe")
It returns your dataset with two extra columns
Variable_name.binned which is the labels
Variable_name.woe.binned which is the replaced values that you can then parse into your regression instead of Variable_name
For the second point, on chi2, the package discretization seems to handle it but I haven't tested it.
The methods used by regression splines to set knot locations might be considered. The rpart package probably has relevant code. You do need to penalize the inferential statistics because this results in an implicit hiding of the degrees of freedom expended in the process of moving the breaks around to get the best fit. Another common method is to specify breaks at equally spaced quantiles (quartiles or quintiles) within the subset with IV=1. Something like this untested code:
cont.var.vec <- # names of all your continuous variables
breaks <- function(var,n) quantiles( dfrm[[var]],
probs=seq(0,1,length.out=n),
na.rm=TRUE)
lapply(dfrm[ dfrm$IV == 1 , cont.var.vec] , breaks, n=5)
s
etwd("D:")
rm(list=ls())
options (scipen = 999)
read.csv("dummy_data.txt") -> dt
head(dt)
summary(dt)
mydata <- dt
head(mydata)
summary(mydata)
##Capping
for(i in 1:ncol(mydata)){
if(is.numeric(mydata[,i])){
val.quant <- unname(quantile(mydata[,i],probs = 0.75))
mydata[,i] = sapply(mydata[,i],function(x){if(x > (1.5*val.quant+1)){1.5*val.quant+1}else{x}})
}
}
library(randomForest)
x <- mydata[,!names(mydata) %in% c("Cust_Key","Y")]
y <- as.factor(mydata$Y)
set.seed(21)
fit <- randomForest(x,y,importance=T,ntree = 70)
mydata2 <- mydata[,!names(mydata) %in% c("Cust_Key")]
mydata2$Y <- as.factor(mydata2$Y)
fit$importance
####var reduction#####
vartoremove <- ncol(mydata2) - 20
library(rminer)
#####
for(i in 1:vartoremove){
rf <- fit(Y~.,data=mydata2,model = "randomForest", mtry = 10 ,ntree = 100)
varImportance <- Importance(rf,mydata2,method="sensg")
Z <- order(varImportance$imp,decreasing = FALSE)
IND <- Z[2]
var_to_remove <- names(mydata2[IND])
mydata2[IND] = NULL
print(i)
}
###########
library(smbinning)
as.data.frame(mydata2) -> inp
summary(inp)
attach(inp)
rm(result)
str(inp)
inp$target <- as.numeric(inp$Y) *1
table(inp$target)
ftable(inp$Y,inp$target)
inp$target <- inp$target -1
result= smbinning(df=inp, y="target", x="X37", p=0.0005)
result$ivtable
smbinning.plot(result,option="badrate",sub="test")
summary(inp)
result$ivtable
boxplot(inp$X2~inp$Y,horizontal=T, frame=F, col="red",main="Distribution")
###Sample
require(caTools)
inp$Y <- NULL
sample = sample.split(inp$target, SplitRatio = .7)
train = subset(inp, sample == TRUE)
test = subset(inp, sample == FALSE)
head(train)
nrow(train)
fit1 <- glm(train$target~.,data=train,family = binomial)
summary(rf)
prediction1 <- data.frame(actual = test$target, predicted = predict(fit1,test ,type="response") )
result= smbinning(df=prediction1, y="actual", x="predicted", p=0.005)
result$ivtable
smbinning.plot(result,option="badrate",sub="test")
tail(prediction1)
write.csv(prediction1 , "test_pred_logistic.csv")
predict_train <- data.frame(actual = train$target, predicted = predict(fit1,train ,type="response") )
write.csv(predict_train , "train_pred_logistic.csv")
result= smbinning(df=predict_train, y="actual", x="predicted", p=0.005)
result$ivtable
smbinning.plot(result,option="badrate",sub="train")
####random forest
rf <- fit(target~.,data=train,model = "randomForest", mtry = 10 ,ntree = 200)
prediction2 <- data.frame(actual = test$target, predicted = predict(rf,train))
result= smbinning(df=prediction2, y="actual", x="predicted", p=0.005)
result$ivtable
smbinning.plot(result,option="badrate",sub="train")
###########IV
library(devtools)
install_github("riv","tomasgreif")
library(woe)
##### K-fold Validation ########
library(caret)
cv_fold_count = 2
folds = createFolds(mydata2$Y,cv_fold_count,list=T);
smpl = folds[[i]];
g_train = mydata2[-smpl,!names(mydata2) %in% c("Y")];
g_test = mydata2[smpl,!names(mydata2) %in% c("Y")];
cost_train = mydata2[-smpl,"Y"];
cost_test = mydata2[smpl,"Y"];
rf <- randomForest(g_train,cost_train)
logit.data <- cbind(cost_train,g_train)
logit.fit <- glm(cost_train~.,data=logit.data,family = binomial)
prediction <- data.f
rame(actual = test$Y, predicted = predict(rf,test))

Resources