Predict function in R not giving an interval - r

I'm trying to use predict() in R to compute a prediction interval for a linear model. When I tried this on a simpler model with only one covariate, it gave the expected output of a point estimate with a confidence interval. When I added a categorical predictor to the model, the predict() output gives what seems like a single-point estimate with no interval. I've Googled to no avail. Can anyone tell me what I've done wrong here?
medcost <- data.frame(
ID = c(1:100),
charges = sample(0:100000, 100, replace = T),
bmi = sample(18:40, 100, replace = T),
smoker = factor(sample(c("smoker", "nonsmoker"), 100, replace = TRUE))
)
mod2 <- glm(charges ~ bmi + smoker, data = medcost)
predict(mod2, interval="predict",
newdata = data.frame(bmi=c(29, 31.5), smoker=c("smoker", "smoker")))

If you want to have the standard error, you could use se.fit = TRUE like this:
mod2 <- glm(charges ~ bmi + smoker, data = medcost)
predict(mod2, interval="predict",
newdata = data.frame(bmi=c(29, 31.5), smoker=c("smoker", "smoker")),
se.fit = TRUE)
#> $fit
#> 1 2
#> 47638.66 47106.14
#>
#> $se.fit
#> 1 2
#> 4304.220 4475.473
#>
#> $residual.scale
#> [1] 28850.85
Created on 2023-01-17 with reprex v2.0.2
I would recommend you having a look at this post: R: glm(...,family=poisson) plot confidence and prediction intervals

Related

Approximated SHAP values for multi-classification problem using randomForest

I would like to use the fastshap package to obtain SHAP values plots for every category of my outcome in a multi-classification problem using a random forest classifier. I could only found chunks of the code around, but no explanation on how to procede from the beginning in obtaining the SHAP values in this case. Here is the code I have so far (my y has 5 classes, here I am trying to obtain SHAP values for class 3):
library(randomForest)
library(fastshap)
set.seed(42)
sample <- sample.int(n = nrow(ITA), size = floor(.75*nrow(ITA)), replace=F)
train <- ITA [sample,]
test <- ITA [-sample,]
set.seed(42)
rftrain <-randomForest(y ~ ., data=train, ntree=500, importance = TRUE)
p_function_3<- function(object, newdata)
caret::predict.train(object,
newdata = newdata,
type = "prob")[,3]
shap_values_G <- fastshap::explain(rftrain,
X = train,
pred_wrapper = p_function_3,
nsim = 50,
newdata=train[which(y==3),])
Now, I took the code largely from an example I found online, and I tried to adapt it (I am not an expert R user), but it does not work.. Can you please help me in correcting it? Thanks!
Here is a working example (with a different dataset), but I think the logic is the same.
library(randomForest)
library(fastshap)
set.seed(42)
ix <- sample(nrow(iris), 0.75 * nrow(iris))
train <- iris[ix, ]
test <- iris[-ix, ]
xvars <- c("Sepal.Width", "Sepal.Length")
yvar <- "Species"
fit <- randomForest(reformulate(xvars, yvar), data = train, ntree = 500)
pred_3 <- function(model, newdata) {
predict(model, newdata = newdata, type = "prob")[, "virginica"]
}
shap_values_3 <- fastshap::explain(
fit,
X = train, # Reference data
feature_names = xvars,
pred_wrapper = pred_3,
nsim = 50,
newdata = train[train$Species == "virginica", ] # For these rows, you will calculate explanations
)
head(shap_values_3)
# Sepal.Width Sepal.Length
# <dbl> <dbl>
# 1 0.101 0.381
# 2 0.159 -0.0109
# 3 0.0736 -0.0285
# 4 0.0564 0.161
# 5 0.0649 0.594
# 6 0.232 0.0305

Training, Tuning, Cross-Validating, and Testing Ranger (Random Forest) Quantile Regression Model? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
May someone share how to train, tune (hyperparameters), cross-validate, and test a ranger quantile regression model, along with error evaluation? With the iris or Boston housing dataset?
The reason I ask is because I have not been able to find many examples or walkthroughs using quantile regression on Kaggle, random blogs, Youtube. Most problems I encountered are classification problems.
I am currently using a quantile regression model but I am hoping to see other examples in particular with hyperparameter tuning
There are a lot of parameters for this function. Since this isn't a forum for what it all means, I really suggest that you hit up Cross Validates with questions on the how and why. (Or look for questions that may already be answered.)
library(tidyverse)
library(ranger)
library(caret)
library(funModeling)
data(iris)
#----------- setup data -----------
# this doesn't include exploration or cleaning which are both necessary
summary(iris)
df_status(iris)
#----------------- create training sample ----------------
set.seed(395280469) # for replicability
# create training sample partition (70/20 split)
tr <- createDataPartition(iris$Species,
p = .8,
list = F)
There are a lot of ways to split the data, but I tend to prefer Caret, because they word to even out factors if that's what you feed it.
#--------- First model ---------
fit.r <- ranger(Sepal.Length ~ .,
data = iris[tr, ],
write.forest = TRUE,
importance = 'permutation',
quantreg = TRUE,
keep.inbag = TRUE,
replace = FALSE)
fit.r
# Ranger result
#
# Call:
# ranger(Sepal.Length ~ ., data = iris[tr, ], write.forest = TRUE,
# importance = "permutation", quantreg = TRUE, keep.inbag = TRUE,
# replace = FALSE)
#
# Type: Regression
# Number of trees: 500
# Sample size: 120
# Number of independent variables: 4
# Mtry: 2
# Target node size: 5
# Variable importance mode: permutation
# Splitrule: variance
# OOB prediction error (MSE): 0.1199364
# R squared (OOB): 0.8336928
p.r <- predict(fit.r, iris[-tr, -1],
type = 'quantiles')
It defaults to .1, .5, and .9:
postResample(p.r$predictions[, 1], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.5165946 0.7659124 0.4036667
postResample(p.r$predictions[, 2], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.3750556 0.7587326 0.3133333
postResample(p.r$predictions[, 3], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.6488991 0.7461830 0.5703333
To see what this looks like in practice:
# this performance is the best so far, let's see what it looks like visually
ggplot(data.frame(p.Q1 = p.r$predictions[, 1],
p.Q5 = p.r$predictions[, 2],
p.Q9 = p.r$predictions[, 3],
Actual = iris[-tr, 1])) +
geom_point(aes(x = Actual, y = p.Q1, color = "P.Q1")) +
geom_point(aes(x = Actual, y = p.Q5, color = "P.Q5")) +
geom_point(aes(x = Actual, y = p.Q9, color = "P.Q9")) +
geom_line(aes(Actual, Actual, color = "Actual")) +
scale_color_viridis_d(end = .8, "Error",
direction = -1)+
theme_bw()
# since Quantile .1 performed the best
ggplot(data.frame(p.Q9 = p.r$predictions[, 3],
Actual = iris[-tr, 1])) +
geom_point(aes(x = Actual, y = p.Q9, color = "P.Q9")) +
geom_segment(aes(x = Actual, xend = Actual,
y = Actual, yend = p.Q9)) +
geom_line(aes(Actual, Actual, color = "Actual")) +
scale_color_viridis_d(end = .8, "Error",
direction = -1)+
theme_bw()
#------------ ranger model with options --------------
# last call used default
# splitrule: variance, use "extratrees" (only 2 for this one)
# mtry = 2, use 3 this time
# min.node.size = 5, using 6 this time
# using num.threads = 15 ** this is the number of cores on YOUR device
# change accordingly --- if you don't know, drop this one
set.seed(326)
fit.r2 <- ranger(Sepal.Length ~ .,
data = iris[tr, ],
write.forest = TRUE,
importance = 'permutation',
quantreg = TRUE,
keep.inbag = TRUE,
replace = FALSE,
splitrule = "extratrees",
mtry = 3,
min.node.size = 6,
num.threads = 15)
fit.r2
# Ranger result
# Type: Regression
# Number of trees: 500
# Sample size: 120
# Number of independent variables: 4
# Mtry: 3
# Target node size: 6
# Variable importance mode: permutation
# Splitrule: extratrees
# Number of random splits: 1
# OOB prediction error (MSE): 0.1107299
# R squared (OOB): 0.8464588
This model produced similarly.
p.r2 <- predict(fit.r2, iris[-tr, -1],
type = 'quantiles')
postResample(p.r2$predictions[, 1], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.4932883 0.8144309 0.4000000
postResample(p.r2$predictions[, 2], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.3610171 0.7643744 0.3100000
postResample(p.r2$predictions[, 3], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.6555939 0.8141144 0.5603333
The prediction was pretty similar overall, as well.
This isn't a very large set of data, with few predictors.
How much do they contribute?
importance(fit.r2)
# Sepal.Width Petal.Length Petal.Width Species
# 0.06138883 0.71052453 0.22956522 0.18082998
#------------ ranger model with options --------------
# drop a predictor, lower mtry, min.node.size
set.seed(326)
fit.r3 <- ranger(Sepal.Length ~ .,
data = iris[tr, -4], # dropped Sepal.Width
write.forest = TRUE,
importance = 'permutation',
quantreg = TRUE,
keep.inbag = TRUE,
replace = FALSE,
splitrule = "extratrees",
mtry = 2, # has to change (var count lower)
min.node.size = 4, # lowered
num.threads = 15)
fit.r3
# Ranger result
# Type: Regression
# Number of trees: 500
# Sample size: 120
# Number of independent variables: 3
# Mtry: 2
# Target node size: 6
# Variable importance mode: permutation
# Splitrule: extratrees
# Number of random splits: 1
# OOB prediction error (MSE): 0.1050143
# R squared (OOB): 0.8543842
The second most important predictor was removed and it improved.
p.r3 <- predict(fit.r3, iris[-tr, -c(1, 4)],
type = 'quantiles')
postResample(p.r3$predictions[, 1], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.4760952 0.8089810 0.3800000
postResample(p.r3$predictions[, 2], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.3738315 0.7769388 0.3250000
postResample(p.r3$predictions[, 3], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.6085584 0.8032592 0.5170000
importance(fit.r3)
# almost everthing relies on Petal.Length
# Sepal.Width Petal.Length Species
# 0.08008264 0.95440333 0.32570147

Out-of-fold vs training error in caret

Using cross validation in model tuning, I get different error rates from caret::train's results object and calculating the error myself on its pred object. I'd like to understand why they differ, and ideally how to use out-of-fold error rates for model selection, plotting model performance, etc.
The pred object contains out-of-fold predictions. The docs are pretty clear that trainControl(..., savePredictions = "final") saves out-of-fold predictions for the best hyperparameter values: "an indicator of how much of the hold-out predictions for each resample should be saved... "final" saves the predictions for the optimal tuning parameters." (Keeping "all" predictions and then filtering to the best tuning values doesn't resolve the issue.)
The train docs say that the results object is "a data frame the training error rate..." I'm not sure what that means, but the values for the best row are consistently different from the metrics calculated on pred. Why do they differ and how can I make them line up?
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#> Loading required package: lattice
#> Loading required package: ggplot2
m
#> Random Forest
#>
#> 50 samples
#> 2 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (4 fold)
#> Summary of sample sizes: 38, 36, 38, 38
#> Resampling results across tuning parameters:
#>
#> min.node.size mtry splitrule RMSE Rsquared MAE
#> 1 2 maxstat 0.5981673 0.6724245 0.4993722
#> 3 1 extratrees 0.5861116 0.7010012 0.4938035
#> 4 2 maxstat 0.6017491 0.6661093 0.4999057
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were mtry = 1, splitrule =
#> extratrees and min.node.size = 3.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
#> [1] 0.609202
MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
#> [1] 0.642394
Created on 2018-04-09 by the reprex package (v0.2.0).
The RMSE for cross validation is not calculated the way you show, but rather for each fold and then averaged. Full example:
set.seed(1)
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
set.seed(1)
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#output
Random Forest
50 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 37, 38, 37, 38
Resampling results across tuning parameters:
min.node.size mtry splitrule RMSE Rsquared MAE
8 1 extratrees 0.6106390 0.4360609 0.4926629
12 2 extratrees 0.6156636 0.4294237 0.4954481
19 2 variance 0.6472539 0.3889372 0.5217369
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 1, splitrule = extratrees and min.node.size = 8.
RMSE for best model is 0.6106390
Now calculate the RMSE for each fold and average:
m$pred %>%
group_by(Resample) %>%
mutate(rmse = caret::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
m$pred %>%
group_by(Resample) %>%
mutate(rmse = MLmetrics::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
I get different results. This is apparently a random process.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
[1] 0.5824464
> MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
[1] 0.5271595
If you want a random (more accurately a pseudo-random process to be reproducible, then use set.seed immediately prior to the call.

Plotting a survival curve from a survreg prediction

I'm relatively new to survival analysis and have been used some standard telco churn data example with a sample below called 'telco':
telco <- read.csv(text = "State,Account_Length,Area_Code,Intl_Plan,Day_Mins,Day_Calls,Day_Charge,Eve_Mins,Eve_Calls,Eve_Charge,Night_Mins,Night_Calls,Night_Charge,Intl_Mins,Intl_Calls,Intl_Charge,CustServ_Calls,Churn
IN,65,415,no,129.1,137,21.95,228.5,83,19.42,208.8,111,9.4,12.7,6,3.43,4,TRUE
RI,74,415,no,187.7,127,31.91,163.4,148,13.89,196,94,8.82,9.1,5,2.46,0,FALSE
IA,168,408,no,128.8,96,21.9,104.9,71,8.92,141.1,128,6.35,11.2,2,3.02,1,FALSE
MT,95,510,no,156.6,88,26.62,247.6,75,21.05,192.3,115,8.65,12.3,5,3.32,3,FALSE
IA,62,415,no,120.7,70,20.52,307.2,76,26.11,203,99,9.14,13.1,6,3.54,4,FALSE
NY,161,415,no,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4,TRUE")
I've run:
library(survival)
dependentvars = Surv(telco$Account_Length, telco$Churn)
telcosurvreg = survreg(dependentvars ~ -Churn -Account_Length, dist="gaussian",data=telco)
telcopred = predict(telcosurvreg, newdata=telco, type="quantile", p=.5)
...to get the predicted lifetime of each customer.
What I'm struggling with is how to visualise a survival curve for this. Is there a way (preferably in ggplot2) to do this from the data I have?
Here is a base R version that plots the predicted survival curves. I have changed the formula so the curves differ for each row
> # change setup so we have one covariate
> telcosurvreg = survreg(
+ Surv(Account_Length, Churn) ~ Eve_Charge, dist = "gaussian", data = telco)
> telcosurvreg # has more than an intercept
Call:
survreg(formula = Surv(Account_Length, Churn) ~ Eve_Charge, data = telco,
dist = "gaussian")
Coefficients:
(Intercept) Eve_Charge
227.274695 -3.586121
Scale= 56.9418
Loglik(model)= -12.1 Loglik(intercept only)= -12.4
Chisq= 0.54 on 1 degrees of freedom, p= 0.46
n= 6
>
> # find linear predictors
> vals <- predict(telcosurvreg, newdata = telco, type = "lp")
>
> # use the survreg.distributions object. See ?survreg.distributions
> x_grid <- 1:400
> sur_curves <- sapply(
+ vals, function(x)
+ survreg.distributions[[telcosurvreg$dist]]$density(
+ (x - x_grid) / telcosurvreg$scale)[, 1])
>
> # plot with base R
> matplot(x_grid, sur_curves, type = "l", lty = 1)
Here is the result

"response" prediciton with earth (MARS) and caret in R

I hope this is not too naive of a question.
I am performing a series of binomial regressions with different models in the caret package in R. All are working so far except for earth (MARS). Typically, the earth family is passed to the glm function through the earth function as glm=list(family=binomial). This seems to be working ok (as evident below). For the general predict() function, I would use the type="response' to properly scale the prediction. The examples below show the non-caret approach in fit1 with the correct prediction in pred1. pred1a is the improperly scaled prediction without type='response'. fit2 is the approach with caret and pred2 is the prediction; it is the same as the non-scaled prediction in pred1a. Digging through the fit2 object, the properly fitted values are present in the glm.list component. Therefore, the earth() function is behaving as it should.
The question is... since the caret prediction() function only takes type='prob' or 'raw', how can I instruct is to predict on the scale of the response?
Thank you very much.
require(earth)
library(caret)
data(mtcars)
fit1 <- earth(am ~ cyl + mpg + wt + disp, data = mtcars,
degree=1, glm=list(family=binomial))
pred1 <- predict(fit1, newdata = mtcars, type="response")
range(pred1)
[1] 0.0004665284 0.9979135993 # Correct - binomial with response
pred1a <- predict(fit1, newdata = mtcars)
range(pred1a)
[1] -7.669725 6.170226 # without "response"
fit2ctrl <- trainControl(method = "cv", number = 5)
fit2 <- train(am ~ cyl + mpg + wt + disp, data = mtcars, method = "earth",
trControl = fit2ctrl, tuneLength = 3,
glm=list(family='binomial'))
pred2 <- predict(fit2, newdata = mtcars)
range(pred2)
[1] -7.669725 6.170226 # same as pred1a
#within glm.list object in fit4
[1] 0.0004665284 0.9979135993
There are a few things:
the outcome (mtcars$am) is numeric 0/1 and train will treat this as a regression model
when the outcome is a factor, train will assume classification and will automatically add glm=list(family=binomial)
with classification and train, you will need to add classProbs = TRUE to trainControl for the model to produce class probabilities.
Here is an example with a different data set in the earth package:
library(earth)
library(caret)
data(etitanic)
a1 <- earth(survived ~ .,
data = etitanic,
glm=list(family=binomial),
degree = 2,
nprune = 5)
etitanic$survived <- factor(ifelse(etitanic$survived == 1, "yes", "no"),
levels = c("yes", "no"))
a2 <- train(survived ~ .,
data = etitanic,
method = "earth",
tuneGrid = data.frame(degree = 2, nprune = 5),
trControl = trainControl(method = "none",
classProbs = TRUE))
then:
> predict(a1, head(etitanic), type = "response")
survived
[1,] 0.8846552
[2,] 0.9281010
[3,] 0.8846552
[4,] 0.4135716
[5,] 0.8846552
[6,] 0.4135716
>
> predict(a2, head(etitanic), type = "prob")
yes no
1 0.8846552 0.11534481
2 0.9281010 0.07189895
3 0.8846552 0.11534481
4 0.4135716 0.58642840
5 0.8846552 0.11534481
6 0.4135716 0.58642840
Max

Resources