I hope this is not too naive of a question.
I am performing a series of binomial regressions with different models in the caret package in R. All are working so far except for earth (MARS). Typically, the earth family is passed to the glm function through the earth function as glm=list(family=binomial). This seems to be working ok (as evident below). For the general predict() function, I would use the type="response' to properly scale the prediction. The examples below show the non-caret approach in fit1 with the correct prediction in pred1. pred1a is the improperly scaled prediction without type='response'. fit2 is the approach with caret and pred2 is the prediction; it is the same as the non-scaled prediction in pred1a. Digging through the fit2 object, the properly fitted values are present in the glm.list component. Therefore, the earth() function is behaving as it should.
The question is... since the caret prediction() function only takes type='prob' or 'raw', how can I instruct is to predict on the scale of the response?
Thank you very much.
require(earth)
library(caret)
data(mtcars)
fit1 <- earth(am ~ cyl + mpg + wt + disp, data = mtcars,
degree=1, glm=list(family=binomial))
pred1 <- predict(fit1, newdata = mtcars, type="response")
range(pred1)
[1] 0.0004665284 0.9979135993 # Correct - binomial with response
pred1a <- predict(fit1, newdata = mtcars)
range(pred1a)
[1] -7.669725 6.170226 # without "response"
fit2ctrl <- trainControl(method = "cv", number = 5)
fit2 <- train(am ~ cyl + mpg + wt + disp, data = mtcars, method = "earth",
trControl = fit2ctrl, tuneLength = 3,
glm=list(family='binomial'))
pred2 <- predict(fit2, newdata = mtcars)
range(pred2)
[1] -7.669725 6.170226 # same as pred1a
#within glm.list object in fit4
[1] 0.0004665284 0.9979135993
There are a few things:
the outcome (mtcars$am) is numeric 0/1 and train will treat this as a regression model
when the outcome is a factor, train will assume classification and will automatically add glm=list(family=binomial)
with classification and train, you will need to add classProbs = TRUE to trainControl for the model to produce class probabilities.
Here is an example with a different data set in the earth package:
library(earth)
library(caret)
data(etitanic)
a1 <- earth(survived ~ .,
data = etitanic,
glm=list(family=binomial),
degree = 2,
nprune = 5)
etitanic$survived <- factor(ifelse(etitanic$survived == 1, "yes", "no"),
levels = c("yes", "no"))
a2 <- train(survived ~ .,
data = etitanic,
method = "earth",
tuneGrid = data.frame(degree = 2, nprune = 5),
trControl = trainControl(method = "none",
classProbs = TRUE))
then:
> predict(a1, head(etitanic), type = "response")
survived
[1,] 0.8846552
[2,] 0.9281010
[3,] 0.8846552
[4,] 0.4135716
[5,] 0.8846552
[6,] 0.4135716
>
> predict(a2, head(etitanic), type = "prob")
yes no
1 0.8846552 0.11534481
2 0.9281010 0.07189895
3 0.8846552 0.11534481
4 0.4135716 0.58642840
5 0.8846552 0.11534481
6 0.4135716 0.58642840
Max
Related
I'm trying to use predict() in R to compute a prediction interval for a linear model. When I tried this on a simpler model with only one covariate, it gave the expected output of a point estimate with a confidence interval. When I added a categorical predictor to the model, the predict() output gives what seems like a single-point estimate with no interval. I've Googled to no avail. Can anyone tell me what I've done wrong here?
medcost <- data.frame(
ID = c(1:100),
charges = sample(0:100000, 100, replace = T),
bmi = sample(18:40, 100, replace = T),
smoker = factor(sample(c("smoker", "nonsmoker"), 100, replace = TRUE))
)
mod2 <- glm(charges ~ bmi + smoker, data = medcost)
predict(mod2, interval="predict",
newdata = data.frame(bmi=c(29, 31.5), smoker=c("smoker", "smoker")))
If you want to have the standard error, you could use se.fit = TRUE like this:
mod2 <- glm(charges ~ bmi + smoker, data = medcost)
predict(mod2, interval="predict",
newdata = data.frame(bmi=c(29, 31.5), smoker=c("smoker", "smoker")),
se.fit = TRUE)
#> $fit
#> 1 2
#> 47638.66 47106.14
#>
#> $se.fit
#> 1 2
#> 4304.220 4475.473
#>
#> $residual.scale
#> [1] 28850.85
Created on 2023-01-17 with reprex v2.0.2
I would recommend you having a look at this post: R: glm(...,family=poisson) plot confidence and prediction intervals
I would like to use the fastshap package to obtain SHAP values plots for every category of my outcome in a multi-classification problem using a random forest classifier. I could only found chunks of the code around, but no explanation on how to procede from the beginning in obtaining the SHAP values in this case. Here is the code I have so far (my y has 5 classes, here I am trying to obtain SHAP values for class 3):
library(randomForest)
library(fastshap)
set.seed(42)
sample <- sample.int(n = nrow(ITA), size = floor(.75*nrow(ITA)), replace=F)
train <- ITA [sample,]
test <- ITA [-sample,]
set.seed(42)
rftrain <-randomForest(y ~ ., data=train, ntree=500, importance = TRUE)
p_function_3<- function(object, newdata)
caret::predict.train(object,
newdata = newdata,
type = "prob")[,3]
shap_values_G <- fastshap::explain(rftrain,
X = train,
pred_wrapper = p_function_3,
nsim = 50,
newdata=train[which(y==3),])
Now, I took the code largely from an example I found online, and I tried to adapt it (I am not an expert R user), but it does not work.. Can you please help me in correcting it? Thanks!
Here is a working example (with a different dataset), but I think the logic is the same.
library(randomForest)
library(fastshap)
set.seed(42)
ix <- sample(nrow(iris), 0.75 * nrow(iris))
train <- iris[ix, ]
test <- iris[-ix, ]
xvars <- c("Sepal.Width", "Sepal.Length")
yvar <- "Species"
fit <- randomForest(reformulate(xvars, yvar), data = train, ntree = 500)
pred_3 <- function(model, newdata) {
predict(model, newdata = newdata, type = "prob")[, "virginica"]
}
shap_values_3 <- fastshap::explain(
fit,
X = train, # Reference data
feature_names = xvars,
pred_wrapper = pred_3,
nsim = 50,
newdata = train[train$Species == "virginica", ] # For these rows, you will calculate explanations
)
head(shap_values_3)
# Sepal.Width Sepal.Length
# <dbl> <dbl>
# 1 0.101 0.381
# 2 0.159 -0.0109
# 3 0.0736 -0.0285
# 4 0.0564 0.161
# 5 0.0649 0.594
# 6 0.232 0.0305
I'm actually trying to do some test on my linear regression model with different functions as ols_vif_tol(), ols_test_normality() or durbinWatsonTest() which only work with lm(). However, I got my model using the train() function of the caret package.
> fitcontrol = trainControl( method = "repeatedcv", number = floor(0.4*nrow(TrainData)), repeats = RepeatsTC, returnResamp = "all", savePredictions = "all")
> BestModel = train(Formula2, data = TrainData, trControl = fitcontrol, method = "lm", metric = "RMSE")
At the end I get this output:
> BestModel
Linear Regression
10 samples
1 predictor
No pre-processing
Resampling: Cross-Validated (4 fold, repeated 100 times)
Summary of sample sizes: 7, 8, 8, 7, 7, 8, ...
Resampling results:
RMSE Rsquared MAE
10.75823 0.8911761 9.660638
Tuning parameter 'intercept' was held constant at a value of TRUE
What I want is to have this output:
> GoodModel = lm(Formula2, data = FinalData)
> GoodModel
Call:
lm(formula = Formula2, data = FinalData)
Coefficients:
(Intercept) Evol.INDUS.PROD
4.089 3.908
So, even if I used method = "lm" I don't have the same output which to give me an error when I do my tests.
> ols_test_normality(BestModel)
Error in ols_test_normality.default(BestModel) : y must be numeric
> ols_test_normality(GoodModel)
-----------------------------------------------
Test Statistic pvalue
-----------------------------------------------
Shapiro-Wilk 0.9042 0.1528
Kolmogorov-Smirnov 0.1904 0.6661
Cramer-von Mises 1.1026 0.0010
Anderson-Darling 0.4615 0.2156
-----------------------------------------------
I know there is a as.lm function but I tried it and I don't have a version that can use it.
Does someone know how to get the same form as the lm() function after using train or a way to use the output of BestModel to do those tests?
EDIT
Here is a simpler case that gives rise to the same error and where you can try different tests.
install.packages("olsrr")
install.package("caret")
library(olsrr)
library(caret)
first = sample(1:10, 10, rep = TRUE)
second = sample(10:20, 10, rep = TRUE)
third = sample(20:30, 10, rep = TRUE)
Df = data.frame(first, second, third)
Df
#Create a model with lm
Model1 = lm(first ~ second + third, data = Df)
Model1
summary(Model1)
ols_test_normality(Model1)
#Create a model with caret::train
Fold = sample(1:nrow(Df) ,size = 0.8*nrow(Df), replace = FALSE)
TrainData = Df[Fold,]
TestData = Df[-Fold,]
fitcontrol = trainControl(method = "repeatedcv", number = 2, repeats = 10)
Model2 = train(first ~ second + third, data = TrainData, trControl = fitcontrol, method = "lm")
Model2
summary(Model2)
ols_test_normality(Model2)
Thank you
Your Model2 is a train object, so ols_test_normality will not work on it:
class(Model2)
[1] "train" "train.formula"
The final lm model is stored under finalModel:
class(Model2$finalModel)
[1] "lm"
ols_test_normality(Model2$finalModel)
-----------------------------------------------
Test Statistic pvalue
-----------------------------------------------
Shapiro-Wilk 0.9843 0.9809
Kolmogorov-Smirnov 0.149 0.9822
Cramer-von Mises 0.4212 0.0611
Anderson-Darling 0.1677 0.9004
-----------------------------------------------
I am trying to find model with lowest AIC. Models are returned from two for loops that make possible combinations of columns. I am unable to make the function return model with lowest AIC. The code below demonstrates where I got stuck:
rm(list = ls())
data <- iris
data <- data[data$Species %in% c("setosa", "virginica"),]
data$Species = ifelse(data$Species == 'virginica', 0, 1)
mod_headers <- names(data[1:ncol(data)-1])
f <- function(mod_headers){
for(i in 1:length(mod_headers)){
tab <- combn(mod_headers,i)
for(j in 1:ncol(tab)){
tab_new <- c(tab[,j])
mod_tab_new <- c(tab_new, "Species")
model <- glm(Species ~., data=data[c(mod_tab_new)], family = binomial(link = "logit"))
}
}
best_model <- model[which(AIC(model)[order(AIC(model))][1])]
print(best_model)
}
f(mod_headers)
Any suggestions? Thanks!
I replaced your for loops with vectorised alternatives
library(tidyverse)
library(iterators)
# Column names you want to use in glm model, saved as list
whichcols <- Reduce("c", map(1:length(mod_headers), ~lapply(iter(combn(mod_headers,.x), by="col"),function(y) c(y))))
# glm model results using selected column names, saved as list
models <- map(1:length(whichcols), ~glm(Species ~., data=data[c(whichcols[[.x]], "Species")], family = binomial(link = "logit")))
# selects model with lowest AIC
best <- models[[which.min(sapply(1:length(models),function(x)AIC(models[[x]])))]]
Output
Call: glm(formula = Species ~ ., family = binomial(link = "logit"),
data = data[c(whichcols[[.x]], "Species")])
Coefficients:
(Intercept) Petal.Length
55.40 -17.17
Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
Null Deviance: 138.6
Residual Deviance: 1.208e-09 AIC: 4
Using your loop, just put all the models in one list.
Then compute the AIC of all these models.
Finally return the model with the minimum AIC.
f <- function(mod_headers) {
models <- list()
k <- 1
for (i in 1:length(mod_headers)) {
tab <- combn(mod_headers, i)
for(j in 1:ncol(tab)) {
mod_tab_new <- c(tab[, j], "Species")
models[[k]] <- glm(Species ~ ., data = data[mod_tab_new],
family = binomial(link = "logit"))
k <- k + 1
}
}
models[[which.min(sapply(models, AIC))]]
}
glm() uses an iterative re-weighted least squares algorithm. The algorithm reaches the maximum number of iterations before it converges - changing this parameter helps in your case:
glm(Species ~., data=data[mod_tab_new], family = binomial(link = "logit"), control = list(maxit = 50))
There was another issue using which, I replaced it with an if after each model fit to compare to the lowest AIC so far. However, I think there are better solutions than this for-loop approach.
f <- function(mod_headers){
lowest_aic <- Inf # added
best_model <- NULL # added
for(i in 1:length(mod_headers)){
tab <- combn(mod_headers,i)
for(j in 1:ncol(tab)){
tab_new <- tab[, j]
mod_tab_new <- c(tab_new, "Species")
model <- glm(Species ~., data=data[mod_tab_new], family = binomial(link = "logit"), control = list(maxit = 50))
if(AIC(model) < lowest_aic){ # added
lowest_aic <- AIC(model) # added
best_model <- model # added
}
}
}
return(best_model)
}
When I run a cluster standard error panel specification with plm and lfe I get results that differ at the second significant figure. Does anyone know why they differ in their calculation of the SE's?
set.seed(572015)
library(lfe)
library(plm)
library(lmtest)
# clustering example
x <- c(sapply(sample(1:20), rep, times = 1000)) + rnorm(20*1000, sd = 1)
y <- 5 + 10*x + rnorm(20*1000, sd = 10) + c(sapply(rnorm(20, sd = 10), rep, times = 1000))
facX <- factor(sapply(1:20, rep, times = 1000))
mydata <- data.frame(y=y,x=x,facX=facX, state=rep(1:1000, 20))
model <- plm(y ~ x, data = mydata, index = c("facX", "state"), effect = "individual", model = "within")
plmTest <- coeftest(model,vcov=vcovHC(model,type = "HC1", cluster="group"))
lfeTest <- summary(felm(y ~ x | facX | 0 | facX))
data.frame(lfeClusterSE=lfeTest$coefficients[2],
plmClusterSE=plmTest[2])
lfeClusterSE plmClusterSE
1 0.06746538 0.06572588
The difference is in the degrees-of-freedom adjustment. This is the usual first guess when looking for differences in supposedly similar standard errors (see e.g., Different Robust Standard Errors of Logit Regression in Stata and R). Here, the problem can be illustrated when comparing the results from (1) plm+vcovHC, (2) felm, (3) lm+cluster.vcov (from package multiwayvcov).
First, I refit all models:
m1 <- plm(y ~ x, data = mydata, index = c("facX", "state"),
effect = "individual", model = "within")
m2 <- felm(y ~ x | facX | 0 | facX, data = mydata)
m3 <- lm(y ~ facX + x, data = mydata)
All lead to the same coefficient estimates. For m3 the fixed effects are explicitly reported while they are not for m1 and m2. Hence, for m3 only the last coefficient is extracted with tail(..., 1).
all.equal(coef(m1), coef(m2))
## [1] TRUE
all.equal(coef(m1), tail(coef(m3), 1))
## [1] TRUE
The non-robust standard errors also agree.
se <- function(object) tail(sqrt(diag(object)), 1)
se(vcov(m1))
## x
## 0.07002696
se(vcov(m2))
## x
## 0.07002696
se(vcov(m3))
## x
## 0.07002696
And when comparing the clustered standard errors we can now show that felm uses the degrees-of-freedom correction while plm does not:
se(vcovHC(m1))
## x
## 0.06572423
m2$cse
## x
## 0.06746538
se(cluster.vcov(m3, mydata$facX))
## x
## 0.06746538
se(cluster.vcov(m3, mydata$facX, df_correction = FALSE))
## x
## 0.06572423