Why do DALEX and tidymodels provide different GOF?

Why do DALEX and tidymodels provide different GOF? - r

I wonder why DALEX model_performance and collect_metrics do not provide the same accuracy. Do they use different measures or different methods? I've compiled the following example code:
library(tidymodels)
library(parsnip)
library(DALEXtra)
set.seed(1)
x1 <- rbinom(1000, 5, .1)
x2 <- rbinom(1000, 5, .4)
x3 <- rbinom(1000, 5, .9)
x4 <- rbinom(1000, 5, .6)
id <- c(1:1000)
y <- as.factor(rbinom(1000, 5, .5))
df <- tibble(y, x1, x2, x3, x4, id)
# create training and test set
set.seed(20)
split_dat <- initial_split(df, prop = 0.8)
train <- training(split_dat)
test <- testing(split_dat)
# use cross-validation
kfolds <- vfold_cv(df)
# recipe
rec_pca <- recipe(y ~ ., data = train) %>%
update_role(id, new_role = "id variable") %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_pca(x1, x2, x3, threshold = 0.9, num_comp = 1)
# parsnip engine
boost_model <- boost_tree() %>%
set_mode("classification") %>%
set_engine("xgboost")
# create wf
boosted_wf <-
workflow() %>%
add_model(boost_model) %>%
add_recipe(rec_pca)
boosted_res <- last_fit(boosted_wf, split_dat)
collect_metrics(boosted_res)
Output of collect_metrics is 0.31
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy multiclass 0.31 Preprocessor1_Model1
2 roc_auc hand_till 0.512 Preprocessor1_Model1
Continuing to prepare for DALEX model explanation.
final_boosted <- generics::fit(boosted_wf, df)
# create an explanation object
explainer_xgb <- DALEXtra::explain_tidymodels(final_boosted,
data = df[,-1],
y = df$y)
perf <- model_performance(explainer_xgb)
perf
Now this provides the following output for the overall fit:
Measures for: multiclass
micro_F1 : 0.43
macro_F1 : 0.5743392
w_macro_F1 : 0.4775901
accuracy : 0.43
w_macro_auc: 0.7064296
Note that accuracy is 0.43 using model_performance and 0.31 using collect_metrics. Does anyone know why this is the case?

I believe it is because different resampling indicies/schemes are being used. In other words, different data are being used to compute the performance statistics.

Related

Extract categorical coeffients and all p-values from a mixed model into a data table

Here is a reproduceable code and sample data
I want to achieve a final data table with 3 columns: 1. exposure quantile 2. OR/RR 3. PV
set.seed(42)
n <- 100
dat = data.frame(ID = rep(c(1:25),times=4 ) ,
Score = rnorm(n, mean=0.3, sd=0.8))
dat = dat %>%
group_by(ID)%>%
dplyr::mutate(exposure1 = rep(c(rnorm(1, mean=6, sd=1.8))),
exposure2 = rep(c(rnorm(1, mean=3, sd=0.6))),
age = rep(c(rnorm(1, mean=40, sd=15))))%>%
ungroup()%>%
dplyr::mutate(exposure1_quantile = cut(exposure1, breaks = 4, labels = c("Q1","Q2","Q3","Q4")),
exposure2_quantile = cut(exposure2, breaks = 4, labels = c("Q1","Q2","Q3","Q4")))
exposures_var = c("exposure1_quantile","exposure2_quantile")
exposure_var_labels("exposure1 Q1","exposure1 Q2 ", "exposure 1 Q3",
"exposure2 Q1","exposure2 Q2 ", "exposure2 Q3")
age="age"
outcome = "Score"
exposure_data_table = c()
for(i in 1:length(exposures_var)){
exp = exposures_var[i]
fixed_effects_formula = paste0(outcome, "~",exp,"+",age)
fixed_effects_formula = as.formula(fixed_effects_formula)
mixedmodel = lme(fixed =fixed_effects_formula, random = ~1|ID, data=dat, method = "ML")
for(m in 2:4){
v = mixedmodel$coefficients$fixed[m]
vector = c(exp , v)
#P=p value for every quantile (HOW TO ADD?)
#exposure_name = exposure_var_labels[?] (HOW TO ADD LABEL)
exposure_data_table = rbind(exposure_data_table, vector)
}
}
exposure_data_table = as.data.table(exposure_data_table)
colnames(exposure_data_table)=c("Exposure","RR")#,"pv")
view(exposure_data_table)
I first used anova to try and get the pvalue but it didnt work.

I think a tidymodels approach using lme would work well here:
library(nlme)
library(tidymodels)
library(multilevelmod)
library(data.table)
lme_spec <-
linear_reg() %>%
set_engine("lme", random = ~ 1 | ID)
Map(function(exp) {
fixed_effects_formula <- as.formula(paste0("Score~",exp,"+ age +", 0))
lme_spec %>%
fit(fixed_effects_formula, data = dat) %>%
broom.mixed::tidy() %>%
filter(effect == "fixed", grepl("exposure", term)) %>%
select(term, estimate, std.error, p.value)
}, exposures_var) %>%
bind_rows() %>%
as.data.table()
#> term estimate std.error p.value
#> 1: exposure1_quantileQ1 -0.16147364 0.3532834 0.6525497
#> 2: exposure1_quantileQ2 0.22318505 0.2719366 0.4214784
#> 3: exposure1_quantileQ3 0.24976757 0.3484126 0.4817411
#> 4: exposure1_quantileQ4 0.14177064 0.4020702 0.7280757
#> 5: exposure2_quantileQ1 0.28976458 0.4191198 0.4972840
#> 6: exposure2_quantileQ2 0.19907863 0.2699164 0.4693496
#> 7: exposure2_quantileQ3 0.35040767 0.2827229 0.2295436
#> 8: exposure2_quantileQ4 -0.09587234 0.3533819 0.7889412
Created on 2022-08-07 by the reprex package (v2.0.1)

crr output list- remove df$ from coefficients?

I am using the cmprsk package to create a series of regressions. In the real models I used, I specified my models in the same way that is shown in the example that produces mel2 below. My problem is, I want the Melanoma$ in front of the coefficients to go away, as happens if I had specified the model like in mel1. Is there a way to delete that data frame prefix out of the object without re-running it?
library(cmprsk)
data(Melanoma, package = "MASS")
head(Melanoma)
mel1 <- crr(ftime = Melanoma$time, fstatus = Melanoma$status, cov1 = Melanoma[, c("sex", "age")], cencode = 2)
covs2 <- model.matrix(~ Melanoma$sex + Melanoma$age)[, -1]
mel2 <- crr(ftime = Melanoma$time, fstatus = Melanoma$status, cov1 = covs2, cencode = 2)
What I want:
What I have:

You could use the data argument in model.matrix, and wrap the crr call in with(Melanoma, ...)
covs2 <- model.matrix(~ sex + age, data = Melanoma)[, -1]
mel2 <- with(Melanoma, crr(ftime = time, fstatus = status,
cov1 = covs2, cencode = 2))
mel2$coef
#> sex age
#> 0.58838573 0.01259388
If you are stuck with existing models like this:
covs2 <- model.matrix(~ Melanoma$sex + Melanoma$age)[, -1]
mel2 <- crr(ftime = Melanoma$time, fstatus = Melanoma$status,
cov1 = covs2, cencode = 2)
You could simply rename the coefficients like this
names(mel2$coef) <- c("sex", "age")
mel2
#> convergence: TRUE
#> coefficients:
#> sex age
#> 0.58840 0.01259
#> standard errors:
#> [1] 0.271800 0.009301
#> two-sided p-values:
#> sex age
#> 0.03 0.18

Optimize function in r

Here is my code:
cee = abs(qnorm(.5*0.1)) # Bonferroni threshold for achieving study-wide significance = 0.1
p.value = (simAll %>% select("p.value"))
p.value1 <- as.numeric(unlist(p.value))
# we use "cee" so R does not get confused with the function 'c'
betahat = log(OR) # Reported OR
z = sign(betahat)*abs(qnorm(0.5*p.value1)) # Reported p-value = 5.7e-4, which we convert to a z-value
###################################################
# THE PROPOSED APPROACH #
###################################################
se = betahat/z # standard error of betahat
mutilde1 = optimize(f=conditional.like,c(-20,20),maximum=T,z=z,cee=cee)$maximum # the conditional mle
The p.value is the p-values for 1000 simulations, same as OR, for the "se“ part, I can get 1000 different se values there. But for the mutilde1 line, there is an error exist: "Error in optimize(f = conditional.like, c(-20, 20), maximum = T, z = z, :
invalid function value in 'optimize'"
How can I fix the issue?
The conditional.like() function:
conditional.like=function(mu,cee,z){
like=dnorm(z-mu)/(pnorm(mu-cee)+pnorm(-cee-mu))
return((abs(z)>cee)*like) }
The simALL is a table looks like this (total 1000 lines)：
# A tibble: 1,000 x 6
id term estimate std.error statistic p.value
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 .x 0.226 0.127 1.78 0.0747
2 2 .x 0.137 0.127 1.08 0.280
3 3 .x 0.304 0.127 2.38 0.0171
4 4 .x 0.497 0.128 3.87 0.000111
OR (total 1000 lines):
> OR
[1] 1.5537098 1.0939850 1.4491432 1.6377551 1.1646904 1.3387534 1.6377551 1.5009351 1.7918552
Also, here is my overall code:
library(tidyverse)
library(broom)
# create a tibble with an id column for each simulation and x wrapped in list()
sim <- tibble(id = 1:1000,
x = list(rbinom(1000,1,0.5))) %>%
# to generate z, pr, y, k use map and map2 from the purrr package to loop over the list column x
# `~ ... ` is similar to `function(.x) {...}`
# `.x` represents the variable you are using map on
mutate(z = map(x, ~ log(1.3) * .x),
pr = map(z, ~ 1 / (1 + exp(-.x))),
y = map(pr, ~ rbinom(1000, 1, .x)),
k = map2(x, y, ~ glm(.y ~ .x, family="binomial")),
# use broom::tidy to get the model summary in form of a tibble
sum = map(k, broom::tidy)) %>%
# select id and sum and unnest the tibbles
select(id, sum) %>%
unnest(cols = c(sum))
simOR <- sim %>%
# drop the intercepts and every .x with a p < 0.05
filter(term !="(Intercept)",
p.value < 0.05)
sim
j1=exp(simOR %>% select("estimate"))
OR1=as.numeric(unlist(j1))
mean(OR1)
simAll <- sim %>%
filter(term !="(Intercept)")
j <- exp(simAll %>% select("estimate"))
OR2 <- as.numeric(unlist(j))
mean(OR2)
simOR2 <- sim %>%
filter(term !="(Intercept)",
p.value < 0.005)
j2 <- exp(simOR2 %>% select("estimate"))
OR3 <- as.numeric(unlist(j2))
mean(OR3)
#op <- par(mfrow = c(3, 1))
hga=hist(OR2, main = NULL, freq = T, breaks = 10) #OR2:Overall OR
hgb=hist(OR1, freq = T,col=2,breaks=10, main="OR:p-value<0.05") #OR1:p-value<0.05
hgc=hist(OR3, freq = T,col=2,breaks=10, main="OR:p-value<0.005") #OR3:p-value<0.005
plot(hga,col=rgb(0,1,0,0.5),main = "OR",xlim=c(0.8,2),ylim=c(0,250))
plot(hgb, add = TRUE,col=rgb(0,0,0.8,0.5),xlim=c(0.8,2),ylim=c(0,250))
plot(hgc, add = TRUE,col=rgb(1,0,0,0.5),xlim=c(0.8,2))
abline(v = mean(OR2), lwd = 4, col = 3)
abline(v = mean(OR3), lwd = 4, col=2)
text(1.65,240,"1.31",col=1)
arrows(1.5,240,1.31,240,length=0.1,col=1,lwd=2)
abline(v = mean(OR1), lwd = 4, col=4)
text(2.1,220,"1.43",col=4)
arrows(1.98,220,1.43,220,length=0.1,col=4,lwd=2)
text(2.1,220,"1.55",col=2)
arrows(1.98,220,1.55,220,length=0.1,col=2,lwd=2)
#########################################
## THE FUNCTIONS BELOW ARE USED TO OBTAIN THE
## BIAS-CORRECTED ESTIMATES
#########################################
conditional.like=function(mu,cee,z){
like=dnorm(z-mu)/(pnorm(mu-cee)+pnorm(-cee-mu))
return((abs(z)>cee)*like) }
conditional.like.z=function(mu,cee,z){
return(conditional.like(mu,cee,z)*mu)
}
#########################################
## THE FUNCTIONS BELOW ARE USED TO OBTAIN THE
## BIAS-CORRECTED CONFIDENCE INTERVAL
#########################################
ptruncnorm.lower=function(z,mu,cee,alpha){
A=pnorm(-cee+mu)+pnorm(-cee-mu)
term1=pnorm(z-mu)
term2=pnorm(-cee-mu)
term3=pnorm(-cee-mu)+pnorm(z-mu)-pnorm(cee-mu)
result=(1/A)*(term1*(z<= -cee)+term2*(abs(z)<cee)+term3*(z>=cee))
return(result-(alpha/2))
}
ptruncnorm.upper=function(z,mu,cee,alpha){
A=pnorm(-cee+mu)+pnorm(-cee-mu)
term1=pnorm(z-mu)
term2=pnorm(-cee-mu)
term3=pnorm(-cee-mu)+pnorm(z-mu)-pnorm(cee-mu)
result=(1/A)*(term1*(z<= -cee)+term2*(abs(z)<cee)+term3*(z>=cee))
return(result-(1-alpha/2))
}
find.lowerz=function(mu,z,cee,alpha){
lowerz=uniroot(ptruncnorm.lower,lower=-20,upper=20,mu=mu,cee=cee,alpha=alpha)$root
return(lowerz-z)
}
find.upperz=function(mu,z,cee,alpha){
upperz=uniroot(ptruncnorm.upper,lower=-20,upper=20,mu=mu,cee=cee,alpha=alpha)$root
return(upperz-z)
}
getCI=function(z,cee,alpha){
uppermu=uniroot(find.lowerz,interval=c(-15,15),cee=cee,z=z,alpha=alpha)$root
lowermu=uniroot(find.upperz,interval=c(-15,15),cee=cee,z=z,alpha=alpha)$root
out=list(lowermu,uppermu)
names(out)=c("lowermu","uppermu")
return(out)
}
source("GW-functions.R")# YOU READ IN THE FUNCTIONS FOR OUR METHOD
cee=abs(qnorm(.5*0.1)) # Bonferroni threshold for achieving study-wide significance = 0.1
p.value=(simAll %>% select("p.value"))
p.value1 <- as.numeric(unlist(p.value))
# we use "cee" so R does not get confused with the function 'c'
betahat=log(OR) # Reported OR
z=sign(betahat)*abs(qnorm(0.5*p.value1)) # Reported p-value = 5.7e-4, which we convert to a z-value
###################################################
# THE PROPOSED APPROACH #
###################################################
se=betahat/z # standard error of betahat
mutilde1=optimize(f=conditional.like,c(-20,20),maximum=T,z=z,cee=cee)$maximum

How to fix this error: Recipes fail to load in Caret:: Train?

I have this problem when load recipes into caret:: train
There something wrong with the NA imputation, but I don't know how to fix it. If I remove the cross validation everything work fine.
Thanks in advance,
data(airquality)
set.seed(33) # for reproducibility
air_split <- initial_split(airquality, prop = 0.7)
air_train <- training(air_split)
air_test <- testing(air_split)
# Feature engineering - final recipe
air_recipe <- recipe(Ozone ~ ., data = air_train) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_knnimpute(all_numeric(), neighbors = 6) %>%
step_log(Ozone, Wind) %>%
step_other(Day, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal(), -all_outcomes())
# Validation
cv5 <- trainControl( method = "repeatedcv",
number = 5,
repeats = 5, allowParallel = TRUE)
# Fit an lm model
set.seed(12)
lm_fit <- train(
air_recipe,
data = air_train,
method = "lm",
trControl = cv5,
metric = "RMSE")
Error message
Error in quantile.default(y, probs = seq(0, 1, length = cuts)) : missing values and NaN's not allowed if 'na.rm' is FALSE
R.version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 6.1
year 2019
month 07
day 05
svn rev 76782
language R
version.string R version 3.6.1 (2019-07-05)
nickname Action of the Toes

Looks like the resamples are made before the recipe is applied.
So you could prep and juice the recipe and use the formula method:
library(recipes)
library(caret)
library(rsample)
data(airquality)
set.seed(33) # for reproducibility
air_split <- initial_split(airquality, prop = 0.7)
air_train <- training(air_split)
air_test <- testing(air_split)
# Feature engineering - final recipe
air_recipe <- recipe(Ozone ~ ., data = air_train) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_knnimpute(all_numeric(), neighbors = 6) %>%
step_log(Ozone, Wind) %>%
step_other(Day, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_naomit(all_outcomes(),all_predictors())
# Prep recipe
air_prep <- prep(air_recipe, retain = TRUE)
# Juice the prepared recipe
air_train <- juice(air_prep)
# Validation
cv5 <- trainControl( method = "repeatedcv",
number = 5,
repeats = 5, allowParallel = TRUE)
# Fit an lm model
set.seed(12)
lm_fit <- train(
Ozone ~ .,
data = air_train,
method = "lm",
trControl = cv5,
metric = "RMSE")
lm_fit
#> Linear Regression
#>
#> 108 samples
#> 5 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 5 times)
#> Summary of sample sizes: 86, 88, 86, 86, 86, 86, ...
#> Resampling results:
#>
#> RMSE Rsquared MAE
#> 0.5091496 0.6568485 0.3793589
#>
#> Tuning parameter 'intercept' was held constant at a value of TRUE
Alternatively, you could use {parsnip} and {tune} to keep everything in the tidymodels idiom:
library(recipes)
library(rsample)
library(parsnip)
library(tune)
library(yardstick)
data(airquality)
set.seed(33) # for reproducibility
air_split <- initial_split(airquality, prop = 0.7)
air_train <- training(air_split)
air_test <- testing(air_split)
air_recipe <- recipe(Ozone ~ ., data = air_train) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_knnimpute(all_numeric(), neighbors = 6) %>%
step_log(Ozone, Wind) %>%
step_other(Day, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_naomit(all_outcomes(),all_predictors())
air_cv <- vfold_cv(air_train, v = 5, repeats = 5)
lm_mod <- linear_reg() %>% set_engine("lm")
lm_fits <- fit_resamples(air_recipe, lm_mod, air_cv)
show_best(lm_fits, metric = "rmse", maximize = FALSE)
#> # A tibble: 1 x 5
#> .metric .estimator mean n std_err
#> <chr> <chr> <dbl> <int> <dbl>
#> 1 rmse standard 0.526 25 0.0256
Created on 2020-04-05 by the reprex package (v0.3.0)

How to obtain confidence intervals for non-linear mixed model (logistic growth curve) with an interaction

I'd like to calculate confidence intervals using the delta method, or bootstrapping, for a non-linear model (logistic growth curve) with a two-way interaction, but am unsure of how to code this. This post is similar to this one
https://stats.stackexchange.com/questions/231074/confidence-intervals-on-predictions-for-a-non-linear-mixed-model-nlme. However, I do not know how to modify it to deal with the interaction term. I'm using the ChickWeight dataframe, but I've modified it so that the curves are more sigmoidal/looks similar to a dataframe I'm using.
So far, I've tried bootstrapping with no success.
library(tidyverse)
library(broom)
library(nlme)
# Modify df so that there are more subjects (Chicks) per treatment (Diet)
Chicks.df <- ChickWeight
Chicks.df$Chick <- factor(Chicks.df$Chick, ordered=F)
Chicks.df <- Chicks.df %>% mutate(Diet = ifelse(Diet == 3 | Diet == 4, 2, 1),
Diet = as.factor(Diet))
# Create df with an additional time point (25th) to make curves more sigmoidal
t.25_diet1 <- rnorm(30, mean=178, sd=1) # new observations for Diet1
t.25_diet2 <- rnorm(20, mean=215, sd=1) # new observations for Diet2
weight <- c(t.23_diet1, t.23_diet2) # bind vectors together
Diet <- c(rep(1,30), rep(2,20)) # Create remaining variables for df
Chick <- c(seq(1:30), seq(31, 50))
Time <- rep(25, 50)
# Bind variables together to make new df, and then combine with original df (i.e. Chicks.df)
newdata <- data.frame(cbind(weight, Diet, Chick, Time))
newdata$Diet <- as.factor(newdata$Diet)
newdata$Chick <- as.factor(newdata$Chick)
Chicks.df <- bind_rows(Chicks.df, data.frame(newdata))
# Using nested ifelse, assign half of the individuals in each diet to new treatment
Chicks.df <- Chicks.df %>% mutate(Drug =
ifelse(Chick %in% c(1:15) &
Diet == 1, 0,
ifelse(Chick %in% c(16:30) &
Diet == 1, 1,
ifelse(Chick %in%
c(31:40) & Diet
== 2, 0, 1))))
# Check Curves
Chicks.df$Drug <- as.factor(Chicks.df$Drug)
ggplot(Chicks.df, aes(Time, weight, col=Diet, linetype=Drug)) + geom_smooth()
# Get self-starting values (use broom pkg)
Chicks.df %>%
group_by(Diet, Drug) %>%
do(fit = nls(weight ~ SSlogis(Time, Asym, xmid, scal), data = .)) %>%
tidy(fit) %>%
select(Diet, Drug, term, estimate) %>%
spread(term, estimate)
# Run model
mod <-nlme(weight ~ SSlogis(Time, Asym, xmid, scal),
fixed=list(Asym + xmid + scal ~ Diet*Drug),
random = Asym ~ 1 | Chick,
start=list(fixed=c(Asym=c(209, 215, 209, 249),
xmid=c(10.2, 10.5, 10.7, 9.7),
scal=c(6.3, 6.3, 5.1, 5.5))),
data=Chicks.df)
summary(mod)
# Build prediction df and attempt bootstrapping
pframe <- data.frame(Time = Chicks.df$Time,
Diet = Chicks.df$Diet,
Drug = Chicks.df$Drug,
Chick = Chicks.df$Chick[1])
pframe$weight <- predict(mod, pframe, re.form=~0)
myfun <- function(data, x){
fBoot <- nlme(weight ~ SSlogis(Time, Asym, xmid, scal),
fixed=list(Asym + xmid + scal ~ Diet*Drug),
random = Asym ~ 1 | Chick,
start=list(fixed=c(Asym=c(209, 215, 209, 249),
xmid=c(10.2, 10.5, 10.7, 9.7),
scal=c(6.3, 6.3, 5.1, 5.5))),
data=Chicks.df)
return((predict(fBoot, newdata=pframe)))
}
myboot <- boot::boot(pframe, myfun, R=100)
myboot # no values predicted

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why do DALEX and tidymodels provide different GOF? - r

I believe it is because different resampling indicies/schemes are being used. In other words, different data are being used to compute the performance statistics.

Related

Extract categorical coeffients and all p-values from a mixed model into a data table

crr output list- remove df$ from coefficients?

Optimize function in r

How to fix this error: Recipes fail to load in Caret:: Train?

How to obtain confidence intervals for non-linear mixed model (logistic growth curve) with an interaction

Categories

Resources