getting sensitivity after fitting decision tree with caret - r

I am trying to obtain the sensitivity and specificity etc. after fitting a decision tree using carret (or even caret::confusionMatrix). I am using code along those lines:
fit <- rpart(
Bla ~ ...
,method="class"
,data=OrginalData
)
preds <- predict(fit, SomeData)
caret::sensitivity(factor(preds[,2]), factor(OrginalData$Bla))
str(preds)
Unfortunately, I get NA as sensitivity. Str returns:
num [1:40, 1:2] 0.926 0.926 0.926 0.926 0.926 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:40] "1" "2" "3" "4" ...
..$ : chr [1:2] "NO" "YESR"
Any ideas?

The problem is you are providing a numeric vector of probabilities as predictions while caret::sensitivity expects a vector of classes. Here is a working code with Sonar data from mlbench:
library(mlbench)
library(rpart)
library(caret)
data(Sonar)
split data:
inds <- createDataPartition(Sonar$Class, p = 0.7, list = FALSE)
train <- Sonar[inds,]
test <- Sonar[-inds,]
create model:
fit <- rpart(Class ~ .,
method = "class",
data = train)
pred <- predict(fit, test)
convert probability to prediction:
caret::sensitivity(as.factor(ifelse(pred[,2] > 0.5, "R", "M")), test$Class)
#output
[1] 0.6969697
Do note that 0.5 as threshold might not be appropriate and it should be tuned according to the desired model behavior. If there is no preference towards a certain type of error, balanced accuracy is a solid metric to tune the probability threshold.
or specify type = class when predicting rpart:
pred <- predict(fit, test, type = "class")
caret::sensitivity(pred, test$Class)

Related

Looping TukeyHSD function through multiple columns of a dataframe

I'm trying to loop TukeyHSD tests through each column of a dataframe and compare treatment levels. Here's some mock data that's a simplified version of the data I have (my data has ~350 columns):
df1 <- data.frame(cmpd1 = c(500,436,1,1,1,1),
cmpd2 = c(1,1,1,1,1,253),
cmpd3 = c(1,1,300,57,150,260),
treatment=c("W","W","A","A","D","D"))
I've followed the suggestions in this post successfully and have created a loop that runs ANOVAs for each column, outputting only columns that had a p-value <0.07 for the treatment comparisons:
# specific compound differences
for (i in 1:3){
column <- names(df1[i])
anova <- broom::tidy(aov(df1[,i] ~ treatment, data = df1))
# only want aov with P < 0.07 printed
if(anova$p.value[1] < 0.07) {
print(column)
print(anova)
}
}
However, I'd like to run TukeyHSD tests on all columns in a similar way, only outputting the tukey results that have a p-value <0.07 for any given treatment comparison. I tried something like this but it doesn't work, giving the error "Error in if (tukey[["p adj"]] < 0.07) { : argument is of length zero":
for (i in 1:3){
column <- names(df1[i])
anova <- aov(df1[,i] ~ treatment, data = df1)
tukey <- TukeyHSD(anova)
# only want tukey with P < 0.07 printed
if(tukey[["p adj"]] < 0.07) {
print(column)
print(tukey)
}
}
I can't figure out the right way to have it only output tukey tests that contain a p-value <0.07, so my ideal output would be something like this (this contains made-up values):
$cmpd1
diff lwr upr p adj
D-A 2.728484e-12 -29169.59 29169.59 1.0000000
W-A 3.637979e-12 -32278.10 32278.10 0.0001
W-D 1.484573e+04 -13620.88 43312.34 0.056
The output of TukeyHSD is a list as evident from the structure
str(TukeyHSD(aov(df1[,1] ~ treatment, data = df1)))
List of 1
$ treatment: num [1:3, 1:4] -2.84e-14 4.67e+02 4.67e+02 -1.09e+02 3.58e+02 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "D-A" "W-A" "W-D"
.. ..$ : chr [1:4] "diff" "lwr" "upr" "p adj"
- attr(*, "class")= chr [1:2] "TukeyHSD" "multicomp"
- attr(*, "orig.call")= language aov(formula = df1[, 1] ~ treatment, data = df1)
- attr(*, "conf.level")= num 0.95
- attr(*, "ordered")= logi FALSE
we can extract the list element 'treatment' which is a matrix and thus the [[ or $ wouldn't work. We can use [ with column name along with the , to separate the row/column index or names and wrap with any as there are 3 values for 'p adj' (if expects a single TRUE/FALSE logical input)
for (i in 1:3){
column <- names(df1[i])
anova <- aov(df1[,i] ~ treatment, data = df1)
tukey <- TukeyHSD(anova)
# only want tukey with P < 0.07 printed
if(any(tukey$treatment[, "p adj"] < 0.07)) {
print(column)
print(setNames(tukey, column))
}
}
-output
[1] "cmpd1"
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = df1[, i] ~ treatment, data = df1)
$cmpd1
diff lwr upr p adj
D-A -2.842171e-14 -109.1823 109.1823 1.000000
W-A 4.670000e+02 357.8177 576.1823 0.000839
W-D 4.670000e+02 357.8177 576.1823 0.000839

Linear Regression Model based on Log-Cosh Loss Function in R

I have read about loss functions theoretically and also how to build regression models based on them in R.
I can apply all of the regression models based on different loss functions in R programming except for Log-Cosh Loss Function.
For example, I would like to build a linear regression model on 5-folds subsets of the DATA, and then extract the coefficients and calculate the individuals and the aggregated variance as follows.
data = read.csv("train.csv") # "critical_temp" is the dependent variable.
data_nom_df=as.data.frame(scale(data))#Normalization
#Cross Validation
set.seed(12345)
k = 5
folds <- createFolds(data_nom_df$critical_temp, k = k, list = TRUE, returnTrain = TRUE)
## Ordinary Least Square regression
#block A
lm = list()
for (i in 1:k) {
lm[[i]] = lm(critical_temp~ .,
data = data_nom_df[folds[[i]],])
}
#block B
lm_coef = list()
lm_coef_var = list()
for(j in 1:(lm[[1]]$coefficients %>% length())){
for(i in 1:k){
lm_coef[[i]] = lm[[i]]$coefficients[j]
lm_coef_var[[j]] = lm_coef %>% unlist() %>% var()
}
}
#block C
lm_var = unlist(lm_coef_var)
lm_df = cbind(coefficients = lm[[1]]$coefficients %>% names() %>% as.data.frame()
, variance = lm_var %>% as.data.frame())
colnames(lm_df) = c("coefficients", "variance_lm")
lm_df
#block D
lm_var_sum = sum(lm_var)
lm_var_sum
The same for the rest of the regression models. However, I do not find any code or package to apply a regression model based on Log-Cosh Loss Function in R.
Could you please guide me to any source that would help me to solve this problem.
This can be done from first principles. Also note the existence of the logcosh function in the limma package which could be used in place of log(cosh(.)) if you have numeric difficulties.
f <- function(b) with(cars, sum(log(cosh(dist - b[1] - b[2] * speed))))
fm0 <- lm(dist ~ speed, cars)
res <- optim(coef(fm0), f, method = "BFGS")
str(res)
## List of 5
## $ par : Named num [1:2] -12.82 3.47
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "speed"
## $ value : num 532
## $ counts : Named int [1:2] 28 10
## ..- attr(*, "names")= chr [1:2] "function" "gradient"
## $ convergence: int 0
## $ message : NULL
Graphics
# the black line is the ordinary least squares regression line and
# the red line is the log cosh regression line
plot(cars)
abline(fm0)
yfit <- res$par[1] + res$par[2] * cars$speed
lines(cars$speed, yfit, col = "red")
ADDED
Note that the optimization can also be written like this which may be useful if you have many independent variables.
fm0 <- lm(dist ~ speed, cars)
X <- model.matrix(fm0)
f <- function(b) with(cars, sum(log(cosh(dist - X %*% b))))
res <- optim(coef(fm0), f, method = "BFGS")
res
giving:
$par
(Intercept) speed
-12.816190 3.469536
$value
[1] 531.5872
$counts
function gradient
28 10
$convergence
[1] 0
$message
NULL

Regression table with clustered standard errors in R jupyter notebook?

I'm using export_summs in R to make a regression table, but when I use coeftest to get clustered standard errors, the table no longer reports N or R^2 properly in those columns. The coefficients and standard errors look good, just missing those additional stats. (I'm used to outreg2 in Stata which is much simpler.)
I tried using tidy_override() as suggested in the last example here (https://hughjonesd.github.io/huxtable/huxreg.pdf), no change.
# Reproducible example
datareg <- NULL
datareg$y <- rnorm(1000)
datareg$x <- rnorm(1000)
datareg$cluster_var <- rnorm(1000)
datareg <- data.frame(datareg)
reg0 <- lm(y ~ x
, data = datareg)
reg1 <- coeftest(
lm(y ~ x
, data = datareg)
, vcovCL, cluster = datareg$cluster_var)
export_summs(reg0, reg1,
model.names = c("Basic", "Cluster SE"))
Issues warning and output:
This is a case where the error message is fairly clear: the broom package does not have a glance method for coeftest objects. This is not an accident--the nature of the coeftest object does not allow for broom to calculate model summary statistics. It retains very little information about the original model:
> str(reg1)
'coeftest' num [1:2, 1:4] 0.0483 0.0153 0.0329 0.0341 1.4668 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] "(Intercept)" "x"
..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
- attr(*, "method")= chr "t test of coefficients"
- attr(*, "df")= int 998
One option is to use the lm_robust function from the estimatr package. It returns objects with robust standard errors that are amenable to both glance and tidy:
reg2 <- estimatr::lm_robust(y ~ x
, data = datareg)
export_summs(reg0, reg2,
model.names = c("Basic", "Cluster SE"), number_format = NA )
──────────────────────────────────────────────────────────────────
Basic Cluster SE
────────────────────────────────────────────────────
(Intercept) 0.0482678107925753 0.0482678107925755
(0.032842483472098) (0.0329070612421128)
x 0.0152928320138191 0.015292832013819
(0.0333488383365212) (0.034094868727288)
────────────────────────────────────────────────────
N 1000 1000
R2 0.000210664993144995 0.000210665
──────────────────────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, Basic, Cluster SE
Huxtable author here. This is how to do it with tidy_override:
library(generics)
library(huxtable)
library(jtools)
library(lmtest)
library(sandwich)
datareg <- NULL
datareg$y <- rnorm(1000)
datareg$x <- rnorm(1000)
datareg$cluster_var <- rnorm(1000)
datareg <- data.frame(datareg)
reg0 <- lm(y ~ x, data = datareg)
reg1 <- coeftest(reg0, vcovCL, cluster = datareg$cluster_var)
reg1 <- tidy_override(reg1, glance = list(nobs = 1000L, r.squared = 0.000),
extend = TRUE) # extend = TRUE is important
export_summs(reg0, reg1, model.names = c("Basic", "Cluster SE"))
Which gives:
────────────────────────────────────────────────────
Basic Cluster SE
───────────────────────────────────
(Intercept) -0.01 -0.01
(0.03) (0.03)
x -0.05 -0.05
(0.03) (0.03)
───────────────────────────────────
N 1000 1000
R2 0.00 0.00
────────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, Basic, Cluster SE
This was fairly tricky and I appreciate your difficulties... I have improved the error reporting in huxreg as a result!

Extracting thresholds or cutoff levels used to calculate ROC in R (`Epi`)

There are solutions to extract these values in the packages pROC and ROCR, but I don't know how to do the same with the Epi package.
Here is the example I'm trying to work through:
library(pROC)
library(Epi)
data(aSAH)
ROC(form = outcome ~ s100b, data=aSAH, plot = "ROC", MX = T)
I tried the following:
rc <- ROC(form = outcome ~ s100b, data=aSAH, plot="sp" )
rc$lr$weights
str(rc$lr$weights)
Named num [1:113] 0.185 0.19 0.171 0.143 0.185 ...
- attr(*, "names")= chr [1:113] "29" "30" "31" "32" ..
which could be correct because the documentation for the package links weights to the model matrix, but I am not sure, especially because he weights seem to start at 29.

Plotting profile likelihood curves in R

I am trying to figure out how to plot the profile likelihood curve of a GLM
parameter with 95% pLCI's on the same plot. The example I have been trying
with is below. The plots I am getting are not the likelihood curves that I
was expecting. The y-axis of the plots is tau and I would like that axis
to be the likelihood so that I have a curve that maxes at the parameter
estimate. I am not sure where I find those likelihood values? I may just
be misinterpreting the theory behind this. Thanks for any help you can give.
Max
clotting <- data.frame(
u = c(5,10,15,20,30,40,60,80,100),
lot1 = c(118,58,42,35,27,25,21,19,18),
lot2 = c(69,35,26,21,18,16,13,12,12))
glm2<-glm(lot2 ~ log(u), data=clotting, family=Gamma)
prof<-profile(glm2)
plot(prof)
Regenerate your example:
clotting <- data.frame(
u = c(5,10,15,20,30,40,60,80,100),
lot1 = c(118,58,42,35,27,25,21,19,18),
lot2 = c(69,35,26,21,18,16,13,12,12))
glm2 <- glm(lot2 ~ log(u), data=clotting, family=Gamma)
The profile.glm function actually lives in the MASS package:
library(MASS)
prof<-profile(glm2)
In order to figure out what profile.glm and plot.profile are doing, see ?profile.glm and ?plot.profile. However, in order to dig into the profile object it may also be useful to examine the code of MASS:::profile.glm and MASS:::plot.profile ... basically, what these tell you is that profile is returning the signed square root of the difference between the deviance and the minimum deviance, scaled by the dispersion parameter. The reason that this is done is so that the profile for a perfectly quadratic profile will appear as a straight line (it's much easier to detect deviations from a straight line than from a parabola by eye).
The other thing that may be useful to know is how the profile is stored. Basically, it's a list of data frames (one for each parameter profiled), except that the individual data frames are a little bit weird (containing one vector component and one matrix component).
> str(prof)
List of 2
$ (Intercept):'data.frame': 12 obs. of 3 variables:
..$ tau : num [1:12] -3.557 -2.836 -2.12 -1.409 -0.702 ...
..$ par.vals: num [1:12, 1:2] -0.0286 -0.0276 -0.0267 -0.0258 -0.0248 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:2] "(Intercept)" "log(u)"
..$ dev : num [1:12] 0.00622 0.00753 0.00883 0.01012 0.0114 ...
$ log(u) :'data.frame': 12 obs. of 2 variables:
..$ tau : num [1:12] -3.516 -2.811 -2.106 -1.403 -0.701 ...
..$ par.vals: num [1:12, 1:2] -0.0195 -0.0204 -0.0213 -0.0222 -0.023 ...
.. ..- attr(*, "dimnames")=List of 2
It also contains attributes summary and original.fit that you can use to recover the dispersion and minimum deviance:
disp <- attr(prof,"summary")$dispersion
mindev <- attr(prof,"original.fit")$deviance
Now reverse the transformation for parameter 1:
dev1 <- prof[[1]]$tau^2
dev2 <- dev1*disp+mindev
Plot:
plot(prof[[1]][,1],dev2,type="b")
(This is the plot of the deviance. You can multiply by 0.5 to get the negative log-likelihood, or -0.5 to get the log-likelihood ...)
edit: some more general functions to transform the profile into a useful format for lattice/ggplot plotting ...
tmpf <- function(x,n) {
data.frame(par=n,tau=x$tau,
deviance=x$tau^2*disp+mindev,
x$par.vals,check.names=FALSE)
}
pp <- do.call(rbind,mapply(tmpf,prof,names(prof),SIMPLIFY=FALSE))
library(reshape2)
pp2 <- melt(pp,id.var=1:3)
pp3 <- subset(pp2,par==variable,select=-variable)
Now plot it with lattice:
library(lattice)
xyplot(deviance~value|par,type="b",data=pp3,
scales=list(x=list(relation="free")))
Or with ggplot2:
library(ggplot2)
ggplot(pp3,aes(value,deviance))+geom_line()+geom_point()+
facet_wrap(~par,scale="free_x")
FYI, for fun, I took the above and whipped it together into a single function using purrr::imap_dfr as I couldn't find a package that implements the above.
get_profile_glm <- function(aglm){
prof <- MASS:::profile.glm(aglm)
disp <- attr(prof,"summary")$dispersion
purrr::imap_dfr(prof, .f = ~data.frame(par = .y,
deviance=.x$z^2*disp+aglm$deviance,
values = as.data.frame(.x$par.vals)[[.y]],
stringsAsFactors = FALSE))
}
Works great!
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
print(d.AD <- data.frame(treatment, outcome, counts))
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
ggplot(get_profile_glm(aglm), aes(x = values, y = deviance)) +
geom_point() +
geom_line() +
facet_wrap(~par, scale = "free_x")

Resources