I was trying to analyse example provided by caret package for confusionMatrix i.e.
lvs <- c("normal", "abnormal")
truth <- factor(rep(lvs, times = c(86, 258)),
levels = rev(lvs))
pred <- factor(
c(
rep(lvs, times = c(54, 32)),
rep(lvs, times = c(27, 231))),
levels = rev(lvs))
xtab <- table(pred, truth)
confusionMatrix(xtab)
However to be sure I don't quite understand it. Let's just pick for example this very simple model :
set.seed(42)
x <- sample(0:1, 100, T)
y <- rnorm(100)
glm(x ~ y, family = binomial('logit'))
And I don't know how can I analogously perform confusion matrix for this glm model. Do you understand how it can be done ?
EDIT
I tried to run an example provided in comments :
train <- data.frame(LoanStatus_B = as.numeric(rnorm(100)>0.5), b= rnorm(100), c = rnorm(100), d = rnorm(100))
logitMod <- glm(LoanStatus_B ~ ., data=train, family=binomial(link="logit"))
library(caret)
# Use your model to make predictions, in this example newdata = training set, but replace with your test set
pdata <- predict(logitMod, newdata = train, type = "response")
confusionMatrix(data = as.numeric(pdata>0.5), reference = train$LoanStatus_B)
but I gain error : dataandreference` should be factors with the same levels
Am I doing something incorrectly ?
You just need to turn them into factors:
confusionMatrix(data = as.factor(as.numeric(pdata>0.5)),
reference = as.factor(train$LoanStatus_B))
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 61 31
# 1 2 6
#
# Accuracy : 0.67
# 95% CI : (0.5688, 0.7608)
# No Information Rate : 0.63
# P-Value [Acc > NIR] : 0.2357
#
# Kappa : 0.1556
#
# Mcnemar's Test P-Value : 1.093e-06
#
# Sensitivity : 0.9683
# Specificity : 0.1622
# Pos Pred Value : 0.6630
# Neg Pred Value : 0.7500
# Prevalence : 0.6300
# Detection Rate : 0.6100
# Detection Prevalence : 0.9200
# Balanced Accuracy : 0.5652
#
# 'Positive' Class : 0
Related
I would like to use the fastshap package to obtain SHAP values plots for every category of my outcome in a multi-classification problem using a random forest classifier. I could only found chunks of the code around, but no explanation on how to procede from the beginning in obtaining the SHAP values in this case. Here is the code I have so far (my y has 5 classes, here I am trying to obtain SHAP values for class 3):
library(randomForest)
library(fastshap)
set.seed(42)
sample <- sample.int(n = nrow(ITA), size = floor(.75*nrow(ITA)), replace=F)
train <- ITA [sample,]
test <- ITA [-sample,]
set.seed(42)
rftrain <-randomForest(y ~ ., data=train, ntree=500, importance = TRUE)
p_function_3<- function(object, newdata)
caret::predict.train(object,
newdata = newdata,
type = "prob")[,3]
shap_values_G <- fastshap::explain(rftrain,
X = train,
pred_wrapper = p_function_3,
nsim = 50,
newdata=train[which(y==3),])
Now, I took the code largely from an example I found online, and I tried to adapt it (I am not an expert R user), but it does not work.. Can you please help me in correcting it? Thanks!
Here is a working example (with a different dataset), but I think the logic is the same.
library(randomForest)
library(fastshap)
set.seed(42)
ix <- sample(nrow(iris), 0.75 * nrow(iris))
train <- iris[ix, ]
test <- iris[-ix, ]
xvars <- c("Sepal.Width", "Sepal.Length")
yvar <- "Species"
fit <- randomForest(reformulate(xvars, yvar), data = train, ntree = 500)
pred_3 <- function(model, newdata) {
predict(model, newdata = newdata, type = "prob")[, "virginica"]
}
shap_values_3 <- fastshap::explain(
fit,
X = train, # Reference data
feature_names = xvars,
pred_wrapper = pred_3,
nsim = 50,
newdata = train[train$Species == "virginica", ] # For these rows, you will calculate explanations
)
head(shap_values_3)
# Sepal.Width Sepal.Length
# <dbl> <dbl>
# 1 0.101 0.381
# 2 0.159 -0.0109
# 3 0.0736 -0.0285
# 4 0.0564 0.161
# 5 0.0649 0.594
# 6 0.232 0.0305
I would like to use nls to fit a global parameter and group-specific parameters. The closest I have found to a minimum reproducible example is below (found here: https://stat.ethz.ch/pipermail/r-help/2015-September/432020.html)
#Generate some data
d <- transform(data.frame(x=seq(0,1,len=17),
group=rep(c("A","B","B","C"),len=17)), y =
round(1/(1.4+x^ifelse(group=="A", 2.3, ifelse(group=="B",3.1, 3.5))),2))
#Fit to model using nls
nls(y~1/(b+x^p[group]), data=d, start=list(b=1, p=rep(3,length(levels(d$group)))))
This gives me an error:
Error in numericDeriv(form[[3L]], names(ind), env, central = nDcentral) :
Missing value or an infinity produced when evaluating the model
I have not been able to figure out if the error is coming from bad guesses for the starting values, or the way this code is dealing with group-specific parameters. It seems the line with p=rep(3,length(levels(d$group))) is for generating c(3,3,3), but switching this part of the code does not remove the problem (same error obtained as above):
#Fit to model using nls
nls(y~1/(b+x^p[group]), data=d, start=list(b=1, p=c(3, 3, 3)))
Switching to nlsLM gives a different error which leads be to believe I am having an issue with the group-specific parameters:
#Generate some data
library(minpack.lm)
d <- transform(data.frame(x=seq(0,1,len=17),
group=rep(c("A","B","B","C"),len=17)), y =
round(1/(1.4+x^ifelse(group=="A", 2.3, ifelse(group=="B",3.1, 3.5))),2))
#Fit to model using nlsLM
nlsLM(y~1/(b+x^p[group]), data=d, start=list(b=1, p=c(3,3,3)))
Error:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
Any ideas?
I think you can do this much more easily with nlme::gnls:
fit2 <- nlme::gnls(y~1/(b+x^p),
params = list(p~group-1, b~1),
data=d,
start = list(b=1, p = rep(3,3)))
Results:
Generalized nonlinear least squares fit
Model: y ~ 1/(b + x^p)
Data: d
Log-likelihood: 62.05887
Coefficients:
p.groupA p.groupB p.groupC b
2.262383 2.895903 3.475324 1.407561
Degrees of freedom: 17 total; 13 residual
Residual standard error: 0.007188101
The params argument allows you to specify fixed-effect submodels for each nonlinear parameter. Using p ~ b-1 parameterizes the model with a separate estimate for each group, rather than fitting a baseline (intercept) value for the first group and the differences between successive groups. (In R's formula language, -1 or +0 signify "fit a model without intercept/set the intercept to 0", which in this case corresponds to fitting all three groups separately.)
I'm quite surprised that gnls and nls don't give identical results (although both give reasonable results); would like to dig in further ...
Parameter estimates (code below):
term nls gnls
1 b 1.41 1.40
2 pA 2.28 2.28
3 pB 3.19 3.14
4 pC 3.60 3.51
par(las = 1, bty = "l")
plot(y~x, data = d, col = d$group, pch = 16)
xvec <- seq(0, 1, length = 21)
f <- function(x) factor(x, levels = c("A","B","C"))
## fit1 is nls() fit
ll <- function(g, c = 1) {
lines(xvec, predict(fit1, newdata = data.frame(group=f(g), x = xvec)), col = c)
}
Map(ll, LETTERS[1:3], 1:3)
d2 <- expand.grid(x = xvec, group = f(c("A","B","C")))
pp <- predict(fit2, newdata = d2)
ll2 <- function(g, c = 1) {
lines(xvec, pp[d2$group == g], lty = 2, col = c)
}
Map(ll2, LETTERS[1:3], 1:3)
legend("bottomleft", lty = 1:2, col = 1, legend = c("nls", "gnls"))
library(tidyverse)
library(broom)
library(broom.mixed)
(purrr::map_dfr(list(nls=fit1, gnls=fit2), tidy, .id = "pkg")
%>% select(pkg, term, estimate)
%>% group_by(pkg)
## force common parameter names
%>% mutate(across(term, ~ c("b", paste0("p", LETTERS[1:3]))))
%>% pivot_wider(names_from = pkg, values_from = estimate)
)
I was able to get this by switching the class of the group from chr to factor. Note the addition of factor() when generating the dataset.
> d <- transform(data.frame(
+ x=seq(0,1,len=17),
+ group=rep(factor(c("A","B","B","C")),len=17)),
+ y=round(1/(1.4+x^ifelse(group=="A", 2.3, ifelse(group=="B",3.1, 3.5))),2)
+ )
> str(d)
'data.frame': 17 obs. of 3 variables:
$ x : num 0 0.0625 0.125 0.1875 0.25 ...
$ group: Factor w/ 3 levels "A","B","C": 1 2 2 3 1 2 2 3 1 2 ...
$ y : num 0.71 0.71 0.71 0.71 0.69 0.7 0.69 0.69 0.62 0.64 ...
> nls(y~1/(b+x^p[group]), data=d, start=list(b=1, p=c(3,3,3)))
Nonlinear regression model
model: y ~ 1/(b + x^p[group])
data: d
b p1 p2 p3
1.406 2.276 3.186 3.601
residual sum-of-squares: 9.537e-05
Number of iterations to convergence: 5
Achieved convergence tolerance: 4.536e-06
I tried computing confusion-matrix for my glm model but I keep getting:
Error: data and reference should be factors with the same levels.
Below is my model:
model3 <- glm(winner ~ srs.1 + srs.2, data = train_set, family = binomial)
confusionMatrix(table(predict(model3, newdata=test_set, type="response")) >= 0.5,
train_set$winner == 1)
winner variable contains team1 and team2.
srs.1 and srs.2 are numerical values.
What is my problem here?
I suppose your winner label is a binary of 0,1. So let's use the example below:
library(caret)
set.seed(111)
data = data.frame(
srs.1 = rnorm(200),
srs.2 = rnorm(200)
)
data$winner = ifelse(data$srs.1*data$srs.2 > 0,1,0)
idx = sample(nrow(data),150)
train_set = data[idx,]
test_set = data[-idx,]
model3 <- glm(winner ~ srs.1 + srs.2, data = train_set, family = binomial)
Like you did, we try to predict, if > 0.5, it will be 1 else 0. You got the table() about right. Note you need to do it both for test_set, or train_set:
pred = as.numeric(predict(model3, newdata=test_set, type="response")>0.5)
ref = test_set$winner
confusionMatrix(table(pred,ref))
Confusion Matrix and Statistics
ref
pred 0 1
0 12 5
1 19 14
Accuracy : 0.52
95% CI : (0.3742, 0.6634)
No Information Rate : 0.62
P-Value [Acc > NIR] : 0.943973
Kappa : 0.1085
I have a problem in building confusion matrix using Decision tree method. The data set is extremely imbalanced and the population of third label ("C") is about 1%.
I have no idea why prediction results of C is all zero(0).
# load the package
install.packages('rpart')
library(rpart)
library(caret)
# load data
data<-read.csv("Drisk0122_01.csv", header=TRUE)
data<-data[ , c(3:43)]
data$Class<-factor(data$Class, levels = c(1,2, 3), labels=c("A", "B", "C"))
set.seed(42)
training.samples <- createDataPartition(y=data$Class, p = 0.7, list = FALSE)
training.samples
train <- data[training.samples, ]
test <- data[-training.samples, ]
############tree
install.packages("tree")
library(tree)
treemod<-tree(Class~. , data=train)
plot(treemod)
text(treemod)
cv.trees<-cv.tree(treemod, FUN=prune.misclass ) # for classification decision tree
plot(cv.trees)
prune.trees <- prune.misclass(treemod, best=4) # for regression decision tree, use prune.tree function
plot(prune.trees)
text(prune.trees, pretty=0)
library(e1071)
treepred <- predict(prune.trees, test, type='class')
confusionMatrix(treepred, test$Class)
The results are as follows:
confusionMatrix(treepred, test$Class)
Confusion Matrix and Statistics
Reference
Prediction A B C
A 2324 360 28
B 211 427 3
C 0 0 0
Overall Statistics
Accuracy : 0.8205
95% CI : (0.807, 0.8333)
No Information Rate : 0.756
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4775
Mcnemar's Test P-Value : 4.526e-15
Statistics by Class:
Class: A Class: B Class: C
Sensitivity 0.9168 0.5426 0.000000
Specificity 0.5257 0.9166 1.000000
Pos Pred Value 0.8569 0.6661 NaN
Neg Pred Value 0.6708 0.8673 0.990755
Prevalence 0.7560 0.2347 0.009245
Detection Rate 0.6931 0.1273 0.000000
Detection Prevalence 0.8088 0.1912 0.000000
Balanced Accuracy 0.7212 0.7296 0.500000
Please find the image of the results from here
I have tried all possible solutions in Stack overflow suggested for data and reference should be factors with the same levels.
set.seed(10)
indices = sample.split(consumers$label, SplitRatio = 0.75)
train = consumers[indices,]
test = consumers[!(indices),]
##Build a logistic regression model
is.factor(train$label)
contrasts(train$label)
lr_model <- data.frame(label = as.numeric(rnorm(100)>0.5), b= rnorm(100), c = rnorm(100), d = rnorm(100))
logitMod <- glm(label ~ ., data=train, family=binomial(link="logit"))
pdata <- predict(logitMod, newdata = train, type = "response")
confusionMatrix(data = as.numeric(pdata>0.5), reference = train$label)
I still get "Error: data and reference should be factors with the same levels."
My dataset has three columns - ration, time and label (where the label is male and female)
Going on a hunch here that you're using caret::confusionMatrix, so here goes. What you're doing is you're passing an integer as data and factor as a reference. Notice that the documentation calls for a factor of predicted classes or a table.
> library(caret)
>
> ref <- factor(sample(0:1, size = 100, replace = TRUE))
> data1 <- sample(0:1, size = 100, replace = TRUE)
> data2 <- factor(sample(0:1, size = 100, replace = TRUE))
# this is your case
> confusionMatrix(data = data1, reference = ref)
Error: `data` and `reference` should be factors with the same levels.
# pass in a factor (try a table for giggles)
> confusionMatrix(data = data2, reference = ref)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 24 19
1 33 24
Accuracy : 0.48
95% CI : (0.379, 0.5822)
No Information Rate : 0.57
P-Value [Acc > NIR] : 0.97198
Kappa : -0.02
Mcnemar's Test P-Value : 0.07142
Sensitivity : 0.4211
Specificity : 0.5581
Pos Pred Value : 0.5581
Neg Pred Value : 0.4211
Prevalence : 0.5700
Detection Rate : 0.2400
Detection Prevalence : 0.4300
Balanced Accuracy : 0.4896
'Positive' Class : 0
confusionMatrix(data = as.factor(as.numeric(pdata>0.5)), reference = train$label)
This should work.