I am using the class package in order to use the KNN algorithm. I am also using the ROCR package to calculate the AUC value.
knn_one<-knn(train, test, train$Digit, k=1)
To calculate the AUC value for another method, e.g. classification trees, I used these series of commands:
treeTrain_Pred<-predict(Tree_Train, test , type = "prob")[,2]
Pred<-prediction(treeTrain_Pred, test$Digit)
Perf<-performance(Pred, "auc")
Perf#y.values[[1]]
However, when I try
knn_one = predict(knn_one, test, type="prob")[,2]
I get the following error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "factor"
How can I fix this and obtain an AUC value for my KNN function?
There is no predict method for knn models, instead you train and receive predictions as part of a single call. Example on sonar data:
library(mlbench)
data(Sonar)
create data partition:
set.seed(1)
tr_ind <- sample(1:nrow(Sonar), 150)
train <- Sonar[tr_ind,]
test <- Sonar[-tr_ind,]
mod <- class::knn(cl = train$Class,
test = test[,1:60],
train = train[,1:60],
k = 5,
prob = TRUE)
Now the probability of the predictions are in:
attributes(mod)$prob
library(pROC)
roc(test$Class, attributes(mod)$prob)
#output
Call:
roc.default(response = test$Class, predictor = attributes(mod)$prob)
Data: attributes(mod)$prob in 30 controls (test$Class M) < 28 cases (test$Class R).
Area under the curve: 0.4667
plot(roc(test$Class, attributes(mod)$prob),
print.thres = T,
print.auc=T)
lets try with k = 4
mod <- class::knn(cl = train$Class,
test = test[,1:60],
train = train[,1:60],
k = 4,
prob = TRUE)
plot(roc(test$Class, attributes(mod)$prob),
print.thres = T,
print.auc = T,
print.auc.y = 0.2)
Related
I want to perform a logistic regression with the train() function from the caret package. My model looks something like that:
model <- train(Y ~.,
data = train_data,
family = "binomial",
method = "glmnet")
With the resulting model, I want to make predictions:
pred <- predict(model, newdata = test_data, s = "lambda.min", type = "prob")
Now, I want to evaluate how good the model predictions are in comparison with the actual test data. For this I know how to receive the ROC and AUC. However I am also interested in receiveing the BRIER SCORE. The formula for the Brier Score is almost identical to the MSE.
The problem I am facing, is that the type argument in predict only allows "prob" (or "class" which I am not interested in) which gives the probability of one prediction beeing a ONE (e.g. 0.64) , and the complementing probability of beeing a ZERO (e.g. 0.37). For the Brier Score however, I need One probability estimate for each prediction that contains the information of both (e.g. a value above 0.5 would indicate a 1 and a value below 0.5 would indicate a 0).
I have not found any solution for receiving the Brier Score in the caret package. I am aware that with the package cv.glmnet the predict function allows the argument "response" which would solve my problem. However, for personal preferences I would like to stay with the caretpackage.
Thanks for the help!
If we go by the wiki definition of brier score:
The most common formulation of the Brier score is
where f_t is the probability that was forecast, o_t the actual outcome of the (0 or 1) and N is the number of forecasting instances.
In R, if your label is a factor, then the logistic regression will always predict with respect to the 2nd level, meaning you just calculate the probability and 0/1 with respect to that. For example:
library(caret)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor","v","o"))
levels(data$Species)
[1] "o" "v"
In this case, o is 0 and v is 1.
train_data = data[idx,]
test_data = data[-idx,]
model <- train(Species ~.,data = train_data,family = "binomial",method = "glmnet")
pred <- predict(model, newdata = test_data)
So we can see the probability of the class:
head(pred)
o v
1 0.8367885 0.16321154
2 0.7970508 0.20294924
3 0.6383656 0.36163437
4 0.9510763 0.04892370
5 0.9370721 0.06292789
To calculate the score:
f_t = pred[,2]
o_t = as.numeric(test_data$Species)-1
mean((f_t - o_t)^2)
[1] 0.32
I use the Brier score to tune my models in caret for binary classification. I ensure that the "positive" class is the second class, which is the default when you label your response "0:1". Then I created this master summary function, based on caret's own suite of summary functions, to return all the metrics I want to see:
BigSummary <- function (data, lev = NULL, model = NULL) {
pr_auc <- try(MLmetrics::PRAUC(data[, lev[2]],
ifelse(data$obs == lev[2], 1, 0)),
silent = TRUE)
brscore <- try(mean((data[, lev[2]] - ifelse(data$obs == lev[2], 1, 0)) ^ 2),
silent = TRUE)
rocObject <- try(pROC::roc(ifelse(data$obs == lev[2], 1, 0), data[, lev[2]],
direction = "<", quiet = TRUE), silent = TRUE)
if (inherits(pr_auc, "try-error")) pr_auc <- NA
if (inherits(brscore, "try-error")) brscore <- NA
rocAUC <- if (inherits(rocObject, "try-error")) {
NA
} else {
rocObject$auc
}
tmp <- unlist(e1071::classAgreement(table(data$obs,
data$pred)))[c("diag", "kappa")]
out <- c(Acc = tmp[[1]],
Kappa = tmp[[2]],
AUCROC = rocAUC,
AUCPR = pr_auc,
Brier = brscore,
Precision = caret:::precision.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
Recall = caret:::recall.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
F = caret:::F_meas.default(data = data$pred, reference = data$obs,
relevant = lev[2]))
out
}
Now I can simply pass summaryFunction = BigSummary in trainControl and then metric = "Brier", maximize = FALSE in the train call.
I encountered this error message while I used prediction after training.
Anyone knows where I went wrong?
# Create train and test tasks
trainTask <- makeClassifTask(data = data.train, target = "SetAppt", positive = "1")
testTask <- makeClassifTask(data = data.test, target = "SetAppt", positive = "1")
ksvm <- makeLearner("classif.ksvm", predict.type = "prob")
tuned.svm <- setHyperPars(ksvm, par.vals = res$x)
getParamSet(tuned.svm)
getHyperPars(tuned.svm)
# Train final model with whole training set
svm.train <- train(tuned.svm, trainTask)
# Predict final model with test set
predict.svm <- predict(svm.train, testTask, subset = y.test)
# I also tried this
predict.svm <- predict(svm.train, newdata = x.test)
I am trying to implement lasso linear regression. I train my model but when I try to make prediction on unknown data it gives me the following error:
Error in cbind2(1, newx) %*% nbeta :
invalid class 'NA' to dup_mMatrix_as_dgeMatrix
Summary of my data is:
I want to predict the unknown percent_gc. I initially train my model using data for which percent_gc is known
set.seed(1)
###training data
data.all <- tibble(description = c('Xylanimonas cellulosilytica XIL07, DSM 15894','Teredinibacter turnerae T7901',
'Desulfotignum phosphitoxidans FiPS-3, DSM 13687','Brucella melitensis bv. 1 16M'),
phylum = c('Actinobacteria','Proteobacteria','Proteobacteria','Bacteroidetes'),
genus = c('Acaryochloris','Acetohalobium','Acidimicrobium','Acidithiobacillus'),
Latitude = c('63.93','69.372','3.493.11','44.393.704'),
Longitude = c('-22.1','88.235','134.082.527','-0.130781'),
genome_size = c(8361599,2469596,2158157,3207552),
percent_gc = c(34,24,55,44),
percent_psuedo = c(0.0032987747,0.0291222313,0.0353728489,0.0590663703),
percent_signalpeptide = c(0.02987198,0.040607055,0.048757170,0.061606859))
###data for prediction
data.prediction <- tibble(description = c('Liberibacter crescens BT-1','Saprospira grandis Lewin',
'Sinorhizobium meliloti AK83','Bifidobacterium asteroides ATCC 25910'),
phylum = c('Actinobacteria','Proteobacteria','Proteobacteria','Bacteroidetes'),
genus = c('Acaryochloris','Acetohalobium','Acidimicrobium','Acidithiobacillus'),
Latitude = c('39.53','69.372','5.493.12','44.393.704'),
Longitude = c('20.1','-88.235','134.082.527','-0.130781'),
genome_size = c(474832,2469837,2158157,3207552),
percent_gc = c(NA,NA,NA,NA),
percent_psuedo = c(0.0074639239,0.0291222313,0.0353728489,0.0590663703),
percent_signalpeptide = c(0.02987198,0.040607055,0.048757170,0.061606859))
x=model.matrix(percent_gc~.,data.all)
y=data.all$percent_gc
cv.out <- cv.glmnet (x, y, alpha = 1,family = "gaussian")
best.lambda= cv.out$lambda.min
fit <- glmnet(x,y,alpha=1)
I then want to make predictions for which percent_gc in not known.
newX = matrix(data = data.prediction %>% select(-percent_gc))
data.prediction$percent_gc <-
predict(object = fit ,type="response", s=best.lambda, newx=newX)
And this generates the error I mentioned above.
I don't understand which format newX should be in order to get rid of this help. Insights would be appreciated.
I could not really figure out how to construct a appropiate matrix, but package glmnetUtils provides functionality to directly fit a formula on a dataframe and predict. With this I got it to predict values:
library(glmnetUtils)
fit <- glmnet(percent_gc~.,data.all,alpha=1)
cv.out <- cv.glmnet (percent_gc~.,data.all, alpha = 1,family = "gaussian")
best.lambda= cv.out$lambda.min
predict(object = fit,data.prediction,s=best.lambda)
I am trying to investigate my model with R with machine learning. Training model in general works not well.
# # Logistic regression multiclass
for (i in 1:30) {
# split data into training/test
trainPhyIndex <- createDataPartition(subs_phy$Methane, p=10/17,list = FALSE)
trainingPhy <- subs_phy[trainPhyIndex,]
testingPhy <- subs_phy[-trainPhyIndex,]
# Pre-process predictor values
trainXphy <- trainingPhy[,names(trainingPhy)!= "Methane"]
preProcValuesPhy <- preProcess(x= trainXphy,method = c("center","scale"))
# using boot to avoid over-fitting
fitControlPhyGLMNET <- trainControl(method = "repeatedcv",
number = 10,
repeats = 4,
savePredictions="final",
classProbs = TRUE
)
fit_glmnet_phy <- train (Methane~.,
trainingPhy,
method = "glmnet",
tuneGrid = expand.grid(
.alpha =0.1,
.lambda = 0.00023),
metric = "Accuracy",
trControl = fitControlPhyGLMNET)
pred_glmnet_phy <- predict(fit_glmnet_phy, testingPhy)
# Get the confusion matrix to see accuracy value
u <- union(pred_glmnet_phy,testingPhy$Methane)
t <- table(factor(pred_glmnet_phy, u), factor(testingPhy$Methane, u))
accu_glmnet_phy <- confusionMatrix(t)
# accu_glmnet_phy<-confusionMatrix(pred_glmnet_phy,testingPhy$Methane)
glmnetstatsPhy[(nrow(glmnetstatsPhy)+1),] = accu_glmnet_phy$overall
}
glmnetstatsPhy
The program always stopped on fit_glmnet_phy <- train (Methane~., ..
this command and shows
Metric Accuracy not applicable for regression models
I have no idea about this error
I also attached the type of mathane
enter image description here
Try normalizing the input columns and mapping the output column as factors. This helped me resolve an issue similar to it.
I have an imbalanced data, and I want to do stratified cross validation and use precision recall auc as my evaluation metric.
I use prSummary in r package caret with stratified index, and I encounter an error when computing performance.
The following is a sample which can be reproduced. I found that there are only ten sample to compute p-r auc, and because of the imbalanced, there is only one class so that it cannot compute p-r auc. (The reason that I found that only ten sample to compute p-r auc is because I modified the prSummary to force this function to print out the data)
library(randomForest)
library(mlbench)
library(caret)
# Load Dataset
data(Sonar)
dataset <- Sonar
x <- dataset[,1:60]
y <- dataset[,61]
# make this data very imbalance
y[4:length(y)] <- "M"
y <- as.factor(y)
dataset$Class <- y
# create index and indexOut
seed <- 1
set.seed(seed)
folds <- 2
idxAll <- 1:nrow(x)
cvIndex <- createFolds(factor(y), folds, returnTrain = T)
cvIndexOut <- lapply(1:length(cvIndex), function(i){
idxAll[-cvIndex[[i]]]
})
names(cvIndexOut) <- names(cvIndex)
# set the index, indexOut and prSummaryCorrect
control <- trainControl(index = cvIndex, indexOut = cvIndexOut,
method="cv", summaryFunction = prSummary, classProbs = T)
metric <- "AUC"
set.seed(seed)
mtry <- sqrt(ncol(x))
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(Class~., data=dataset, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
Here is the error message:
Error in ROCR::prediction(y_pred, y_true) :
Number of classes is not equal to 2.
ROCR currently supports only evaluation of binary classification tasks.
I think I found the weird thing...
Even I specified the cross validation index, the summary function(no matter prSummary or other summary function) will still randomly(I am not sure) select ten sample to computing performance.
The way I did is defined a summary function with tryCatch to avoid this error occur.
prSummaryCorrect <- function (data, lev = NULL, model = NULL) {
print(data)
print(dim(data))
library(MLmetrics)
library(PRROC)
if (length(levels(data$obs)) != 2)
stop(levels(data$obs))
if (length(levels(data$obs)) > 2)
stop(paste("Your outcome has", length(levels(data$obs)),
"levels. The prSummary() function isn't appropriate."))
if (!all(levels(data[, "pred"]) == levels(data[, "obs"])))
stop("levels of observed and predicted data do not match")
res <- tryCatch({
auc <- MLmetrics::PRAUC(y_pred = data[, lev[2]], y_true = ifelse(data$obs == lev[2], 1, 0))
}, warning = function(war) {
print(war)
auc <- NA
}, error = function(e){
print(dim(data))
auc <- NA
}, finally = {
print("finally")
auc <- NA
})
c(AUC = res,
Precision = precision.default(data = data$pred, reference = data$obs, relevant = lev[2]),
Recall = recall.default(data = data$pred, reference = data$obs, relevant = lev[2]),
F = F_meas.default(data = data$pred, reference = data$obs, relevant = lev[2]))
}