Calculate AUC for test set (keras model in R) - r

Is there a way (function) to calculate AUC value for a keras model in R on test-set?
I have searched on google but nothing shown up.
From Keras model, we can extract the predicted values as either class or probability as follows:
Probability:
[1,] 9.913518e-01 1.087829e-02
[2,] 9.990101e-01 1.216531e-03
[3,] 9.445553e-01 6.256607e-02
[4,] 9.928864e-01 6.808311e-03
[5,] 9.993126e-01 1.028240e-03
[6,] 6.075442e-01 3.926141e-01
Class:
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
Many thanks,
Ho

Generally it does not really matter what calssifier (keras or not) did the prediction. All you need to estimate the AUC are two things: the predicted probabilities from some classifier and the actual category (for example, dead "yes" vs. "no"). With this data you can calculate both, True Positive Rate and False positive rate, thus you can also make a ROC plot and estimate AUC with this data. You can use
library(pROC)
roc_obj <- roc(category, prediction)
auc(roc_obj)
See here for some more explanation.

I'm not sure this will answer your needs as it depends on your data structure and keras output format, but have a look to Dismo package's function evaluate. You need to set up something like that:
library(dismo)
predictors <- stack of explaining variables
pres_test <- a subset of data used to model ##that you not use in your model for this testing purpose
backg_test <- true or random (background) absence data
model <- output of your model
AUC <- evaluate(pres_test, backg_test, model, predictors) ## you may bootstrap this step x time by randomly selecting 'pres_test' and 'backg_test' x time.

Related

credit score from SVM probabilities in R

I am trying to calculate credit scores for the germancredit dataframe in R. I used linear SVM classifier to predict 0 and 1 (i.e. 0 = good, 1 = bad).
I managed to produce probabilities from SVM classifier using the following code.
final_pred = predict(classifier, newdata = data_treated[1:npredictors], decision.values = TRUE, probability = TRUE)
probs = attr(final_pred,"probabilities")
I want to know how to read these probabilities output. The sample output is here.
Does the following output mean that, if the prediction is 1 (Default) in fifth row, then probability is 0.53601166.
0 1 Prediction
1 0.90312125 0.09687875 0
2 0.57899408 0.42100592 0
3 0.93079172 0.06920828 0
4 0.76600082 0.23399918 0
5 0.46398834 0.53601166 1
Can I then use the above respective probabilities to develop credit scorecard like we usually do with logistic regression model
You get a probability for outcome 0 or 1. The first two columns for each row sum to one and give you the overall probability. Your interpretation seems correct to me, i.e. with a probability of 0.53 it is more likely that a default will happen, than the probability of no default happening with p = 0.46.
Yes, you could use that model for developing a credit scorecard. Please mind, that you don't necessarily need to use 0.5 as your cutoff value for deciding if company or person X is going to default.

Correlation of categorical data to binomial response in R

I'm looking to analyze the correlation between a categorical input variable and a binomial response variable, but I'm not sure how to organize my data or if I'm planning the right analysis.
Here's my data table (variables explained below):
species<-c("Aaeg","Mcin","Ctri","Crip","Calb","Tole","Cfus","Mdes","Hill","Cpat","Mabd","Edim","Tdal","Tmin","Edia","Asus","Ltri","Gmor","Sbul","Cvic","Egra","Pvar")
scavenge<-c(1,1,0,1,1,1,1,0,1,0,1,1,1,0,0,1,0,0,0,0,1,1)
dung<-c(0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0)
pred<-c(0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0)
nectar<-c(1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0)
plant<-c(0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0)
blood<-c(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0)
mushroom<-c(0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0)
loss<-c(0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0) #1 means yes, 0 means no
data<-cbind(species,scavenge,dung,pred,nectar,plant,blood,mushroom,loss)
data #check data table
data table explanation
I have individual species listed, and the next columns are their annotated feeding types. A 1 in a given column means yes, and a 0 means no. Some species have multiple feeding types, while some have only one feeding type. The response variable I am interested in is "loss," indicating loss of a trait. I'm curious to know if any of the feeding types predict or are correlated with the status of "loss."
thoughts
I wasn't sure if there was a good way to include feeding types as one categorical variable with multiple categories. I don't think I can organize it as a single variable with the types c("scavenge","dung","pred", etc...) since some species have multiple feeding types, so I split them up into separate columns and indicated their status as 1 (yes) or 0 (no). At the moment I was thinking of trying to use a log-linear analysis, but examples I find don't quite have comparable data... and I'm happy for suggestions.
Any help or pointing in the right direction is much appreciated!
There are too little samples, you have 4 loss == 0 and 18 loss == 1. You will run into problems fitting a full logistic regression (i.e including all variables). I suggest testing for association for each feeding habit using a fisher test:
library(dplyr)
library(purrr)
# function for the fisher test
FISHER <- function(x,y){
FT = fisher.test(table(x,y))
data.frame(
pvalue=FT$p.value,
oddsratio=as.numeric(FT$estimate),
lower_limit_OR = FT$conf.int[1],
upper_limit_OR = FT$conf.int[2]
)
}
# define variables to test
FEEDING <- c("scavenge","dung","pred","nectar","plant","blood","mushroom")
# we loop through and test association between each variable and "loss"
results <- data[,FEEDING] %>%
map_dfr(FISHER,y=data$loss) %>%
add_column(var=FEEDING,.before=1)
You get the results for each feeding habit:
> results
var pvalue oddsratio lower_limit_OR upper_limit_OR
1 scavenge 0.264251538 0.1817465 0.002943469 2.817560
2 dung 1.000000000 1.1582683 0.017827686 20.132849
3 pred 0.263157895 0.0000000 0.000000000 3.189217
4 nectar 0.535201640 0.0000000 0.000000000 5.503659
5 plant 0.002597403 Inf 2.780171314 Inf
6 blood 1.000000000 0.0000000 0.000000000 26.102285
7 mushroom 0.337662338 5.0498688 0.054241930 467.892765
The pvalue is p-value from fisher.test, basically with an odds ratio > 1, the variable is positively associated with loss. Of all your variables, plant is the strongest and you can check:
> table(loss,plant)
plant
loss 0 1
0 18 0
1 1 3
Almost all that are plant=1, are loss=1.. So with your current dataset, I think this is the best you can do. Should get a larger sample size to see if this still holds.

Predict mclust cluster membership outside R

I've used mclust to find clusters in a dataset. Now I want to implement these findings into external non-r software (predict.Mclust is thus not an option as has been suggested in previous similar Questions) to classify new observations. I need to know how mclust classifies observations.
Since mclust outputs a center and a covariance matrix for each cluster it felt reasonable to calculate mahalanobis distance for every observation and for every cluster. Observations could then be classified to the mahalonobi-nearest cluster. It seems not not to work fully however.
Example code with simulated data (in this example I only use one dataset, d, and try to obtain the same classification as mclust does by the mahalanobi approach outlined above):
set.seed(123)
c1<-mvrnorm(100,mu=c(0,0),Sigma=matrix(c(2,0,0,2),ncol=2))
c2<-mvrnorm(200,mu=c(3,3),Sigma=matrix(c(3,0,0,3),ncol=2))
d<-rbind(c1,c2)
m<-Mclust(d)
int_class<-m$classification
clust1_cov<-m$parameters$variance$sigma[,,1]
clust1_center<-m$parameters$mean[,1]
clust2_cov<-m$parameters$variance$sigma[,,2]
clust2_center<-m$parameters$mean[,2]
mahal_clust1<-mahalanobis(d,cov=clust1_cov,center=clust1_center)
mahal_clust2<-mahalanobis(d,cov=clust2_cov,center=clust2_center)
mahal_clust_dist<-cbind(mahal_clust1,mahal_clust2)
mahal_classification<-apply(mahal_clust_dist,1,function(x){
match(min(x),x)
})
table(int_class,mahal_classification)
#List mahalanobis distance for miss-classified observations:
mahal_clust_dist[mahal_classification!=int_class,]
plot(m,what="classification")
#Indicate miss-classified observations:
points(d[mahal_classification!=int_class,],pch="X")
#Results:
> table(int_class,mahal_classification)
mahal_classification
int_class 1 2
1 124 0
2 5 171
> mahal_clust_dist[mahal_classification!=int_class,]
mahal_clust1 mahal_clust2
[1,] 1.340450 1.978224
[2,] 1.607045 1.717490
[3,] 3.545037 3.938316
[4,] 4.647557 5.081306
[5,] 1.570491 2.193004
Five observations are classified differently between the mahalanobi approach and mclust. In the plots they are intermediate points between the two clusters. Could someone tell me why it does not work and how I could mimic the internal classification of mclust and predict.Mclust?
After formulating the above question I did some additional research (thx LoBu) and found that the key was to calculate the posterior probability (pp) for an observation to belong to a certain cluster and classify according to maximal pp. The following works:
denom<-rep(0,nrow(d))
pp_matrix<-matrix(rep(NA,nrow(d)*2),nrow=nrow(d))
for(i in 1:2){
denom<-denom+m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i])
}
for(i in 1:2){
pp_matrix[,i]<-m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i]) / denom
}
pp_class<-apply(pp_matrix,1,function(x){
match(max(x),x)
})
table(pp_class,m$classification)
#Result:
pp_class 1 2
1 124 0
2 0 176
But if someone in layman terms could explain the difference between the mahalanobi and pp approach I would be greatful. What do the "mixing probabilities" (m$parameters$pro) signify?
In addition to Mahalanobis distance, you also need to take the cluster weight into account.
These weight the relative importance of clusters when they overlap.

TPR & FPR Curve for different classifiers - kNN, NaiveBayes, Decision Trees in R

I'm trying to understand and plot TPR/FPR for different types of classifiers. I'm using kNN, NaiveBayes and Decision Trees in R. With kNN I'm doing the following:
clnum <- as.vector(diabetes.trainingLabels[,1], mode = "numeric")
dpknn <- knn(train = diabetes.training, test = diabetes.testing, cl = clnum, k=11, prob = TRUE)
prob <- attr(dpknn, "prob")
tstnum <- as.vector(diabetes.testingLabels[,1], mode = "numeric")
pred_knn <- prediction(prob, tstnum)
pred_knn <- performance(pred_knn, "tpr", "fpr")
plot(pred_knn, avg= "threshold", colorize=TRUE, lwd=3, main="ROC curve for Knn=11")
where diabetes.trainingLabels[,1] is a vector of labels (class) I want to predict, diabetes.training is the training data and diabetest.testing is the testing.data.
Plot looks like the following:
The values stored in prob attribute is a numeric vector (decimal between 0 and 1). I convert the class labels factor into numbers and then I can use it with prediciton/performance function from ROCR library. Not 100% sure I'm doing it correct but at least it works.
For the NaiveBayes and Decision Trees tho, with prob/raw parameter speciefied in predict function I don't get a single numeric vector but a vector of lists or matrix where probability for each class is specified (I guess), eg:
diabetes.model <- naiveBayes(class ~ ., data = diabetesTrainset)
diabetes.predicted <- predict(diabetes.model, diabetesTestset, type="raw")
and diabetes.predicted is:
tested_negative tested_positive
[1,] 5.787252e-03 0.9942127
[2,] 8.433584e-01 0.1566416
[3,] 7.880800e-09 1.0000000
[4,] 7.568920e-01 0.2431080
[5,] 4.663958e-01 0.5336042
The question is how to use it to plot ROC curve and why in kNN I get one vector and for other classifieres I get them separate for both classes?
ROC curve
The ROC curve you provided for knn11 classifier looks off - it is below the diagonal indicating that your classifier assigns class labels correctly less than 50% of the time. Most likely what happened there is that you provided wrong class labels or wrong probabilities. If in training you used class labels of 0 and 1 - those same class labels should be passed to ROC curve in the same order (without 0 and one flipping).
Another less likely possibility is that you have a very weird dataset.
Probabilities for other classifiers
ROC curve was developed to call events from the radar. Technically it is closely related to predicting an event - probability that you correctly guess the even of a plane approaching from the radar. So it uses one probability. This can be confusing when someone does classification on two classes where "hit" probabilities are not evident, like in your case where you have cases and controls.
However any two class classification can be termed in terms of "hits" and "misses" - you just have to select a class which you will call an "event". In your case having diabetes might be called an event.
So from this table:
tested_negative tested_positive
[1,] 5.787252e-03 0.9942127
[2,] 8.433584e-01 0.1566416
[3,] 7.880800e-09 1.0000000
[4,] 7.568920e-01 0.2431080
[5,] 4.663958e-01 0.5336042
You would only have to select one probability - that of an event - probably "tested_positive". Another one "tested_negative" is just 1-tested_positive because when classifier things that a particular person has diabetes with 79% chance - he at the same time "thinks" that there is a 21% chance of that person not having diabetes. But you only need one number to express this idea, so knn only returns one, while other classifier can return two.
I don't know which library you used for decision trees so cannot help with the output of that classifier.
Looks like you are something fundamentally wrong.
Ideally KNN graph looks like above one. Here are few point you can use.
Calculate distance in you code.
Use below code for prediction in python
Predicted class
print(model_name.predict(test))
3 nearest neighbors
print(model_name.kneighbors(test)[1])

Creating classification features from wavelet transformed time series

I'm interested in using a wavelet transform, Haar for example, to create classification variables from time series data to use in logistic regression.
Simple example. Let's say I'm trying to predict payment defaults and I have a person's monthly expense data and someone with consistent expenses is better than someone with increasing expenses in the the most recent 4 months.
If I have two sample borrowers:
Borrower A - Good - expensesA = c(100,110,95,105), default = 0
Borrower B - Bad - expensesB = c(75,100,150,200), default = 1
If I am using logistic regression, glm() in R, to create a classification model, and the R wavelets package dwt() function for a "haar" transform of the time series what are the appropriate features to extract from the dwt() object to use in glm()?
The truncated output for Borrower A is:
tr = dwt(expensesA, filter = "haar")
tr
An object of class "dwt"
Slot "W":
$W1
[,1]
[1,] 7.071068
[2,] 7.071068
$W2
[,1]
[1,] -5
Slot "V":
$V1
[,1]
[1,] 148.4924
[2,] 141.4214
$V2
[,1]
[1,] 205
Slot "filter":
Filter Class: Daubechies
Name: HAAR
Length: 2
Level: 1
Wavelet Coefficients: 7.0711e-01 -7.0711e-01
Scaling Coefficients: 7.0711e-01 7.0711e
-01
I know Ws are wavelet coefficients and the Vs the scaling coefficients.
Do I need to use all four W1 and V1 values as variables to properly model this or is it okay to try just the W1s without V1s (or vice versa)?
Is it worthwhile to try only the single W2 and V2s as variables?
Or is it better to try to use a clustering algortihm and label them based on clusters?
I know it of course also depends on the data, but I'm looking for a starting point regarding best practices.

Resources