Evaluating classifiers based on aggregated confidence scores - math
We have a "trained classifier". This is not necessarily a classifier such as a SVM, MLP etc.
The classifier returns a list of outputs evaluated with a confidence score.
Given an input, the output could look like this:
Matched output 1 -> score 90
Matched output 2 -> score 85
Matched output 3 -> score 80
In this case we would consider the classifier result to be "bad", since the distance of all confidence scores for all outputs is "low".
Given a different input, the output could look like this:
Matched output 1 -> score 90
Matched output 2 -> score 45
Matched output 3 -> score 25
In this case we would consider the classifier result to be "good", since the distance of all confidence scores for all outputs is "high".
We have a lot of inputs that we can run through the system.
Is there a way to find out what is the "high enough" distance so that I can say the model is "confident enough"?
This is not meant for comparing algorithms, but the systems performance to itself over time.
Related
How to know probability output by a model corresponds to which class?
I am studying the third chapter of An Introduction to Statistical Learning with Application in R which discusses classification models. In section 4.7.3 Linear Discriminate Analysis (Lab) the model is applied on a dataset named Smarket to predict the up and down of the stock market. Here the total number of down and up prediction was done by the following lines of code: sum(lda.pred$posterior[, 1]>= .5) and sum(lda.pred$posterior[, 1] < .5) and the writers wrote that Notice that the posterior probability output by the model corresponds to the probability that the market will decrease and then to verify these line of codes were written: lda.pred$posterior [1:20 , 1] which gives posterior probability of 20 observations and lda.class [1:20] which gives classes corresponding to the probabilities of the observations given above Also when I wrote the line of code (thanks to ISLR online course): data.frame(lda.pred)[1:20, ] which gives classes and corresponding probabilites. Here is seen that observations having probabilities < 0.5 are classified as down class and observations having probabilities >= 0.5 are classified as up class. This all is a bit confusing to me. My question is in the first case how do we know that when the probability is greater or equal to 0.5 the prediction is down? Because using contrast() function it is seen that R has created a dummy variable with a 1 for Up which means that the values correspond to the probability of the market going up, rather than down. Again in the second case why observations having probabilities >= 0.5 are classified as up? Don't the first and second case contradict?
You are predicting both, where the posterior probability of "Down" is 1 - "Up". It just so happens that class "Down" is stored in the first column of lda.pred lda.pred$posterior [1:20 , 1]. The probabilities of "Up" are stored in the second column. (lda.pred$posterior [1:20 , 2]).
Is it possible to write a function in R to perform a discriminant analysis with a cumulative, variable number of factors?
I am attempting to perform a least discriminant analysis on geometric morphometric data. Because geometric morphometric data typically produces large numbers of variables and discriminant analyses require more data points than variables to accurately classify specimens, a common solution in the literature is to perform a principal component analysis and then use a variable number of PCs representing less than 99% of the cumulative variance but returning the highest reclassification rate as input for the LDA. Right now the way I am doing this is running the LDA in R (using the functions in the Morpho and MASS packages) under every possible number of PCs used and noting the classification accuracy by hand until I found the lowest number of PCs that returned the highest accuracy, but this is highly inefficient. I was wondering if there was any way to write a function that would run an LDA for all possible numbers of the first N PCs (up to a certain, user defined level representing 99% of the cumulative variance) and return the percent reclassification rate for each level, producing something like the following: PCs percent_accuracy 20 72.2 19 76.3 18 77.4 17 80.1 16 75.4 15 50.7 ... ... 1 20.2 So row 1 would be the reclassification rate when the first 20 PCs are used, row 2 is the rate when the first 19 PCs are used, and so on and so forth.
Correlation of categorical data to binomial response in R
I'm looking to analyze the correlation between a categorical input variable and a binomial response variable, but I'm not sure how to organize my data or if I'm planning the right analysis. Here's my data table (variables explained below): species<-c("Aaeg","Mcin","Ctri","Crip","Calb","Tole","Cfus","Mdes","Hill","Cpat","Mabd","Edim","Tdal","Tmin","Edia","Asus","Ltri","Gmor","Sbul","Cvic","Egra","Pvar") scavenge<-c(1,1,0,1,1,1,1,0,1,0,1,1,1,0,0,1,0,0,0,0,1,1) dung<-c(0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0) pred<-c(0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0) nectar<-c(1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0) plant<-c(0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0) blood<-c(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0) mushroom<-c(0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0) loss<-c(0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0) #1 means yes, 0 means no data<-cbind(species,scavenge,dung,pred,nectar,plant,blood,mushroom,loss) data #check data table data table explanation I have individual species listed, and the next columns are their annotated feeding types. A 1 in a given column means yes, and a 0 means no. Some species have multiple feeding types, while some have only one feeding type. The response variable I am interested in is "loss," indicating loss of a trait. I'm curious to know if any of the feeding types predict or are correlated with the status of "loss." thoughts I wasn't sure if there was a good way to include feeding types as one categorical variable with multiple categories. I don't think I can organize it as a single variable with the types c("scavenge","dung","pred", etc...) since some species have multiple feeding types, so I split them up into separate columns and indicated their status as 1 (yes) or 0 (no). At the moment I was thinking of trying to use a log-linear analysis, but examples I find don't quite have comparable data... and I'm happy for suggestions. Any help or pointing in the right direction is much appreciated!
There are too little samples, you have 4 loss == 0 and 18 loss == 1. You will run into problems fitting a full logistic regression (i.e including all variables). I suggest testing for association for each feeding habit using a fisher test: library(dplyr) library(purrr) # function for the fisher test FISHER <- function(x,y){ FT = fisher.test(table(x,y)) data.frame( pvalue=FT$p.value, oddsratio=as.numeric(FT$estimate), lower_limit_OR = FT$conf.int[1], upper_limit_OR = FT$conf.int[2] ) } # define variables to test FEEDING <- c("scavenge","dung","pred","nectar","plant","blood","mushroom") # we loop through and test association between each variable and "loss" results <- data[,FEEDING] %>% map_dfr(FISHER,y=data$loss) %>% add_column(var=FEEDING,.before=1) You get the results for each feeding habit: > results var pvalue oddsratio lower_limit_OR upper_limit_OR 1 scavenge 0.264251538 0.1817465 0.002943469 2.817560 2 dung 1.000000000 1.1582683 0.017827686 20.132849 3 pred 0.263157895 0.0000000 0.000000000 3.189217 4 nectar 0.535201640 0.0000000 0.000000000 5.503659 5 plant 0.002597403 Inf 2.780171314 Inf 6 blood 1.000000000 0.0000000 0.000000000 26.102285 7 mushroom 0.337662338 5.0498688 0.054241930 467.892765 The pvalue is p-value from fisher.test, basically with an odds ratio > 1, the variable is positively associated with loss. Of all your variables, plant is the strongest and you can check: > table(loss,plant) plant loss 0 1 0 18 0 1 1 3 Almost all that are plant=1, are loss=1.. So with your current dataset, I think this is the best you can do. Should get a larger sample size to see if this still holds.
Differential expression gene analysis: how to do t.test on expression matrix with goups different clinical matrix?
I am beginning my PhD in transcriptome (affymetrix assay) analysis. I have an expression matrix (trans_data : 32000 genes x 620 samples), and a clinical matrix (clin_data : 620 samples x 42 clinical caracteristics). The samples belong to 1 of the 4 populations A-B-C-D. I'd like to draw comparision of gene expression between population A and B without triying to bind the two matrix. I'd like to optain a matrix with mean expression of each genes in the two population, then pvalue, then adjusted p value. Then I could select only differentially expressed genes (padj < 0,05). thanks for your help. Alain
Can't answer your question directly without a clear reproducible example but you might want to check out the rather excellent tableone package
How to control method of ranking for Kruskal-Wallis test in R?
In my experiments I tried different set ups to balance the distribution between two tasks. Each set up was run 32 times. I got the following task distributions [ratio from 0 to 1 of tasktype1/(tasktype1+tasktype2)]: http://oi63.tinypic.com/2cf6szb.jpg This is how (part of) the dataframe looks like in R: http://oi67.tinypic.com/2z9fg28.jpg I think ANOVA is not suitable as there seems to be no normal distribution of the data. (Is there a quick way to verify a low level of normality? Is there a standard at which point ANOVA is not suitable anymore?) Therefore I decided to do the Kruskal-Wallis test. Reading on the test I figured the data needs to be ranked. But how can I choose the method of ranking when computing the Kruskal-Wallis test in R. In my case the "desired" outcome is a balanced population (ratio of 0.5). So the ranks would be: rank: ratio: 1 0.5 2 0.4 and 0.6 [...] [...] Can kruskal.test() be adjusted accordingly? (Maybe I am understanding the function wrong...) My best guess would be just to try: kruskal.test(ratio ~ Method, data = ds)