The best way to calculate classification accuracy? - math

I know one formula to calculate classification accuracy is X = t / n * 100 (where t is the number of correct classification and n is the total number of samples. )
But, let's say we have total 100 samples, 80 in class A, 10 in class B, 10 in class C.
Scenario 1: All 100 samples were assigned to class A, by using the formula, we got accuracy equals 80%.
Scenario 2: 10 samples belong to B were correctly assigned to class B ;10 samples belong to C were correctly assigned to class C as well; 30 samples belong to A correctly assigned to class A; the rest 50 samples belong to A were incorrectly assigned to C. By using the formula, we got accuracy of 50%.
My question is:
1: Can we say scenario 1 has a higher accuracy rate then scenario 2?
2: Is there any way to calculate accuracy rate for classification problem?
Many thanks ahead!

Classification accuracy is defined as "percentage of correct predictions". That is the case regardless of the number of classes. Thus, scenario 1 has a higher classification accuracy than scenario 2.
However, it sounds like what you are really asking is for an alternative evaluation metric or process that "rewards" scenario 2 for only making certain types of mistakes. I have two suggestions:
Create a confusion matrix: It describes the performance of a classifier so that you can see what types of errors your classifier is making.
Calculate the precision, recall, and F1 score for each class. The average F1 score might be the single-number metric you are looking for.
The Classification metrics section of the scikit-learn documentation has lots of good information about classifier evaluation, even if you are not a scikit-learn user.

Related

Metric for evaluating agreement at inter-rater reliability for a single subject by multiple raters

I'm making a rating survey in R (Shiny) and I'm tryng to find a metric that can evaluate the agreement but for only one of the "questions" in the survey. The ratings range from 1 to 5. There is multiple raters and each rater rates a set of 10 questions according to the ratings.
I've used Fleiss Kappa and Krippendorff Alpha for the whole set of questions and raters and it works but when evaluating each question separately these metrics give negative value. I tried calculating them by hand (formulas) and I still get the same results so I guess that they don't work for a small sample of subjects (in this case a sample of 1).
I've looked at other metrics like rwg in the multilevel package but thus far I can't seem to make it work. According to r documentation:
rwg(x, grpid, ranvar=2)
Where:
x = A vector representing the construct on which to estimate agreement.
grpid = A vector identifying the groups from which x originated.
Can someone explain me what the rwg function expects from me?
If someone know some other agreement metric that might work better please let me know.
Thanks.

why Strauss-hardcore model could has a gamma bigger than 1?

the spatstat book said clearly that a Strauss model is invalid with a gamma bigger than 1, that is true:
multiple.Strauss<-ppm(P1a4.multiple~1, Strauss(r=51),method='ho')
#Warning message:
#Fitted model is invalid - cannot be simulated
as the L(r) function does has a trough first, I refit the data as a Strauss-hardcore model:
Mo.hybrid<-Hybrid(H=Hardcore(),S=Strauss(51))
multiple.hybrid<-ppm(P1a4.multiple~1,Mo.hybrid,method='ho')
#Hard core distance: 12.65963
#Fitted S interaction parameter gamma: 2.7466492
it interesting to see that the model fitted suceessfully, with a gamma>1 !
I want to know whether the gamma in Strauss-Hardcore model has same meaning with Strauss model, therefore could used as a indicator of aggregation?
Yes, the interpretation is similar and indicates some aggregation behaviour. The model with gamma>1 may be less intuitive to understand: Say the hardcore distance is r=12 and the Strauss interaction distance is R=50. Then you say that pairs of points within distance 12 of each other are heavily penalized (not permitted at all) while pairs of points separated by between 12 and 50 are encouraged (have a higher probability of occurring than at random). Pairs of points separated by more than 50 do not change the baseline probability (complete randomness).
Simulations from the StraussHardcore model often shows strange aggregation behavior, but it may be suitable for your data.

The accuracy limit of n weak classifier combinations

For dataset D(n * m), n is the number of samples, m is the number of features.
There are k weak classifiers, and the accuracy of each classifier is 60%.
How many weak classifier combinations can improve accuracy to 90%?
Can this problem be solved by mathematical formula?
If use 2 classifiers, the accuracy is 60%
If use 3 classifiers, the accuracy is 64.8%(3 * 60%^2 * 40% + 60%^3)
Is that right?
The idea behind your question is ensembling.
This can work when each of the model you do choose, perform well for individual classes. So you can assign a weight to each of them according to the classes and generate the final output.
for example,
you have 3 classes (C1, C2, C3)
Let's say Model A, predicts well for C1 then you can set the final probability of C1 as
prob_of_C1=model_A_prob0.7+model_B_prob0.2+model_C_prob*0.1
Similarly, you can apply the same rule for other classes. You might have to change the weights you assign and that's generally done based on the precision of each model for that particular class. This will work, only if you have the models performing differently for different classes.
if you want to learn more, you can look at xgboost alogrithm, and this blog explains it nicely: XGBoost Algorithm: Long May She Reign!

k-nearest neighbors where # of objects in each class differs vastly

I am running knn (in R) on a dataset where objects are classified A or B. However, there are many more A's than B's (18 of class A for every 1 of class B).
How should I combat this? If I use a k of 18, for example, and there are 7 B's in the neighbors (way more than the average B's in a group of 18), the test data will still be classified as A when it should probably be B.
I am thinking that a lower k will help me. Is there any rule of thumb for choosing the value of k, as it relates to the frequencies of the classes in the train set?
Ther is no such rule, for your case i would try a very small k probably between 3 and 6.
About the dataset, unless your test data or real world data are found in about the same ratio you have mentioned ( 18:1 ) i would remove some A's for more accurate results, i wont advise you doing it if the ratio is indeed close to the real world data because you will lose the effect of the ratio (lower probability classify for a lower probability data).

mathematical model to build a ranking/ scoring system

I want to rank a set of sellers. Each seller is defined by parameters var1,var2,var3,var4...var20. I want to score each of the sellers.
Currently I am calculating score by assigning weights on these parameters(Say 10% to var1, 20 % to var2 and so on), and these weights are determined based on my gut feeling.
my score equation looks like
score = w1* var1 +w2* var2+...+w20*var20
score = 0.1*var1+ 0.5 *var2 + .05*var3+........+0.0001*var20
My score equation could also look like
score = w1^2* var1 +w2* var2+...+w20^5*var20
where var1,var2,..var20 are normalized.
Which equation should I use?
What are the methods to scientifically determine, what weights to assign?
I want to optimize these weights to revamp the scoring mechanism using some data oriented approach to achieve a more relevant score.
example
I have following features for sellers
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
5] Price competitiveness
Are there better algorithms/approaches to solve this problem? calculating score? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
Unfortunately stackoverflow doesn't have latex so images will have to do:
Also as a disclaimer, I don't think this is a concise answer but your question is quite broad. This has not been tested but is an approach I would most likely take given a similar problem.
As a possible direction to go, below is the multivariate gaussian. The idea would be that each parameter is in its own dimension and therefore could be weighted by importance. Example:
Sigma = [1,0,0;0,2,0;0,0,3] for a vector [x1,x2,x3] the x1 would have the greatest importance.
The co-variance matrix Sigma takes care of scaling in each dimension. To achieve this simply add the weights to a diagonal matrix nxn to the diagonal elements. You are not really concerned with the cross terms.
Mu is the average of all logs in your data for your sellers and is a vector.
xis the mean of every category for a particular seller and is as a vector x = {x1,x2,x3...,xn}. This is a continuously updated value as more data are collected.
The parameters of the the function based on the total dataset should evolve as well. That way biased voting especially in the "feelings" based categories can be weeded out.
After that setup the evaluation of the function f_x can be played with to give the desired results. This is a probability density function, but its utility is not restricted to stats.

Resources