I'm trying to measure the agreement between two different systems of classification (one of them based on machine-learning algorithms, and the other based on human ground-truthing), and I'm looking for input from someone who's implemented a similar sort of system.
The classification schema allows each item to be classified into multiple different nodes in a category taxonomy, where each classification carries a weight coefficient. For example, if some item can be classified into four different taxonomy nodes, the result might look like this for the algorithmic and ground-truth classifiers:
ALGO TRUTH
CATEGORY A: 0.35 0.50
CATEGORY B: 0.30 0.30
CATEGORY C: 0.25 0.15
CATEGORY D: 0.10 0.05
The weights will always add up to exactly 1.0, for all selected category nodes (of which there are about 200 in the classification taxonomy).
In the example above, it's important to note that both lists agree about rank ordering (ABCD), so they should be scored as being in strong agreement with one another (even though there are some differences in the weights assigned to each category. By contrast, in the next example, the two classifications are in complete disagreement with respect to rank-order:
ALGO TRUTH
CATEGORY A: 0.40 0.10
CATEGORY B: 0.35 0.15
CATEGORY C: 0.15 0.35
CATEGORY D: 0.10 0.40
So a result like this should get a very low score.
One final example demonstrates a common case where the human-generated ground-truth contains duplicate weight values:
ALGO TRUTH
CATEGORY A: 0.40 0.50
CATEGORY B: 0.35 0.50
CATEGORY C: 0.15 0.00
CATEGORY D: 0.10 0.00
So it's important that the algorithm allows lists without perfect rank ordering (since the ground truth could be validly interpreted as ABCD, ABDC, BACD, or BADC)
Stuff I've tried so far:
Root Mean Squared Error (RMSE): Very problematic. It doesn't account for rank-order agreement, which means that gross disagreements between categories at the top of the list are swept under the rug by agreement about categories at the bottom of the list.
Spearman's Rank Correlation: Although it accounts for differences in rank, it gives equal weight to rank agreements at the top of the list and those at the bottom of the list. I don't really care much about low-level discrepancies, as long as the high-level discrepancies contribute to the error metric. It also doesn't handle cases where multiple categories can have tie-value ranks.
Kendall Tau Rank Correlation Coefficient: Has the same basic properties and limitations as Spearman's Rank Correlation, as far as I can tell.
I've been thinking about rolling my own ad-hoc metrics, but I'm no mathematician, so I'd be suspicious of whether my own little metric would provide much rigorous value. If there's some standard methodology for this kind of thing, I'd rather use that.
Any ideas?
Okay, I've decided to implement a weighted RMSE. It doesn't directly account for rank-ordering relationships, but the weighting system automatically emphasizes those entries at the top of the list.
Just for review (for anyone not familiar with RMSE), the equation looks like this, assuming two different classifiers A and B, whose results are contained in an array of the same name:
RMSE Equation http://benjismith.net/images/rmse.png
In java the implementation looks like this:
double[] A = getAFromSomewhere();
double[] B = getBFromSomewhere();
// Assumes that A and B have the same length. If not, your classifier is broken.
int count = A.length;
double sumSquaredError = 0;
for (int i = 0; i < count; i++) {
double aElement = A[i];
double bElement = B[i];
double error = aElement - bElement;
double squaredError = error * error;
sumSquaredError += squaredError;
}
double meanSquaredError = sumSquaredError / count;
double rootMeanSquaredError = Math.sqrt(meanSquaredError);
That's the starting point for my modified implementation. I needed to come up with a weighting system that accounts for the combined magnitude of the two values (from both classifiers). So I'll multiply each squared-error value by SQRT(Ai^2 + Bi^2), which is a plain-old Euclidean distance function.
Of course, since I use a weighted error in the numerator, I need to also use the sum of all weights in the denominator, so that my results are renormalized back into the (0.0, 1.0) range.
I call the new metric "RMWSE", since it's a Root Mean Weighted Squared Error. Here's what the new equation looks like:
RMWSE Equation http://benjismith.net/images/rmwse.png
And here's what it looks like in java:
double[] A = getAFromSomewhere();
double[] B = getBFromSomewhere();
// Assumes that A and B have the same length. If not, your classifier is broken.
int count = A.length;
double sumWeightedSquaredError = 0;
double sumWeights = 0;
for (int i = 0; i < count; i++) {
double aElement = A[i];
double bElement = B[i];
double error = aElement - bElement;
double squaredError = error * error;
double weight = Math.sqrt((aElement * aElement) + (bElement * bElement));
double weightedSquaredError = weight * squaredError;
sumWeightedSquaredError += weightedSquaredError;
sumWeights += weight;
}
double meanWeightedSquaredError = sumWeightedSquaredError / sumWeights;
double rootMeanWeightedSquaredError = Math.sqrt(meanWeightedSquaredError);
To give you an idea for how this weight works in practice, let's say my two classifiers produce 0.95 and 0.85 values for some category. The error between these two values is 0.10, but the weight is 1.2748 (which I arrived at using SQRT(0.95^2 + 0.85^2)). The weighted error is 0.12748.
Likewise, if the classifiers produce 0.45 and 0.35 for some other category, the error is still just 0.10, but the weight is only 0.5701, and the weighted error is therefore only 0.05701.
So any category with high values from both classifiers will be more heavily weighted than categories with a high value from only a single classifier, or categories with low values from both classifiers.
This works best when my classification values are renormalized so that the maximum values in both A and B are 1.0, and all the other values are scaled-up, proportionately. Consequently, the dimensions no longer sum up to 1.0 for any given classifier, but that doesn't really matter anyhow, since I wasn't exploiting that property for anything useful.
Anecdotally, I'm pretty happy with the results this is giving in my dataset, but if anyone has any other ideas for improvement, I'd be totally open to suggestions!
I don't think you need to worry about rigour to this extent. If you want to weight certain types of agreement more than others, that is perfectly legitimate.
For example, only calculate Spearman's for the top k categories. I think you should get perfectly legitimate answers.
You can also do a z-transform etc. to map everything to [0,1] while preserving what you consider to be the "important" pieces of your data set (variance, difference etc.) Then you can take advantage of the large number of hypothesis testing functions available.
(As a side note, you can modify Spearman's to account for ties. See Wikipedia.)
Related
I am a data mining beginner and am trying to first formulate an approach to a clustering problem I am solving.
Suppose we have x writers, each with a particular style (use of unique words etc.). They each write multiple short texts, let's say a haiku. We collect many hundreds of these haikus from the authors and try to understand from the haikus, using context analysis, how many authors we had in the first place (we somehow lost records of how many authors there were, after a great war!)
Let's assume I create a hash table of words for each of these haikus. Then I could write a distance function that would look at the repetition of similar words between each vector. This could allow me to implement some sort of k-mean clustering function.
My problem now is to measure, probabilistically, the number of clusters, i.e. the number of authors, that would give me the optimum fit.
Something like:
number of authors | probability
1, 0.05
2, 0.1
3, 0.2
4, 0.4
5, 0.1
6, 0.05
7, 0.03
8, 0.01
The only constraint here is that as the number of authors (or clusters) goes to infinity, the sigma of the probabilities should converge onto 1, I think.
Does anyone have any thoughts or suggestions on how to implement this second part?
Let's formulate an approach using Bayesian statistics.
Pick a prior P(K) on the number of authors, K. For example, you might say K ~ Geometric(p) with support {1, 2, ... } where E[K] = 1 / p is the number of authors you expect there to be prior to seeing any writings.
Pick a likelihood function L(D|K) that assigns a likelihood to the writing data D given a fixed number of authors K. For example, you might say L(D|K) is the total amount of error in a k-component GMM found by expectation-maximization. To be really thorough, you could learn L(D|K) from data: the internet is full of haikus with known authors.
Find the value of K that maximizes the posterior probability P(K|D) - your best guess at the number of authors. Note that since P(K|D) = P(D|K)P(K)/P(D), P(D) is constant, and L(D|K) is proportional to P(D|K), you have:
max { P(K|D) | K = 1, 2, ... } = max { L(D|K)P(K) | K = 1, 2, ... }
With respect to your question, the first column in your table corresponds to K and the second column corresponds to a normalized P(K|D); that is, it is proportional to L(D|K)P(K).
i want to implement the algorithm of "Probability of Error and Average Correlation Coefficient". (more info Page 143. It is a algorithm to elect unused features from set of features. As far as i know, this algorithm is not limited to boolean valued features but i dont know how i can use it for continuous features.
This is the only example what i could find about this algorithm:
Thus, X is to be predicted feature and C is any feature. To calculate Probability of Error value of C, they select values which are mismatching with green pieces. Thus PoE of C is (1-7/9) + (1-6/7) = 3/16 = 1875.
My question is thus: How can we use a continuous feature instead of a boolean feature, like in this example, to calculate PoE? Or is it not possible?
The algorithm that you describe is a feature selection algorithm, similar to the forward selection technique. At each step, we find a new feature Fi that minimizes this criterion :
weight_1 * ErrorProbability(Fi) + weight_2 * Acc(Fi)
ACC(Fi) represents the mean correlation between the feature Fi and other features already selected. You want to minimize this in order to have all your features not correlated, thus have a well conditionned problem.
ErrorProbability(Fi) represents if the feature correctly describes the variable you want to predict. For example, lets say you want to predict if tommorow will be rainy depending on temperature (continuous feature)
The Bayes error rate is (http://en.wikipedia.org/wiki/Bayes_error_rate) :
P = Sum_Ci { Integral_xeHi { P(x|Ci)*P(Ci) } }
In our example
Ci belong to {rainy ; not rainy}
x are instances of temperatures
Hi represent all temperatures that would lead to a Ci prediction.
What is interesting is that you can take any predictor you like.
Now, suppose you have all temperatures in one vector, all states rainy/not rainy in another vector :
In order to have P(x|Rainy), consider the following values :
temperaturesWhenRainy <- temperatures[which(state=='rainy')]
What you should do next is to plot an histogram of these values. Then you should try to fit a distribution on it. You will havea parametric formula of P(x|Rainy).
If your distribution is gaussian, you can do it simply :
m <- mean(temperaturesWhenRainy)
s <- sd(temperaturesWhenRainy)
Given some x value, you have the density of probability of P(x|Rainy) :
p <- dnorm(x, mean = m, sd = s)
You can do the same procedure for P(x|Not Rainy). Then P(Rainy) and P(Not Rainy) are easy to compute.
Once you have all that stuff you can use the Bayes error rate formula, which yields your ErrorProbability for a continuous feature.
Cheers
This is a question about normalization of data that takes into account different parameters.
I have a set of articles in a website. The users use the rating system and rate the articles from 1 to 5 stars. 1 star means a bad article and marks the article 'bad'. 2 stars give an 'average' rating. 3,4 and 5 stars rate 'good', 'very good' and 'excellent'.
I want to normalize these ratings in the range of [0 - 2]. The normalized value will represent a score and will be used as a factor for boosting the article up or down in article listing. Articles with 2 or less stars, should get a score in the range of [0-1] so this boost factor will have a negative effect. Articles with rating of 2 or more stars should get a score in the range of [1-2] so this the boost factor will have a positive boost.
So for example, an article that has a 3.6 stars will get a boost factor of 1.4. This will boost the article up in the articles listing. An article with 1.9 stars will get a score of 0.8. This score will boost the article further down in the listing. An article with 2 stars will get a boost factor of 1 - no boost.
Furthermore I want to take into account the number of votes each article has. An article with a single vote of 3 stars must rank worse than an article of 4 votes and 2.8 stars average. (the boost factor could be 1.2 and 1.3 respectively)
If I understood you correctly, you should use a Sigmoid function, which refers to the special case of the Logistic function. Sigmoid and other logistic functions are often used in Neural networks to shrink (compress or normalize) input range of data (for example, to [-1,1] or [0,1] range).
I'm not going to solve your rating system, but a general way of normalising values is this.
Java method:
public static float normalise(float inValue, float min, float max) {
return (inValue - min)/(max - min);
}
C function:
float normalise(float inValue, float min, float max) {
return (inValue - min)/(max - min);
}
This method let you have negative values on both max and min. For example:
variable = normalise(-21.9, -33.33, 18.7);
Note: that you can't let max and min be the same value, or lett max be less than min. And inValue should be winth in the given range.
Write a comment if you need more details.
Based on the numbers, and a few I made up myself, I came up with these 5 points
Rating Boost
1.0 0.5
1.9 0.8
2.0 1.0
3.6 1.4
5.0 2.0
Calculating an approximate linear regression for that, I got the formula y=0.3x+0.34.
So, you could create a conversion function
float ratingToBoost(float rating) {
return 0.3 * rating + 0.34;
}
Using this, you will get output that approximately fits your requirements. Sample data:
Rating Boost
1.0 0.64
2.0 0.94
3.0 1.24
4.0 1.54
5.0 1.84
This obviously has linear growth, which might not be what you're looking for, but with only three values specified, it's hard to know exactly what kind of growth you expect. If you're not satisfied with linear growth, and you want e.g. bad articles to be punished more by a lower boosting, you could always try to come up with some more values and generate an exponential or logarithmic equation.
I am trying to determine the volatility of a rank.
More specifically, the rank can be from 1 to 16 over X data points (the number of data points varies with a maximum of 30).
I'd like to be able to measure this volatility and then map it to a percentage somehow.
I'm not a math geek so please don't spit out complex formulas at me :)
I just want to code this in the simplest manner possible.
I think the easiest first pass would be Standard Deviation over X data points.
I think that Standard Deviation is what you're looking for. There are some formulas to deal with, but it's not hard to calculate.
Given that you have a small sample set (you say a maximum of 30 data points) and that the standard deviation is easily affected by outliers, I would suggest using the interquartile range as a measure of volatility. It is a trivial calculation and would give a meaningful representation of the data spread over your small sample set.
If you want something really simple you could take the average of the absolute differences between successive ranks as volatility. This has the added bonus of being recursive. Us this for initialisation:
double sum=0;
for (int i=1; i<N; i++)
{
sum += abs(ranks[i]-ranks[i-1]);
}
double volatility = sum/N;
Then for updating the volatility if a new rank at time N+1 is available you introduce the parameter K where K determines the speed with which your volatility measurement adapts to changes in volatility. Higher K means slower adaption, so K can be though of as a "decay time" or somesuch:
double K=14 //higher = slower change in volatility over time.
double newvolatility;
newvolatility = (oldvolatility * (K-1) + abs(rank[N+1] - rank[N]))/K;
This is also known as a moving average (of the absolute differences in ranks in this case).
I want to rank a set of sellers. Each seller is defined by parameters var1,var2,var3,var4...var20. I want to score each of the sellers.
Currently I am calculating score by assigning weights on these parameters(Say 10% to var1, 20 % to var2 and so on), and these weights are determined based on my gut feeling.
my score equation looks like
score = w1* var1 +w2* var2+...+w20*var20
score = 0.1*var1+ 0.5 *var2 + .05*var3+........+0.0001*var20
My score equation could also look like
score = w1^2* var1 +w2* var2+...+w20^5*var20
where var1,var2,..var20 are normalized.
Which equation should I use?
What are the methods to scientifically determine, what weights to assign?
I want to optimize these weights to revamp the scoring mechanism using some data oriented approach to achieve a more relevant score.
example
I have following features for sellers
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
5] Price competitiveness
Are there better algorithms/approaches to solve this problem? calculating score? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
Unfortunately stackoverflow doesn't have latex so images will have to do:
Also as a disclaimer, I don't think this is a concise answer but your question is quite broad. This has not been tested but is an approach I would most likely take given a similar problem.
As a possible direction to go, below is the multivariate gaussian. The idea would be that each parameter is in its own dimension and therefore could be weighted by importance. Example:
Sigma = [1,0,0;0,2,0;0,0,3] for a vector [x1,x2,x3] the x1 would have the greatest importance.
The co-variance matrix Sigma takes care of scaling in each dimension. To achieve this simply add the weights to a diagonal matrix nxn to the diagonal elements. You are not really concerned with the cross terms.
Mu is the average of all logs in your data for your sellers and is a vector.
xis the mean of every category for a particular seller and is as a vector x = {x1,x2,x3...,xn}. This is a continuously updated value as more data are collected.
The parameters of the the function based on the total dataset should evolve as well. That way biased voting especially in the "feelings" based categories can be weeded out.
After that setup the evaluation of the function f_x can be played with to give the desired results. This is a probability density function, but its utility is not restricted to stats.