This is a question about normalization of data that takes into account different parameters.
I have a set of articles in a website. The users use the rating system and rate the articles from 1 to 5 stars. 1 star means a bad article and marks the article 'bad'. 2 stars give an 'average' rating. 3,4 and 5 stars rate 'good', 'very good' and 'excellent'.
I want to normalize these ratings in the range of [0 - 2]. The normalized value will represent a score and will be used as a factor for boosting the article up or down in article listing. Articles with 2 or less stars, should get a score in the range of [0-1] so this boost factor will have a negative effect. Articles with rating of 2 or more stars should get a score in the range of [1-2] so this the boost factor will have a positive boost.
So for example, an article that has a 3.6 stars will get a boost factor of 1.4. This will boost the article up in the articles listing. An article with 1.9 stars will get a score of 0.8. This score will boost the article further down in the listing. An article with 2 stars will get a boost factor of 1 - no boost.
Furthermore I want to take into account the number of votes each article has. An article with a single vote of 3 stars must rank worse than an article of 4 votes and 2.8 stars average. (the boost factor could be 1.2 and 1.3 respectively)
If I understood you correctly, you should use a Sigmoid function, which refers to the special case of the Logistic function. Sigmoid and other logistic functions are often used in Neural networks to shrink (compress or normalize) input range of data (for example, to [-1,1] or [0,1] range).
I'm not going to solve your rating system, but a general way of normalising values is this.
Java method:
public static float normalise(float inValue, float min, float max) {
return (inValue - min)/(max - min);
}
C function:
float normalise(float inValue, float min, float max) {
return (inValue - min)/(max - min);
}
This method let you have negative values on both max and min. For example:
variable = normalise(-21.9, -33.33, 18.7);
Note: that you can't let max and min be the same value, or lett max be less than min. And inValue should be winth in the given range.
Write a comment if you need more details.
Based on the numbers, and a few I made up myself, I came up with these 5 points
Rating Boost
1.0 0.5
1.9 0.8
2.0 1.0
3.6 1.4
5.0 2.0
Calculating an approximate linear regression for that, I got the formula y=0.3x+0.34.
So, you could create a conversion function
float ratingToBoost(float rating) {
return 0.3 * rating + 0.34;
}
Using this, you will get output that approximately fits your requirements. Sample data:
Rating Boost
1.0 0.64
2.0 0.94
3.0 1.24
4.0 1.54
5.0 1.84
This obviously has linear growth, which might not be what you're looking for, but with only three values specified, it's hard to know exactly what kind of growth you expect. If you're not satisfied with linear growth, and you want e.g. bad articles to be punished more by a lower boosting, you could always try to come up with some more values and generate an exponential or logarithmic equation.
Related
Main question: Suppose you have a discrete, finite data set $d$. Then the command summary(d) returns the Min, 1st quartile, Median, mean, 3rd quartile, and max. My question is: what formula does R use to compute the 1st quartile?
Background: My data set was: d=c(1,2,3,3,4,9). summary(d) returns 2.25 as the first quartile. Now, one way to compute the first quartile is to choose a value q1 such that 25% of the data set is less than of equal to q1. Clearly, this is not what R is using. So, I was wondering, what formula does R use to compute the first quartile?
Google searches on this topic have left even even more puzzled, and I couldn't find a formula that R uses. Typing help(summary) in R wasn't helpful to me too.
General discussion:
There are many different possibilities for sample quantile functions; we want them to have various properties (including being simple to understand and explain!), and depending on which properties we want most, we might prefer different definitions.
As a result, the wide variety of packages between them use many different definitions.
The paper by Hyndman and Fan [1] gives six desirable properties for a sample quantile function, lists nine existing definitions for the quantile function, and mentions which (of a number of common) packages use which definitions. Its Introduction says (sorry, the mathematics in this quote doesn't render properly any more, since it was moved to SO):
the sample quantiles that are used in statistical
packages are all based on one or two order statistics, and
can be written as
\hat{Q}_i(p) = (1 - γ) X_{(j)} + γ X_{(j+1)}\,,
where \frac{j-m}{n}\leq p< \frac{j-m+1}{n} \quad (1)
for some m\in \mathbb{R} and 0\leq\gamma\leq 1.
Which is to say, in general, the sample quantiles can be written as some kind of weighted average of two adjacent order statistics (though it may be that there's only weight on one of them).
In R:
In particular, R offers all nine definitions mentioned in Hyndman & Fan (with $7$ as the default). From Hyndman & Fan we see:
Definition 7. Gumbel (1939) also considered the modal position
$p_k = \text{mode}\,F(X_{(k)}) = (k-l)/(n-1)$. One nice property is that the vertices of $Q_7(p)$ divide the range into $n-1$ intervals, and exactly $100p\%$ of the intervals lie to the left of $Q_7(p$) and $100(1-p)\%$ of the intervals lie to the right of $Q_7(p)$.
What does this mean? Consider n=9. Then for (k-1)/(n-1) = 0.25, you need k = 1+(9-1)/4 = 3. That is, the lower quartile is the 3rd observation of 9.
We can see that in R:
quantile(1:9)
0% 25% 50% 75% 100%
1 3 5 7 9
For its behavior when n is not of the form 4k+1, the easiest thing to do is try it:
> quantile(1:10)
0% 25% 50% 75% 100%
1.00 3.25 5.50 7.75 10.00
> quantile(1:11)
0% 25% 50% 75% 100%
1.0 3.5 6.0 8.5 11.0
> quantile(1:12)
0% 25% 50% 75% 100%
1.00 3.75 6.50 9.25 12.00
When k isn't integer, it's taking a weighted average of the adjacent order statistics, in proportion to the fraction it lies between them (that is, it does linear interpolation).
The nice thing is that on average you get 3 times as many observations above the first quartile as you get below. So for 9 observations, for example, you get 6 above and 2 below the third observation, which divides them into the ratio 3:1.
What's happening with your sample data
You have d=c(1,2,3,3,4,9), so n is 6. You need (k-1)/(n-1) to be 0.25, so k = 1 + 5/4 = 2.25. That is, it takes 25% of the way between the second and third observation (which coincidentally are themselves 2 and 3), so the lower quartile is 2+0.25*(3-2) = 2.25.
Under the hood: some R details:
When you call summary on a data frame, this results in summary.data.frame being applied to the data frame (i.e. the relevant summary for the class you called it on). Its existence is mentioned in the help on summary.
The summary.data.frame function (ultimately -- via summary.default applied to each column) calls quantile to compute quartiles (you won't see this in the help, unfortunately, since ?summary.data.frame simply takes you to the summary help and that doesn't give you any details on what happens when summary is applied to a numeric vector -- this is one of those really bad spots in the help).
So ?quantile (or help(quantile)) describes what R does.
Here are two things it says (based directly off Hyndman & Fan). First, it gives general information:
All sample quantiles are defined as weighted averages of consecutive order
statistics. Sample quantiles of type i are defined by:
Q[i](p) = (1 - γ) x[j] + γ x[j+1],
where 1 ≤ i ≤ 9, (j-m)/n ≤ p < (j-m+1)/n, x[j] is the jth order statistic,
n is the sample size, the value of γ is a function of j = floor(np + m) and
g = np + m - j, and m is a constant determined by the sample quantile type.
Second, there's specific information about method 7:
Type 7
m = 1-p
. p[k] = (k - 1) / (n - 1). In this case, p[k] = mode[F(x[k])]. This is used by S.
Hopefully the explanation I gave earlier helps to make more sense of what this is saying. The help on quantile pretty much just quotes Hyndman & Fan as far as definitions go, and its behavior is pretty simple.
Reference:
[1]: Rob J. Hyndman and Yanan Fan (1996),
"Sample Quantiles in Statistical Packages,"
The American Statistician, Vol. 50, No. 4. (Nov.), pp. 361-365
Also see the discussion here.
Having read How Not To Sort By Average Rating I thought I should give it a try.
CREATE FUNCTION `mydb`.`LowerBoundWilson95` (pos FLOAT, neg FLOAT)
RETURNS FLOAT DETERMINISTIC
RETURN
IF(
pos + neg <= 0,
0,
(
(pos + 1.9208) / (pos + neg)
-
1.96 * SQRT(
(pos * neg) / (pos + neg) + 0.9604
)
/ (pos + neg)
)
/
(
1 + 3.8416
/ (pos + neg)
)
);
Running some tests, I discover that objects with pos=0 and neg>0 have very small, but non-negative scores, whereas an object with pos=neg=0 has a score of zero, ranking lower.
I am of the opinion that an unrated object should be listed above one which has no positive ratings but some negatives.
I reasoned that "the individual ratings are all really expressions of deviation from some baseline, so I'll move the baseline, I'll give every object a 'neutral' initial score," so I came up with this:
CREATE FUNCTION `mydb`.`AdjustedRating` (pos FLOAT, neg FLOAT)
RETURNS FLOAT DETERMINISTIC
RETURN
(
SELECT `mydb`.`LowerBoundWilson95` (pos+4, neg+4)
);
Here are some sample outputs for AdjustedRating
\ pos 0 1 2
neg
0 | 0.215 | 0.188 | 0.168
1 | 0.266 | 0.235 | 0.212
2 | 0.312 | 0.280 | 0.235
This is closer to the sort of scores I want and as a numerical hack I guess it's workable, but I can't mathematically justify it
Is there a better way, a "right" way?
The problem arises because this approximation (lower confidence bound) is really meant for identifying the highest rated items of a list. If you were interested in the lowest ranked, you could take the upper confidence bound instead.
Alternatively, we use Bayesian statistics which is the formalization of exactly the second method you describe. Evan Miller actually had a followup post to this in which he said:
The solution I proposed previously — using the lower bound of a confidence interval around the mean — is what computer programmers call a hack. It works not because it is a universally optimal solution, but because it roughly corresponds to our intuitive sense of what we'd like to see at the top of a best-rated list: items with the smallest probability of being bad, given the data.
Bayesian statistics lets us formalize this intuition...
Using the Bayesian ranking approach, any point that has zero data would fall back to the prior mean (what you refer to as the initial score) and then move away from it as it collects data. This is also the approach used at IMDB to compute their top Movies lists.
https://math.stackexchange.com/questions/169032/understanding-the-imdb-weighted-rating-function-for-usage-on-my-own-website
The specific method you suggest of crediting each object 4 upvotes and 4 downvotes is equivalent to putting a mean of 0.5 with a weight of 8 votes. Given an absence of any other data, this is a reasonable start. Laplace famously argued in the sunrise problem that events should be credited with 1 success and 1 failure. In the item ranking problem, we have a lot more knowledge, so it makes sense to set the prior mean equal to the average ranking. The weight of this prior mean (or how fast you move off it as a function of data, also called the prior variance) can be challenging to set.
For IMDB's ranking of the Top 250 Movies, they use a mean movie ranking of 7.1 with a weight of 25000 votes, which is equivalent to treating all movies as if they started with 25000 "free" votes with a rating of 7.1
I am generating a user-user similarity matrix from a user-rating data (particularly MovieLens100K data). Computing correlation leads to some NaN values. I have tested in a smaller dataset:
User-Item rating matrix
I1 I2 I3 I4
U1 4 0 5 5
U2 4 2 1 0
U3 3 0 2 4
U4 4 4 0 0
User-User Pearson Correlation similarity matrix
U1 U2 U3 U4 U5
U1 1 -1 0 -nan 0.755929
U2 -1 1 1 -nan -0.327327
U3 0 1 1 -nan 0.654654
U4 -nan -nan -nan -nan -nan
U5 0.755929 -0.327327 0.654654 -nan 1
For computing the pearson correlation , only corated items are considered between two users. (See Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions, Gediminas Adomavicius, Alexander Tuzhilin
How can i handle the NaN values?
EDIT
Here is a code with which i find pearson correlation in R. The R matrix is the user-item rating matrix. Contains 1 to 5 scale rating 0 means not rated. S is the user-user correlation matrix.
for (i in 1:nrow (R))
{
cat ("user: ", i, "\n");
for (k in 1:nrow (R))
{
if (i != k)
{
corated_list <- which (((R[i,] != 0) & (R[k,] != 0)) == TRUE);
ui <- (R[i,corated_list] - mean (R[i,corated_list]));
uk <- (R[k,corated_list] - mean (R[k,corated_list]));
temp <- sum (ui * uk) / sqrt (sum (ui^2) * sum (uk^2));
S[i,k] <- ifelse (is.nan (temp), 0, temp)
}
else
{
S[i,k] <- 0;
}
}
}
Note that in the S[i,k] <- ifelse (is.nan (temp), 0, temp) line i am replacing the NaNs with 0.
I recently developed a recommender system in Java for user-user & user-item matrix. Firstly as you probably already have found. RS are difficult. For my implementation I utilised the Apache Common Math Library which is fantastic, you are using R which is probably relatively similar in how it calculates Pearson's.
Your question was: How can I handle NaN values, followed by an edit saying you are saying NaN is = 0.
My answer is this:
You shouldn't really handle NaN values as 0, because what you are saying is that there is absolutely no correlation between users or users/items. This might be the case, but it is likely not always the case. Ignoring this will skew your recommendations.
Firstly you should be asking yourself, "why am I getting NaN values"? Here are some reasons from the Wiki page of NaN detailing why you might get a NaN value:
There are three kinds of operations that can return NaN:
Operations with a NaN as at least one operand.
Indeterminate forms
The divisions 0/0 and ±∞/±∞
The multiplications 0×±∞ and ±∞×0
The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions
The standard has alternative functions for powers:
The standard pow function and the integer exponent pown function define 00, 1∞, and ∞0 as 1.
The powr function defines all three indeterminate forms as invalid operations and so returns NaN.
Real operations with complex results, for example:
The square root of a negative number.
The logarithm of a negative number
The inverse sine or cosine of a number that is less than −1 or greater than +1.
You should debug your application and step through each step to see which of the above reasons is the offending cause.
Secondly understanding that Pearsons Correlation can be represented in a number of different ways, and you need to consider whether you are calculating it across a sample or population and then find the appropriate method of calculating it i.e. for a population:
cor(X, Y) = Σ[(xi - E(X))(yi - E(Y))] / [(n - 1)s(X)s(Y)]
where E(X) is the mean of X,
E(Y) is the mean of the Y values and
s(X), s(Y) are standard deviations and
standard deviations is generally the positive square root of the variance and
variance = sum((x_i - mean)^2) / (n - 1)
where mean is the Mean and
n is the number of sample observations.
This is probably where your NaN are appearing i.e. dividing by 0 for not rated. If you can I would suggest not using the value of 0 to mean not rated, instead use null. I would do this for 2 reasons:
1. The 0 is probably what is cocking up your results with NaNs, and
2. Readability / Understandability. Your Scale is 1 - 5, so 0 should not feature, confuses things. So avoid that if possible.
Thirdly from a recommender stand point, think about things from a recommendation point of view. If you have 2 users and they only have 1 rating in common, say U1 and U4 for I1 in your smaller dataset. Is that 1 item in common really enough to offer recommendations on? The answer is - of course not. So can I also suggest you set a minimum threshold of ratingsInCommon to ensure that the quality of recommendation is better. The minimum you can set for this threshold is 2, but consider setting it a bit higher. If you read the MovieLens research then they set it to between 5-10 (cant remember off the top of my head). The higher you set this the less coverage you will get but you will achieve "better" (lower error scores) recommendations. You've probably done your reading of the academic literature then you will have probably picked up on this point, but thought I would mention it anyway.
On the above point. Look at U4 and compare with every other User. Notice how U4 does not have more that 1 item in common with any user. Now hopefully you will notice that the NaNs appear exclusively with U4. If you have followed this answer then you will hopefully now see that the reason you are getting NaNs is because you can actually compute Pearson's with just 1 item in common :).
Finally one thing that slightly bothers me about the sample dataset above is number of correlations that are 1's and -1's. Think about what that is actually saying about these users preferences, then sense check them against the actual ratings. E.g. look at U1 and U2 ratings. for Item 1 they have strong positive correlation of 1 (both rated it a 4) then for Item 3 they have a strong negative correlation (U1 rated it 5, U3 rated it 1), seems strange that Pearson Correlation between these two users is -1 (i.e. their preferences are completely opposite). This is clearly not the case, really the Pearson score should be a bit above or a bit below 0. This issue links back into points about using 0 on the scale and also comparing only a small amount of items together.
Now, there are strategies in place for "filling in" items that users have not rated. I am not going to go into them you need read up on that, but essentially it is like using the average score for that item or the average rating for that user. Both methods have their downsides, and personally I don't really like either of them. My advice is to only calculate Pearson correlations between users when they have 5 or more items in common, and ignore the items where ratings are 0 (or better - null) ratings.
So to conclude.
NaN does not equal 0 so do not set it to 0.
0's in your scale are better represented as null
You should only calculate Pearson Correlations when the number of items in common between two users is >1, preferably greater than 5/10.
Only calculate the Pearson Correlation for two users where they have commonly rated items, do not include items in the score that have not been rated by the other user.
Hope that helps and good luck.
I'm trying to calculate the AUC for a large-ish data set and having trouble finding one that both handles values that aren't just 0's or 1's and works reasonably quickly.
So far I've tried the ROCR package, but it only handles 0's and 1's and the pROC package will give me an answer but could take 5-10 minutes to calculate 1 million rows.
As a note all of my values fall between 0 - 1 but are not necessarily 1 or 0.
EDIT: both the answers and predictions fall between 0 - 1.
Any suggestions?
EDIT2:
ROCR can deal with situations like this:
Ex.1
actual prediction
1 0
1 1
0 1
0 1
1 0
or like this:
Ex.2
actual prediction
1 .25
1 .1
0 .9
0 .01
1 .88
but NOT situations like this:
Ex.3
actual prediction
.2 .25
.6 .1
.98 .9
.05 .01
.72 .88
pROC can deal with Ex.3 but it takes a very long time to compute. I'm hoping that there's a faster implementation for a situation like Ex.3.
So far I've tried the ROCR package, but it only handles 0's and 1's
Are you talking about the reference class memberships or the predicted class memberships?
The latter can be between 0 and 1 in ROCR, have a look at its example data set ROCR.simple.
If your reference is in [0, 1], you could have a look at (disclaimer: my) package softclassval. You'd have to construct the ROC/AUC from sensitivity and specificity calculations, though. So unless you think of an optimized algorithm (as ROCR developers did), it'll probably take long, too. In that case you'll also have to think what exactly sensitivity and specificity should mean, as this is ambiguous with reference memberships in (0, 1).
Update after clarification of the question
You need to be aware that grouping the reference or actual together looses information. E.g., if you have actual = 0.5 and prediction = 0.8, what is that supposed to mean? Suppose these values were really actual = 5/10 and prediction = 5/10.
By summarizing the 10 tests into two numbers, you loose the information whether the same 5 out of the 10 were meant or not. Without this, actual = 5/10 and prediction = 8/10 is consistent with anything between 30 % and 70 % correct recognition!
Here's an illustration where the sensitivity is discussed (i.e. correct recognition e.g. of click-through):
You can find the whole poster and two presentaions discussing such issues at softclassval.r-forge.r-project.org, section "About softclassval".
Going on with these thoughts, weighted versions of mean absolute, mean squared, root mean squared etc. errors can be used as well.
However, all those different ways to express of the same performance characteristic of the model (e.g. sensitivity = % correct recognitions of actual click-through events) do have a different meaning, and while they coincide with the usual calculation in unambiguous reference and prediction situations, they will react differently with ambiguous reference / partial reference class membership.
Note also, as you use continuous values in [0, 1] for both reference/actual and prediction, the whole test will be condensed into one point (not a line!) in the ROC or specificity-sensitivity plot.
Bottom line: the grouping of the data gets you in trouble here. So if you could somehow get the information on the single clicks, go and get it!
Can you use other error measures for assessing method performance? (e.g. Mean Absolute Error, Root Mean Square Error)?
This post might also help you out, but if you have different numbers of classes for observed and predicted values, then you might run into some issues.
https://stat.ethz.ch/pipermail/r-help/2008-September/172537.html
I'm trying to measure the agreement between two different systems of classification (one of them based on machine-learning algorithms, and the other based on human ground-truthing), and I'm looking for input from someone who's implemented a similar sort of system.
The classification schema allows each item to be classified into multiple different nodes in a category taxonomy, where each classification carries a weight coefficient. For example, if some item can be classified into four different taxonomy nodes, the result might look like this for the algorithmic and ground-truth classifiers:
ALGO TRUTH
CATEGORY A: 0.35 0.50
CATEGORY B: 0.30 0.30
CATEGORY C: 0.25 0.15
CATEGORY D: 0.10 0.05
The weights will always add up to exactly 1.0, for all selected category nodes (of which there are about 200 in the classification taxonomy).
In the example above, it's important to note that both lists agree about rank ordering (ABCD), so they should be scored as being in strong agreement with one another (even though there are some differences in the weights assigned to each category. By contrast, in the next example, the two classifications are in complete disagreement with respect to rank-order:
ALGO TRUTH
CATEGORY A: 0.40 0.10
CATEGORY B: 0.35 0.15
CATEGORY C: 0.15 0.35
CATEGORY D: 0.10 0.40
So a result like this should get a very low score.
One final example demonstrates a common case where the human-generated ground-truth contains duplicate weight values:
ALGO TRUTH
CATEGORY A: 0.40 0.50
CATEGORY B: 0.35 0.50
CATEGORY C: 0.15 0.00
CATEGORY D: 0.10 0.00
So it's important that the algorithm allows lists without perfect rank ordering (since the ground truth could be validly interpreted as ABCD, ABDC, BACD, or BADC)
Stuff I've tried so far:
Root Mean Squared Error (RMSE): Very problematic. It doesn't account for rank-order agreement, which means that gross disagreements between categories at the top of the list are swept under the rug by agreement about categories at the bottom of the list.
Spearman's Rank Correlation: Although it accounts for differences in rank, it gives equal weight to rank agreements at the top of the list and those at the bottom of the list. I don't really care much about low-level discrepancies, as long as the high-level discrepancies contribute to the error metric. It also doesn't handle cases where multiple categories can have tie-value ranks.
Kendall Tau Rank Correlation Coefficient: Has the same basic properties and limitations as Spearman's Rank Correlation, as far as I can tell.
I've been thinking about rolling my own ad-hoc metrics, but I'm no mathematician, so I'd be suspicious of whether my own little metric would provide much rigorous value. If there's some standard methodology for this kind of thing, I'd rather use that.
Any ideas?
Okay, I've decided to implement a weighted RMSE. It doesn't directly account for rank-ordering relationships, but the weighting system automatically emphasizes those entries at the top of the list.
Just for review (for anyone not familiar with RMSE), the equation looks like this, assuming two different classifiers A and B, whose results are contained in an array of the same name:
RMSE Equation http://benjismith.net/images/rmse.png
In java the implementation looks like this:
double[] A = getAFromSomewhere();
double[] B = getBFromSomewhere();
// Assumes that A and B have the same length. If not, your classifier is broken.
int count = A.length;
double sumSquaredError = 0;
for (int i = 0; i < count; i++) {
double aElement = A[i];
double bElement = B[i];
double error = aElement - bElement;
double squaredError = error * error;
sumSquaredError += squaredError;
}
double meanSquaredError = sumSquaredError / count;
double rootMeanSquaredError = Math.sqrt(meanSquaredError);
That's the starting point for my modified implementation. I needed to come up with a weighting system that accounts for the combined magnitude of the two values (from both classifiers). So I'll multiply each squared-error value by SQRT(Ai^2 + Bi^2), which is a plain-old Euclidean distance function.
Of course, since I use a weighted error in the numerator, I need to also use the sum of all weights in the denominator, so that my results are renormalized back into the (0.0, 1.0) range.
I call the new metric "RMWSE", since it's a Root Mean Weighted Squared Error. Here's what the new equation looks like:
RMWSE Equation http://benjismith.net/images/rmwse.png
And here's what it looks like in java:
double[] A = getAFromSomewhere();
double[] B = getBFromSomewhere();
// Assumes that A and B have the same length. If not, your classifier is broken.
int count = A.length;
double sumWeightedSquaredError = 0;
double sumWeights = 0;
for (int i = 0; i < count; i++) {
double aElement = A[i];
double bElement = B[i];
double error = aElement - bElement;
double squaredError = error * error;
double weight = Math.sqrt((aElement * aElement) + (bElement * bElement));
double weightedSquaredError = weight * squaredError;
sumWeightedSquaredError += weightedSquaredError;
sumWeights += weight;
}
double meanWeightedSquaredError = sumWeightedSquaredError / sumWeights;
double rootMeanWeightedSquaredError = Math.sqrt(meanWeightedSquaredError);
To give you an idea for how this weight works in practice, let's say my two classifiers produce 0.95 and 0.85 values for some category. The error between these two values is 0.10, but the weight is 1.2748 (which I arrived at using SQRT(0.95^2 + 0.85^2)). The weighted error is 0.12748.
Likewise, if the classifiers produce 0.45 and 0.35 for some other category, the error is still just 0.10, but the weight is only 0.5701, and the weighted error is therefore only 0.05701.
So any category with high values from both classifiers will be more heavily weighted than categories with a high value from only a single classifier, or categories with low values from both classifiers.
This works best when my classification values are renormalized so that the maximum values in both A and B are 1.0, and all the other values are scaled-up, proportionately. Consequently, the dimensions no longer sum up to 1.0 for any given classifier, but that doesn't really matter anyhow, since I wasn't exploiting that property for anything useful.
Anecdotally, I'm pretty happy with the results this is giving in my dataset, but if anyone has any other ideas for improvement, I'd be totally open to suggestions!
I don't think you need to worry about rigour to this extent. If you want to weight certain types of agreement more than others, that is perfectly legitimate.
For example, only calculate Spearman's for the top k categories. I think you should get perfectly legitimate answers.
You can also do a z-transform etc. to map everything to [0,1] while preserving what you consider to be the "important" pieces of your data set (variance, difference etc.) Then you can take advantage of the large number of hypothesis testing functions available.
(As a side note, you can modify Spearman's to account for ties. See Wikipedia.)