Why Does LogLikelihoodSimilarity function return values greater than 1.0 for a dataset of 0s and 1s? - similarity

I have a large dataset of preferences that are expressed as 1.0, and I am using the Tanimoto Similarity functions and the Generic Boolean User and Item Preference Recommenders. Recommendations are generally values between 0 and 1.0.
Many sources, such as the Mahout in Action book, and this prior SO thread recommend the LogLikelihoodSimilarity metric over Tanimoto for boolean datasets. When I switched to the LogLikelihood Similarity metric, it generated some scores in a much higher range, such as 11. I had to go back to Tanimoto to get more sensical ratings. Can you suggest any potential fixes, or am I misunderstanding the return values of the recommended item scores?

In the case of no ratings, the value you observe is not a predicted rating. After all, they are all 1.0 and so can't be used for ranking. The result is actually a sum of similarities, which is why it can be arbitrarily large. It is not supposed to be in [0,1] or anything like that.

Related

Setting the "tpow" and "expcost" arguments in TraMineR::seqdist

I'm actually working on the pathways of inpatients during their hospital stay. These pathways are represented as states sequences (the current medical unit at each time unit) and I'm trying to find typical pathways through clustering algorithms.
I create the distance matrix by using the seqdist function from the R package TraMineR, with the method "OMspell". I've already read the R documentation and the related articles, but I can't find how to set the arguments tpow and expcost.
As the time unit is an hour, I don't want any little difference of duration to have a big impact on the clustering result (contrary to a medical unit transfer for example). But I don't want the duration not to have any impact either...
Also, is there a proper way to choose their value ? Or do I just continue to grope around for a good configuration ? (I'm using Dunn, Davies-Bouldin and Silhouette criteria to compare the results of hierarchical clustering, besides the medical opinion on the resulting clusters)
The parameter tpow is an exponential coefficient applied to transform the actual spell lengths (durations). The default value is 1 for which the spell lengths are taken as are. With tpow=0, you would just ignore spell durations, and with tpow=0.5 you would consider the square root of the spell lengths.
The expcost parameter is the expansion cost, i.e. the cost for expanding a (transformed) spell length by one unit. In other words, when in the editing of one sequence into the other a spell of length t1 has to be expanded to length t2, it would cost expcost * |t2^tpow - t1^tpow|. With expcost=0 spells in a same state (e.g. AA and AAAAA) would be equivalent whatever their lengths.
With tpow=.5, for example, increasing the spell length from 1 to 2 costs more than increasing a spell length form 3 to 4. If you do not want to give to much importance to small differences in spell lengths use a low expcost. However, note that the expcost applies to the transformed spell lengths and you may want to adjust it when you change the tpow value.

Precision at k when fewer than k documents are retrieved

In information retrieval evaluation, what would precision#k be, if fewer than k documents are retrieved? Let's say only 5 documents were retrieved, of which 3 are relevant. Would the precision#10 be 3/10 or 3/5?
It can be hard to find text defining edge cases of measures like this, and the mathematical formulations often don't deal with the incompleteness of data. For issues like this, I tend to turn to the decision made by trec_eval which a tool distributed by NIST that has implementations of all common retrieval measures, especially those used by the challenges in Text Retrieval Conferences (TREC challenges).
Per the metric description in m_P.c of trec_eval 9.0 (called the latest on this page):
Precision measured at various doc level cutoffs in the ranking.
If the cutoff is larger than the number of docs retrieved, then
it is assumed nonrelevant docs fill in the rest. Eg, if a method
retrieves 15 docs of which 4 are relevant, then P20 is 0.2 (4/20).
Precision is a very nice user oriented measure, and a good comparison
number for a single topic, but it does not average well. For example,
P20 has very different expected characteristics if there 300
total relevant docs for a topic as opposed to 10.
This means that you should always divide by k even if fewer than k were retrieved, so the precision would be 0.3 instead of 0.6 in your particular case. (Punish the system for retrieving fewer than k).
The other tricky case is when there are fewer than k relevant documents. This is why they note that precision is a helpful measure but does not average well.
Some measures that are more robust to these issues are: Normalized Discounted Cumulative Gain (NDCG) which compares the ranking to an ideal ranking (at a cutoff) and (simpler) R-Precision: which calculates precision at the number of relevant documents, rather than a fixed k. So that one query may calculate P#15 for R=15, and another may calculate P#200 for R=200.

Generate random small numbers with a target average

I need to write a function that returns on of the numbers (-2,-1,0,1,2) randomly, but I need the average of the output to be a specific number (say, 1.2).
I saw similar questions, but all the answers seem to rely on the target range being wide enough.
Is there a way to do this (without saving state) with this small selection of possible outputs?
UPDATE: I want to use this function for (randomized) testing, as a stub for an expensive function which I don't want to run. The consumer of this function runs it a couple of hundred times and takes an average. I've been using a simple randint function, but the average is always very close to 0, which is not realistic.
Point is, I just need something simple that won't always average to 0. I don't really care what the actual average is. I may have asked the question wrong.
Do you really mean to require that specific value to be the average, or rather the expected value? In other words, if the generated sequence were to contain an extraordinary number of small values in its initial part, should the rest of the sequence atempt to compensate for that in an attempt to get the overall average right? I assume not, I assume you want all your samples to be computed independently (after all, you said you don't want any state), in which case you can only control the expected value.
If you assign a probability pi for each of your possible choices, then the expected value will be the sum of these values, weighted by their probabilities:
EV = āˆ’ 2pāˆ’2 āˆ’ pāˆ’1 + p1 + 2p2 = 1.2
As additional constraints you have to require that each of these probabilities is non-negative, and that the above four add up to a value less than 1, with the remainder taken by the fifth probability p0.
there are many possible assignments which satisfy these requirements, and any one will do what you asked for. Which of them are reasonable for your application depends on what that application does.
You can use a PRNG which generates variables uniformly distributed in the range [0,1), and then map these to the cases you described by taking the cumulative sums of the probabilities as cut points.

Proper similarity measure for clustering

I have problems in finding a proper similarity measure for clustering. I have around 3000 arrays of sets, where each set contains features of certain domain (e.g., number, color, days, alphabets, etc). I'll explain my problem with an example.
Lets assume i have only 2 arrays(a1 & a2) and I want to find the similarity between them. each array contains 4 sets (in my actual problem there are 250 sets (domains) per array) and a set can be empty.
a1: {a,b}, {1,4,6}, {mon, tue, wed}, {red, blue,green}
a2: {b,c}, {2,4,6}, {}, {blue, black}
I have come with a similarity measure using Jaccard index (denoted as J):
sim(a1,a2) = [J(a1[0], a2[0]) + J(a1[1], a2[1]) + ... + J(a1[3], a2[3])]/4
note:I divide by total number of sets (in the above example 4) to keep the similarity between 0 and 1.
Is this a proper similarity measure and are there any flaws in this approach. I am applying Jaccard index for each set separately because I want compare the similarity between related domains(i.e. color with color, etc...)
I am not aware of any other proper similarity measure for my problem.
Further, can I use this similarity measure for clustering purpose?
This should work for most clustering algorithms. Don't use k-means - it can handle numeric vector spaces only. But you have a vector-of-sets type of data.
You may want to use a different mean than the arithmetic average for combining the four Jaccard measures. Try the harmonic or geometric means. See, the average over 250 values will likely be somewhere close to 0.5 all the time, so you need a mean that is more "aggressive".
So the plan sounds good. Just try it, implement this similarity and plug it into various clustering algorithm and see if they find something. I like OPTICS for exploring data and distance functions, as the OPTICS plot can be very indicative whether (or not!) there is something to be found based on the distance function. If the plot is too flat, there just is not much to be found, it is like a representative sample of the distances in the data set...
I use ELKI, and they even have a tutorial on adding custom distance functions: http://elki.dbs.ifi.lmu.de/wiki/Tutorial/DistanceFunctions although you can probably just compute the distances with whatever tool you like and write them to a similarity matrix. At 3000 objects this will remain very manageable, 4200000 doubles is just a few MB.

Formula to prioritize tasks based on weight and date

Is there a formula or algorithm which can prioritize items based on weight and a date? For instance, a critical item would always be at the top of the list while a two normal items would be prioritized based on their due date.
Scheduling is one of the most-studied areas of computer science, which is convenient, because it gives a lot of prior art that you can learn from.
Perhaps the easiest approach is Earliest Deadline First -- where you schedule the task with the first deadline and work on it until it blocks. Then work on the next earliest deadline. The downside is that low-priority tasks that take a long time might stall higher-priority tasks.
It might be worthwhile to determine if your scheduling must be hard, firm, or soft -- sometimes it makes sense to drop tasks completely and finish nearly everything on time than to finish everything but half a second too late.
Yes. This can either be done by defining a comparison function that checks priority first. I.e.
// Returns n < 0, 0, or n > 1 if value1 is less than, equal to or greater
compare(value1, value2) {
if(value1.priority != value2.priority) {
return value1.priority - value2.priority;
}
return value1.date - value2.date;
}
Alternatively, this function returns a value calculated from the date and the priority, this can be used to compare tasks and order them by priority (and then date):
// Returns
task.GetValue() {
return me.GetDateAsIntegerValue() + MAX_DATE_VALUE * me.GetPriority();
}
But just as sarnold mentioned, this is a highly studied area.
A different way to look at this is as a ranking problem. If you take these two values, weight and priority as inputs, you can create a table of paired comparisons that decompose items into their inputs (weight and priority) and outputs are relative orderings.
Consider, say, item 42 and item 69, denoted X42 and X69: if you have their weights and priority (W42, P42) and (W69, P69), you'd like to know if X42 should appear before X69, after it, or at an equal position. If you have a training set, you can tag whether one is preferred to the other.
What we're lacking here is a method for comparing these. A very simple method is to use logistic regression on the differences, i.e. a simple function f( (W_A - W_B), (P_A - P_B)), or f((W42 - W69),(P42 - P69)), in this case. If the result is above some threshold, then A is preferred to B, otherwise B is preferred to A. You can use this to sort the results.
As usual, most of the results online are not very accessible to beginners. Here's a short chapter that may be helpful in understanding the logistic regression. However, if you'd like to address such matters in more depth, the statistics StackExchange site would be better.
You'll have to decide: (1) if what you're looking at can be decomposed into an additive function of the weight and priority, and, if so, (2) the loss function or objective function that you need to minimize, so that you can get the optimal parameters for this additive function. An ordinal logistic model is one choice, ordinal probit another, and there are tons of other options. If you don't use an additive function (i.e. a linear combination), you'll have a challenging range of possibilities to consider, so it's best to start with something simple.
You can separate the tasks by rating the impact 1-10 (10 being highest) and the output needed 1-10 (also 10 being hardest)
You add the numbers together and divide by two. The result will be the priority ranking of your task 1-10 (10 being most important).
Example:
Check Emails: impact 2 output 1 = 1.5
Call potential customer: impact 10 output 2 = 6
From this example the calling of the customer would then be placed in a higher priority than checking emails.

Resources