Precision at k when fewer than k documents are retrieved - information-retrieval

In information retrieval evaluation, what would precision#k be, if fewer than k documents are retrieved? Let's say only 5 documents were retrieved, of which 3 are relevant. Would the precision#10 be 3/10 or 3/5?

It can be hard to find text defining edge cases of measures like this, and the mathematical formulations often don't deal with the incompleteness of data. For issues like this, I tend to turn to the decision made by trec_eval which a tool distributed by NIST that has implementations of all common retrieval measures, especially those used by the challenges in Text Retrieval Conferences (TREC challenges).
Per the metric description in m_P.c of trec_eval 9.0 (called the latest on this page):
Precision measured at various doc level cutoffs in the ranking.
If the cutoff is larger than the number of docs retrieved, then
it is assumed nonrelevant docs fill in the rest. Eg, if a method
retrieves 15 docs of which 4 are relevant, then P20 is 0.2 (4/20).
Precision is a very nice user oriented measure, and a good comparison
number for a single topic, but it does not average well. For example,
P20 has very different expected characteristics if there 300
total relevant docs for a topic as opposed to 10.
This means that you should always divide by k even if fewer than k were retrieved, so the precision would be 0.3 instead of 0.6 in your particular case. (Punish the system for retrieving fewer than k).
The other tricky case is when there are fewer than k relevant documents. This is why they note that precision is a helpful measure but does not average well.
Some measures that are more robust to these issues are: Normalized Discounted Cumulative Gain (NDCG) which compares the ranking to an ideal ranking (at a cutoff) and (simpler) R-Precision: which calculates precision at the number of relevant documents, rather than a fixed k. So that one query may calculate P#15 for R=15, and another may calculate P#200 for R=200.

Related

Why does textreuse packge in R make LSH buckets way larger than the original minhashes?

As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse package in R, and I am surprised by the size of the data it generates. textreuse is a peer-reviewed ROpenSci package, so I assume it does its job correctly, but my question persists.
Let's say I use 256 permutations and 64 bands for my minhash and LSH functions respectively -- realistic values that are often used to detect with relative certainty (~98%) similarities as low as 50%.
If I hash a random text file using TextReuseTextDocument (256 perms) and assign it to trtd, I will have:
object.size(trtd$minhashes)
> 1072 bytes
Now let's create the LSH buckets for this object (64 bands) and assign it to l, I will have:
object.size(l$buckets)
> 6704 bytes
So, the retained hashes in the LSH buckets are six times larger than the original minhashes. I understand this happens because textreuse uses a md5 digest to create the bucket hashes.
But isn't this too wasteful / overkill, and can't I improve it? Is it normal that our data reduction technique ends up bloating up to this extent? And isn't it more efficacious to match the documents based on the original hashes (similar to perms = 256 and bands = 256) and then use a threshold to weed out the false positives?
Note that I have reviewed the typical texts such as Mining of Massive Datasets, but this question remains about this particular implementation. Also note that the question is not only out of curiosity, but rather out of need. When you have millions or billions of hashes, these differences become significant.
Package author here. Yes, it would be wasteful to use more hashes/bands than you need. (Though keep in mind we are talking about kilobytes here, which could be much smaller than the original documents.)
The question is, what do you need? If you need to find only matches that are close to identical (i.e., with a Jaccard score close to 1.0), then you don't need a particularly sensitive search. If, however, you need to reliable detect potential matches that only share a partial overlap (i.e., with a Jaccard score that is closer to 0), then you need more hashes/bands.
Since you've read MMD, you can look up the equation there. But there are two functions in the package, documented here, which can help you calculate how many hashes/bands you need. lsh_threshold() will calculate the threshold Jaccard score that will be detected; while lsh_probability() will tell you how likely it is that a pair of documents with a given Jaccard score will be detected. Play around with those two functions until you get the number of hashes/bands that is optimal for your search problem.

How to determine upper bound of c when estimating jaccard similarity between documents?

Let's say I've a million documents that I preprocessed (calculated signatures for using minhash) in O(D*sqrt(D)) time where D is the number of documents. When I'm given a query document, I've to return the first of the million preprocessed documents in O(sqrt(D)) time such that the jaccard similarity is greater than or equal to, say, 0.8.
If there's no document similar to the query document enough to reach that score, I've to return a document with similarity at least c * 0.8 (where c<1) with probability at least 1 - 1/e^2. How may I find the maximum value of C for this minhash scheme?
Your orders of complexity/time don't sound right. Calculating the minhashes (signature) for a document should be roughly O(n), where n is the number of features (e.g., words, or shingles).
Finding all similar documents to a given document (with estimated similarity above a given threshold) should be roughly O(log(n)), where n is the number of candidate documents.
A document with (estimated) minimum .8 jaccard similarity will have at least 80% of its minhashes matching the given document. You haven't defined c and e for us, so I can't tell what your minimum threshold is -- I'll leave that to you -- but you can easily achieve this efficiently in a single pass:
Work through all your base document's hashes one by one. For each hash, look in your hash dictionary for all other docs that share that hash. Keep a tally for each document found of how many hashes it shares. As soon as one of these tallies reaches 80% of the total number of hashes, you have found the winning document and can halt calculations. But if none of the tallies ever reach the .8 threshold, then continue to the end. Then you can choose the document with the highest tally and decide whether that passes your minimum threshold.

agrep max.distance arguments in R

I need some help with the specific arguments of the agrep package in R.
In terms of cost, all, insertions, deletions and substitutions each have a "maximum number/fraction of substitutions" integer or fraction input parameter.
Ive read the documentation on it, but I still cannot figure out some specifics:
What is the difference of a "cost=1" and "all=1"?
How is a decimal interpreted, such as "cost=0.1", "inserts=0.9", "all=0.25", etc.?
I understand the basics of the Levenshtein Distance, but how is it applied in terms of the cost or all arguments?
Sorry if this is fairly basic, but like I said, the documentation I have read on it is slightly confusing.
Thanks in advance
Not 100% certain, but here is my understanding:
in max.distance, cost and all are interchangeable if you don't specify a costs argument (this is the next argument); if you do, then cost will limit based on the weighted (as per costs) costs of insertion/deletion/substitutions you specified, whereas all will limit on the raw count of those operations
The fraction represents what fraction of the number of characters in your pattern argument you want to allow as insertion/deletions/substitutions (i.e. 0.1 on a 10 character pattern would allow 1 change). If you specify costs, then it is the fraction of # of characters in pattern * max(costs), though presumably fractions in max.distance{insertions/deletions/substitutions} will be # of characters * corresponding costs value.
I agree that the documentation is not as complete as it could be. I discovered the above by building simple test examples and messing around with them. You should be able to do the same an confirm for yourself, particularly the last part (i.e. whether costs affects the fraction measure of max.distance{insertions/deletions/substitutions}), which I haven't tested.

How to select stop words using tf-idf? (non english corpus)

I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document.
Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune.
The best (as in more representative) terms in a document are those with higher tf-idf because those terms are common in the document, while being rare in the collection.
As a quick note, as #Kevin pointed out, very common terms in the collection (i.e., stop-words) produce very low tf-idf anyway. However, they will change some computations and this would be wrong if you assume they are pure noise (which might not be true depending on the task). In addition, if they are included your algorithm would be slightly slower.
edit:
As #FelipeHammel says, you can directly use the IDF (remember to invert the order) as a measure which is (inversely) proportional to df. This is completely equivalent for ranking purposes, and therefore to select the top "k" terms. However, it is not possible to use it to select based on ratios (e.g., words that appear in more than 50% of the documents), although a simple thresholding will fix that (i.e., selecting terms with idf lower than a specific value). In general, a fix number of terms is used.
I hope this helps.
From "Introduction to Information Retrieval" book:
tf-idf assigns to term t a weight in document d that is
highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);
lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
lowest when the term occurs in virtually all documents.
So words with lowest tf-idf can considered as stop words.

Problem for lsi

I am using Latent semantic analysis for text similarity. I have 2 questions.
How to select K value for dimention reduction?
I read alot every where that LSI work for similary meaning words for example car and automobile. How is it possible??? What is the magic step I am missing here?
The typical choice for k is 300. Ideally, you set k based on an evaluation metric that uses the reduced vectors. For example, if you're clustering documents, you could select the k that maximizes the clustering solution score. If you don't have a benchmark to measure against, then I would set k based on how big your data set is. If you only have 100 documents, then you wouldn't expect to need several hundred latent factors to represent them. Likewise, if you have a million documents, then 300 may be too small. However, in my experience the resulting vectors are fairly robust to large changes in k, provided that k is not too small (i.e., k = 300 does about as well as k = 1000).
You might be confusing LSI with Latent Semantic Analysis (LSA). They're very related techniques, with the difference being that LSI operates on documents, and LSA operates on words. Both approaches use the same input (a term x document matrix). There are several good open source LSA implementations if you would like to try them. The LSA wikipedia page has a comprehensive list.
try a couple of different values from [1..n] and see what works for whatever task you are trying to accomplish
Make a word-word correlation matrix [ i.e. cell(i,j) holds the # of docs where (i,j) co-occur ] and use something like PCA on it

Resources