some ideas and direction of how to measure ranking, AP, MAP, recall for IR evaluation - information-retrieval

I have question about how to evaluate the information retrieve result is good or not such as calculate
the relevant document rank, recall, precision ,AP, MAP.....
currently, the system is able to retrieve the document from the database once the users enter the query. The problem is I do not know how to do the evaluation.
I got some public data set such as "Cranfield collection" dataset link
it contains
1.document 2.query 3.relevance assesments
DOCS QRYS SIZE*
Cranfield 1,400 225 1.6
May I know how to use do the evaluation by using "Cranfield collection" to calculate
the relevant document rank, recall, precision ,AP, MAP.....
I might need some ideas and direction. not asking for how to code the program.

Document Ranking
Okapi BM25 (BM stands for Best Matching) is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). See the Wikipedia page for more details.
Precision and Recall
Precision measures "of all the documents we retrieved as relevant how many are actually relevant?".
Precision = No. of relevant documents retrieved / No. of total documents retrieved
Recall measures "Of all the actual relevant documents how many did we retrieve as relevant?".
Recall = No. of relevant documents retrieved / No. of total relevant documents
Suppose, when a query "q" is submitted to an information retrieval system (ex., search engine) having 100 relevant documents w.r.t. the query "q", the system retrieves 68 documents out of total collection of 600 documents. Out of 68 retrieved documents, 40 documents were relevant. So, in this case:
Precision = 40 / 68 = 58.8% and Recall = 40 / 100 = 40%
F-Score / F-measure is the weighted harmonic mean of precision and recall. The traditional F-measure or balanced F-score is:
F-Score = 2 * Precision * Recall / Precision + Recall
Average Precision
You can think of it this way: you type something in Google and it shows you 10 results. It’s probably best if all of them were relevant. If only some are relevant, say five of them, then it’s much better if the relevant ones are shown first. It would be bad if first five were irrelevant and good ones only started from sixth, wouldn’t it? AP score reflects this.
Giving an example below:
AvgPrec of the two rankings:
Ranking#1: (1.0 + 0.67 + 0.75 + 0.8 + 0.83 + 0.6) / 6 = 0.78
Ranking#2: (0.5 + 0.4 + 0.5 + 0.57 + 0.56 + 0.6) / 6 = 0.52
Mean Average Precision (MAP)
MAP is mean of average precision across multiple queries/rankings. Giving an example for illustration.
Mean average precision for the two queries:
For query 1, AvgPrec: (1.0+0.67+0.5+0.44+0.5) / 5 = 0.62
For query 2, AvgPrec: (0.5+0.4+0.43) / 3 = 0.44
So, MAP = (0.62 + 0.44) / 2 = 0.53
Sometimes, people use precision#k, recall#k as performance measure of a retrieval system. You should build a retrieval system for such testings. If you want to write your program in Java, you should consider Apache Lucene to build your index.

calculating precision and recall is simple;
Precision is the fraction of relevant retrieved documents to all the documents that you retrieved.
Recall is the fraction of relevant documents retrieved to all relevant documents.
For example if a query has 20 relevant documents, and you retrieved 25 documents that only 14 of them is relevant to the query, then :
Precision = 14/25 and
Recall = 14/20.
But precision and recall should be combined in a way, that way is called F-Measure and is harmonic mean of precision and recall:
F-Score = 2*Precision*Recall/Precision+Recall .
AP tells you the proportion of relevant documents to irrelevant documents in a specific number of retrieved documents. Assume you retrieved 25 documents and in the first 10 documents, 8 relevant documents are retrieved. So AP(10) = 8/10;
If you calculate and add AP for 1 to N, then divide it by N, you just calculated MAP. Where N is the total number of relevant documents in yoyr data set.

Related

Design an algorithm that minimises the load on the most heavily loaded server

Reading the book of Aziz & Prakash 2021 I am a bit stuck on problem 3.7 and the associated solution for which I am trying to implement.
The problem says :
You have n users with unique hashes h1 through hn and
m servers, numbered 1 to m. User i has Bi bytes to store. You need to
find numbers K1 through Km such that all users with hashes between
Kj and Kj+1 get assigned to server j. Design an algorithm to find the
numbers K 1 through Km that minimizes the load on the most heavily
loaded server.
The solution says:
Let L(a,b) be the maximum load on a server when
users with hash h1 through ha are assigned to servers S1 through Sb in
an optimal way so that the max load is minimised. We observe the
following recurrence:
In other words, we find the right value of x such that if we pack the
first x users in b - 1 servers and the remaining in the last servers the max
load on a given server is minimized.
Using this relationship, we can tabulate the values of L till we get
L(n,m). While computing L(a,b) when the values of L is tabulated
for all lower values of a and b we need to find the right value of x to
minimize the load. As we increase x, L(x,b-1) in the above expression increases the the sum term decreases. We can do binary search for x to find x that minimises their max.
I know that we can probably use some sort of dynamic programming, but how could we possibly implement this idea into a code?
The dynamic programming algorithm is defined fairly well given that formula: Implementing a top-down DP algorithm just needs you to loop from x = 1 to a and record which one minimizes that max(L(x,b-1), sum(B_i)) expression.
There is, however, a simpler (and faster) greedy/binary search algorithm for this problem that you should consider, which goes like this:
Compute prefix sums for B
Find the minimum value of L such that we can partition B into m contiguous subarrays whose maximum sum is equal to L.
We know 1 <= L <= sum(B). So, perform a binary search to find L, with a helper function canSplit(v) that tests whether we can split B into such subarrays of sum <= v.
canSplit(v) works greedily: Remove as many elements from the start of B as possible so that our sum does not exceed v. Repeat this a total of m times; return True if we've used all of B.
You can use the prefix sums to run canSplit in O(m log n) time, with an additional inner binary search.
Given L, use the same strategy as the canSplit function to determine the m-1 partition points; find the m partition boundaries from there.

Compressed Bloom Filters with fixed false positive probability

I am trying to implement compressed bloom filter implementation according to this paper Compressed Bloom Filters by Michael Mitzenmacher. I need to calculate m - number of bits and k - number of hash functions for given fixed false positive probability. For example:
I know that if I have n = 1000 elements(to be inserted in bloom filter) and given probability p = 0.01, the "optimal" number of bits will for Bloom filter will be (-n * Math.log(p) / (Math.log(2) * Math.log(2))) = 9585
And also I need k = (9585/1000)*Math.log(2) = 7 - hash functions. That is, I will get false positive rate 0.01.
To "compress" bloom filter we need to build more "sparse" filter - get lesser hash functions and more number of bits in vector.
But I did not get the idea how to calculate number of hash functions and number of bits for this sparse filter. If we decrease k by 1 how will increase number of bits? what is the ratio?
Well, I did not find any concrete ratio between number of hash functions and number of bits with fixed false positive rate. But I have found precalculated table of such values. Here you can find values of number of hash functions and bit vector length with corresponding false positive rate. Since we have such table we can find number of hash functions(less than optimal) and corresponding bit vector length with false positive rate less or equal to the given.
Here is implementation for building "sparse" bloom filters.
Hope this will save someone`s time in future.

In tf-idf why do we normalize by document frequency and not average term frequency across all documents in the corpus?

Average term frequency would be the average frequency that term appears in other documents. Intuitively I want to compare how frequently it appears in this document relative to the other documents in the corpus.
An example:
d1 has the word "set" 100 times, d2 has the word "set" 1 time, d3 has the word "set" 1 time, d4-N does not have the word set
d1 has the word "theory" 100 times, d2 has the word "theory" 100 times, d3 has the word "theory" 100 times, d4-N does not have the word set
Document 1 has the same tf-idf for the word "set" and the word "theory" even though the word set is more important to d1 than theory.
Using average term frequency would distinguish these two examples. Is tf-iatf (inverse average term frequency) a valid approach? To me it would give me more important keywords, rather than just "rare" and "unique" keywords. If idf is "an estimate of how rare that word is" wouldn't iatf be a better estimate? It seems only marginally harder to implement (especially if the data is pre-processed).
I am thinking of running an experiment and manually analyzing the highest ranked keywords with each measure, but wanted to pass it by some other eyes first.
A follow-up question:
Why is tf-idf used so frequently as opposed to alternative methods like this which MAY be more accurate? (If this is a valid approach that is).
Update:
Ran an experiment where I manually analyzed the scores and corresponding top words for a few dozen documents, and it seems like iatf and inverse collection frequency (the standard approach to what I described) have super similar results.
Tf-idf is not meant to compare the importance of a word in a document across two corpora.
It is rather meant to distinguish the importance of a word within a document in relation to the distribution of the same term in the other documents of the same collection (not across collections).
A standard approach that you can apply for your case is: collection frequency, cf(t), instead of document frequency, df(t).
cf(t) measures how many times does a term t occurs in the corpus.
cf(t) divided by the total collection size would give you the probability
of sampling t from the collection.
And then you can compute a linear combination of tf(t,d) and cf(t) values, which gives you the probability of sampling a term t either from a document or from the collection.
P(t,d) = \lambda P(t|d) + (1-\lambda) P(t|Collection)
This is known by the name of Jelinek Mercer smoothed Language Model.
For your example (letting \lambda=0.5):
Corpus 1: P("set",d1) = 0.5*100/100 + 0.5*100/102
Corpus 2: P("set",d1) = 0.5*100/100 + 0.5*100/300
Clearly, P("set",d1) for corpus 2 is less (almost one-third) of that in corpus 1.

USING TFIDF FOR RELATIVE FREQUENCY, COSINE SIMILARITY

I'm trying to use TFIDF for relative frequency to calculate cosine distance. I've selected 10 words from one document say: File 1 and selected another 10 files from my folder, using the 10 words and their frequency to check which of the 10 files are similar to File 1. Say Total number of files in folder are 46.i know that DF(is the no of documents the word appears in) IDF(is log(total no of files(46)/DF) and TFIDF(is the product of TF(frequency of the word in one doc) and IDF)
QUESTION:
Assuming what i said above is 100% correct, after getting the TFIDF for all 10 words in one document say: File 2, Do i add all the TFIDF for each of the 10 words together to get the TFIDF for File 2?
What is the cosine distance?
Could anyone help with an example?
The problem is you are confused between cosine similarity and tf-idf. While the former is a measure of similarity between two vectors (in this case documents), the latter simply is a technique of setting the components for the vectors to be eventually used in the former.
Particular to your question, it is rather inconvenient to select 10 terms from each document. I'd rather suggest to work with all terms. Let V be the total number of terms (the cardinality of the set of union over all documents in the collection). You can the represent each document as a vector of V dimensions. The ith component of a particular document D can be set to the tf-idf weight corresponding to that term (say t), i.e. D_i = tf(t,D)*idf(t)
Once you represent every document in your collection in this way, you can then compute the inter-document similarities in the following way.
cosine-sim(D, D') = (1/|D_1|*|D'|) * \sum_{i=1}^{V} D_i * D'_i
= (1/|D_1|*|D'|) * \sum_{i=1}^{V} tf(t,D)*idf(t)*tf(t,D')*idf(t)
Note that the contributing terms in this summation are only those ones which occur in both documents. If a term t occurs in D but not in D' then tf(t,D')=0 which thus contributes 0 to the sum.

Cosine similarity and tf-idf

I am confused by the following comment about TF-IDF and Cosine Similarity.
I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90."
Now I'm wondering....aren't they 2 different things?
Is tf-idf already inside the cosine similarity? If yes, then what the heck - I can only see the inner dot products and euclidean lengths.
I thought tf-idf was something you could do before running cosine similarity on the texts. Did I miss something?
Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.
If d2 and q are tf-idf vectors, then
where θ is the angle between the vectors. As θ ranges from 0 to 90 degrees, cos θ ranges from 1 to 0. θ can only range from 0 to 90 degrees, because tf-idf vectors are non-negative.
There's no particularly deep connection between tf-idf and the cosine similarity/vector space model; tf-idf just works quite well with document-term matrices. It has uses outside of that domain, though, and in principle you could substitute another transformation in a VSM.
(Formula taken from the Wikipedia, hence the d2.)
TF-IDF is just a way to measure the importance of tokens in text; it's just a very common way to turn a document into a list of numbers (the term vector that provides one edge of the angle you're getting the cosine of).
To compute cosine similarity, you need two document vectors; the vectors represent each unique term with an index, and the value at that index is some measure of how important that term is to the document and to the general concept of document similarity in general.
You could simply count the number of times each term occurred in the document (Term Frequency), and use that integer result for the term score in the vector, but the results wouldn't be very good. Extremely common terms (such as "is", "and", and "the") would cause lots of documents to appear similar to each other. (Those particular examples can be handled by using a stopword list, but other common terms that are not general enough to be considered a stopword cause the same sort of issue. On Stackoverflow, the word "question" might fall into this category. If you were analyzing cooking recipes, you'd probably run into issues with the word "egg".)
TF-IDF adjusts the raw term frequency by taking into account how frequent each term occurs in general (the Document Frequency). Inverse Document Frequency is usually the log of the number of documents divided by the number of documents the term occurs in (image from Wikipedia):
Think of the 'log' as a minor nuance that helps things work out in the long run -- it grows when it's argument grows, so if the term is rare, the IDF will be high (lots of documents divided by very few documents), if the term is common, the IDF will be low (lots of documents divided by lots of documents ~= 1).
Say you have 100 recipes, and all but one requires eggs, now you have three more documents that all contain the word "egg", once in the first document, twice in the second document and once in the third document. The term frequency for 'egg' in each document is 1 or 2, and the document frequency is 99 (or, arguably, 102, if you count the new documents. Let's stick with 99).
The TF-IDF of 'egg' is:
1 * log (100/99) = 0.01 # document 1
2 * log (100/99) = 0.02 # document 2
1 * log (100/99) = 0.01 # document 3
These are all pretty small numbers; in contrast, let's look at another word that only occurs in 9 of your 100 recipe corpus: 'arugula'. It occurs twice in the first doc, three times in the second, and does not occur in the third document.
The TF-IDF for 'arugula' is:
1 * log (100/9) = 2.40 # document 1
2 * log (100/9) = 4.81 # document 2
0 * log (100/9) = 0 # document 3
'arugula' is really important for document 2, at least compared to 'egg'. Who cares how many times egg occurs? Everything contains egg! These term vectors are a lot more informative than simple counts, and they will result in documents 1 & 2 being much closer together (with respect to document 3) than they would be if simple term counts were used. In this case, the same result would probably arise (hey! we only have two terms here), but the difference would be smaller.
The take-home here is that TF-IDF generates more useful measures of a term in a document, so you don't focus on really common terms (stopwords, 'egg'), and lose sight of the important terms ('arugula').
The complete mathematical procedure for cosine similarity is explained in these tutorials
part-I
part-II
part-III
Suppose if you want to calculate cosine similarity between two documents, first step will be to calculate the tf-idf vectors of the two documents. and then find the dot product of these two vectors. Those tutorials will help you :)
tf/idf weighting has some cases where they fail and generate NaN error in code while computing. It's very important to read this:
http://www.p-value.info/2013/02/when-tfidf-and-cosine-similarity-fail.html
Tf-idf is just used to find the vectors from the documents based on tf - Term Frequency - which is used to find how many times the term occurs in the document and inverse document frequency - which gives the measure of how many times the term appears in the whole collection.
Then you can find the cosine similarity between the documents.
TFIDF is inverse documet frequency matrix and finding cosine similarity against document matrix returns similar listings

Resources