Can we use F-measure, precision, recall, with ranked retrieval results? - information-retrieval

I'm using Indri with TrecEval and I'm wondering if we can use F-measure, precision, recall, with ranked retrieval results.
If yes, what the F-measure... will mean ? Are those values somehow relevant, like for evaluating if the queries are close to the corpus ?
I know that the MAP values are for evaluating the ranked results. But I'm wondering if F-measure... may be useful for something else. I'm confused here, and I made researches but there is something that I don't get.
Thank's for your help.

Precision, Recall, and F1 are set based measures. This means that they score a set of documents, not a ranking.
We typically evaluate these sort of measures at fixed numbers of top documents: 5,10,20,50,100,500,1000. Then we can plot a curve and it shows us the whole ranking somehow.
Or you will talk about the precision/recall at 20, e.g. within the first two pages of results for most interfaces. F1 isn’t used much for IR, as our ranking measures balance these anyway (AP, NDCG, etc).
F1#20 will give you a number representing the geometric mean of recall and precision within the best 10 documents according to your ranker.

Related

Discrepancy Between Two Methods of Finding Information Entropy

So I learned about the concept of information entropy from Khan Academy where is was phrased in the form of "average amount of yes or no questions needed per symbol". They also gave an alternative form using logarithms.
So let's say we have a symbol generator that produces A,B, and C.
P(A)=1/2, P(B)=1/3, and P(C)=1/6
According to their method, I would gat a chart like this:
First method
Then I would multiply their probability of occurring by the amount of questions needed for each giving
(1/2)*1+(1/3)*2+(1/6)*2 = 1.5bits
but their other method gives
-(1/2)log2(1/2)-(1/3)log2(1/3)-(1/6)log2(1/6)= 1.459... bits
The difference is small, but still significant. I've tried this with different combinations and probabilities and got similar results. Is there something I'm missing? Am I using either method wrong, or is one of them more conditional?
Your second calculation is correct.
The problem with your decision tree approach is that the decision tree is not optimal (and indeed, no binary decision tree could be for those probabilities). Your “is it B” decision node represents less than one bit of information, since once you get there you already know it’s probably B. So your decision tree represents a potential encoding of symbols which is expected to consume 1.5 bits on average, but it represents slightly less than 1.5 bits of information.
In order to have a binary tree which represents an optimal encoding, each node needs to have balanced probabilities. This is not possible if some symbol has a probability whose denominator is not a power of 2.

Why do we use log probability in deep learning?

I got curious while reading the paper 'Sequence to Sequence Learning with Neural Networks'.
In fact, not only this paper but also many other papers use log probabilities, is there a reason for that?
Please check the attached photo.
Two reasons -
Theoretical - Probabilities of two independent events A and B co-occurring together is given by P(A).P(B). This easily gets mapped to a sum if we use log, i.e. log(P(A)) + log(P(B)). It is thus easier to address the neuron firing 'events' as a linear function.
Practical - The probability values are in [0, 1]. Hence multiplying two or more such small numbers could easily lead to an underflow in a floating point precision arithmetic (e.g. consider multiplying 0.0001*0.00001). A practical solution is to use the logs to get rid of the underflow.
For any given problem we need to optimise the likelihood of parameters. But optimising the product require all data at once and requires huge computation.
We know that a sum is a lot easier to optimise as the derivative of a sum is the sum of derivatives. So, taking log convert it to sum and makes computation faster.
Refer this

Behaviour of dfmax in glmnet

(NB: This is a slightly modified version of a post I'd made on a different forum. I received no responses there, hence the post here. If this is not allowed, please let me know, will take down the question).
I am new to glmnet, so I do not yet understand fully what the various
parameters do. I am trying to build a multinomial classifier which restricts
the number of features used in the model. From reading the docs and some
answers on this forum, I understand dfmax is the way to do it. I
played around with it a bit; I have a couple of questions and would appreciate some help:
Setup
For a particular dataset, I want to restrict the number of features to 3;
the original data has 126 features. Here's what I run:
fit<-glmnet(data.matrix(X), data.matrix(y), family='multinomial', dfmax=3)
d<-data.frame(tidy(fit))
This is the value of d:
My questions about the output:
I see multiple values of lambda in there; it looks like
glmnet tries to fit lambdas that gets the number of terms close to
dfmax=3. So its less like the LARs algorithm (where we
move stagewise by adding variables and can stop at an exact number of variables) and more about getting the
right lambdas for regularization that lead to the intended dfmax. Is
this right?
I'm guessing alpha plays a role in how close we can get
to dfmax. At alpha=1, where we're doing lasso, and so its easier to
get close to dfmax, compared to when alpha=0 and we're doing ridge.
Is this understanding correct?
A "neighborhood" of dfmax is the
best we can do it'd seem. Or am I missing a parameter that gets me
to the model with the exact dfmax (FYI: alpha=1 doesn't seem to get
me to the precise number of non zero terms either, at least on this
dataset).
In the first solution - step=1, there are no variables used. Does this mean the relative odds equal a constant?
What does pmax do?
Thanks in advance!

Document similarity selfplagiarism

I have thousands of small documents from 100 different authors. Using quanteda package, I calculated cosine similarity between the authors with themselves. For example, author x has 100 texts, so I have come up with a 100 x 100 matrix of similarity. Author y has 50 texts, so I have come up with a 50 x 50 similarity matrix.
Now I want to compare these two authors. In other words, which author copies himself more? If I take the average the columns or rows and then average again the vector of means, I arrive at a number so I can compare these two means of means, but I am not sure if these proceeding is right. I hope I made myself clear.
I think the answer depends on what exactly is your quantity of interest. If this is a single summary of how similar are an author's documents to one another, then some distribution across the document similarities, within author, is probably your best means of comparing this quantity between authors.
You could save and plot the cosine similarities across an author's documents as a density, for instance, in addition to your strategy of summarising this distribution using a mean. To capture the variance I would also characterise the standard deviation of this similarity.
I'd be cautious about calling cosine similarity within author as "self-plagiarism". Cosine similarity computes a measure of distance across vector representations of bags of words, and is not viewed as a method for identifying "plagiarism". In addition, there are very pejorative connotations to the term "plagiarism", which means the dishonest representation of someone else's ideas as your own. (I don't even believe that the term "self-plagiarism" makes sense at all, but then I have academic colleagues who disagree.)
Added:
Consider the textreuse package for R, it is designed for the sort of text analysis of reuse that you are looking for.
I don't think Levenshtein distance is what you are looking for. As the Wikipedia page points out, the LD between kitten and sitting is 3, but this means absolutely nothing in substantive terms about their semantic relationship or one being an example of "re-use" of the other. An argument could be made that LD based on words might show re-use, but that's not how most algorithms e.g. http://turnitin.com implement detection for plagiarism.

What is the meaning of "Inf" in S_Dbw output in R commander?

I have ran clv package which consists of S_Dbw and SD validity indexes for clustering purposes in R commander. (http://cran.r-project.org/web/packages/clv/index.html)
I evaluated my clustering results from DBSCAN, K-Means, Kohonen algorithms with S_Dbw index. but for all these three algorithms S_Dbw is "Inf".
Is it "Infinite" meaning? Why did i confront with "Inf". Is there any problem in my clustering results?
In general, when is S_Dbw index result "Inf"?
Be careful when comparing different algorithms with such an index.
The reason is that the index is pretty much an algorithm in itself. One particular clustering will necessarily be the "best" for each index. The main difference between an index and an actual clustering algorithm is that the index doesn't tell you how to find the "best" solution.
Some examples: k-means minimizes the distances from cluster members to cluster centers. Single-link hierarchical clustering will find the partition with the optimal minimum distance between partitions. Well, DBSCAN will find the partitioning of the dataset, where all density-connected points are in the same partition. As such, DBSCAN is optimal - if you use the appropriate measure.
Seriously. Do not assume that because one algorithm scores higher than another in a particular measure means that the algorithm works better. All that you find out this way is that a particular algorithm is more (cor-)related to a particular measure. Think of it as a kind of correlation between the measure and the algorithm, on a conceptual level.
Using a measure for comparing different results of the same algorithm is different. Then obviously there shouldn't be a benefit from one algorithm over itself. There might still be a similar effect with respect to parameters. For example the in-cluster distances in k-means obviously should go down when you increase k.
In fact, many of the measures are not even well-defined on DBSCAN results. Because DBSCAN has the concept of noise points, which the indexes do not AFAIK.
Do not assume that the measure will either give you an indication of what is "true" or "correct". And even less, what is useful or new. Because you should be using cluster analysis not to find a mathematical optimum of a particular measure, but to learn something new and useful about your data. Which probably is not some measure number.
Back to the indices. They usually are totally designed around k-means. From a short look at S_Dbw I have the impression that the moment one "cluster" consists of a single object (e.g. a noise object in DBSCAN), the value will become infinity - aka: undefined. It seems as if the authors of that index did not consider this corner case, but only used it on toy data sets where such situations did not arise. The R implementation can't fix this, without diverting from the original index and instead turning it into yet another index. Handling noise objects and singletons is far from trivial. I have not yet seen an index that doesn't fail in one way or another - typically, a solution such as "all objects are noise" will either score perfect, or every clustering can trivially be improved by putting each noise object to the nearest non-singleton cluster. If you want your algorithm to be able to say "this object doesn't belong to any cluster" then I do not know any appropriate index.
The IEEE floating point standard defines Inf and -Inf as positive and negative infinity respectively. It means your result was too large to represent in the given number of bits.

Resources