How to select stop words using tf-idf? (non english corpus) - information-retrieval

I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document.

Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune.
The best (as in more representative) terms in a document are those with higher tf-idf because those terms are common in the document, while being rare in the collection.
As a quick note, as #Kevin pointed out, very common terms in the collection (i.e., stop-words) produce very low tf-idf anyway. However, they will change some computations and this would be wrong if you assume they are pure noise (which might not be true depending on the task). In addition, if they are included your algorithm would be slightly slower.
edit:
As #FelipeHammel says, you can directly use the IDF (remember to invert the order) as a measure which is (inversely) proportional to df. This is completely equivalent for ranking purposes, and therefore to select the top "k" terms. However, it is not possible to use it to select based on ratios (e.g., words that appear in more than 50% of the documents), although a simple thresholding will fix that (i.e., selecting terms with idf lower than a specific value). In general, a fix number of terms is used.
I hope this helps.

From "Introduction to Information Retrieval" book:
tf-idf assigns to term t a weight in document d that is
highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);
lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
lowest when the term occurs in virtually all documents.
So words with lowest tf-idf can considered as stop words.

Related

why every word vector has dimension 300?

why every word has 300 dimension vector in spaCy, Gensim and fasttext? I did't understand what the dimension is, and why is it used for? what that 300 dimension means? and why every word or a sentense has 300 dimension? and what information is stored in that 300 dimension vector?
To generally understand what's happening with word-vectors, nd more generally vector-modelling of text, you should seek online intro articles & tutorials.
You can make word-vectors in any dimensionality, when you train them from large corpora of texts. 300 dimensions has been commonly used since the original research papers from Google, as they seemed to think that offered a good tradeoff of being large-enough to capture desired functionality, without being overlarge (which requires more memory, training data, and training-time). The set of 3 million word-vectors they released, that they'd trained on over 100B words of news articles (GoogleNews vectors), were 300-dimensions.
Gensim defaults to 100 dimensions when creating new vectors, as something more often appropriate for the somewhat smaller datasets and systems that may be used by individual developers – but supports any value you'd like. You'll sometimes see other papers/projects using word-vectors of 400, 600, or 1000 dimensions.
In a 'dense embedding' like word2vec vectors, the individual dimensions are generally not intepretable. Each dimension has neither a fixed nor specific meaning. Rather, most 'neighborhoods' will (in a well-trained model) have meanings that roughly map to human-describable ideas, and certain 'directions' in the high-dimensional space may also correlated with human-describable concepts.
So, words that are synonyms or otherwise-similarly-used tend to be each others' nearest-neighbors in the space, and travelling in certain directions may also lead to other words whose menaing shift in particular ways - the aspect that lets word2vec sort-of- solve analogies, as in the famous "king - man + woman = queen" semantic-arithmetic example.

Precision at k when fewer than k documents are retrieved

In information retrieval evaluation, what would precision#k be, if fewer than k documents are retrieved? Let's say only 5 documents were retrieved, of which 3 are relevant. Would the precision#10 be 3/10 or 3/5?
It can be hard to find text defining edge cases of measures like this, and the mathematical formulations often don't deal with the incompleteness of data. For issues like this, I tend to turn to the decision made by trec_eval which a tool distributed by NIST that has implementations of all common retrieval measures, especially those used by the challenges in Text Retrieval Conferences (TREC challenges).
Per the metric description in m_P.c of trec_eval 9.0 (called the latest on this page):
Precision measured at various doc level cutoffs in the ranking.
If the cutoff is larger than the number of docs retrieved, then
it is assumed nonrelevant docs fill in the rest. Eg, if a method
retrieves 15 docs of which 4 are relevant, then P20 is 0.2 (4/20).
Precision is a very nice user oriented measure, and a good comparison
number for a single topic, but it does not average well. For example,
P20 has very different expected characteristics if there 300
total relevant docs for a topic as opposed to 10.
This means that you should always divide by k even if fewer than k were retrieved, so the precision would be 0.3 instead of 0.6 in your particular case. (Punish the system for retrieving fewer than k).
The other tricky case is when there are fewer than k relevant documents. This is why they note that precision is a helpful measure but does not average well.
Some measures that are more robust to these issues are: Normalized Discounted Cumulative Gain (NDCG) which compares the ranking to an ideal ranking (at a cutoff) and (simpler) R-Precision: which calculates precision at the number of relevant documents, rather than a fixed k. So that one query may calculate P#15 for R=15, and another may calculate P#200 for R=200.

Series of numbers with minimized risk of collision

I want to generate some numbers, which should attempt to share as few common bit patterns as possible, such that collisions happen at minimal amount. Until now its "simple" hashing with a given amount of output bits. However, there is another 'constraint'. I want to minimize the risk that, if you take one number and change it by toggling a small amount of bits, you end up with another number you've just generated. Note: I don't want it to be impossible or something, I want to minimize the risk!
How to calculate the probability for a list with n numbers, where each number has m bits? And, of course, what would be a suitable method to generate those numbers? Any good articles about this?
To answer this question precisely, you need to say what exactly you mean by "collision", and what you mean by "generate". If you just want the strings to be far apart from each other in hamming distance, you could hope to make an optimal, deterministic set of such strings. It is true that random strings will have this property with high probability, so you could use random strings instead.
When you say
Note: I don't want it to be impossible or something, I want to minimize the risk!
this sounds like an XY problem. If some outcome is the "bad thing" then why do you want it to be possible, but just low probability? Shouldn't you want it not to happen at all?
In short I think you should look up the term "error correcting code". The codewords of any good error correcting code, with any parameters that you feel like, will have the minimal risk of collision in the presence of random noise, for that number of code words of that length, and they can typically be generated very easily using matrix multiplication.

agrep max.distance arguments in R

I need some help with the specific arguments of the agrep package in R.
In terms of cost, all, insertions, deletions and substitutions each have a "maximum number/fraction of substitutions" integer or fraction input parameter.
Ive read the documentation on it, but I still cannot figure out some specifics:
What is the difference of a "cost=1" and "all=1"?
How is a decimal interpreted, such as "cost=0.1", "inserts=0.9", "all=0.25", etc.?
I understand the basics of the Levenshtein Distance, but how is it applied in terms of the cost or all arguments?
Sorry if this is fairly basic, but like I said, the documentation I have read on it is slightly confusing.
Thanks in advance
Not 100% certain, but here is my understanding:
in max.distance, cost and all are interchangeable if you don't specify a costs argument (this is the next argument); if you do, then cost will limit based on the weighted (as per costs) costs of insertion/deletion/substitutions you specified, whereas all will limit on the raw count of those operations
The fraction represents what fraction of the number of characters in your pattern argument you want to allow as insertion/deletions/substitutions (i.e. 0.1 on a 10 character pattern would allow 1 change). If you specify costs, then it is the fraction of # of characters in pattern * max(costs), though presumably fractions in max.distance{insertions/deletions/substitutions} will be # of characters * corresponding costs value.
I agree that the documentation is not as complete as it could be. I discovered the above by building simple test examples and messing around with them. You should be able to do the same an confirm for yourself, particularly the last part (i.e. whether costs affects the fraction measure of max.distance{insertions/deletions/substitutions}), which I haven't tested.

Recursive hypothesis-building with ambiguites - what's it called?

There's a problem I've encountered a lot (in the broad fields of data analyis or AI). However I can't name it, probably because I don't have a formal CS background. Please bear with me, I'll give two examples:
Imagine natural language parsing:
The flower eats the cow.
You have a program that takes each word, and determines its type and the relations between them. There are two ways to interpret this sentence:
1) flower (substantive) -- eats (verb) --> cow (object)
using the usual SVO word order, or
2) cow (substantive) -- eats (verb) --> flower (object)
using a more poetic world order. The program would rule out other possibilities, e.g. "flower" as a verb, since it follows "the". It would then rank the remaining possibilites: 1) has a more natural word order than 2), so it gets more points. But including the world knowledge that flowers can't eat cows, 2) still wins. So it might return both hypotheses, and give 1) a score of 30, and 2) a score of 70.
Then, it remembers both hypotheses and continues parsing the text, branching off. One branch assumes 1), one 2). If a branch reaches a contradiction, or a ranking of ~0, it is discarded. In the end it presents ranked hypotheses again, but for the whole text.
For a different example, imagine optical character recognition:
** **
** ** *****
** *******
******* **
* ** **
** **
I could look at the strokes and say, sure this is an "H". After identifying the H, I notice there are smudges around it, and give it a slightly poorer score.
Alternatively, I could run my smudge recognition first, and notice that the horizontal line looks like an artifact. After removal, I recognize that this is ll or Il, and give it some ranking.
After processing the whole image, it can be Hlumination, lllumination or Illumination. Using a dictionary and the total ranking, I decide that it's the last one.
The general problem is always some kind of parsing / understanding. Examples:
Natural languages or ambiguous languages
OCR
Path finding
Dealing with ambiguous or incomplete user imput - which interpretations make sense, which is the most plausible?
I'ts recursive.
It can bail out early (when a branch / interpretation doesn't make sense, or will certainly end up with a score of 0). So it's probably some kind of backtracking.
It keeps all options in mind in light of ambiguities.
It's based off simple rules at the bottom can_eat(cow, flower) = true.
It keeps a plausibility ranking of interpretations.
It's recursive on a meta level: It can fork / branch off into different 'worlds' where it assumes different hypotheses when dealing with the next part of data.
It'll forward the individual rankings, probably using bayesian probability, to dependent hypotheses.
In practice, there will be methods to train this thing, determine ranking coefficients, and there will be cutoffs if the tree becomes too big.
I have no clue what this is called. One might guess 'decision tree' or 'recursive descent', but I know those terms mean different things.
I know Prolog can solve simple cases of this, like genealogies and finding out who is whom's uncle. But you have to give all the data in code, and it doesn't seem convienent or powerful enough to do this for my real life cases.
I'd like to know, what is this problem called, are there common strategies for dealing with this? Is there good literature on the topic? Are there libraries for ideally C(++), Python, were you can just define a bunch of rules, and it works out all the rankings and hypotheses?
I don't think there is one answer that fits all the bullet points you have. But I hope my links will lead you closer to an answer or might give you a different question.
I think the closest answer is Bayesian network since you have probabilities affecting each other as I understand it, it is also related to Conditional probability and Fuzzy Logic
You also describe a bit of genetic programming as well as Artificial Neural Networks
I can name drop some more topics which might be related:
http://en.wikipedia.org/wiki/Rule-based_programming
http://en.wikipedia.org/wiki/Expert_system
http://en.wikipedia.org/wiki/Knowledge_engineering
http://en.wikipedia.org/wiki/Fuzzy_system
http://en.wikipedia.org/wiki/Bayesian_inference

Resources